GAPS: Genetic Algorithm Optimised Parallelisation - CiteSeerX

2 downloads 0 Views 292KB Size Report
Department of Computer Science, University of Manchester,. Oxford Road ... GAPS produces code having 21?25% improvements in parallel execution time ...
GAPS: Genetic Algorithm Optimised Parallelisation Andy Nisbet Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, U.K.

Abstract

The compilation of FORTRAN programs for SPMD execution on parallel architectures often requires the application of program restructuring transformations such as loop interchange, loop distribution, loop fusion, loop skewing and statement reordering. Determining the optimal transformation sequence that minimises execution time for a given program is an NP-complete problem. The hypothesis of the research described here is that genetic algorithm (GA) techniques can be used to determine the sequence of restructuring transformations which are better, or, as good as, those produced by more conventional compiler search techniques. The Genetic Algorithm Parallelisation System (GAPS) compiler framework is presented. GAPS uses GA optimisation to determine the restructuring transformation applied to each statement and its associated iteration space. The hypothesis of GAPS is tested with a comparison of the performance of SPMD code produced by PFA, petit from the Omega Project at the University of Maryland and GAPS for an ADI 10241024 benchmark on an SGI Origin 2000. On this benchmark GAPS produces code having 21?25% improvements in parallel execution time when measured execution time is used as the evaluation function.

Keywords: Genetic Algorithm, Parallel, Compiler

1 Introduction State of the art optimising compilers [1,2,3] utilise transformation sequences which attempt to minimise individual sources of overhead largely in isolation, since considering all Supported by EPSRC project GR/K82291, [email protected]. 

overheads simultaneously is not practical. An in nite number of transformations can be applied, and predicting the eventual total overhead for a particular target architecture is intractable in general. Compiler technology requires a strategy which can apply transformations in order to globally minimise all sources of overhead. Such a strategy is possible with the use of genetic algorithm techniques. Section 2 describes related work. Section 3 discusses the basics concepts of GA optimisation [4] and the means by which GAs may be applied to the problem of compilation/parallelisation. Section 4 provides an overview of the GAPS framework and discusses aspects of the Uni ed Transformation Framework [3] of the Omega Project [5]. Section 5 describes the current GAPS con guration and presents results for this con guration in parallelising an ADI benchmark. Section 6 presents conclusions and future work.

2 Related Work [6] described a knowledge-based parallelisation environment which attempted to match information derived through analysis and measurement of a program and its hardware target with information stored in a knowledge base. Such approaches can be very successful in selecting local optimisations for an application, however they cannot easily consider the global e ect of a local optimisation on program overhead. The GA approach advocated in this paper does not su er from this de ciency. GAs were applied to the problem of partitioning tasks for parallel execution in [7]. The

related technique of genetic programming was applied to the problem of autoparallelisation in [8; 9; 10] but no execution time results were presented for their parallelised programs. [11] discussed a technique to use GA optimisation for autoparallelisation, but no experimental results were presented.

Random Methods Compiler Strategies

Measured Execution Time

3.1 GA Optimised Compilation GAs can be applied to compilation problems using three basic concepts:{ i) An encoding describes a legal ordered set of transformations 1 . ii) Mutation and recombination reproduction operators describe mechanisms that alter encodings. iii) An evaluation function represents 1 Program semantics are preserved when a legal or-

dered set of transformations are applied to a program.

Evaluate Fitness

Overhead Prediction

YES STOP

3 Genetic Algorithm (GA) Optimisation GAs iteratively seek better solutions to problems in the same way that evolution has optimised populations of living organisms for their habitat and environment. The process of reproduction creates new child organisms through the application of recombination and mutation operators to encodings of parent organisms. The processes of natural selection ensure that successful organisms reproduce more often than less successful ones. In a GA, an organism represents a potential solution in a problem domain. A population of organisms represents a subset of the possible solutions to a problem. Solutions are represented in a population as encodings. The success of a solution encoding in solving a problem is measured by its tness value which is calculated by an evaluation function. High tness solution encodings are selected for reproduction more frequently than low tness solutions in order to mimic natural selection. Therefore, GAs can evolve good solutions to complex optimisation problems through the manipulation of solution encodings.

Initialise Population

Programmer Solutions

Is Performance Satisfactory

NO

Select ‘‘Parent Encoded Solutions’’ Recombination & Mutation Operators Compiler Strategy

Select Reproduction Operator

Operators Operators Using Overhead, Profile Information

Generate New Solutions

Recalculate Reproduction Operator Selection Probabilities

Figure 1: The GAPS Approach. a target architecture plus the application program to be compiled. Evaluation functions calculate the total overhead of the application program when transformed by an encoding, and give high tness values to encodings which yield low overheads.

4 GAPS Framework The GAPS framework (as depicted in gure 1) aims to evaluate the application and performance bene t of GA optimisation techniques to the compilation of loop-based programs for parallel architectures. Conventional compilation techniques will be used to create hybrid GAs [4] which either utilise problem-speci c knowledge 2 to generate the initial population of solution encodings, or which use conventional compiler techniques as reproduction operators. This is in contrast to pure GAs which use no 2 Problem-speci c knowledge concerns both the ap-

plication and the execution environment of a parallel architecture.

problem-speci c knowledge and are therefore independent of target architecture and application programs. Hybrid techniques can guarantee that the best solution generated is no worse than the conventional techniques used to hybridise the GA. Compilation with GA optimisation will investigate the e ect and use of overhead prediction and measured/instrumented/pro led execution in constructing two distinct evaluation functions. Overhead prediction uses cost models of parallel architectures to estimate the total overhead of a transformed application program on a target architecture. Information logged during overhead prediction/instrumented execution will be used by hybrid GAs to identify transformation(s) which can reduce program overheads.

4.1 GAPS Infrastructure

GAPS uses the Uni ed Transformation Framework (UTF) from the Omega Project. The UTF provides two simple mathematical abstractions:{ time-mappings and spacemappings which respectively and concisely specify i) the restructuring transformations applied, and ii), the partition of computation for parallel execution. Single Program Multiple Data (SPMD) code can then be generated with block, cyclic or block-cyclic scheduling. The reader is referred to [3; 5] for an in-depth description of time and space mappings.

4.1.1 Time-mappings 8j; 1  j  xmp ) Lj (x1; : : : ; xj?1;~s)  xj  Uj (x1; : : : ; xj?1;~s) (1) Tp : [ip1 ; : : : ; ipmp ] ! [fp0 ; : : : ; fpn]jCp (2) An iteration-space Ip represents the set of iterations with which statement p will be executed. The iteration-space of a statement p surrounded by mp loops is speci ed as [x1 ; : : : ; xmp ] 2 Ip such that equation 1 holds, where Lj and Uj are functions representing the lower and upper bounds respectively of the j th

loop around p, and ~s is a vector of symbolic constants. Time-mappings of the form of equation 2 specify the iteration reordering transformation applied to the iteration-space Ip of statement p, where the ip1 ; : : : ; ipmp iteration variables represent the m loops nested around p. Note that for simplicity the range of each Tp has n + 1 components, where each of the positions 0; : : : ; n are referred to as levels. The fpi expressions are quasi-ane3 functions of the iteration variables and symbolic constants, whilst Cp is an optional restriction on the domain of the mapping. The condition Cp enables piecewise time-mappings to be speci ed as [i Tpi jCpi where the Tpi jCpi 's are time-mappings with disjoint domains. The mapping Tp represents the fact that the iteration [ip1 ; : : : ; ipmp ] in the original iteration space of statement p is mapped to iteration [fp0 ; : : : ; fpn ] in the new iteration space if condition Cp is true. As with unimodular transformations, the points in the new iteration space are then executed in the order speci ed by the lexicographic [12] operator  as in equation 3. (x1 ; : : : ; xn )  (y1 ; : : : ; yn) , 9m : (8i 1  i  m ) xi = yi ) ^ xm < ym (3) The concept of a group is introduced in order to relate the structure and the number of loops in a transformed program to the set of time-mappings 8Tp. Two statements p; q are said to be in the same group gpk , gqk at level k if 8i : (0  i  k) ^ (fpi , fqi) where fpi is a constant and fqi is a constant. The e ect of two statements existing in the same group at a level k is that the execution periods of any loops speci ed by 8i : ((0  i  k) ^ ((fpi 6= constant) _ (fqi 6= constant)) will overlap. The gpk 's are given positive integer values such that 0  gpn  j8pj ? 1 ^ j [p gpn j , j8pj. This means that although statements may be in the same group at levels < n, at level n each statement must be in a group that is distinct from 3 Quasi-ane expressions are ane functions plus

integer division and remainder when dividing by a constant.

that of other statements. Thus, time-mappings specify a total order for all statements in the restructured source code. The code fragments and time-mappings in gures 2, 3 and 4 illustrate the e ect of di erent groupings on the total order of statements and on the number and structure of loops. The iteration reordering program transformations speci ed by the set of time-mappings 8Tp are only legal if the original program semantics are preserved. This is true if: 1. The new ordering of iterations respects all control- ow and data dependences in the original code. If i is an iteration of statement p and j an iteration of statement sq , and dpq indicates that a dependence exists from i to j then Tp (i) must be executed before Tq (j ), as in equation 4 where Sym is the set of all symbolic constants and  is the lexicographic order operator.

8i; j; p; q; Sym i ! j 2 dpq ) Tp(i)  Tq (j ) (4)

2. The restructured program performs exactly the same set of computations as the original program as speci ed by equation 5.

do t = 1,1024 1: a(t) = c(t) + d(t) 2: b(t) = c(t) + d(t) enddo

T1 : [i] ! [0; i; 0] T2 : [j ] ! [0; j; 1] g10 = 0; g20 = 0 g12 = 0; g22 = 1 Figure 3: Loop nest (b). do j = 1,1024 2: b(j) = c(j) + d(j) enddo do i = 1,1024 1: a(i) = c(i) + d(i) enddo

T1 : [i] ! [1; i; 0] and T2 : [j ] ! [0; j; 0] g10 = 1; g20 = 0 Figure 4: Loop nest (c).

4.1.2 Space-mappings Sp : [ip1 ; : : : ; ipmp ] ! [vp1 ; : : : ; vplp ] (6) Space-mappings of the form of equation 6

8p; q; i; j; Sym(p = q^i = j ) , Tp(i) = Tq (j ) specify the partition of a statement p's iter(5)

do i = 1: a(i) enddo do j = 2: b(j) enddo

1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)

T1 : [i] ! [0; i; 0] T2 : [j ] ! [1; j; 0] g10 = 0; g20 = 1 Figure 2: Loop nest (a).

ations to a virtual processor array, where the ip1 ; : : : ; ipmp are iteration variables and lp is the dimensionality of the virtual processor array, and the vpi 's are ane functions of the iteration variables. As with data-distributions the virtual processor array is then folded onto the physical processor array in either a blocked, cyclic, or, block-cyclic fashion. The spacemappings used by GAPS are currently those generated by petit where lp = 1. Such spacemappings correspond to loops that are partitioned in one-dimensions.

4.2 Time-mapping Encodings

A member of the population Pi is represented by one-dimensional space-mappings (as gener-

ated by petit ) and time-mapping encodings for each and every statement in the original program. The time-mappings generated by petit are of the form in equation 7 where pj is the number of the loop in position j according to the given loop permutation p and the c0p ; : : : ; cmp and the d1p ; : : : ; dmp are integer constants. The c0p ; : : : ; cmp correspond to loopdistribution, loop-fusion and statement- reordering transformations whilst the d1p ; : : : ; dmp correspond to loop-alignment transformations. Time-mappings having this form simplify some aspects of code-generation. Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; ip1  d1p ; c1p ; ip2  d2p ; : : : ; ipm  dmp ; cmp ] (7) GAPS currently generates time-mappings of the form in equation 8 where jp are 0 or, ipk  dkp such that the ordering p is respected as in equation 9. Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; 1p ; c1p ; 2p ; : : : ; np ; cnp ] (8) ((jp = 0) _ (jp = ipk  dkp )) ^ (k =

X

j ?1

l=1 n X (lp 6= 0) + 1) ^ (mp = (lp 6= 0)) l=1

(9)

Such time-mappings can represent a widerrange of statement-reordering and loop-fusion transformation than those of the form in equation 7. An encoding of a time-mapping stores i) the levels at which jp = ipk  dkp and ii), the total order j of each level of constants 8p; cjp which is speci ed in equation 10. 8p; q : (cjp j cjq ) , (cjp < cjq ) (10) Note that the total order j may store degenerate information at levels where j  k, if gpk?1 6, gqk?1 ^ ckp 6= ckq . This degenerate information behaves in a similar manner to recessive genes because information stored by 8p : (cip; ip+1 ) ^ 0  i < k determines if k has an e ect on the groups 8p; gpk . The recessive nature of this degenerate information may the reproduction operators described in section 4.3 to reach the optimum solution encoding more easily with less reproductions.

4.3 Reproduction Operators

An operator and two encodings Pi ; Pj are randomly selected for reproduction in order to create two new encodings. The operators may e ect some combination of loop-fusion, loopdistribution and statement-reordering.

4.3.1 Group-Based Mutation As mentioned earlier, the e ect of two statements existing in the same group at a level k is that the execution periods of any loops speci ed by 8i : ((0  i  k) ^ ((ip+1 6= 0) _ (iq+1 6= 0)) will overlap. The group-based mutation operator manipulates the grouping of a parent Pi's statements in order to produce a new child Ci encoding having a di erent group structure. A second child Cj is created by another application of this operator to parent Pj . The alterations to group structure may change the number and structure of loops and the placement of post-wait /barrier synchronisation in the transformed SPMD program. The group-based mutation operator randomly selects a level k, and two statements p; q. The lexicographic ordering of Tpk 0 and Tqk 0 at level k is then evaluated where Tpk 0 = (c0p ; : : : ; ckp ). A new and di erent lexicographic ordering for Tpk 0 and Tqk 0 is then randomly selected and enforced at level k. In order to illustrate the e ects of this operator, consider the code fragment in gure 2. The time-mapping encodings for this are:{ 11 = i; 12 = j and 0: c01 < c02 ^ 1: c11 , c12 . If level 0 is randomly selected and the new lexicographic ordering at level 0 is randomly chosen to be T10 0 , T20 0 then the code will resemble that generated in gure 3. Note that, as in this case T1n0 , T2n0 , the constants cnp 0s are randomly selected so that jGnj = j8pj is maintained.

4.3.2 Group-Based Crossover The group-based crossover operator combines information from two parents Pi ; Pj and produces a single child Ci with the aim of creating new statement groupings. Child Cj is

produced by a second application of the operator on Pj ; Pi . An application of group-based crossover on Pi ; Pj creates the child Ci such that the total order k : 0  k  n of the ckp constants are taken from Pi if k 2 K and from Pj if k 62 K whilst the lp : 1  l  n are taken from Pi . The jK j is randomly chosen such that 1  jK j  2(n + 1)=3 as results presented in [13] for the related problem of schedule optimisation suggested that a good crossover technique should randomly choose the number of crossover points to be between 1/3 and 2/3 of the total length of the ordered list. Note that in group-based crossover the structure of the groups for all statements at a particular level k are potentially altered whereas the group-based mutation operator directly alters the grouping of a statement-pair. Consider gures 3 and 4 to be Pi ; Pj in order to illustrate the e ects of group-based crossover. jK j = 1 ^ 0 2 K . The encoding for Pi is:{ 11 = i; 12 = j and 0 : c01 , c02 ^ 1 : c11 < c12 . The encoding for Pj is:{ 11 = i; 12 = j and 0 : c02 < c01 ^ 1 : c01 , c02 . The total order on the ckp for child Ci becomes 0: c01 , c02 ^ 1: c02 < c01 whilst for child Cj created by an application of group-based crossover on Pj ; Pi with the same set K :{ 0: c02 < c01 ^ 1: c11 < c12 . Figures 5 and 6 illustrate the time-mappings and the restructured code for Ci and Cj . Note that child Cj has degenerate information stored by the total order of constants cjp 's for 1 . It is intended that degenerate grouping information may be activated through interactions between groupbased crossover and mutation operators. do i = 1,1024 2: b(i) = c(i) + d(i) 1: a(i) = c(i) + d(i) enddo

T1 : [i] ! [0; i; 1] T2 : [j ] ! [0; j; 0] Figure 5: Child Ci for group-based crossover.

do i = 2: b(i) enddo do j = 1: a(j) enddo

1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)

T1 : [i] ! [1; i; 0] T2 : [j ] ! [0; j; 1] Figure 6: Child Cj for group-based crossover.

4.3.3 Iterator-Based Crossover Iterator-based crossover aims to produce encodings which exploit the wide-range of statement reordering and loop-fusion transformations that GAPS style time-mappings can represent. Speci cally, to fuse loop-nests and reorder statements having di erent nesting depths in an arbitrary fashion. An example of such an application of loop-fusion is illustrated in the original and the transformed loop nests of gures 8 and 9 respectively. Iteratorbased crossover combines information concerning jp from two parents Pi ; Pj and produces a single child Ci . A second application of this operator on Pj ; Pi creates child Cj . This operator randomly selects a set of statements P : 1  jP j  2j8pj=3 and imposes the positions at which the jp 6= 0 : p 2 P from Pi onto a child encoding. The total ordering of ckp 's in the child, and the positions at which jq 6= 0 : q 62 P are inherited from the encoding of Pj . Consider gures 7 and 8 to be parent encodings Pi and Pj whilst gures 9 and 10 are the children Ci and Cj that are produced when statement one is randomly selected.

4.4 Evaluation Functions

In this paper we present results for two distinct evaluation functions, one which produces a prediction of overhead and another that actually measures the execution time of a parallelised program.

do i = 1,1024 1: a(i) = a(i) * b(i) + d(i) do j = 1,1024 2: e(j,i) = b(j) + d(j) enddo enddo

T1 : [i] ! [0; i; 0; 0; 0] T2 : [i; j ] ! [0; i; 0; j; 1] Figure 7: Pi for iterator-based crossover.

do i = 1,1024 1: a(i) = a(i) * b(i) + d(i) enddo do j = 1,1024 do k = 1,1024 2: e(k,j) = b(k) + d(k) enddo enddo

T1 : [i] ! [0; 0; 0; i; 0] T2 : [i; j ] ! [1; i; 0; j; 0] Figure 8: Pj for iterator-based crossover.

do i = 1,1024 do j = 1,1024 if i.EQ.1 1: a(j) = a(j) * b(j) + d(j) endif 2: e(j,i) = b(j) + d(j) enddo enddo

T1 : [i] ! [0; 0; 0; i; 0] and T2 : [i; j ] ! [0; i; 0; j; 1]. Figure 9: Ci for iterator-based crossover.

do i = 1,1024 1: a(i) = a(i) * b(i) + d(i) enddo do i = 1,1024 do j = 1,1024 2: e(j,i) = b(j) + d(j) enddo enddo

T1 : [i] ! [0; i; 0; 0; 0] T2 : [i; j ] ! [1; i; 0; j; 0]. Figure 10: Cj for iterator-based crossover.

4.4.1 Predicted Overhead The evaluation function produces a naive prediction of loop and synchronisation overhead of the transformed program. Communication overhead is not considered by the evaluation function because partitioning decisions are currently speci ed by petit in the form of spacemappings . A simple one-dimensional loop-nest is modelled as having computation, loop and synchronisation costs. The cost of a barrier synchronisation is modelled as PL whilst post-wait synchronisation which requires fewer interprocessor-messages and enables pipelinestyle parallelism to be exploited is modelled as PL=2 where P is the number of processors and L is the latency of inter-processor messages. Computation costs are modelled as the sum of the statements in the loop whereas the overhead of the loop itself is modelled as x. The total loop cost is then equal to the number of iterations y multiplied by the sum of computation, synchronisation and loop-overhead costs. The values of P; L; x and y are set by default to P = 10; L = 10; x = 1 and y = 40. In order to apply this model of loop and synchronisation overhead to general loop-structures it is necessary to evaluate the cost of an innerloop before the cost of an outer loop can be evaluated, whilst the cost of non-nested loops, statements and synchronisation are summed.

Equations 11 and 12 specify that the members of the set G0k+1 are the groups gpk+1 for those statements which were in the same group gki at level k. Equation 13 speci es that the cost of a group gki at level k is0 equal to the sum of the costs of groups gk+1i plus any synchronisation cost Synch(gki ) due0 to gki . Note that the costs of groups gk+1i may include loops. For the synchronisation cost Synch(gki ) in equation 14, Bk (gki ) and Pk (gki ) respectively represent the number of barriers and post-wait synchronisation constructs present in the transformed source because of gki at level k. Equation 15 and 16 then specify how an evaluation cost for a time-mapping encoding is produced.

Gk = [pgpk (11) G0k+1 = [pgpk+1 : 8p; gpk , gki ^ gki 2 Gk (12) 8gki 2 Gk : C (gki ) = Synch(gki ) + (C (gk+1i 0 ) + x)y : (y = 1; x = 0

X

8gk+1i 0 2G0k+1

6, 9(kp+1 6= 0) , x = 1; y = 40)(13)

Synch(gki ) = (2Bk (gki ) + Pk (gki ))PL=2(14) 8gni 2 Gn : C (gni ) = 1 + Synch(gni )(15) Evaluation =

X0 X

k=n 8gki 2Gk

C (gki )(16)

Consider the code fragment in gure 11, with time-mappings and groupings for each statement. The cost of the groups at the nth level is Level 4 costs C (0) = 1; C (1) = 1; C (3) = 1; C (2) = 1 Level 2 costs C (0) = 1; C (1) = 1; C (3) = 80; C (2) = 1 Level 0 costs C (0) = 80 + PL=2; C (1) = 3280 + PL; C (2) = 80 Evaluation = 3440 + 3PL=2

(17) (18) (19) (20)

if myid.EQ.0 do i = 1,1024, 1: a(i) = b(i) + c enddo endif post(myid) if myid.GT.0 wait(myid-1) endif do j = max(0,lb),min(1024,ub) 2: b(j) = c(j) * d(j) + a(j) do k = max(1,lb),min(1023,ub) 3: e(k,j) = b(k) + d(j) enddo enddo barrier() do l = max(2,lb),min(1023,ub) 4: a(l) = b(l-1) + b(l+1) enddo

T1 [i] ! [0; i; 0; 0; 0] T2 [j ] ! [1; j; 0; 0; 0] T3 [j; k] ! [1; j; 1; k; 0] T4 [l] ! [2; l; 0; 0; 0] g10 = 0; g20 = 1; g30 = 1; g40 = 2 g12 = 0; g22 = 1; g32 = 3; g42 = 2 g14 = 0; g24 = 1; g34 = 3; g44 = 2

Figure 11: Evaluation function example.

4.4.2 Measured Execution Time This evaluation function compiles the SPMD code and runs the executable 3 times on 2 processors. The function returns the average of the 2 best runs as its evaluation in order to minimise any interference from other jobs. The compilation routes from FORTRAN to an executable for PFA, GAPS and petit are illustrated in gure 12. Note that f2p merely translates FORTRAN into the petit-language .

5 Method and Results The initial GAPS con guration is concerned purely with optimisation of restructuring transformations. GAPS currently optimises

FORTRAN SOURCE

PFA

f2p

PETIT LANGUAGE

GAPS

petit

SPMD C-CODE Link with P4 Runtime Library

EXECUTABLE

Figure 12: Compilation Routes of PFA,GAPS and petit.

determines the selection probability of individuals ranked using the evaluation function. Steady-state reproduction policy uses groupbased mutation/crossover and the iteratorbased crossover operators with equal and xed selection probability. GAPS preserves elitism5 by selecting individuals for deletion from the lower ranked half of the population. Figure 13 plots the predicted overhead of the best legal solution in the population against the number of reproductions. Figure 14 presents the temporal performance of petit , PFA and the GAPS solution having the best predicted loop and synchronisation overhead. Figure 15 plots the measured 2-processor execution time evaluation function of every legal solution generated against the number of reproductions. Figure 16 presents the temporal performance of petit , PFA and the best solution generated by GAPS when using measured 2-processor execution time.

the application of loop-distribution, loopfusion and statement reordering transformations whilst loop-permutation4 decisions and iteration-partitioning are determined during a preprocessing step which invokes petit . GAPS seeds one member of the population of 20 with encodings of restructuring transformations generated by petit whilst the remaining 19 members are randomly initialised. Note that illegal time-mapping encodings which do not preserve program semantics are allowed into the population. This removes the obligation that random initialisation must generate legal solutions. Further, the path in the solution space from a particular encoding to the optimal solution may be considerably shortened or only be reachable by intermediary illegal solutions. Two runs of GAPS were made, the rst using a prediction of loop and synchronisation overhead as the evaluation function and the other using measured 2-processor execution time. Linearly normalised tness then

GAPS proposes a framework for the application of GA optimisation to autoparallelising compiler technology. Preliminary GAPS results for the ADI benchmark produce encouraging performance improvements between 21 ? 25% in parallel execution time when measured execution time is used as the evaluation

4 Loop-permutation is equivalent to any combination

5 Elitism ensures that the best solution in a popula-

of loop-interchange and loop-reversal.

Predicted Overhead of Best Solution

800000 790000 780000 770000 760000 750000 740000 730000 720000 1

10

100 1000 Reproductions

10000

Figure 13: Predicted overhead of ADI.

6 Conclusions & Future Work

tion is never deleted.

0.07

0.18 0.16

GAPS petit (Omega) PFA (Native SGI)

0.05

1/(Execution Time)

1/(Execution Time)

0.06

0.04 0.03 0.02

0.1 0.08 0.06 0.04

0.01

0.02 1

2

3

4 5 Processors

6

7

8

Figure 14: Performance using predicted overhead. 85 Measured 2 Processor Execution Time

GAPS 0.14 petit (Omega Project) PFA (Native SGI) 0.12

80 75 70 65 60 55 0

5000

10000 Reproductions

15000

20000

Figure 15: Measured execution time for ADI. function over petit , whilst PFA does not parallelise the code. Note that this performance was achieved in spite of the fact that only 5:5% of GAPS' generated solutions (out of 20000 reproductions) corresponded to legal programs. GAPS with predicted loop and synchronisation overhead as an evaluation function produced performance improvements of between 10 ? 17% in parallel execution time over petit whilst legal solutions were generated 3:2% of the time. GAPS achieves the best performance using measured execution time as its evaluation function, but still has respectable results using a naive prediction of loop and synchronisation overhead. An important but obvious point to note is that measured execution time is a portable metric for performance on different target architectures. Thus, the GAPS framework using such a metric o ers the possibility of an architecture independent parallelis-

2

4

6

8 10 Processors

12

14

16

Figure 16: Performance using measured execution time. ing compiler. Future work will investigate how to generate legal solutions with higher frequency. We plan to achieve this through the exploitation of data-dependence information within reproduction operators. Ongoing/future work also concerns i), an evaluation of the initial GAPS con guration on a number of benchmarks, and ii), the implementation and evaluation of the full GAPS framework which can apply any transformation that can be expressed by UTF time and space mappings.

7 Acknowledgements The author would like to thank i) the Omega Project from the University of Maryland (USA) and ii) colleagues from the Centre for Novel computing at the University of Manchester (UK).

8 References [1] The SUIF Compiler System, http://suif.stanford.edu/ [2] Polaris, Automatic Parallelization of Conventional Fortran Programs, http://polaris.cs.uiuc.edu/polaris/polaris.html. [3] W.A. Kelly (1996), Optimization within a Uni ed Transformation Framework, PhD. Thesis, University of Maryland, USA. [4] L. Davis (1991), Handbook of Genetic Algorithms, Van Nostrand Reinhold

[5] The Omega Project, Frameworks and Algorithms for the Analysis and Transformation of Scienti c Programs, http://www.cs.umd.edu/projects/omega/ [6] B.M. Chapman, H.P. Zima, J. Hulman, S. Andel (1994) Intelligent Parallelization Within the Vienna Fortran Compilation System. In Proc. 4th Workshop on Compilers for Parallel Computers, Delft. [7] N. Mansour, G.C. Fox. (1991) A Hybrid Genetic Algorithm for Task Allocation in Multicomputers, International Conference on Genetic Algorithms ICGA91. [8] P. Walsh, C. Ryan (1995) Automatic conversion of programs from serial to parallel using Genetic Programming - The Paragen System, The proceedings of ParCo '95. Springer-Verlag. [9] C. Ryan, P. Walsh (1997), The Evolution of Provable Parallel Programs, The Proceedings of Genetic Programming 1997. [10] P.Walsh, C. Ryan (1997) A New Technique for Loop Parallelization for Distributed Memory Machines using Genetic Programming, Technical Report University College Cork, Ireland. [11] K.P. Williams, S.A. Williams (1996) Genetic Compilers: A New Technique for Automatic Parallelisation, Proceedings of 2nd European School on Parallel Programming Environments for High Performance Computing, Alpe d'Huez, France, April. [12] H. Zima, B. Chapman (1991), Supercompilers for Parallel and Vector Computers, ACM Press. [13] G. Syswerda (1989), Uniform crossover in genetic algorithms, International Conference on Genetic Algorithms 89.