GAPS: A Compiler Framework for Genetic Algorithm (GA ... - CiteSeerX

5 downloads 0 Views 85KB Size Report
Department of Computer Science, University of Manchester, ... is tested with an evaluation of the performance of SPMD code produced by GAPS for an ADI ...
GAPS: A Compiler Framework for Genetic Algorithm (GA) Optimised Parallelisation Andy Nisbet Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, U.K. Abstract The compilation of FORTRAN programs for SPMD execution on parallel architectures often requires the application of program restructuring transformations such as loop interchange, loop distribution, loop fusion, loop skewing and statement reordering. Determining the optimal transformation sequence that minimises execution time for a given program is an NP-complete problem. The hypothesis of the research described here is that genetic algorithm (GA) techniques can be used to determine the sequence of restructuring transformations which are better, or, as good as, those produced by more conventional compiler search techniques. The Genetic Algorithm Parallelisation System (GAPS) compiler framework is presented. GAPS uses GA optimisation to determine the restructuring transformation applied to each statement and its associated iteration space. The hypothesis of GAPS is tested with an evaluation of the performance of SPMD code produced by GAPS for an ADI 1024x1024 benchmark on an SGI Origin 2000.

1 Introduction State of the art optimising compilers [1,2,4] utilise transformation sequences which attempt to minimise individual sources of overhead largely in isolation, since considering all overheads simultaneously is not practical. An infinite number of transformations can be applied, and determining the eventual total overhead for a particular target architecture is intractable in general. Compiler technology requires a strategy which can apply transformations in order to globally minimise all sources of overhead. Such a strategy is possible with the use of genetic algorithm techniques.

2 Genetic Algorithm (GA) Optimisation GAs iteratively seek better solutions to problems in the same way that evolution has optimised populations of living organisms for their habitat and environment. The process of reproduction creates new child organisms through the application of recombination and mutation operators to encodings of parent organisms. The processes of natural selection ensure that successful organisms reproduce more often than less successful ones. In a GA, an organism represents a potential solution in a problem domain. A population of organisms represents a subset of the possible solutions to a problem. Solutions are represented in a population as encodings. The success of a solution encoding in solving a problem is measured by its fitness value which is calculated by an evaluation function. High fitness solution encodings are selected for reproduction more frequently than low fitness solutions in order to mimic natural selection. Thus, GAs can evolve good solutions to complex optimisation problems via the manipulation of solution encodings.

2.1 GA Optimised Compilation GAs can be applied to compilation problems using three basic concepts:– i) An encoding describes a legal ordered set of transformations1 . ii) Mutation and recombination reproduction operators describe mechanisms that alter encodings. iii) An evaluation function represents a target architecture plus the application program to be compiled. Evaluation functions calculate the total overhead of the application program when transformed by an encoding, and give high fitness values to encodings which yield low overheads.  Supported by EPSRC project GR/K82291, [email protected].

1 Program semantics are preserved when a legal ordered set of transformations are applied to a program.

Recombination/Mutation Operators

STOP

Random Methods

Compiler Strategy Operators

Programmer Solutions

Operators Using Overhead Prediction/Profile

Compiler Strategies

Information

YES NO

Initialise

Evaluate

Population

Fitness

Satisfactory Performance?

Select "Parent Encodings"

Select Reproduction

Generate

Operator

New Solutions

Recalculate Reproduction Operator

Measured Execution Time

Selection Probabilities

Overhead Prediction

Figure 1: The GAPS Approach.

3 GAPS Framework The GAPS framework (as depicted in figure 1) will evaluate the application and performance benefit of GA optimisation techniques to the compilation of loop-based programs for parallel architectures. Conventional compilation techniques will be used to create hybrid GAs which either utilise problem-specific knowledge2 to generate the initial population of solution encodings, or which use conventional compiler techniques as reproduction operators. This is in contrast to pure GAs which use no problem-specific knowledge and are therefore independent of target architecture and application programs. Hybrid techniques can guarantee that the best solution generated is no worse than the conventional techniques used to hybridise the GA. GAPS will investigate the effect and use of overhead prediction and instrumented/profiled execution as evaluation functions. Overhead prediction uses cost models of parallel architectures to estimate the total overhead of a transformed application program on a target architecture. Information from overhead prediction/instrumented execution will be used by hybrid GAs to identify transformation(s) which can reduce program overheads.

3.1 Preliminary GAPS Configuration The initial GAPS configuration is concerned purely with optimisation of restructuring transformations. GAPS currently optimises the application of loop-distribution, loop-fusion and statement reordering transformations whilst looppermutation3 decisions and iteration-partitioning are determined during a preprocessing step which invokes petit4 [3]. GAPS seeds one member of the population of 20 with encodings of restructuring transformations generated by petit whilst the remaining 19 members are randomly initialised. Linearly normalised fitness determines the selection probability of individuals ranked using predictions of loop and synchronisation overhead. Steady-state reproduction uses one mutation and two crossover operators with equal and fixed selection probability. GAPS preserves elitism 5 by selecting individuals for deletion from the lower ranked half of the population.

3.2 GAPS Infrastructure GAPS uses the Unified Transformation Framework (UTF) [4] from the Omega Project [3]. The UTF provides two simple mathematical abstractions:– time-mappings and space-mappings which respectively and concisely specify i) the restructuring transformations applied, and ii), the partition of computation for parallel execution. Single Program Multiple Data (SPMD) code can then be generated. 2 Problem-specific knowledge concerns both the application and the execution environment of a parallel architecture. 3 Loop-permutation is equivalent to any combination of loop-interchange and loop-reversal. 4 A prototype parallelising compiler. 5 Elitism ensures that the best solution in a population is never deleted.

Time-mappings

8j; 1  j  xmp ) Lj (x1 ; : : : ; xj?1 ;~s)  xj  Uj (x1; : : : ; xj?1 ;~s) Tp : [ip1 ; : : : ; ipmp ] ! [fp0 ; : : : ; fpn ]jCp

(1) (2)

An iteration-space Ip represents the set of iterations with which statement p will be executed. The iteration-space of a statement p surrounded by mp loops is specified as [x1 ; : : : ; xmp ] 2 Ip such that equation 1 holds, where Lj and Uj are functions representing the lower and upper bounds respectively of the j th loop around p, and ~s is a vector of symbolic constants. Time-mappings of the form of equation 2 specify the iteration reordering transformation applied to the iterationspace Ip of statement p, where the ip1 ; : : : ; ipmp iteration variables represent the m loops nested around p. Note that for simplicity the range of each Tp has n + 1 components, where each of the positions 0; : : : ; n are referred to as levels. The fpi expressions are quasi-affine6 functions of the iteration variables and symbolic constants, whilst Cp is an optional restriction on the domain of the mapping. The condition Cp enables piecewise time-mappings to be specified as [i Tpi jCpi where the Tpi jCpi ’s are time-mappings with disjoint domains. The mapping Tp represents the fact that the iteration [ip1 ; : : : ; ipmp ] in the original iteration space of statement p is mapped to iteration [fp0 ; : : : ; fpn ] in the new iteration space if condition Cp is true. As with unimodular transformations, the points in the new iteration space are then executed in the order specified by the lexicographic operator  as in equation 3.

(x1 ; : : : ; xn )  (y1 ; : : : ; yn) , 9m : (8i 1  i  m ) xi = yi ) ^ xm < ym

(3)

The concept of a group is introduced in order to relate the structure and the number of loops in a transformed program to the set of time-mappings 8Tp . Two statements p; q are said to be in the same group gpk , gqk at level k if 8i : (0  i  k) ^ (fpi , fqi ) where fpi is a constant and fqi is a constant. The effect of two statements existing in the same group at a level k is that the execution periods of any loops specified by 8i : ((0  i  k) ^ ((fpi 6= constant) _ (fqi 6= constant)) will overlap. The gpk ’s are given positive integer values such that 0  gpn  j8pj ? 1 ^ j [p gpn j , j8pj. This means that although statements may be in the same group at levels < n, at level n each statement must be in a group that is distinct from that of other statements. Thus, time-mappings specify a total order for all statements in the restructured source code. The code fragments and time-mappings in figures 2, 3 and 4 illustrate the effect of different groupings on the total order of statements and on the number and structure of loops. The iteration reordering program transformations specified by the set of time-mappings 8Tp are only legal if the original program semantics are preserved. This is true if, i) the restru ctured program performs exactly the same set of computations and, ii), the new ordering of iterations respects all control-flow and data dependences in the original code. If i is an iteration of statement p and j an iteration of statement q , and the dependence relation dpq indicates that a dependence exists from i to j then Tp (i) must be executed before Tq (j ), as in equation 4 where Sym is the set of all symbolic constants. 8i; j; p; q; Sym i ! j 2 dpq ) Tp (i)  Tq (j ) (4) Space-mappings

Sp : [ip1 ; : : : ; ipmp ] ! [vp1 ; : : : ; vplp ]

(5)

Space-mappings of the form of equation 5 specify the partition of a statement p’s iterations to a virtual processor array, where the ip1 ; : : : ; ipmp are iteration variables and lp is the dimensionality of the virtual processor array, and the vpi ’s are affine functions of the iteration variables. As with data-distributions the virtual processor array is then folded onto the physical processor array in either a blocked, cyclic, or, block-cyclic fashion. GAPS currently uses the space-mappings generated by petit which partitions iteration-spaces in one-dimension such that lp = 1.

3.3 Time-mapping Encodings A member of the population Pi is represented by one-dimensional space-mappings (as generated by petit) and timemapping encodings for each and every statement in the original program. The time-mappings generated by petit are of the form in equation 6 where pj is the number of the loop in position j according to the given loop permutation 6 Quasi-affine expressions are affine functions plus integer division and remainder when dividing by a constant.

do i = a(i) enddo do j = 2: b(j) enddo 1:

1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)

do t = 1,1024 do j = 1,1024 a(t) = c(t) + d(t) 2: b(j) = c(j) + d(j) b(t) = c(t) + d(t) enddo enddo do i = 1,1024 1: a(i) = c(i) + d(i) enddo

1: 2:

T1 : [i] ! [0; i; 0] T2 : [j ] ! [1; j; 0] g10 = 0; g20 = 1

T1 : [i] ! [0; i; 0] T2 : [j ] ! [0; j; 1] g10 = 0; g20 = 0 g12 = 0; g22 = 1

T1 : [i] ! [1; i; 0] and T2 : [j ] ! [0; j; 0] g10 = 1; g20 = 0

Figure 2: Loop nest (a).

Figure 3: Loop nest (b).

Figure 4: Loop nest (c).

p and the c0p ; : : : ; cmp and the d1p ; : : : ; dmp are integer constants. The c0p ; : : : ; cmp correspond to loop-distribution, loopfusion and statement- reordering transformations whilst the d1p ; : : : ; dm p correspond to loop-alignment transformations. Time-mappings having this form simplify some aspects of code-generation.

Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; ip1  d1p ; c1p ; ip2  d2p ; : : : ; ipm  dmp ; cmp ]

(6)

GAPS currently generates time-mappings of the form in equation 7 where jp are 0 or, ipk  dkp such that the ordering p is respected as in equation 8.

Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; 1p ; c1p ; 2p ; : : : ; np ; cnp ]

((jp = 0) _ (jp = ipk  dkp )) ^ (k =

Xj?1(lp 6= 0) + 1) ^ (mp = Xn (lp 6= 0)) l=1

l=1

(7) (8)

Such time-mappings can represent a wider-range of statement-reordering and loop-fusion transformation than those of the form in equation 6. An encoding of a time-mapping stores i) the levels at which jp = ipk  dkp and ii), the total order j of each level of constants 8p; cjp which is specified in equation 9.

8p; q : (cjp j cjq ) , (cjp < cjq )

(9)

Note that the total order j may store degenerate information at levels where j  k , if gpk?1 6, gqk?1 ^ ckp 6= ckq . This degenerate information behaves in a similar manner to recessive genes because information stored by 8p : (cip ; ip+1 ) ^ 0  i < k determines if k has an effect on the groups 8p; gpk . The recessive nature of this degenerate information may enable group-based mutation and crossover reproduction operators described in section 3.4 to prodce the optimum solution encoding with fewer reproductions.

3.4 Reproduction Operators An operator and two encodings Pi ; Pj are randomly selected for reproduction in order to create two new encodings. The operators may effect some combination of loop-fusion/distribution and statement-reordering. Group-Based Mutation As mentioned earlier, the effect of two statements existing in the same group at a level k is that the execution periods of any loops specified by 8i : ((0  i  k ) ^ ((ip+1 6= 0) _ (iq+1 6= 0)) will overlap. The group-based mutation operator manipulates the grouping of a parent Pi ’s statements in order to produce a new child Ci encoding having a different group structure. A second child Cj is created by another application of this operator to parent Pj . The alterations to group structure may change the number and structure of loops and the placement of post-wait/barrier synchronisation in the transformed SPMD program. The group-based mutation operator randomly selects a level k , and two statements p; q . The lexicographic ordering 0 0 0 of Tpk and Tqk at level k is then evaluated where Tpk = (c0p ; : : : ; ckp ). A new and different lexicographic ordering

0

0

for Tpk and Tqk is then randomly selected and enforced at level k . In order to illustrate the effects of this operator, consider the code fragment in figure 2. The time-mapping encodings for this are:– 11 = i; 12 = j and 0 : c01 < c02 ^ 1 : c11 , c12 . If level 0 is randomly selected and the new lexicographic ordering at level 0 is randomly chosen 0 0 to be T10 , T20 . Note that in this case T1n 0 , T2n 0 , therefore in order to ensure jGn j = j8pj the constants cnp 0 s are randomly selected to give T1n 0  T2n 0 producing code as in figure 3. Group-Based Crossover The group-based crossover operator combines information from two parents Pi ; Pj and produces a single child Ci with the aim of creating new statement groupings. Child Cj is produced by a second application of the operator on Pj ; Pi . An application of group-based crossover on Pi ; Pj creates the child Ci such that the total order k : 0  k  n of the ckp constants are taken from Pi if k 2 K and from Pj if k 62 K whilst the lp : 1  l  n are taken from Pi . The jK j is randomly chosen such that 1  jK j  2(n + 1)=3. Note that in group-based crossover the structure of the groups for all statements at a particular level k are potentially altered whereas the group-based mutation operator directly alters the grouping of a statement-pair. Consider figures 3 and 4 to be Pi ; Pj in order to illustrate the effects of group-based crossover. jK j = 1 ^ 0 2 K . The encoding for Pi is:– 11 = i; 12 = j and 0 : c01 , c02 ^ 1 : c11 < c12 . The encoding for Pj is:– 11 = i; 12 = j and 0 : c02 < c01 ^ 1 : c01 , c02 . The total order on the ckp for child Ci becomes 0 : c01 , c02 ^ 1 : c02 < c01 whilst for child Cj created by an application of group-based crossover on Pj ; Pi with the same set K :– 0 : c02 < c01 ^ 1 : c11 < c12 . Figures 5 and 6 illustrate the time-mappings and the restructured code for Ci and Cj . Note that child Cj has degenerate information stored by the total order of constants cjp ’s for 1 . It is intended that degenerate grouping information may be activated through interactions between group-based crossover and mutation operators. do i = 1,1024 2: b(i) = c(i) + d(i) 1: a(i) = c(i) + d(i) enddo

T1 : [i] ! [0; i; 1] T2 : [j ] ! [0; j; 0] Figure 5:

Ci group crossover.

do i = b(i) enddo do j = 1: a(j) enddo 2:

1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)

T1 : [i] ! [1; i; 0] T2 : [j ] ! [0; j; 1] Figure 6:

Cj group crossover.

do i = 1,1024 a(i) = a(i)*b(i)+d(i) do j = 1,1024 2: e(j,i) =b(j)+d(j) enddo enddo 1:

T1 : [i] ! [0; i; 0; 0; 0] T2 : [i; j ] ! [0; i; 0; j; 1] Figure 7: Pi iterator crossover.

Iterator-Based Crossover Iterator-based crossover aims to fuse loop-nests and reorder statements having different nesting depths. An example of such an application of loop-fusion is illustrated in the original and the transformed loop nests of figures 8 and 9 respectively. Iterator-based crossover combines information concerning jp from two parents Pi ; Pj and produces a single child Ci . A second application of this operator on Pj ; Pi creates child Cj . This operator randomly selects a set of statements P : 1  jP j  2(j8pj + 1)=3 and imposes the positions at which the jp 6= 0 : p 2 P from Pi onto a child encoding. The total ordering of ckp ’s in the child, and the positions at which jq 6= 0 : q 62 P are inherited from the encoding of Pj . Consider figures 7 and 8 to be parent encodings Pi and Pj whilst figures 9 and 10 are the children Ci and Cj that are produced when statement one is randomly selected.

3.5 Evaluation Function The evaluation function produces a naive prediction of loop and synchronisation overhead of the transformed program. Communication overhead is not considered by the evaluation function because partitioning decisions are currently specified by petit in the form of space-mappings. A simple one-dimensional loop-nest is modelled as having computation, loop and synchronisation costs. The cost of a barrier synchronisation is modelled as PL whilst post-wait synchronisation which requires fewer interprocessor-messages and enables pipeline-style parallelism to be exploited is modelled as PL=2 where P is the number of processors and L is the latency of inter-processor messages. Computation costs are modelled as the sum of the statements in the loop whereas the overhead of the loop itself is modelled as x. The total loop cost is then equal to the number of iterations y multiplied by the sum of computation, synchronisation and loop-overhead costs. The default values of P; L; x and y are P = 10; L = 10; x = 1 and y = 40. In order to

do i = 1,1024 do i = 1,1024 do i = 1,1024 a(i) = a(i)*b(i)+d(i) do j = 1,1024 1: a(i) = a(i)*b(i)+d(i) enddo if i.EQ.1 enddo do j = 1,1024 1: a(j)=a(j)*b(j)+d(j) do i = 1,1024 do k = 1,1024 endif do j = 1,1024 2: e(k,j) = b(k)+d(k) 2: e(j,i) = b(j)+d(j) 2: e(j,i) = b(j)+d(j) enddo enddo enddo enddo enddo enddo 1:

T1 : [i] ! [0; 0; 0; i; 0] T2 : [i; j ] ! [1; i; 0; j; 0] Figure 8: Pj iterator crossover.

T1 : [i] ! [0; 0; 0; i; 0] and T2 : [i; j ] ! [0; i; 0; j; 1]. Figure 9: Ci iterator crossover.

T1 : [i] ! [0; i; 0; 0; 0] T2 : [i; j ] ! [1; i; 0; j; 0]. Figure 10: Cj iterator crossover.

apply this model of loop and synchronisation overhead to general loop-structures it is necessary to evaluate the cost of an inner-loop before the cost of an outer loop can be evaluated, whilst the cost of non-nested loops, statements and synchronisation are summed.

4 Results, Conclusions and Future Work

800000

0.07

790000

0.06

GAPS petit (Omega) PCA (Native SGI)

780000 1/(Execution Time)

Predicted Overhead of Best Solution

Figure 11 plots the predicted loop and synchronisation overhead of the best solution for the ADI 1024  1024 benchmark over the course of an GA run. Figure 12 presents the temporal performance of the ADI benchmark when transformed by GAPS, PCA (Native SGI) and petit (Omega Project) compilers on an SGI Origin 2000. GAPS proposes a

770000 760000 750000 740000

0.05 0.04 0.03 0.02

730000

0.01

720000 1

10

100 1000 Reproductions

10000

Figure 11: Predicted overhead of ADI.

1

2

3

4 5 Processors

6

7

8

Figure 12: Performance of final ADI.

framework for the application of GA optimisation to autoparallelising compiler technology. Preliminary GAPS results for the ADI benchmark produce small but encouraging performance improvements between 5 ? 6% over UTF on 1-8 processors whilst PCA does not parallelise the code. Ongoing work concerns i), an evaluation of the initial GAPS configuration on a number of benchmarks, and ii), the implementation and evaluation of the full GAPS framework which can apply any transformation that can be expressed by UTF time and space mappings. The GA techniques employed by GAPS will be developed to encompass a larger number of operators, hybridisation and adaptive reproduction.

5 References [1] The SUIF Compiler System, http://suif.stanford.edu/ [2] Polaris, Automatic Parallelization of Conventional Fortran Programs, http://polaris.cs.uiuc.edu/polaris/polaris.html. [3] The Omega Project, Frameworks and Algorithms for the Analysis and Transformation of Scientific Programs, http://www.cs.umd.edu/projects/omega/ [4] W.A. Kelly (1996), Optimization within a Unified Transformation Framework, PhD. Thesis, University of Maryland, USA.

Suggest Documents