GAPS: Iterative Feedback Directed Parallelisation Using Genetic Algorithms Andy Nisbet Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, U.K.
Abstract The compilation of FORTRAN programs for SPMD execution on parallel architectures often requires the application of program restructuring transformations such as loop interchange, loop distribution, loop fusion, loop skewing and statement reordering. Determining the optimal transformation sequence that minimises execution time for a given program is an NP-complete problem. The hypothesis of the research described here is that genetic algorithm (GA) techniques can be used to determine the sequence of restructuring transformations which are better, or, as good as, those produced by more conventional compiler search techniques. The Genetic Algorithm Parallelisation System (GAPS) compiler framework is presented. GAPS uses a novel iterative feedback directed approach to autoparallelisation that is based upon genetic algorithm optimisation. Traditional restructuring transformations are represented as mappings that are applied to each statement and its associated iteration space. The hypothesis of GAPS is tested with a comparison of the performance of SPMD code produced by PFA, petit from the Omega Project at benchmark on an SGI Origin 2000. Encouraging the University of Maryland and GAPS for an ADI initial results show that GAPS delivers performance improvements of up to 44% when the execution times of code benchmark on an SGI produced by GAPS, PFA, PETIT from the Omega Project are compared for an ADI Origin2000. On this benchmark, GAPS produces code having ? improvements in parallel execution time.
1024 1024
21 25%
1024 1024
1 Introduction State of the art optimising compilers [1,2,4] utilise transformation sequences which attempt to minimise individual sources of overhead largely in isolation, since considering all overheads simultaneously is not practical. An infinite number of transformations can be applied, and determining the eventual total overhead for a particular target architecture is intractable in general. Compiler technology requires a strategy which can apply transformations in order to globally minimise all sources of overhead. Such a strategy is possible with the use of genetic algorithm techniques.
2 Genetic Algorithm (GA) Optimisation GAs [5] iteratively seek better solutions to problems in the same way that evolution has optimised populations of living organisms for their habitat and environment. The process of reproduction creates new child organisms through the application of recombination and mutation operators to encodings of parent organisms. The processes of natural selection ensure that successful organisms reproduce more often than less successful ones. In a GA, an organism represents a potential solution in a problem domain. A population of organisms represents a subset of the possible solutions to a problem. Solutions are represented in a population as encodings. Feedback on the success/performance of a solution encoding in solving a problem is measured by a solution’s fitness value which is calculated by an evaluation function. High fitness solution encodings are selected for reproduction more frequently than low fitness solutions in order to mimic natural Supported by EPSRC project GR/K82291,
[email protected].
Recombination/Mutation Operators
STOP
Random Methods
Compiler Strategy Operators
Programmer Solutions
Operators Using Overhead Prediction/Profile
Compiler Strategies
Information
YES NO
Initialise
Evaluate
Population
Fitness
Satisfactory Performance?
Select "Parent Encodings"
Select Reproduction
Generate
Operator
New Solutions
Recalculate Reproduction Operator
Measured Execution Time
Selection Probabilities
Overhead Prediction
Figure 1: The GAPS Approach.
selection. Thus, GAs can evolve good solutions to complex optimisation problems via the manipulation of solution encodings.
2.1 GA Optimised Compilation GAs can be applied to compilation problems using three basic concepts:– i) An encoding describes a legal ordered set of transformations1 . ii) Mutation and recombination reproduction operators describe mechanisms that alter encodings. iii) An evaluation function represents a target architecture plus the application program to be compiled. Evaluation functions calculate the total overhead of the application program when transformed by an encoding, and give high fitness values to encodings which yield low overheads.
3 GAPS Framework The GAPS framework (as depicted in figure 1) aims to evaluate the application and performance benefit of GA optimisation techniques to the compilation of loop-based programs for parallel architectures. Conventional compilation techniques can be used to create hybrid GAs which either utilise problem-specific knowledge2 to generate the initial population of solution encodings, or which use conventional compiler techniques as reproduction operators. This is in contrast to pure GAs which use no problem-specific knowledge and are therefore independent of target architecture and application programs. Hybrid techniques can guarantee that the best solution generated is no worse than the conventional techniques used to hybridise the GA.
3.1 GAPS Infrastructure GAPS uses the Unified Transformation Framework (UTF) [4] from the Omega Project [3]. The UTF provides two simple mathematical abstractions:– time-mappings and space-mappings which respectively and concisely specify i) the restructuring transformations applied, and ii), the partition of computation for parallel execution. Single Program Multiple Data (SPMD) code can then be generated. 1 2
Program semantics are preserved when a legal ordered set of transformations are applied to a program. Problem-specific knowledge concerns both the application and the execution environment of a parallel architecture.
A.P. Nisbet
2
3.1.1
Time-mappings
The total order in which statements and any iterations associated with a statement are executed by a program, are respectively described by a time-mapping Tp and an iteration-space Ip for each and every statement p in the program. Figures 2, 3 and 4 illustrate the effect of different time-mappings on the number and structure of loops for a simple two statement code-fragment. The iteration variables ip1 ; : : : ; ipmp in the domain of a time-mapping Tp represent the number of loops mp surrounding a particular statement p. In the examples in figures 2, 3 and 4, each of the time-mappings has one iteration variable, signifying that each statement is surrounded by only one loop. A mapping Tp maps an iteration-space having [ip1 ; : : : ; ipmp ] iterationvariables in the domain onto an iteration-space in the range [fp0 ; : : : ; fpn ] where the fpk expressions are quasi-affine3 functions of the iteration variables and symbolic constants. Note, each of the positions in the range of a time-mapping 0; 1; :::; n are referred to as levels. The points in the new iteration spaces for each and every time-mapping range are then executed in lexicographic order as specified by the operator as defined in equation 1. do i = a(i) enddo do j = 2: b(j) enddo 1:
1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)
do t = 1,1024 do j = 1,1024 a(t) = c(t) + d(t) 2: b(j) = c(j) + d(j) b(t) = c(t) + d(t) enddo enddo do i = 1,1024 1: a(i) = c(i) + d(i) enddo
1: 2:
: [ ] ! [0 0] : [ ] ! [0 1] 1 : f[ ] : 1 1024g 2 : f[ ] : 1 1024g 0 0 1 =0 2 =0 2=0 2=1 1 2
: [ ] ! [0 0] 0] 2 : [ ] ! [1 1 : f[ ] : 1 1024g 2 : f[ ] : 1 1024g 0 0 1 =0 2 =1 T1
i
; i;
T
j
; j;
I I
i
i
j
g
j
;g
Figure 2: Loop nest (a).
T1
i
T2
j
; i;
; j;
I
i
i
I
j
j
g
;g
g
;g
Figure 3: Loop nest (b).
: [ ] ! [1 0] : [ ] ! [0 0] 1 : f[ ] : 1 1024g 2 : f[ ] : 1 1024g 0 0 1 =1 2 =0 T1
i
T2
j
I I
; i;
; j;
i
i
j
g
j
;g
Figure 4: Loop nest (c).
(x1 ; : : : ; xn ) (y1; : : : ; yn) , 9m : (8i 1 i m ) xi = yi) ^ xm < ym
(1)
The concept of a group is introduced in order to relate the structure and the number of loops in a transformed program to its associated time-mappings. Two statements p; q are said to be in the same group gpk , gqk at level k if 8i : (0 i k) ^ (fpi , fqi) where fpi is a constant and fqi is a constant. The effect of two statements existing in the same group at a level k is that the execution periods of any loops specified by 8i : ((0 i k ) ^ ((fpi 6= constant) _ (fqi 6= constant)) will overlap. Note, although statements may be in the same group at levels < n, at level n each statement must be in a group that is distinct from that of other statements. Thus, time-mappings specify a total order for all statements in the restructured source code. In the examples in figures 2, 3 and 4, the gpk ’s are given positive integer values to signify the lexicographic order of the time-mappings. In figure 2 the value of g10 = 0 and g20 = 1 signifies that statement 1 is lexicographically less than statement 2 at level 0 as 0 < 1. Conceptually, this means that statement 1 appears before statement 2 with separate (distributed) loops, because here, the loop iteration variables occur after level 0. However, in figure 3, statement 1 is lexicographically less than statement 2 at level 2, and 3
Quasi-affine expressions are affine functions plus integer division and remainder when dividing by a constant.
A.P. Nisbet
3
lexicographically equal at level 1 where the loop iteration variables for statement 1 and 2 occur. Consequently, the loops for statement 1 and 2 are fused and statement 1 occurs before statement 2. In figure 5 a 2D loop having poor cache locality is presented (assuming FORTRAN memory layout). In figure 6 a permuted and distributed version of this code fragment with improved cache locality is presented. Note, how the time-mapping T1 in figure 6 for statement 1 permutes the loops by reversing the order of the loop iteration variables i; j , whilst T2 in the same figure does not alter the iteration variable order. do i = 1,1024 do j = 1,2048 1: b(i,j) = c(i) + d(i) 2: a(j,i) = c(j) + d(j) enddo enddo
:[ :[ ]:1 1 : f[ ]:1 2 : f[
I
i; j
I
i; j
] ! [0 0 0] ] ! [0 0 1] 1024 ^ 1 2048g 1024 ^ 1 2048g 0 =0 0=0 1 2 4 4 1 =0 2 =1
T1
i; j
; i;
; j;
T2
i; j
; i;
; j;
g
i
j
i
g
do i = 1,1024 do j = 1,2048 2: a(j,i) = c(j) + d(j) enddo enddo do j = 1,2048 do i = 1,1024 1: b(i,j) = c(i) + d(i) enddo enddo
j
;g
;g
I I
:[ :[ ]:1 1 : f[ ]:1 2 : f[
] ! [1 0 0] ] ! [0 0 0] 1024 ^ 1 2048g 1024 ^ 1 2048g 0 =1 0 =0 1 2
T1
i; j
; j;
; i;
T2
i; j
; i;
; j;
i; j
i; j
g
Figure 5: 2D Loop.
i
j
i
j
;g
Figure 6: Permuted and distributed 2D loop.
Legality of Time-mappings The iteration reordering program transformations specified by time-mappings are only legal if the original program semantics are preserved. This is true if, i) the restructured program performs exactly the same set of computations and, ii), the new ordering of iterations respects all control-flow and data dependences in the original code. If i is an iteration of statement p and j an iteration of statement q , and the dependence relation dpq indicates that a dependence exists from i to j then Tp (i) must be executed before Tq (j ), as in equation 2 where Sym is the set of all symbolic constants.
3.1.2
8i; j; p; q; Sym i ! j 2 dpq ) Tp(i) Tq (j )
(2)
Sp : [ip1 ; : : : ; ipmp ] ! [vp1 ; : : : ; vplp ]
(3)
Space-mappings
Space-mappings of the form of equation 3 specify the partition of a statement p’s iterations to a virtual processor array, where the ip1 ; : : : ; ipmp are iteration variables and lp is the dimensionality of the virtual processor array, and the vpi ’s are affine functions of the iteration variables. As with data-distributions the virtual processor array is then folded onto the physical processor array in either a blocked, cyclic, or, blockcyclic fashion. GAPS currently uses the space-mappings generated by petit which partitions iteration-spaces in one-dimension such that lp = 1.
A.P. Nisbet
4
3.2 Time-mapping Encodings A member of the population Pi is represented by one-dimensional space-mappings (as generated by petit) and time-mapping encodings for each and every statement in the original program. The time-mappings generated by petit are of the form in equation 4 where pj is the number of the loop in position j according to 1 m 0 m the given loop permutation p and the c0p ; : : : ; cm p and the dp ; : : : ; dp are integer constants. The cp ; : : : ; cp correspond to loop-distribution, loop-fusion and statement- reordering transformations whilst the d1p ; : : : ; dm p correspond to loop-alignment transformations. Time-mappings having this form simplify some aspects of code-generation.
Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; ip1 d1p ; c1p ; ip2 d2p ; : : : ; ipm dmp ; cmp ] GAPS currently generates time-mappings of the form in equation 5 where the ordering p is respected as in equation 6.
jp are 0 or, ipk dkp such that
Tp : [ip1 ; : : : ; ipmp ] ! [c0p ; 1p ; c1p; 2p ; : : : ; np ; cnp ]
((jp = 0) _ (jp = ipk dkp )) ^ (k =
(4)
X(lp 6= 0) + 1) ^ (mp = Xn (lp 6= 0))
(5)
j ?1 l=1
l=1
(6)
Such time-mappings can represent a wider-range of statement-reordering and loop-fusion transformation than those of the form in equation 4. An encoding of a time-mapping stores i) the levels at which jp = ipk dkp and ii), the total order j of each level of constants 8p; cjp which is specified in equation 7.
8p; q : (cjp j cjq ) , (cjp < cjq )
(7)
Note that the total order j may store degenerate information at levels where j k , if gpk?1 6, gqk?1 ^ ckp 6= ckq . This degenerate information behaves in a similar manner to recessive genes because information stored by 8p : (cip ; ip+1 ) ^ 0 i < k determines if k has an effect on the groups 8p; gpk . The recessive nature of this degenerate information may enable group-based mutation and crossover reproduction operators described in section 3.3 to produce the optimum solution encoding with fewer reproductions.
3.3 Reproduction Operators Group-Based Mutation This operator manipulates the grouping of a parent Pi ’s statements in order to produce a new child Ci encoding having a different group structure. A second child Cj is created by another application of this operator to parent Pj . The alterations to group structure may change the number and structure of loops and the placement of post-wait/barrier synchronisation in the transformed SPMD program. The group-based mutation operator randomly selects a level k , and two statements p; q . The lexico0 0 0 graphic ordering of Tpk and Tqk at level k is then evaluated where Tpk = (c0p ; : : : ; ckp ). A new and different 0 0 lexicographic ordering for Tpk and Tqk is then randomly selected and enforced at level k . In order to illustrate the effects of this operator, consider the code fragment in figure 2. The time-mapping encodings for this are:– 11 = i; 12 = j and 0 : c01 < c02 ^ 1 : c11 , c12 . If level 0 is randomly selected and the new 0 0 lexicographic ordering at level 0 is randomly chosen to be T10 , T20 . Note that in this case T1n 0 , T2n 0 , therefore in order to ensure that the time-mappings specify a total-order, the constants cnp 0 s are randomly selected to give T1n 0 T2n 0 producing code as in figure 3.
A.P. Nisbet
5
Group-Based Crossover This operator combines information from two parents Pi ; Pj and produces a single child Ci with the aim of creating new statement groupings. Child Cj is produced by a second application of the operator on Pj ; Pi . An application of group-based crossover on Pi ; Pj creates the child Ci such that the total order k : 0 k n of the ckp constants are taken from Pi if k 2 K and from Pj if k 62 K whilst the lp : 1 l n are taken from Pi . The jK j is randomly chosen such that 1 jK j 2(n + 1)=3. Note that in group-based crossover the structure of the groups for all statements at a particular level k are potentially altered whereas the group-based mutation operator directly alters the grouping of a statement-pair. Consider figures 3 and 4 to be Pi ; Pj in order to illustrate the effects of group-based crossover. jK j = 1 ^ 0 2 K . The encoding for Pi is:– 11 = i; 12 = j and 0: c01 , c02 ^ 1: c11 < c12 . The encoding for Pj is:– 11 = i; 12 = j and 0 : c02 < c01 ^ 1 : c01 , c02 . The total order on the ckp for child Ci becomes 0: c01 , c02 ^ 1: c02 < c01 whilst for child Cj created by an application of group-based crossover on Pj ; Pi with the same set K :– 0 : c02 < c01 ^ 1 : c11 < c12 . Figures 7 and 8 illustrate the time-mappings and the restructured code for Ci and Cj . Note that child Cj has degenerate information stored by the total order of constants cjp ’s for 1 . It is intended that degenerate grouping information may be activated through interactions between group-based crossover and mutation operators. do i = 1,1024 b(i) = c(i) + d(i) a(i) = c(i) + d(i) enddo
2: 1:
: [ ] ! [0 1] : [ ] ! [0 0] Figure 7: Ci group crossover. T1
i
; i;
T2
j
; j;
do i = 2: b(i) enddo do j = 1: a(j) enddo
1,1024 = c(i) + d(i) 1,1024 = c(j) + d(j)
: [ ] ! [1 0] : [ ] ! [0 1] Figure 8: Cj group crossover. T1
i
; i;
T2
j
; j;
3.4 Method and Results The initial GAPS configuration is concerned purely with optimisation of restructuring transformations. GAPS currently optimises the application of loop-distribution, loop-fusion and statement reordering transformations whilst loop-permutation4 and iteration-partitioning decision are determined during a preprocessing step which invokes petit5 [3]. GAPS seeds one member of the population of 20 with encodings of restructuring transformations generated by petit whilst the remaining 19 members are randomly initialised. Note that illegal time-mapping encodings which do not preserve program semantics are allowed into the population. This removes the obligation that random initialisation must generate legal solutions. Further, the path in the solution space from a particular encoding to the optimal solution may be considerably shortened or only be reachable by intermediary illegal solutions. For legal time-mapping encodings, the evaluation function actually compiles and then runs and measures the time taken to execute SPMD code on two processors. Illegal time-mapping encodings are automatically given an execution time equal to the maximum possible value of a double without running any code. Linearly normalised fitness then determines the selection probability of individuals ranked using the evaluation function. Thus, legal time-mapping encodings will always have a greater probability of selection for reproduction than illegal encodings. Steady-state reproduction policy uses group-based mutation and crossover 4 5
Loop-permutation is equivalent to any combination of loop-interchange and loop-reversal. A prototype parallelising compiler.
A.P. Nisbet
6
operators with equal and fixed selection probability. GAPS preserves elitism6 by selecting individuals for deletion from the lower ranked half of the population. The elitist policy ensures that GAPS will always produce code that is at least as good as that produced by petit which is used to seed one member of the initial population. Figures 9 and 11 plot the measured 2 processor execution time returned by the evaluation function for every legal solution generated against the number of reproductions for ADI 1024 1024 and Shallow 1024 1024 benchmarks respectively over the course of a GAPS optimisation run. Note, the results for Shallow are incomplete because insufficient machine time was available to produce and evaluate 20000 individuals prior to production of this paper. Full Shallow results will be presented at the workshop. Figure 10 presents the temporal performance (1/execution time) of the ADI benchmark when transformed by GAPS, PFA (Native SGI) and petit (Omega Project) compilers on an SGI Origin 2000. The wall clock time taken to produce and evaluate 20000 ADI individuals was approximately 24 hours.
4 Conclusions and Future Work
0.2
80 1/(Execution Time)
Measured 2 Processor Execution Time
85
75 70 65 60
0.15
GAPS petit (Omega Project) PFA (Native SGI)
0.1
0.05
55 0
5000
10000 Individuals
15000
20000
Figure 9: GAPS ADI Performance.
2
4
6
8 10 Processors
12
14
16
Figure 10: Performance of final ADI.
Measured 2 Processor Execution Time
120 115 110 105 100 95 0
1000
2000
3000 4000 Individuals
5000
6000
7000
Figure 11: GAPS Shallow Performance.
GAPS proposes a framework for the application of GA optimisation to autoparallelising compiler technology. Preliminary GAPS results for the ADI benchmark produce encouraging performance improvements 6
Elitism ensures that the best solution in a population is never deleted.
A.P. Nisbet
7
of 44% and 37% in parallel execution time over PFA and PETIT respectively on 16 processors. One of the reasons for this performance improvement is a particular loop-fusion than enables common subexpression elimination to be performed. Note that this performance was achieved in spite of the fact that only 5:5% of GAPS’ generated solutions (out of 20000 reproductions) corresponded to legal programs. On the Shallow benchmark, GAPS generated legal solutions at a rate of 4:1% and (with incomplete results) produced an 18% improvement in 2 processor execution time over petit after creating less than 7000 individuals. Future work will investigate the effect of utilising data-dependence information within reproduction operators in order to generate legal solutions with higher frequency. One possible concern about the use of GA optimisation within a parallelising compiler is that the high number of reproductions required to produce an optimal or near optimal solution may be prohibitively expensive. We reject this on the following grounds:–
expert programmers often require months to tune an application for a new architecture. GAPS iteratively produces new solutions. Thus, users of an application undergoing parallelisation can always use the current best solution, they need not wait for all GAPS reproductions to complete. the production, legality checking and evaluation of individuals could easily be parallelised and made asynchronous. Currently, GAPS (synchronously) generates, checks legality and then waits until legal individuals have been evaluated before starting a new reproduction step. an estimation/prediction of parallel execution time could be used instead of measured execution time if an application undergoing parallelisation itself had a prohibitively long execution time.
Other ongoing/future work also concerns i), an evaluation of the initial GAPS configuration on a number of benchmarks, ii), the implementation and evaluation of the full GAPS framework which can apply any transformation that can be expressed by UTF time and space mappings, and iii) the development of profiledirected reproduction operators which focus the manipulation of time-mapping encodings on statements in computationally expensive loops of a transformed program.
5 References
[1] The SUIF Compiler System, http://suif.stanford.edu/ [2] Polaris, Automatic Parallelization of Conventional Fortran Programs, http://polaris.cs.uiuc.edu/polaris/polaris.html. [3] The Omega Project, Frameworks and Algorithms for the Analysis and Transformation of Scientific Programs, http://www.cs.umd.edu/projects/omega/ [4] W.A. Kelly (1996), Optimization within a Unified Transformation Framework, PhD. Thesis, University of Maryland, USA. [5] Edited by L. Davis (1991), Handbook of Genetic Algorithms, Van Nostrand Reinhold, ISBN 0-442-00173-8
A.P. Nisbet
8