Feedback Assisted Iterative Compilation M.F.P. O’Boyle ICSA, Edinburgh University, United Kingdom.
[email protected]
P.M.W. Knijnenburg LIACS, Leiden University, the Netherlands.
[email protected]
Abstract The rapid rate of architectural change and the sheer complexity of modern processors places enormous stress on compiler technology. What is required is an approach which achieves good performance by exploiting many complex hardware features simultaneously and, moreover, is capable of evolving and adapting to architectural change. This paper describes such an approach using feedback-directed program restructuring. We have implemented a compiler framework that decouples compilation strategy from other compiler components and developed three strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the target machine and can be run on any platform. The compiler was evaluated on four platforms and applied to three full SPEC FP benchmarks. We show that our approach is able to give significant performance improvements over static approaches.
1. Introduction The growth in the use of computing technology is matched by a continuing demand for higher performance in all areas of computing. This demand for ever greater performance has led to an exponential growth in hardware performance and architecture evolution. Such a rapid range of architectural change has placed enormous stress on compiler technology, such that current compiler technology can no longer keep up with architectural change. Traditional approaches to compiler optimisations which are based on static analysis and a hardwired compiler strategy based on a simplified internal machine model can no longer be used in a computing environment that is continual changing. This is especially true for embedded processors, that have a high rate of architectural change. Moreover, for these platforms there is no need for backward compatibility and hence for each new platform many applications need to
G.G. Fursin ICSA, Edinburgh University, United Kingdom.
[email protected]
be rewritten. Furthermore, modern architectures have very complex internal organisations, like high issue widths, outof-order execution, deep memory hierarchies etc. Therefore, simplified machine models that only take into account a small part of an actual architecture, only provide very rough performance estimates that are too imprecise to statically select the best optimisations. What is required is an approach which evolves and adapts to architectural change without sacrificing performance and, moreover, take into account runtime behaviour of platforms and applications. Therefore, compilers must be able to navigate a transformation space and have a metric, whether it be static analysis or profile information, to determine the worth of a transformation. This knowledge must not be hardwired within the compiler, as it will be quickly be made redundant. This paper examines feedback assisted compilation strategies based on traversing an optimisation space. Early results suggest that such an approach can give significant performance improvements over purely static approaches without excessive compilation time and can adapt to changing hardware. The use of transformations to improve program performance has been extensively studied for over 30 years. Such work is based primarily on static analysis and is characterised by trying i to determine how such a program would perform on a particular processor, and ii developing a program transformation such that the code is more likely to execute more efficiently. Such an approach relies on modelling those features of the architecture and the program that are considered important. Although there has been improvement, much work remains because it is extremely difficult to accurately model program/machine interaction as the optimisation space is highly non-linear with many local minima. For instance, Figure 1 shows a typical transformation space; in this case the performance of matrix multipliand varying tile and cation on the UltraSparc for N unroll sizes. It is periodic with high frequency oscillation and many local minima. Within such spaces, it is difficult to find the absolute minimum using static analysis.
()
( )
= 512
representation and the iterative compilation strategies implemented. This is followed in Section 4 by an evaluation of this approach with respect to 7 benchmarks on two separate platforms This is followed in Section 5 with a review of related work. Section 6 provides some concluding remarks.
30 25
Time
20 15
2. Compiler Infrastructure
10 5
This section describes the compiler infra-structure implemented in order to support feedback directed restructuring. By decoupling compiler strategy from machine model, we develop compilation strategies that can adapt to architectural change.
0 20 15
100 80
10
60 40
5 20 Unroll
0
0
Tile Size
Figure 1. Transformation Space: UltraSparc N = 512
2.1. Overall Coordinator At the heart of the system is an overall coordinator responsible for reading in the original program and producing an optimised version. The coordinator selects different transformations available from the transformation server, based on a decision from a separate strategy model. The coordinator then compiles and runs the transformed program on the target machine and gathers profile information in order to guide the optimisation process. The coordinator keeps track of different program versions and their associated profile information which is made available to the strategy module. Figure 2 shows the overall structure of our feedback directed compilation system. In order that the compiler is as modular as possible, with little hardwired functionality, the coordinator has no access to the internal program representation used by the transformation server. Instead, it represents the program as a simple decorated syntax tree. Communication with the transformation server is via a socket link where transformations are given as a series of commands. By using text based commands as the means of describing transformations, there is a simple well-defined high level interface between the coordinator and transformation server. This allows modification of internal program representation and transformation implementation without affecting either the compiler strategy or the overall coordination. If for instance an alternative transformation server were available that provided a different or better range of transformations, it is can be directly added to the architecture. Communication with the compiler strategy is via a module interface, where the coordinator passes the results of previous iterations and the transformations available from the transformation sever. From this information, it is the responsibility of the strategy module either to select a new program to execute or terminate the process.
The problem is made worse in that the number of transformations to apply is potentially infinite and therefore the transformation space considered must be highly restricted. Furthermore, due to architectural evolution, any optimisation strategy is swiftly made redundant. This problem is becoming more severe with each new generation of processor. This paper examines another approach based on iterative compilation [7, 9]. Different transformations are applied, corresponding to points in the transformation space, and their worth evaluated by executing the program. Several evaluations, based on a compiler search strategy, are made upto a certain pre-defined limit with the compiler selecting the best one. Although such an approach is usually ruled out in terms of excessive compilation time , it is precisely the approach used by expert programmers where the application is to be executed many times. Embedded systems are an extreme example of this and hence part of our study examines multimedia codes for embedded systems. In fact there exists a spectrum of techniques from those based purely on static analysis, to purely iterative which makes few assumptions about the program/processor. Our framework allows the integration of these approaches by using the results of different static analysis as “seed points” in the optimisation space i.e. they form part of the initial set of points with which to start the iterative evaluation. In this paper, however we investigate using solely profile information with no program analysis or machine models. For an investigation into combining static and dynamic information, please see [10] This paper is organised as follows. Section 2 describes the overall compiler infrastructure implemented and is followed in Section 3 by an overview of transformation space 2
Strategy Module
TransformationServer
Source Program
Co−ordinator Optimised Program
profile information
Machine hosting compiler Target Machine
compile and execute Figure 2. Compiler Framework
2.2. Transformation Server distribute
The transformation server receives commands as input over the socket interface. The named programs are parsed and then translated into the MARS [11] internal representation. This is a high level abstraction of the key components of the program namely iteration spaces, arrays and access functions. Iteration spaces and the index domain of the arrays are represented as integer lattice points within convex polyhedra, stored as an ordered set of linear constraints. Access functions are stored as linear transformations on the iteration vector. This representation allows high-level rapid analysis of program structure and the simple application of high level transformations (and checking of validity) via the transformation toolbox. For instance, unimodular and nonsingular loop transformations are implemented as a simple matrix multiplication with the appropriate loop constraint matrix and enclosed access matrices. Data transformations, such as global index reordering, array padding etc., can also be similarly achieved by multiplication of the appropriate index domain and all corresponding accesses. The key benefit of this formulation for iterative compilation is that we may apply long sequences of transformations without generating an intermediate form, that may be difficult for later optimisations to analyse. Thus, uniform program representations will become increasingly important when we wish to deal with large transformation spaces.
12
interchange(1,2)
11 unroll(k)
10 interchange(1,2)
tile(k)
distribute 14
Figure 3. Tree of Transformations
we are able to investigate different compiler approaches to optimisations. Whether or not a particular transformation is applicable or legal needs not be stored in the strategy but is available by separate method calls, allowing an orthogonal breakdown of strategy as opposed to current transformation technology. The main objective of such a compiler strategy is to decide on the next transformation to apply, guided by information in the form of static analysis, execution time, or heuristics which are meant to reduce the transformation space to consider. While the majority of research in optimisation via high level restructuring has relied on static information, in this paper we are primarily concerned with techniques that have no architectural knowledge are solely based on dynamic profile information. For a hybrid approach, see [10].
2.3. Strategy Module
As the potential search space is extremely large, the strategies typically have a fixed budget, limiting the number of executions allowed. In the next section, we describe the iterative compilation strategy implemented within our framework.
The strategy of a compiler is usually hardwired and cannot be easily modified when moving from one target to the next. By separating what the compiler should do from how, 3
20 unroll
... .............. . .. ...... . . . .
execution time, 14, is greater than the original 12, and hence may be discarded from further investigation. Other branch and bound techniques could also be applied. There is some scope for eliminating duplicates here. If a transformation is followed by its inverse (distribution followed by fusion or interchange the same loops twice) then this branch can be removed as it is equivalent to an identity transformation and has no effect on the program. For instance, in figure 3, a subtree is removed (shown by the dotted line) when interchange is followed by interchange. Similarly, if two distinct paths have different transformation sequences encoded but have the same overall effect, then one can be discarded. Determining equivalences of transformations outside the non-singular class is a challenging task and at present we only discard transformation sequences containing identity forming sequences rather than searching for equivalences. While some transformations are parameterised by program structure, e.g., the number of loops in a loop nest in the case of loop interchange, others depend on data sizes and are best treated differently. For instance, determining the best value of k in figure 3 for unrolling or tiling is best handled differently.
. . .
1 1
tile
100
Figure 4. Transformation Grid
3. Compiler Strategy It is an optimising compilers’ responsibility to select the best transformation sequence so as to minimise execution time. This section first describes how the transformation space exploration is represented. Based on this formulation, we then present two strategies currently implemented which traverse this transformation space so as to find the best transformation sequence.
3.1.2 Grids While many transformation sequences can easily be represented as trees, this is not the case for many parameterised transformations. For instance, if we wish to tile a loop and consider tile sizes from 1 to 100, it is not convenient to consider each of the hundred versions as separate branches in a tree. Instead, we search for the parameters using a grid based approach, as described in [2, 7]. In essence, each parameterised transformation can be considered as an axis in a transformation space, with a point on the axis representing the parameter value of the corresponding transformation. For instance, determining the best loop tile factor and loop unroll factor would create a 2 dimensional optimisation space. Given such a space, we use a grid-based approach [2, 7] to find the best parameter. To illustrate this, consider the diagram in figure 4. It shows a grid representing the tiling and unrolling optimisation space. A number of points are initially sampled shown by the widely spaced points. These are evaluated and placed in a priority queue based on performance. Good regions are selected for further investigation, as shown by the greater number of points for the upper lefthand region in figure 4. Depending on the budget available (see the following section) further refinement can take place.
3.1. Transformation Space Representation Although many compilation systems may try certain high level transformations within a particular compilation phase, there is no explicit acknowledgement that this is just a particular selected transformation from a diverse range of options, with which it may be compared. In order that an iterative approach may usefully explore an interesting optimisation space, the transformation space must be represented in way that is uniform and amenable to optimisation traversal. In our implementation we use trees for standard transformations and grids for parameterised ones. 3.1.1 Trees Many transformations can be considered as either yes/no decisions, apply the transformation or not. One example would be the fusing of two loop nests or not. Sequences of such transformations can be readily represented in a tree structure To illustrate this, consider the diagram in figure 3. Each path of the tree represents an ordered sequence of transformations and when annotated with execution time (shown inside each node) can be used by a search strategy as a means of guiding transformation space exploration. If for instance a path is not promising for 2 or more branches, then it may be abandoned. For example in the case where interchange of loops 1 and 2 is followed by distribution the
3.2. Transformation Space Exploration Our current compiler strategy uses a tree based search of the transformation, except in the case of parameterised 4
transformations such as loop tiling or array padding where a grid based formulation is used. This combined tree/grid is traversed and the best performing optimisations are added to a priority queue. A simple strategy would be to evaluate all possible transformation sequences and pick the best. Clearly, this is not practical if we consider an arbitrary length transformation sequence and an unbounded parameter space. The key feature of any practical iterative strategy is the idea of a fixed compilation budget defining the number of different program versions that may be tested. Given such a budget, the challenge is to devise a strategy that will make most efficient use of a limited resource. In this paper we describe two strategies currently implemented within our compiler. However, it should be stated that the subject of determining optimal time-bounded strategies is very much in its infancy. The two strategies can be usefully thought as being transformation orientated and cost orientated.
of a budget relatively small compared to program size, code sections are selected arbitrarily.
3.2.1 Transformation Based Strategy
The major difference between this strategy and the previous one is that rather than spreading the budget across the whole program evenly, we use more accurate profile information to guide the process. This is precisely the approach taken by expert programmers when attempting to improve program performance. The sections of code are sorted by execution time and the most time intensive routines prioritised. One approach would be to expend all effort on the most expensive section until it no longer dominates execution time. However, if this section is near optimal or cannot be improved by further transformation, this is a waste of resource. Instead those most time-consuming routines that account for up to X of the overall execution time are considered simultaneously. Thus resources are focused on the most-time consuming areas but more than one region is and thus conconsidered for optimisation. If set X sider the whole program, then we have the same algorithm , but this figure can be modias above. Currently, X fied over time. For instance, it may make sense to consider a wide range of code sections initially and then reduce the area of interest as more information becomes available.
Dynamic Reordering Given a limited budget, we would ideally like the best transformations to be among the first tried. As this is not possible, the transformations are tried initially in arbitrary fixed order. However, we can change this order dynamically. If for instance a transformation appears to be successful in one section of the program then it is considered earlier in later suitable sections of the program where the code structure is similar. Of course, detecting whether two sections of a program are sufficiently similar to benefit from the same particular transformation is nontrivial and at present only loop depth and array subscript patterns are considered. Currently, we are implementing this approach. 3.2.2 Cost Based Strategy
This strategy splits the compilation budget B into two components, one for evaluating single transformations B1 and the second for evaluating combined sequences B2 . The budget B1 is divided evenly across the program so that given a program can be divided into a number of code sections C , each code section is allocated a budget of jBC1j . As we are primarily concerned with loop-based transformations, we divide the program up into code sections of loop nests. Of course, there are other ways of examining a program including an array orientated view point - for data transformations [3]. Given a budget jBC1j , for each loop nest, we apply a single transformation and determine it’s worth and repeat with a new transformation until the budget is exhausted. The worth of each transformation for each code section is recorded. In the second phase the remaining budget B2 is again divided among the code section and limits the number of combinations that can be considered. Depending on the budget available, sequences of length 2 from the top ranking transformations are applied before longer combinations involving less high ranked transformations are considered. When the budget B2 is finally exhausted, the best transformation sequence is applied to each code section, giving the final code. Of course, the technique relies on the assumption that transformation application to distinct program sections is orthogonal (to prevent combinatorial explosion in evaluating each combination) and that the combined application of transformations to distinct program sections is equivalent to the sum of the individual transformation effects. This of course, is an over simplification as transformation application do interfere. Furthermore, it is assumed that there is enough budget to examine the entire program. In the case
%
= 100
= 50
3.3. Implemented Strategies In this section we discuss the three strategies that have been implemented. 3.3.1 Strategy 1 This strategy uses data and loop transformations in a costconscious manner. Rather than search through a large space of all possible loop and data transformations, it targets those sections of the program that dominate program execution and considers restricted loop and data transformations in 5
improved. Each transformation is evaluated independently to avoid combinatorial explosion. Rather than combine the best padding, tiling and unroll factors, we randomly search within the restricted good region of values for the best combination. This avoids the coupled behaviour of transformations (where the best form of one transformation plus the best form of another gives a sub-optimal value when combined), without having to exhaustively search a large space.
separate phases reducing the number of combinatorial options by imposing a phase order. Initially, the program is profiled and those subroutines that dominate execution time are recorded. Within each marked routine those loop nests that dominate execution are also marked as are the arrays referenced within them. After this initially marking phase, we consider consider data transformations on those arrays marked as significant. As data transformations are global in effect, they are considered first of all on the assumption that local loop transformations can later compensate [12]. In this strategy the only data transformation considered is array padding and this is applied to the first dimension of the marked arrays interprocedurally. If there are p padding factors to consider and a arrays, then the number of different padding combinations is pa . To reduce this complexity, we pad each array the same amount reducing the complexity to p. Within this strategy we consider all padding factors between 1 and 15, by applying the padding, running the program and selecting the pad size with the smallest execution time. To this new padded program, we now consider loop transformations. Loop tiling is considered for all those loop nests marked initially as significant. Each loop nest is considered in turn and tiled with tile sizes ranging form 2 to 64. When the best tile size is determined, it is recorded before moving on to the next loop nest. One again to avoid combinatorial explosion,each loop is optimised in isolation, ignoring the effect of transforming one lop on the rest of the program. Once the tile factors for each significant loop have been determined, they are all applied to give a new program. Finally loop distribution is applied to those nests where tiling was of no benefit, evaluated and recorded to give the final program.
4. Experimental Results In this section we discuss the results obtained using the three strategies for iterative compilation, on three SPEC FP benchmarks and four platforms.
4.1. Benchmarks and Platforms We use the following SPEC FP benchmarks: tomcatv, swim and mgrid. We ran our experiments using the reference data input sets. We have used the following platforms and compilers. For each benchmark and platform, we have used two agressive compiler optimization levels. Optimization level O4 enables inline expansion of small procedures, software pipelining, and all -O3 optimizations. Optimization level O5 extends -O4 in that it also applies loop transformations. Hence the improvement results below are with respect to a compiler that already statically optimises loops. Since we almost always obtain an improvement, this shows that our approach significantlly improves static optimisation techniques. Alpha 21164 at 500MHz. In-order execution. OS Digital UNIX V4.0D Compiler Compaq Fortran V5.0 Optimization levels -O5 and -O4
3.3.2 Strategy 2 In the previous strategy, we considered array padding followed by loop tiling and loop distribution, as data transformations have a global impact. In Strategy 2, we first apply loop tiling and then array padding. We do this because it is possible that benefits available from loop tiling have been missed due to inappropriate padding. For the rest, Strategy 2 is similar to Strategy 1.
Alpha 21264 at 500MHz. Out-of-order execution. OS Digital UNIX V4.0E Compiler Compaq Fortran V5.2 Optimization levels -O5 and -O4 Alpha 21264 at 667MHz (EV67). Out-of-order execution. OS Compaq Tru64 UNIX V5.0A Compiler Compaq Fortran V5.4 Optimization levels -O5 and -O4
3.3.3 Strategy 3 This strategy again focuses on the three transformations considered before: array padding, loop tiling, and loop unrolling. Once again, profiling is used to determine those arrays and loop nests of interest. In this strategy, each transformation is applied over a range of values independently and good regions are selected. Thus, for each array and each loop nest, we determine a range of values within which there are certain points where program behaviour is
Pentium II at 350 MHz. OS Windows 2000 Professional. Compiler Compaq Fortran V6.1 Optimization levels optimize:5 and optimize:4
6
We have run our iterative strategies on top of the backend compilers with the given optimization levels and report the speedups with respect to these different optimization levels. We have run all benchmarks on all platforms for both optimization levels.
tems. Significant improvements over the most aggressive compilation levels of standard compilers that rely on static analysis are obtained. Moreover, by choosing an optimisation set randomly, the approach can be successfully applied using a small number of program evaluations. However, this comes at a cost off slightly degraded improvements.
4.2. Speedup Results
5. Related Work and Discussion As can be readily seen from Tables 1, 2, and 3, we obtain significant improvements overall, up to 40%. For strategies 1 and 2, we let the search continue until no further possibilities could be tried by the overall coordinator. This required between 70 and 200 program evaluations. We restricted the number of program evaluations to 20 for Strategy 3. We observe from the tables that Strategies 1 and 2 perform about equally well: only small differences in speedup are found and across the benchmarks in some cases Strategy 1 performs better, and sometimes Strategy 2. In only one case (EV67 on tomcatv with -O4) does Strategy 1 outperforms Strategy 2 significantly, and in one case (PII on swim with -O4) Strategy 2 outperforms Strategy significantly. Moreover, for mgrid, array padding could not be applied and therefore both strategies produce the same optimisation. This shows that in some cases the order in which transformations are applied by the strategy (in our case, a fixed order) can be important, but not so pronounced as one might expect. A solution to this problem can be to try out many possible different orders. This will increase compilation time significantly, of course, but in cases where this can be afforded, it could pay off. Strategy 3, where random parameters are used, performs less than Strategies 1 and 2, but it still obtains good speedups. In three cases, improvements are substantially less than for both other strategies. In some case, we obtained no speedup at all. In these cases, however, the improvements found by the other strategies is also small. It appears that in these cases, 20 evaluations using random parameters is not sufficient and a larger amount is required. In the vast majority of the cases, however, degradations are much smaller. It should be noted that in no case Strategy 3 outperforms the other strategies. However, Strategy 3 requires an order of magnitude less program evaluations. Indeed, we have included this strategy in order to test our approach for small budgets of program evaluations. We found that in this case, a random strategy is preferable, The other strategies have more difficulty in restricting their budget since they traverse a tree depth first and hence will leave out of consideration many options that may contain the best optimization. Moreover, Strategy 3 perfectly scales in the number of allowed evaluations. We conclude that iterative compilation is a viable approach to program optimization, in case substantial compilation time can be afforded, as is the case for embedded sys-
Feedback directed optimisation [14] is a basic technique used in computer architecture where hardware resources are dedicated to tracing and predicting program behaviour [13]. Similarly, in low-level compilers profile guided compilation is widely used to determine execution path, allowing improved program optimisation [5]. For an excellent survey on feedback techniques, the reader is referred to [14]. Due to the problems of compile-time unknowns, several researchers have considered using runtime information. For instance, in Rauchweger [?] runtime tests are used to see if a loop maybe executed in parallel when it is undecidable at compile time. Others use runtime information to select the best implementation, typically defining one or more options statically which are then considered at runtime. For example, in [6], whether or not a portion of the iteration space should be tiled depends on runtime characteristics and in [4], different synchronisation algorithms are called depending on runtime behaviour. In [16, 1] systems for generation highly optimised versions of BLAS routines are described which probe the underlying hardware for platform specific parameters. Our approach can be seen as an extension of these last two approaches in a compilation framework. Wolf, Maydan and Chen [17] have described a compiler that also searches for the optimal optimisation based on a fixed order of the transformations and a static cost model to evaluate the different optimisations. Feedback directed high level transformations have also recently become more popular. In [15] a framework is described which allows remote on-line optimisation of a program while it is running, gaining the benefits of actual knowledge of runtime parameters without the overhead of compilation on the critical path. Our approach is similar in spirit in that different optimisations are tried and the best selected, theirs on-line ours off-line. However, the main distinction is that we have developed generic search strategies based on investing a systematic transformation optimisation space. Others have considered continuous compilation [?] in the background targetting runtime hotspots. Again, such an approach is machine specific and only considers a fixed predetermined number of optimisations. Our technique uses an actual specific data set to tune the program - just as expert programmers do. Such an approach is feasible if the control-path is relatively data invariant as is typical in multimedia and scientific applications. Further7
tomcatv tomcatv swim swim mgrid mgrid
-O5 -O4 -O5 -O4 -O5 -O4
21164 12.3 13.5 27.9 22.6 4.3 32.6
21264 25.4 25.4 38.2 40.0 10.5 15.4
EV67 19.0 6.2 2.7 8.2 5.0 1.0
PII 22.3 24.8 20.3 20.6 18.0 18.1
Table 1. Strategy 1: Percentage improvement
tomcatv tomcatv swim swim mgrid mgrid
-O5 -O4 -O5 -O4 -O5 -O4
21164 11.8 13.6 27.0 21.8 4.3 32.6
21264 21.9 27.8 32.6 39.1 10.5 15.4
EV67 13.7 5.6 3.1 7.4 5.0 1.0
PII 20.9 24.7 18.5 29.5 18.0 18.1
Table 2. Strategy 2: Percentage improvement
tomcatv tomcatv swim swim mgrid mgrid
-O5 -O4 -O5 -O4 -O5 -O4
21164 8.3 9.9 20.1 19.0 0 30.5
21264 23.7 22.0 31.8 33.0 8.3 14.8
EV67 9.2 5.3 0 4.7 0 0.8
PII 14.6 24.6 11.9 14.5 16.1 14.5
Table 3. Strategy 3: Percentage improvement
8
more, in [8] we have shown that the best sequence of transformations does vary for data size but that we can optimize for a set of data sizes by minimizing the average execution time. In this case, optimizations are found that are almost as good as optimizations found for one single data size.
[8] T. Kisuki, P.M.W. Knijnenburg, M.F.P. O’Boyle and H.A.G. Wijshoff, Iterative compilation in program optimization. Proc. CPC2000, pages 35-41, 2000. [9] T. Kisuki, P.M.W. Knijnenburg and M.F.P.O’Boyle, Combined selection of tile sizes and unroll factors using iterative compilation. Proc. PACT2000, pages 237-246, 2000.
6. Conclusion
[10] P.M.W. Knijnenburg, T. Kisuki, K. Gallivan and M.F.P.O’Boyle, The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling, Proc. FDDO-3, pages 31-40, 2000.
This paper has described an aggressive compiler framework that outperforms static optimisation approaches and that allows optimisers to adapt to new platforms by way of feedback directed iterative compilation. By decoupling strategy from implementation, we have implemented three architecture blind generic optimisation approaches. These rely on our framing the problem of optimisation as that of traversing a transformation space in order to minimise the object function of execution time. We have shown that for three full SPEC FP benchmarks, across four platforms and using two aggressive optimization levels in the backend compiler, iterative compilation outperforms this backend compiler, up to 40%. We conclude that iterative compilation is viable approach to program optimization, in case when high performance in crucial and long compilation runs can be afforded.
[11] M.F.P. O’Boyle, MARS: A Distributed Memory Approach to Shared Memory Compilation. Proc. Languages, Compilers and Runtime Systems for Scalable Computing, Springer Verlag, Pittsburgh, May 1998. [12] M.F.P O’Boyle and P.M.W. Knijnenburg, Efficient parallelization using combined loop and data transformations, Proc. PACT99, pages 283-291, 1999. [13] E. Rotenberg et al. Trace Processors, IEEE Micro, December 1997 [14] M. Smith, Overcoming the Challenges to FeedbackDirected Optimizations, Proc. Dynamo’00, 2000. [15] M.J. Voss and R. Eigenmann, ADAPT: Automated DeCoupled Adaprive Program Transformation, Proc. ICPP, 2000. [16] R.C. Whaley and J.J. Dongarra. Automatically tuned linear algebra software. Porc. Alliance, 1998.
References
[17] M.E. Wolf, D.E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. Int’l. J. of Parallel Programming, 26(4):479–503, 1998.
[1] J. Bilmes, K. Asanovi´c, C.W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: A portable, highperformance, C coding methodology. In Proc. ICS’97, pages 340–347, 1997. [2] F. Bodin, T.Kisuki, P.M.W. Knijnenburg, M.F.P. O’Boyle and E. Rohou Iterative Compilation in a Non-Linear Optimisation Space, Proc. 1st Workshop on Feedback Directed and Dynamic Compilation. Organised in conjunction with Pact ’98, 1998. [3] M.F.P. O’Boyle and P.M.W. Knijnenburg, Nonsingular Data Transformations: Definition, Validity, and Applications, Int’l J. of Parallel Programming, vol. 27(3), pages 131-159, 1999. [4] Diniz P. and Rinard M., Dynamic Feedback: An Effective Technique for Adaptive Computing, Programming Languages Design and Implementation, Las Vegas 1997. [5] R.Gupta, D. Berson and J.Fang, Path Profile Guided Partial Dead Code Elimination Using Predication, PACT 1995. [6] Hummel S.F., Banicesu I., Wang C.-T. and Wein J., Load Balancing and Data Locality via Fractiling: An Experimental Study, Kluwer Academic Press, LCR, New York May 1995 [7] T. Kisuki, P.M.W. Knijnenburg, M.F.P. O’Boyle, F. Bodin, and H.A.G. Wijshoff A Feasibility Study in Iterative Compilation, International Symposium on High Performance Computing, Kyoto, May 1999.
9