Computer Aided Hand Tuning (CAHT): âApplying. Case-Based Reasoning to Performance Tuningâ. Antoine MONSIFROT and François BODIN. IRISA-University ...
Computer Aided Hand Tuning (CAHT): “Applying Case-Based Reasoning to Performance Tuning” Antoine MONSIFROT and Franc¸ois BODIN IRISA-University of Rennes France
mantics of the program or the rule to apply a transformation depends on some knowledge of the target machine or application eld that cannot be transfered in the compiler heuristics.
ABSTRACT
F or most parallel and high performance systems, tuning guides provide the users with advices to optimize the execution time of their programs. Execution time may be very sensitive to small program changes. Such modi cations may be local (on loop) or global (data structures and layout).
In this paper, we propose to help end-users with the tuning process through an interactiv e tool complementary to existing compilers and automatic parallelizers. Our goal is to pro vide a live tuning guide capable of detecting optimization opportunities that are not caught by existing tools. Our rst prototype, called caht (Computer Aided Hand T uning), targets SMP architectures for OpenMP programs. caht relies on a very general technique, case based reasoning [16]. This technique is adequate to experiment and build an easily expandable and exible system. Our rst implementation applies to scienti c codes written in Fortran 77.
In this paper, we propose to help end-users with the tuning process through an interactiv etool complementary to existing compilers and automatic parallelizers. Our goal is to provide a live tuning guide capable of detecting optimization opportunities that are not caugh tby existing tools. Our rst prototype, called caht (Computer Aided Hand T uning), targets SMP architectures for OpenMP programs. caht relies on a v ery general technique, case based reasoning. This technique is adequate to experiment and build an easily expandable and exible system. Our rst implementation applies to scienti c codes written in Fortran 77. 1.
In Section 2 we brie y present the case-based reasoning technique and show how it can be specialized for dealing with code tuning. Section 3 presents the implementation of the system. In Section 4 we experiment caht on a parallel loop benchmark, then on a real application (DeFT [8]).
INTRODUCTION
P erformance tuning is the last development phase. Execution time may be very sensitiv eto small program changes that are made locally on loops and/or globally on data structures. This performance tuning is usually made by experts that must master the target architecture, the available compilers or parallelizers and code transformations as well as the intimate structure of the application and the numerical algorithms. Most parallel and high performance system vendors provide these experts with tuning guides [13, 14, 11, 10]. Correctness of the changes in the code is the responsability of the experts.
2. CASE-BASED REASONING: AN EXPERIMENTAL APPROACH TO HAND TUNING
The expert knowledge associated with optimization and parallelization is not well structured and does not lend itself to con ventional rule-based systems. In contrast, the case-based reasoning approach [16] aims to solv e a giv en problem by adapting the solution of a similar already encountered one. Case-based reasoning is based on four main operations: identi c ation, retrieval, reuse and retention. This is very similar to the code tuning process. For code tuning ,the expert mostly applies code transformations according to the peculiarities of the target architecture when some statement structures are found. F urthermore performance tuning on SMP architecture focuses on loop nests making most of the code c hanges local. P erformance also depends on the particular target arc hitecture.The case-based reasoning approach is exible enough to derive a target speci c version of the system at a reduced cost.
Compilers successfully optimize programs when tw o conditions are met. First, the required code transformations are available in the compiler and can be safely applied. Secondly, the optimizations are applied to the signi cant parts of the code. How ever, in many cases, the compiler or automatic parallelizer fails to optimize performance either because applying a code transformation might not preserve the sePermission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ICS ’01 Sorrento, Italy © ACM 2001 1-58113-410-x/01/06…$5.00
In the context of performance tuning, we rst identify code fragments for in vestigation:these can be obtained by pro ling. In order to retrieve cases, we check a set of properties on code fragments (i.e. loops) to identify opportunities for tuning. These properties are called indices in the remain-
196
der of this section. The conjunction of several indices on a code fragment allows to identify an opportunity for tuning. Preliminary studies [18, 7] report good potential with abstractions based upon abstract syntax, data accesses, control ow, etc. In the identi ed loops the solution to reuse may be automatically applied, if it consists of prede ned automatic transformations, or it may need to be performed by the user. Finally, when new optimizations are found, the user should be able to retain a new case in the system. We address this last issue in Section 3.
Category
Loop Structure
In our experimental environment, a case is a pair C = (Problem, Solution ) where Problem is a set of properties of loops and Solution is a sequence of transformations which represents a tuning or parallelization technique with its associated documentation. In our prototype the safety of the transformation has to be checked by the user but caht provides help for the insertion of debug routines (see Section 3.1). The remainder of this section describes the internals of our system. Indices that can be used to trigger the cases are described in section 2.1. Cases are described Section 2.2.
Category
2.1 Performance Related Indices
Arithmetic Expressions
Finding optimization opportunities consists in identifying a set of properties on loops that indicates that a program transformation might improve the execution time. In our context, the properties are related to the exploitation of the target architecture and removing expensive computations. We distinguish two kind of indices, the loop property indices and the loop nest patterns:
Indices
Maximum depth of the loop nest Perfectly nested loops Ane loops Number of statements in the nest Number of array references in the nest Smallest loop iteration number Smallest iteration numbers is outer loop Has a loop exit via goto(s) Has goto's ... Steps of the loops Else-if construct If Condition on loop index Does not contain function calls Call to an inline candidate function Fusable loops (same iteration space and access to same arrays)
Indices /x/y
Division by an invariant Power with an integer constant Polynomial expression Mixed data types
#
1 2 3 4a 4b 5 6 7 8 9 10 11 12 13 14
#
15 16 17 18 19
Table 1: Code and expression structure indices.
Loop Property Indices: A set of indices describes the
loop properties. These indices describe the loop according to computation and data accesses properties. The rst set of indices, Table 1 deals with the structure of the loop nests. The second set of indices are related to data dependencies and data accesses as shown Table 2. Contrary to the analysis used in classical compilers the results of the analysis that produces the indices is not always interpreted in a conservative way. For instance index #29 Table 2 indicates if a loop is parallel or not according to array references only. However control ow statements (goto, ...) or scalar references might forbid parallel execution of the loop. This interpretation is necessary to propose all potential optimization opportunities to the user. Statement Patterns: Patterns in caht are similar in spirit to idioms [19] with the major dierence that in a case-based reasoning system, the situation must be recognized with a high probability only. First this allows for a lot of simpli cations in the system, but it also allows to generalize similar \performance situations" to code fragments that have about the same characteristics. This freedom is part of the essence of the system to go beyond automatic optimizations. These patterns are directly associated to cases. For instance, let us consider the pattern for matrix multiply loops. The system assumes that the pattern for matrix multiplication, given in Table 6, is described
Category
Indices
Ratio of known bad access stride All array access strides are on leading dimension or invariant Non linear array accesses Array Accesses Percentage of invariant array access if loop X is innermost Multiple Access to same array element Power of 2 array leading dimension Percent array access stride on dimension Y if loop X is innermost Uniform data dependencies Data Positive data dependency vectors Dependencies Parallel loop Data dependency matrix indicates do across loop Permutable loopX Y
#
20 21 22 23 24 25 26 27 28 29 30 31
Table 2: Data accesses and data dependencies indices.
197
SUBROUTINE LOOP(AD,BD,CD,N,PS) REAL*8 AD(N,3),BD(N,N,3) REAL*8 CD(N,N,3) INTEGER N,PS,I,J,K DO I = 1,PS DO J = 1,PS DO K = 1,3 AD(J,K) = AD(J,K) +BD(I,J,K)+CD(J,I,K) ENDDO ENDDO ENDDO END
DIAGNOSIS
Source FORTRAN
C A H T
FORESYS TSF
Figure 1: Loop example
case 3 case 2 case 1
Indices Calculation
Abstract Loops Representations
by a loop nest structure is a three level deep, there is data accesses to, at least, 2D arrays with linear index expressions, and the expression must contain at least an addition, and a multiplication, and nally an assignment to an array that is also on the right hand side. These properties do not guarantee that a matching loop nest is a matrix multiplication but it has a high probability to be one. If it is not then it shares the same computation and data accesses structure so it is likely that the code transformation for the matrix mulitplication also applies.
Figure 2: System overview 3. CAHT A PRELIMINARY IMPLEMENTATION
Our prototype has been implemented using tsf [5], a foresys [20] extension that provides access to abstract syntax trees via a custom scripting language. It can extract and insert information into the foresys program database. From tsf we build an abstract representation of loops that is used to compute the indices and check the patterns upon. Figure 2 shows the global organization of the system. The loop abstraction is presented in Section 3.2. Case creation is then presented in Section 3.3.
2.2 Performance Tuning Cases
Most of the cases we have constructed were directly extracted from tuning guides [13, 14, 11]. Cases are obtained by combining indices and/or by nding loop nest patterns. Patterns are used to capture the syntactic nature of the loop and indices give a more abstract view of the properties of loops.
3.1 TSF
tsf is an extension of foresys [20], a Fortran 77 and 90 re-engineering system. tsf is not very dierent, for what it allows to do, from tools such as Sage++ [4], mt1 [2] or Polaris [3]. It provides functions to analyze and to modify abstract syntax trees whose complexity is hidden by some "object" abstraction. However, tsf diers greatly in the manner the program is presented to the user. tsf is centered around a scripting language that provides interactive and simple access to the Fortran programs. The scripts are dynamically loaded, translated on the y to Lisp and interpreted. They are interactive and user oriented, contrary to tools such as Sage++ that are batch and compiler writer oriented. The tsf script mechanism allows the user to write its own program transformations as well as to access analysis (such as data dependencies, interprocedural analysis of parameter passing, etc.) that are usually internal to restructuring tools.
The cases listed in Table 3 address tuning loop for improving data locality. The cases described in Table 4 correspond to improvement of the structure of the loop. For instance let us consider the loop in Figure 1. The following cases are detected: fully unroll the innermost loop (small iteration count), unroll-and-jam the outer loop, loop blocking. These three optimizations can all be applied. A special case, compiler friendly is used to indicate to the user if a good level of compiler optimizations can be expected for the loops, that is a good compiler will do the necessary transformations. Table 5 shows a set of cases corresponding to sub-expressions sometimes found in programs and that may be particularly inecient when not handled by compilers.
The tsf scripting language is in spirit very close to the Matlab language but instead of dealing with matrices it deals with Fortran statements. Similarly to Matlab that provides access to complex operators, tsf scripts give access to various program transformations (such as loop blocking, etc.) via built-in functions. The tsf script, Figure 3 illustrates the style of programming for a simple program that extracts the index expression for the array accesses. The scripts can be used to implement the reuse operations associated to the case-based reasoning technique.
Table 6 presents a set of cases corresponding to computation kernels for which there is a well known ecient implementation (and may be for instance provided by target speci c libraries). Finally in Table 7, we have described a few cases to achieve loop parallelization. The techniques to be used are specific to each target machine, especially in the case of sparse computation.
198
Name Instruction Level Parallelism (ILP) Data locality Data locality Data locality Data locality Data structure
ILP and Data Locality Cases Indices Corresponding tion
Transforma-
#5 #6 #12
Unrolling the innermost loop [1]
#4 #23 #26 #31 #1 #20 #31 #12 #26 #31
Unroll and jam [6] Loop interchange [1] Loop blocking for data locality or TLB eciency issues [1] Array dimension interchange Array padding [1]
#4 #20 #25
Table 3: Cases for ILP and data locality optimization. Name
Loop Transformation Cases Indices Corresponding tion
Loop overhead #4 #5 #12 Loop overhead #6 #31 Loop overhead data #14 locality ILP #7
Transforma-
Loop elimination ([14], page 133) Loop interchange ([14], page 128) Loop fusion [1]
Unrolling + test precomputation ([14], page 140) Function call #13 Function inlining ([14], page 152) Register pressure #4 Loop distribution [1] Sequence of Else-if #10 Tests can be sorted to reduce execution time ([14], page 145) Redundant array ac- #24 Array accesses can potentially be cesses replaced by scalar variable accesses. Peelable Conditional #11 The IF can be removed from the loop via peeling. Fusion #14 Loop can be fused to improve data locality, ILP or parallelism. Compiler Friendly #3 #4 #27 #29 Loop is compiler friendly and it is to 6#76#12 ... be expected that the compiler behaves properly.
Table 4: Loop transformation cases. Name Division
Expression Optimization Cases Indices Corresponding tion #15
Power Computation
#17
Horner
#18
Mixed data type
#19
Transforma-
Replace division my multiplication by inverse and add remainder for error propagation [14], page 86 Replace power by multiplication [14], page 84 Use Horner's scheme for the polynomial computation [14], page 86 Sort expression [14], page 90
Table 5: Expression cases.
199
Name
Matrix Multiplication LU Factorization SAXPY Vector Copy
Kernel Cases Indices Library MxM Pattern LU Pattern SAXPY Pattern Vector Copy Pattern
Call optimized BLAS (ATLAS [22]) Call optimized BLAS Call optimized BLAS Call optimized BLAS
Table 6: Kernel detection for which ecient implementation are known. Parallelization Cases Name Indices Corresponding Transformation Do accross #27 #28 #30 #31 loop Scalar reduc- x = x op ... tion Sparse Reduc- Sparse Reduction Pattern tion Not auto- No data dependencies found matically parallelized
Loop skewing
[14], page 133 Array Expansion[17] Potential parallelization (usually inihbited by control ow operations), add OpenMP directives
Table 7: Parallelization cases. The script mechanism also can help for checking the changes in the tuned program. After each modi ed loop nest a call to a debug routine can be inserted for each modi ed array in the loop. The debug routine computes a checksum of the array (based on xor). Inserting calls to the debug routine has to be done twice and consistently for the optimized and the non optimized version of the code. An execution of these two codes allow to compare checksums and identify when and where changes occurred in the computation. The small size of the checksum allows for a long execution time. The
exibility of the scripting mechanism is essential to specialize the system for each code according to the programmer's method.
3.3 Adding a New Case
Easy addition of new cases is an essential property of the system. For each new target machine new cases must be easily added to take into account the peculiarities of the programming environment, the hardware as well as the available libraries. Adding a new case consists in writing a small C++ function. A typical case is 10 to 20 lines of C++. The parameters are always the same and are provided as the set of values for each index and the abstract loop representation. The compiler friendly case is shown in Figure 4. The values of NbSt and NbRef are both set to 25 because larger loops may be worth distributed to avoid spill code due to register pressure (we consider target processors with 32 logical registers).
3.2 CAHT Abstract Loop Representation
Instead of directly accessing the front-end syntax tree, CAHT manipulates an abstract form of loops. The abstract form is used to compute and add indices and is essential to help adding new cases. It also makes caht independent from the compiler front-end technology used.
4. EXPERIMENTS
To experiment the eectiveness of the approach, we have rst used a loop benchmark [9] that highlighted frequent loop nest patterns. The second test was run on a real Fortran program (DeFT [8]) of a realistic size (75 863 lines of Fortran). The code was run on a 4-processor Sun Enterprise 450 (using Guide [15]) and on a 4-processor SGI Onyx. The automatic parallelizer used for both target machines is KAP [15].
The abstract loop representation stores data accesses converted in linear form when possible. Variable types are given with the size of arrays when they are statically known. Data dependencies are computed using the description of these accesses. Representations of the statements in the loop body and of the nesting structure are stored in a pre x tree format in order to simplify the pattern matching. In this format, variable names are removed, loops are converted to a normal form (do-enddo) etc.
4.1 Parallel Loop Experiment
To experiment the technique we have chosen a 64 loop benchmark from [9] that was designed to study the automatic parallelizer performance. The code was compiled using the ;O3 optimization level for the SGI machine and -fast for the Sun machine. The results provided by caht, with current case database are the following:
When available pro ling data such as the execution time as well as the number of iterations of the loop are stored in the abstract loop form.
200
SCRIPT AccessAnalysis () expArray := $csel IF (expArray.VARIANT == "vardim") THEN
//beginning of the script //Fortran selection in expArray //check if the selection is //an array access externalTool := "../CBRanalyzer" //analyzer program path STARTTOOL(externalTool) //start the analyzer reachedIndex := expArray.REACHABLEINDEX//get surrounding index variables SEND(reachedIndex) //send enclosing loop indexes to //the analyzer nbdim := expArray.DIMENSION //number of array dimensions WHILE (nbdim != 0) //for all dimensions dim := expArray.DIMENSION(nbdim) //get the access expression SEND (dim.LINEARFORM(reachedIndex)) //send the array access //expression in linear form nbdim := nbdim - 1 //next dimension ENDWHILE CLOSETOOL(externalTool) //stop the analysis program ENDIF // ENDSCRIPT //end of the script
Figure 3: Example of tsf script for extracting array accesses indexes. f
int casCompilerFriendly(IndiceS SetIndice, AbstractLoop *loop) if(SetIndice.getIndiceValue(AffineLoop)&& /* check indice #3 */ (SetIndice.getIndiceValue(NumberOfStatement) NbSt)&& /* check indice #4a */ (SetIndice.getIndiceValue(NumberOfArrayRef) NbRef)&& /* check indice #4b */ !SetIndice.getIndiceValue(GotoStatement)&& /* check indice #8 */ SetIndice.getIndiceValue(allGoodStride)) /* check indice #21 */ return TRUE;
g
f
g
<