An environment for the rapid prototyping and development ... - CiteSeerX

3 downloads 2473 Views 242KB Size Report
tools for the development of numerical programs and libraries. ... not interpreted and the bookkeeping operations only need to be performed when the .... An important characteristic of functions in MATLAB is that they are side-e ect free unless.
In Proc. DAGS'94 Symposium on Parallel Computing and Problem Solving Environments F. Makedon ed. pp 11{25, Dartmouth College, July 1994

An environment for the rapid prototyping and development of numerical programs and libraries for scienti c computation L. DeRose, K. Gallivan, E. Gallopoulos, B. Marsolf and D. Padua June 1994 CSRD Report No. 1370

Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 West Main Street Urbana, Illinois 61801

An Environment for the Rapid Prototyping and Development of Numerical Programs and Libraries for Scienti c Computation L. De Rose

K. Gallivany E. Gallopoulosz B. Marsolfy Center for Supercomputing Research and Development and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Urbana, Illinois 61801 fderose, gallivan, stratis, marsolf, [email protected]

D. Paduax

Abstract

Interactive array languages are powerful programming tools for the development of numerical programs and libraries. They provide an environment that tends to increase productivity in software development. The trade-o is that in order to provide this nicer programming environment, array languages are usually interpreted, with the resulting negative e ect on performance. In this paper we present our project to develop an environment that takes advantage of both the power of interactive array languages and the performance of compiled languages, for the rapid prototyping and development of numerical programs and libraries for scienti c computation. The performance of the resulting code can be further improved by the application of optimizations and the exploitation of parallelism and data distribution.

1 Introduction Interactive array languages such as APL [13, 21] and MATLAB [19] are powerful programming tools for the development of numerical programs and libraries. Many computational scientists consider that it is easier to prototype algorithms and applications using array languages instead of conventional languages such as Fortran and C. One reason is the interactive nature of the language, which facilitates debugging and analysis. A second reason is that interactive array languages are usually contained within problem-solving environments which include easy-touse facilities for displaying results both graphically and in tabular form [12]. Third, in these Supported by the CSRD Aliates under grant from the U.S. National Security Agency. Supported by the National Science Foundation under Grant No. US NSF CCR-9120105 and by ARPA under a subcontract from the University of Minnesota of Grant No. ARPA/NIST 60NANB2D1272. z Supported by the National Science Foundation under Grant No. US NSF CCR-9120105. x Supported in part by Army contract DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army or the Government.  y

1

languages it is not necessary to specify the dimension, shape, or type of elements of arrays. While some researchers may consider that lack of typing increases the probability of error, in practice programmers nd that this is a convenient feature. Finally, these languages also have an extensive set of functions and higher level operators, such as array and matrix addition, multiplication, division, matrix transpose, and vector reductions, that facilitate the development of scienti c programs. The downside is that interactive array languages are implemented with interpreters and therefore their execution is sometimes inecient. The interpreter spends its time reading and parsing the program, dynamically determining the type of the operations, dynamically allocating storage for the resulting variables, and performing the operations. In fact, a large fraction of the interpreter's execution time is wasted doing work that could be done statically by a compiler. For example, the compiler could determine the type of elements and the dimensions and shape of many operands in the program by doing global ow analysis. In this way, the execution could be made more ecient by eliminating the need for some or all the run-time bookkeeping operations. A study of the e ectiveness of this type of approach on APL programs is presented in [6]. When the bulk of the computations is done by the high-level array functions, the ineciency of the interpreter is less of a problem, because these functions are not interpreted and the bookkeeping operations only need to be performed when the function is invoked and/or returns. However, for some applications and algorithms, such functions are not sucient and the program needs to execute a signi cant number of loops and scalar operations. In some preliminary experiments we have conducted, it was observed that interpreting programs executing mainly loops and scalar operations could be up to two orders of magnitude slower than executing their compiled versions. To improve performance, the ideal environment for the development of numerical programs and libraries should use an interpreted array language for the conception and development phases and a compiler for the production phase. Performance can be further improved by exploiting parallelism and data distribution, taking into consideration the semantics of the array language to improve the quality of the resulting parallel code, as shown in Figure 1. It is possible for the compiler to generate code for any programming language paradigm, with two well-suited choices being a compiled array language, like Fortran 90 [1], and an object-oriented data parallel language, for instance pC++ [5]. In this paper we present our project to develop such an environment. This environment consists of three main parts, a front-end compiler that translates MATLAB programs into an internal form and two tools to transform this internal form into a target language, either Fortran 90 plus directives for parallelism and data distribution or pC++. A description of the project is presented in Section 2, with the presentation of the three main parts of this environment. Section 3 describe the current status of our work. Some experimental results are shown in Section 4, and nally our conclusions are presented in Section 5.

2 Project description An overall view of the environment is shown in Figure 2; the environment consists of three main parts: the front-end MATLAB compiler, a Fortran 90 generator, and a pC++ generator, as previously described. Within these main parts, additional analysis is done and optimizations are applied to improve the performance of the resulting code. 2

USER

INTERACTIVE ARRAY LANGUAGE

COMPILER

PARALLELIZER

SEQUENTIAL

PARALLEL

CODE

CODE

Figure 1: Ideal programming environment for the development of programs and libraries for numerical computation. The MATLAB language was chosen as the input language because it is used extensively by computational scientists and is available on a wide range of platforms. Fortran 90 is a natural choice as an output language, rst because Fortran has been used as a scienti c programming language for the past 30 years. Secondly, because Fortran 90 is also an array language, it has many features that facilitate the compilation process, especially for vector computations and handling memory management. Thirdly, parallelism and data distribution can be exploited with the migration to High Performance Fortran (HPF) [17], which is a superset of Fortran 90. Moreover, most of today's massively parallel machines have Fortran compilers that are a subset of Fortran 90, containing the array language features and some mechanism for data distribution. For the growing body of people who are using C++ for the development of numerical algorithms, pC++ is one way of providing a parallel version of this work. pC++ is a barrierbased extension to C++ which allows object-parallel programming by de ning collections of elements and by de ning methods which operate, in parallel, on the elements within a collection. By generating pC++, we have the ability to apply object-oriented techniques to scienti c programming and explore the issues of using a barrier-based language. Also, pC++ is available for several massively parallel machines. In addition, by allowing the environment to generate these two di erent languages, we will be able to compare the e ectiveness of the two approaches for parallel processing. Eventually this should improve the performance in both cases through the exchange of information and by the experimentation with di erent parallelization techniques.

2.1 The Front-end Compiler

The front-end compiler reads in the MATLAB language and converts it into an abstract syntax tree (AST). During this process, a mechanism combining static and dynamic inference methods for type, rank, and shape inference is executed [9]. One characteristic that makes interactive array languages easy to use is their lack of type 3

User Interface

Lexical Analyzer Syntax Analyzer

MATLAB

Front End Compiler Inference Intermediate Code

Code Optimization

Maple Convertor to Sage++

Sage++ Objects

Sage++

Code Generator for Fortran 90

Algebraic Restructuring Primitive Translation

Fortran 90 Code

Fortran 90 Compiler

Context Sensitivity

Polaris

Sequential Code

pC++ Code Parallel Code pC++ Compiler

Parallel Code

Figure 2: Program development environment.

4

declarations. The fact that the language is not typed simpli es the programmer's job, because every function and operator is polymorphic, and therefore type casting or function cloning to accept di erent input and output types is not necessary. Although run-time type-checking provides more exibility for the user, it normally means sacri cing performance in order to achieve this goal. To translate a typeless language such as MATLAB into a language such as Fortran or C++, a type-inference mechanism is necessary. MATLAB works with only two types for variables, real and complex, which our type inference approach has to support. The solution we adopted is to use a combination of static and dynamic analyses. The static inference uses a type algebra, similar to the one described in [22] for SETL. This algebra operates on the type of the MATLAB objects and is implemented with the use of tables for all arithmetic operations. The dynamic analysis is used for all expressions for which type cannot be inferred statically by the compiler. To this end we introduce conditional statements to select the operation and variables of the appropriate type. Although this solution introduces some overhead in the computation and increases the size of the code, in several cases it is cheaper than executing complex arithmetic all the time. Shape inference is also important for the eciency of the code. If the sizes of the arrays are known at compile time, they can be declared statically, and the overhead of dynamic allocation can be avoided. However, in some cases it is impossible to determine shape, because it can be dependent on the input data, and thus the only alternative is to allocate the array dynamically. To avoid the problem of rank inference, we could declare all variables as matrices, since MATLAB considers all variables to be two-dimensional arrays, including scalars and vectors. However, this approach has a negative impact on performance because the unnecessary use of indices and the overdimensioning of variables will cause poor memory and cache utilization. Hence, it is also important to provide the ability of recognizing scalars and vectors. Functions must also be considered during the inference pass. This problem is important because a MATLAB program may consist of several function calls, and the knowledge of the statements inside the functions might be very helpful in de ning the rank, shape, and type of the variables. Thus, functions cannot be treated as simple black boxes. There are two types of functions in MATLAB, intrinsic (or built-in) functions, and M les. Built-in functions range from elementary mathematical functions such as sqrt, log, and sin to more advanced matrix functions such as inv (for matrix inverse), qr (for orthogonal triangular decomposition), and eig (for eigenvalues and eigenvectors). M- les consist of a sequence of standard MATLAB statements, which possibly include references to other M- les. An important characteristic of functions in MATLAB is that they are side-e ect free unless declared otherwise1 . Therefore, arguments and variables de ned and manipulated inside the le are local to the function and do not operate on the workspace of its caller. This is an important characteristic that improves parallelization, as will be discussed later. We are considering three approaches for the treatment of functions. The rst is to compile each function independently and save all the information about the function that is necessary for the inference process. This information is then used during the compilation of the main program anytime that the function is called. The second approach is to inline the function every time that it is called, and compile the whole program at once. This approach is more appealing, because type, rank, and shape inforNewer versions of MATLAB accept global declarations, which we will not support with this version of the compiler. 1

5

[L, U, P] = lu(A); y = L (P b); x = U y;

n n



Figure 3: MATLAB code segment for LU decomposition mation can be easily propagated inside the functions. This approach, however, requires special handling for the case of recursive functions[16]. The third approach is to apply interprocedural analysis. This would overcome the code explosion that could occur with inlining and will make it possible to deal with recursive functions. However, it comes at the expense of a more complex implementation. We will implement a mix of the three approaches described above. For built-in functions, we will use the rst and third approaches, by constructing a database that given the types of the input arguments and function that are used, will contain the possible types for the output parameters and, whenever possible, information about special characteristics of the output, such as matrix shapes (e.g., lower triangular, diagonal). For M- les, we will use the second approach.

2.2 Optimizations

During the compilation of the MATLAB code, optimizations can be applied to improve the performance or numerical properties of the code. These optimizations can be applied at several different places within the environment and can be done either interactively or semi-automatically. However, for best results, the optimizations need to be applied at the most appropriate level. The optimizations to be used include the standard techniques [2] as well as other techniques such as context sensitivity, algebraic restructuring, and primitive-set to primitive-set translation. The standard techniques to be used include common subexpression elimination, copy propagation, and dead code elimination. These methods can be applied to both the AST and to the target code resulting from the code generators. However, the eciency of the target code may be improved by applying the optimizations during the early stages of the compiler, when more information may be available, as opposed to applying them to the target code. The performance of the target code can be improved with the use of context sensitivity by the front-end compiler. The semantic information that is available in the source program can be analyzed by the front-end compiler to allow the utilization of optimized functions in the compiled code, instead of needing to use generalized methods. This information could otherwise be very dicult to recover, or could even be lost, after the compilation process. For example, consider the MATLAB code for the solution of a linear system Ax = b, using a LU decomposition, as presented in Figure 3 The rst statement calls a built-in function (lu), that returns a lower triangular matrix L, an upper triangular matrix U, and a permutation matrix P. For a regular compiler, the second statement should perform a matrix-vector multiplication (Pb) and solve the linear system Ly = Pb. Finally, the last statement should solve the linear system Ux = y. However, by taking into consideration the semantics of the array language, and knowing the properties of L, 6

and P, a smart compiler would know that the matrix-vector multiplication (P  b) is only a permutation on the vector b, and that the two linear systems to be solved are triangular systems that can be solved using a faster routine. Therefore, we observe that a large gain in performance can be obtained by using the inherent information of the program. Furthermore, by knowing this information, a more appropriate parallel algorithm can be used to improve the performance of the parallel code. Another technique, that we call algebraic restructuring, uses the algebraic rules de ned for the variables, whether they are scalars, vectors, or matrices, to restructure the operations performed on the variables. To perform such manipulations symbolic computation tools, such as Maple [7], can be employed. In some cases applying these rules may be similar to the standard loop-based restructuring strategies already used, such as blocking of matrices, but we also want to be able to handle special matrix classes and more complex operators. Our goal in applying the algebraic rules to matrices and vectors is to achieve better restructuring than when the rules are only applied to elements. As an example of algebraic restructuring, the associative rule for matrices can be applied to transform a triangular solve with a column sweep, U,

(L?4 1 (L?3 1(L?2 1 (L?1 1  f )))) ; into the product form [11],

(((L?4 1  L?3 1 )(L?2 1  L?1 1 ))f ) ; thereby generating more parallelism at the matrix operation level. In applying the rules, however, one must decide what goal the optimizations should try to achieve: improved serial performance, improved parallelism, or improved numerical stability, and what machine resources are available. For example, the product rule that was just described can require more processors and generate more operations than the column sweep, which might be undesirable on systems with few processors. Primitive-set to primitive-set translation can also be used to translate the code to the level of numerical operations that will work best for the target machine and application [10]. Instead of dealing with the code only at a matrix operation level, this phase is able to convert the algorithms to matrix{vector operations or vector{vector operations. This optimization technique should be guided by factors such as the machine architecture, availability of low level libraries, and problem size, with the goal of achieving the best performance. As an example of the use of multiple levels of primitives to perform the same operation, matrix multiplication can be implemented using either of the three levels of BLAS primitives. The operation C = A  B is performed in Figure 4 using primitives from the three BLAS levels: the vector-triad (level 1), the vector-matrix product (level 2), and submatrix multiplication (level 3). Although all three segments perform the same matrix operation, the selection of the best method to use is based upon the resources available on the target architecture. The performance of these code segments can vary greatly depending upon factors such as the number of processors to be used, the size of the cache, and the presence of vector registers.

2.3 Fortran 90 generator

After the generation of the AST, the generation of Fortran 90 is straightforward, because most of the problems that normally appear in the code generation phase in a regular compiler [2] 7

Vector-Triad

Do i = 1, n C(i,1:n) = 0 Do j = 1, n C(i,1:n) = C(i,1:n) + A(i,1:n) * B(j,1:n) End Do End Do

Vector-Matrix Product

Do i = 1, n C(i,1:n) = A(i,1:n)*B End Do

Submatrix Products

Do i = 1, n, k Do j = 1, n, k C(i:i+k-1,j:j+k-1) = A(i:i+k-1,1:n) * B(1:n,j:j+k-1) End Do End Do

Figure 4: Three possible options for matrix multiply using BLAS primitives (e.g., register allocation) do not need to be addressed in ours, since Fortran 90 is a high level language. In addition, several of the array constructs in MATLAB have direct translation to Fortran 90. Examples of these constructions are array operations such as addition and multiplication that have the same semantics in Fortran 90, and some vector reductions and matrix operations such as dotproduct, matrix multiplication, and transposition, that can be translated into Fortran 90 functions. Nevertheless, the translation process is not always straightforward, because there are several idiosyncratic features in MATLAB that cannot be converted directly into Fortran 90. Examples of these features are the operators \=" and \n" that are used for division of matrices, and the functions that can have several output variables. These issues, however, are treated by the front-end compiler, during the generation of the AST. Most of the parallelization of the Fortran 90 code will be done by interfacing with Polaris, a parallelizing compiler for massively parallel machines [20]. We are also investigating the automatic generation of data partitioning and distribution through the use of directives (other techniques for automatic data partitioning have been described in [8], [15], and [18]). With the use of heuristics, our compiler will select distributions based on its knowledge of the best data partitioning for each built-in function. Thus, in the LU example in Figure 3, the compiler can generate directives for data distribution for solving the linear system using block factorization, as shown in Figure 5, using HPF notation. !HPF$ distribute A(block,*) !HPF$ align with A :: L,U !HPF$ align (:) with A(:,1) :: b,t,y,x

Figure 5: Data distribution directives for solving the linear system Ax = b. Directives can also be used to facilitate the exploitation of functional parallelism [14] and loop parallelism. As described before, MATLAB functions are side-e ect free; hence, the work executed by a parallelizing compiler to detect functional parallelism can be simpli ed signi cantly if this information is given as a directive. Also, by knowing that a function is side-e ect free, a parallelizing compiler will not need to perform interprocedural analysis to decide whether a loop that has a function or subroutine call on its body can be parallelized. 8

2.4 pC++ generator

The generation of pC++ from the AST form is performed with the help of Sage++ [4], a class library for building C++ restructuring tools. The main classes supported by the library are projects, les, and statements, which are arranged in a hierarchy, with each project consisting of one or more les and each le consistent of one or more statements. The statement class is used to represent all types of statements, including header, declaration, control, and executable statements. Statement objects within a le can be modi ed, changed, or added to restructure the program. To save the modi ed program, Sage++ can either output the le in its internal format or it can output the le as source code. For our project we start Sage++ with an empty le and translate the AST nodes into Sage++ statements within the le. When all of the statements have been added to the le, Sage++ is used to output the le as pC++ source code. Sage++ is being used for several reasons. First, it makes the code generation easier because we do not have to deal with all of the details of the target language. Second, because Sage++ can restructure the target language, this provides the tools necessary to modify the target language which is generated for the algorithm, allowing direct in uence over the generated code without needing to change to the input source code. The converter to Sage++, therefore, is responsible for translating the nodes in the AST into statements in the Sage++ le. This translation is accomplished by traversing the AST, in order, and examining the nodes. For each executable node in the tree, a decision needs to be made about how to implement the code in pC++, based on the operation being performed and the data types of the operands. These factors are used to determine what type of data structures to use, what level of primitives to use for the operations, and what data and operations to place into classes. When the operands are scalars this translation can be easy, but when the operands are vectors and matrices the translation becomes more complicated. For the generation of parallel code, the data distribution must also be determined. pC++ provides parallelism to C++ by allowing objects to be grouped into sets which are distributed over processors. Each set is represented as one element in a collection, where each element of a collection can be operated on in parallel. Thus, to generate pC++ code the converter must create collections of objects and de ne operations on the elements of the collections. Also, in order to provide load balancing it is necessary to determine how many objects should be grouped into each element of the collection. Additional details about the generation of pC++ code can be found in [10].

3 Current Status At the current time we have a system that can parse MATLAB code into the AST and generate Fortran 90 or pC++ code for most operations and built-in functions2 . We have implemented code optimizations at multiple levels, including scalar optimizations, for common subexpression elimination and copy propagation, and the primitive-set to primitive-set translation, for mapping MATLAB operations to multiple libraries. 2 Currently the code is generated assuming all variables are real because type inference has not yet been implemented.

9

The compiler is capable of mapping MATLAB operations to subroutines calls for a speci ed library. This is done by preparing a mapping for each subroutine in the library to the corresponding MATLAB operation. The mapping matches the MATLAB operation being performed on certain data types to a corresponding subroutine call. The compiler can currently map a MATLAB operation to a single subroutine call or to a subroutine call within a loop (nested loops are also supported.) The compiler is currently being modi ed to support other combining rules when mapping the MATLAB operations to subroutine calls, with the new rules allowing mapping one MATLAB operation to multiple subroutine calls and mapping multiple MATLAB operations to a single subroutine call. These mappings are used to create the recon gurable libraries.

Original MATLAB code: C = A * B;

The translated code calling a splib subroutine: amub(A.nrows, A.ncols, A.values, A.colnums, A.rowptrs, B.values, B.colnums, B.rowptrs, C.values, C.colnums, C.rowptrs, C.nlen, _aT__1, _aT__2);

The translated code using a level 1 BLAS primitive: for (_aT__1 = 1 ; _aT__1