Optimizing High Performance Software Libraries - Semantic Scholar

Optimizing High Performance Software Libraries Samuel Z. Guyer Calvin Lin Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712 May 23, 2000 Abstract This paper describes how the use of software libraries, which is prevalent in high performance computing, can benefit from compiler optimizations in much the same way that conventional computer languages do. We explain how the compilation of these informal languages differs from the compilation of more conventional computer languages. In particular, such compilation requires precise pointer analysis, requires domain-specific information about the library’s semantics, and requires a configurable compilation scheme. Finally, we show that the combination of dataflow analysis and pattern matching form a powerful tool for performing configurable analysis and transformations.

1 Introduction High performance computing, and scientific computing in particular, relies heavily on software libraries. Libraries are attractive because they provide an easy mechanism for reusing code. Moreover, each library typically encapsulates a particular domain of expertise, such as graphics or linear algebra, and the use of such libraries allows programmers to think at a higher level of abstraction. In many ways, libraries are simply informal domain-specific languages that use only procedural interfaces. The procedural interface is significant because it couches these informal languages in a familiar form that is less imposing than a new computer language, which typically introduces its own new syntax and semantics. Unfortunately, while these libraries are not viewed as languages by users, they also are not viewed as languages by compilers. With a few exceptions, compilers treat each invocation of a library routine the same as any other procedure call. Thus, many optimization opportunities are lost because the semantics of these informal languages are ignored. As a trivial example, an invocation of the C standard math library’s exponentiation function, pow(a,b), can be simplified to the value 1 by constant folding when its second argument evaluates to 0. This paper argues that there are many such opportunities for optimization, if only compilers could be made aware of a library’s semantics. These optimizations, which we term library-level optimizations, include choosing specialized library routines in favor of more general ones, eliminating unnecessary library calls, moving library calls around, and customizing the implementation of a library routine for a particular call site. Figure 1 shows our system architecture for performing library-level optimizations [23]. In this approach, the annotations capture semantic information about library routines. These annotations are provided by a library expert and placed in a separate file that accompanies the usual header files and source code. This information is read by our compiler, dubbed the Broadway compiler, which performs source-to-source library-level optimizations that transform both the library source and the application source. The resulting integrated system of library and application code is then compiled and linked using conventional tools. This system architecture offers three practical benefits. First, because the annotations are specified in a separate file from the library source, our approach applies to existing libraries and existing applications. Second, the annotations describe the library, not the application, so the application programmer does nothing more than use the Broadway Compiler in place of a standard C compiler. Finally, the non-trivial cost of writing the library annotations can be amortized over many applications. This system architecture also provides an important conceptual benefit in terms of separation of concerns. The compiler encapsulates all compiler analysis and optimization machinery, but no library-specific knowledge. Meanwhile, the 1

Library Annotations

Application source code

Header files

Source code

Broadway Compiler

Integrated and optimized source code

Figure 1: Architecture of the Broadway Compiler system annotations describe domain expertise and library knowledge. Together, the annotations and compiler free the applications programmer to focus on application design rather than on embedding library-level optimizations in the source code. The annotation language faces two competing goals. To keep the annotations simple, the language needs to have a small set of useful constructs that apply to a wide range of software libraries. At the same time, to provide power, the language has to convey sufficient information for the Broadway compiler to perform a wide range of optimizations. The remainder of this paper will describe how we have designed a compiler and annotation language that addresses these goals. This paper makes the following contributions. (1) We demonstrate the potential opportunities and pitfalls of librarylevel optimization. (2) We identify common classes of library-level optimizations, and compare them to conventional optimizations. (3) We show how these common classes of optimizations can be performed by combining two well-understood technologies: dataflow analysis and pattern matching. (4) We describe extensions to our earlier annotation language [23], which conveys semantic information to a library-level optimizer, and we argue that these extensions are sufficient to handle a large variety of library-level optimizations.

2 Opportunities In this section we present a number of examples demonstrating the potential opportunities and benefits of library-level optimization. Our approach is to look at how traditional optimizers improve code and to extend these techniques to libraries. We use four different libraries to show the spectrum of possible optimizations and to give a sense of the challenges we face in automating them. In subsequent sections we discuss these challenges in detail, and show how our solution addresses them. Traditional compilers employ a number of general techniques to improve program performance. While not an exhaustive taxonomy, the following list covers most of the common optimizations:

Eliminate redundant computations. These include partial redundancy elimination, common subexpression elimination, loop-invariant code motion, and value numbering. Perform computations at compile-time. This may be as simple as constant folding or as complex as partial evaluation Exploit special cases. Examples include algebraic identities and simplifications, as well as strength reductions. Schedule code. Take advantage of non-blocking loads and asynchronous I/O operations to hide the cost of longlatency operations. Enable other improvements. Use code transformations such as inlining, cloning, loop transformations, and lowering the internal representation. These same categories form the basis for our library-level optimization strategy. In some cases, the library-level optimizations are identical to their classical counterparts. In other cases we need to reformulate the optimizations for the unique requirements of libraries. However, we will show that in all cases a human must provide some knowledge of the semantics of the library, as a compiler cannot automatically discover the effective optimizations for an arbitrary library.

2

2.1 C Standard Math Library The standard math library extends the double-precision capabilities of C to include a variety of higher-level math functions. However, unlike the built-in arithmetic operators, these extensions receive no support in terms of compiler optimization. The code fragments in Figure 2 illustrate the untapped opportunities. A conventional optimizing compiler will discover three optimization opportunities on the built-in operators: (1) strength reduction on the computation of d1, (2) loopinvariant code motion on the computation of d3, and (3) replacement of division with bitwise right-shift in the computation of int1. The resulting optimized code is shown in the middle fragment. However, there are three analogous optimization opportunities on the math library computations that a conventional compiler will not discover: (4) strength reduction of the power operator in the computation of d2, (5) loop-invariant code motion on the cosine operator in the computation of d4, and (6) replacement of sine divided by cosine with tangent in the computation of d5. The code fragment on the right shows the result of applying these optimizations.

Original Code for (i=1; i 2; d5 = sin(y)/cos(y); }

Library−level

4 5 6

d1 = 0.0; d2 = 1.0; d3 = 1.0/z; d4 = cos(z); d5 = tan(y); for (i=1; i> 2; }

Figure 2: A conventional compiler will optimize built-in math operators, but not math library operators. Notice that each application of a library-level optimization is likely to yield much greater performance improvement than the analogous conventional optimization. For example, removing an unnecessary multiplication may save a handful of cycles, while removing an unnecessary cosine computation may save hundreds or thousands of cycles.

2.2 GNU Multiple Precision Library The GNU Multiple Precision Library (GMP) [20] provides numerical operators for arbitrary precision integer, rational and floating-point arithmetic. Like the standard math library, the GMP library could benefit from simple optimizations such as strength reduction, code motion, and algebraic identities. However, an interesting property of the library is that it has two distinct interfaces: (1) a basic interface, which provides high-level, user-oriented functions, and (2) an advanced interface which provides fast, low-level functions. The following excerpt from the GMP documentation concisely expresses the philosophy behind the different interfaces: The mpn functions [the advanced interface] are designed to be as fast as possible, not to provide a coherent calling interface. The different functions have somewhat similar interfaces, but there are variations that make them hard to use. These functions do as little as possible apart from the real multiple precision computation, so that no time is spent on things that not all callers need. Compiler writers take a similar approach to the design and processing of intermediate code representation. The compiler refines an input program from the initial abstract syntax tree, through successively lower levels of abstraction. Each level has optimizations that work best with those abstractions. Once the opportunities at one level are exhausted, compilation moves to the next lower level where new opportunities are often revealed. For example, having explicit array subscript operators facilitates array dependence analysis; subsequent lowering of the subscript operator often allows CSE and code motion on the resulting address computations. Note that a programmer could avoid using the subscript operator altogether by manually coding the array accesses with pointer arithmetic. While some performance improvement may result, such programming is tedious and error-prone without the convenient array abstraction. In addition, such programming makes array dependence analysis practically impossible. The same principle applies to GMP and to many other libraries that have both high-level and low-level interfaces. We can view the high-level interface as the programmer’s interface, and the low-level interface as the compiler target. Our goal in this situation is to automatically convert a high-level program into an efficient, low-level program. This allows the programmer to focus on clean design, while the compiler takes care of performance. Figure 3 depicts the overall approach. 3

Array Subscript High−level

Low−level

GMP arithmetic

A[i][j]

mpz_, mpq_, mpf_

*(A + i*__lda + j)

mpn_

Figure 3: Automatic refinement works for libraries as well as built-in operators.

2.3 MPI The Message Passing Interface (MPI) [14] has become a widely used standard for distributed-memory multiprocessor programming. Unfortunately, it can be challenging to obtain good performance with MPI because MPI provides so many different ways to send and receive data. In particular, there are twelve different ways (or “modes”) to perform point-topoint communication. The following table shows the combinations; the horizontal axis selects a handshaking protocol, while the vertical axis indicates the effect on the calling process. Blocking Nonblocking Persistent

Default MPI_Send MPI_Isend MPI_Send_init

Synchronous MPI_Ssend MPI_Issend MPI_Ssend_init

Ready MPI_Rsend MPI_Irsend MPI_Rsend_init

Buffered MPI_Bsend MPI_Ibsend MPI_Bsend_init

The proliferation of communication modes presents the programmer with several problems. First, different modes can perform differently depending on the underlying parallel architecture [8]. For example, non-blocking send performs poorly on machines in which the CPU is responsible for communication. Second, some modes are only effective in certain programming situations. For example, even with the proper hardware, non-blocking send is useless unless there are computations that can go on concurrently. Third, some modes are difficult to use correctly. For example, “ready” send requires that the corresponding receive already be posted by the destination process; otherwise the behavior is undefined. Figure 4 shows the steps required to turn a simple send into a non-blocking send. (In this example, the arguments to the MPI functions have been simplified). First, we use refinement to replace the blocking send with a non-blocking send and a corresponding wait. Second, we reschedule the functions to allow independent computation to occur during the communication. Original while (c) { compute_on(data); MPI_Send(data, dest_id); compute_on(other_data); MPI_Recv(data, src_id); }

Non−blocking while (c) { compute_on(data); MPI_Isend(data, dest_id, & req_handle); MPI_Wait(req_handle); compute_on(other_data); MPI_Recv(data, src_id); }

Schedule concurrency while (c) { compute_on(data); MPI_Isend(data, dest_id, & req_handle); compute_on(other_data); MPI_Wait(req_handle); MPI_Recv(data, src_id); }

Figure 4: Automatically convert blocking to non-blocking, and overlap communication and computation. Current optimizing compilers achieve similar effects when scheduling instructions for superscalar, pipelined, or outof-order execution processors. Instructions are reordered to increase utilization of the execution units and pipeline stages, while still respecting the dataflow dependences of the program. In MPI, the simple communication modes compose the programmer’s view, in which messages have simple blocking semantics. The advanced modes represent the underlying communication architecture, which exposes the non-blocking modes and the explicit hand-shaking algorithms. Ideally, the compiler could automatically convert a program that uses the simple communication modes into one that exploits the more complex and efficient modes. The compiler would identify the opportunities to employ the complex functions and would ensure their correct use. As with GMP, we view the basic functions as the programmer’s interface and the advanced functions as the compiler target. In addition, the compiler should schedule the MPI calls to maximize concurrency.1 1 Note

that some MPI optimizations are beyond the scope of our current annotation language. In particular, some optimizations, including the ready-

4

The publicly available MPICH [22] implementation of MPI provides a third level interface that offers additional opportunities. It has an internal interface, the MPI Abstract Device Interface, which isolates device dependent code from device independent code [31]. The ADI is a simple, high-performance layer used to implement the standard MPI interface. Not only are these functions low-level and difficult to use, but programmers should not access them directly because they are not part of the public MPI interface. However, we can view the ADI as a compiler target, which allows enhancements that aren’t possible at the standard interface level. By automating this program refinement, we maintain ease of programming and preserve the standard interface.

2.4 PLAPACK The PLAPACK library is a set of routines for coding parallel linear algebra algorithms in C or Fortran [32]. PLAPACK consists of parallel versions of the same kernel routines found in the BLAS [15] and LAPACK [1]. At the highest level, it provides an interface that hides much of the parallelism from the programmer. PLAPACK provides abstractions which can be useful for performing optimizations. For example, PLAPACK programs manipulate linear algebra objects indirectly though handles called views. A view consists of data, possibly distributed across processors, and an index range that selects some or all of the data. A typical algorithm operates by partitioning the views and working on one small piece at a time. While most PLAPACK procedures are designed to accept any type of view, the actual parameters often have special distributions. Recognizing and exploiting these special distributions can yield significant performance gains [3].

(1) Determine the best partition size:

while (1) { PLA_Obj_split_size( A, PLA_SIDE_LEFT, &size_l, &tmp); PLA_Obj_split_size( A, PLA_SIDE_TOP, &size_r, &tmp); size = min3(nb, size_l, size_t); if (size == 0) break;

(2) Logically partition A:

PLA_Obj_split_4( A, size, size, &A11, &A12, &A21, &A22);

(3) Factor the small piece A11:

Factor(A11);

(4) Solve A21 __view_1, DATA of __view_1 −−> __data } access { __view_1, height } modify { } on_exit { upper −−> new __view_2, DATA of __view_2 −−> __data, lower −−> new __view_3, DATA of __view_3 −−> __data } }

Figure 8: Annotations for pointer structure and dataflow dependences. In some cases, a modern compiler could derive this information automatically from the library source code. However, there are situations when this is undesirable, infeasible, or even impossible. Many libraries encapsulate functionality for which no source code is available, such as low-level I/O or interprocess communication. Even with the source code, it is easier to provide the information declaratively, especially if it is well known. Finally, we may want to model abstract relationships that are not explicitly represented in the library implementation. For example, a file descriptor logically refers to a file on disk through a series of operating system data structures. Using annotations, we can explicitly represent the file and it’s relationship to the descriptor, which might make it possible to recognize when two descriptors access the same file. 3.3.1 Pointer Annotations: on entry and on exit The on_entry and on_exit annotations convey the effects of the procedure on pointer-based data structures by describing the pointer configuration before and after execution. Each annotation contains a list of expressions of the following form: [ label of ] identifier --> [ new ] identifier The binary --> operator indicates that the object named by the identifier on the left points to the object named on the right. An object can point to multiple targets by using the optional label. In the on_entry annotations, these expressions describe the state of the incoming arguments, and give names to the internal objects. In Figure 8, the on_entry annotation has two pointer expressions. The first indicates that the formal parameter obj is a pointer and assigns the name __view_1 to the target of the pointer. The second says the __view_1 itself is a pointer and assigns the name __data to its target. In the on_exit annotations, the expressions can create new objects (using the new keyword), and can alter the relationships between the existing objects. In Figure 8, the on_exit annotation declares that the split procedure creates two new objects, __view_2 and __view_3, and that the two parameters upper and lower point to the new objects. In addition, the two new objects point to the same data object __data from the input. This annotation describes the configuration depicted earlier in Figure 6. 9

3.3.2 Dataflow Annotations: access and modify The access and modify annotations declare the objects that the procedure accesses or modifies. These annotations may refer to formal parameters, or to any of the internal objects introduced by the pointer annotations. The annotations in Figure 8 show that the procedure uses the length argument and reads the input view __view_1. In addition, we automatically add the accesses and modifications implied by the pointer annotations: a dereference of a pointer is an access, and setting a new target is a modification.

3.4 Using Dependence Analysis Simply having the proper dependence analysis information for library routines enables several classical optimizations. By separating accesses from modifications, we can identify library calls that are dead code, and remove them. The compiler can also identify purely functional routines by looking at whether the objects accessed are different from those that are modified. Loop invariant code motion and common subexpression elimination can then process these routines as they would any other operators. However, as described in the next section, most of the library-level optimizations require more than just dependence information.

4 Library-Level Optimizations In our system, library-level optimizations are made possible by combining two tools: dataflow analysis and pattern-based transformations. Both tools are configurable, so that each library can specify its own analyses and transformations. While our formulations of the tools are not especially sophisticated, they work together in a complementary fashion to enable powerful optimizations. Patterns work well for identifying local syntactic properties and for specifying transformations, but cannot easily express properties about the context within the overall program. Consider the following fragment from an application program: PLA_Obj_horz_split_2( A, size, &A_upper, &A_lower); ... if ( is_Local(A_upper) ) { ... } else { ... }

The use of PLA_Obj_horz_split_2 ensures that A_upper resides locally on a single processor. Therefore, we can simplify the subsequent if statement, replacing it with the then-branch using the following pattern transformation: if ( cond ) thenbody else elsebody

replace-with

thenbody

when

cond == true

The transformation depends on two conditions that are not captured in the pattern. First, we need to know that the is_Local function does not have any side-effects. Second, we need to track the library-specific local property of A_upper through any intervening statements to make sure that it is not invalidated. It would be awkward to use patterns to express this condition; it requires a complex type of pattern capable of matching a sequence of statements that do not modify A_upper and that do not have any incoming control-flow. Our program analysis framework fills in the missing capabilities, giving the patterns access to globally derived information such as dataflow dependences and control-flow information. We use abstract interpretation to further extend these capabilities, allowing each library to specify it’s own dataflow analysis problems, such as matrix distribution analysis. The combined system allows the annotations to exploit the best attributes of each tool. The following sequence outlines our process for library-level optimization: 1. Pointers and dependences. Analyze the program using combined pointer and dependence analysis referring the annotations when necessary to incorporate library behavior. 2. Classical optimizations. Apply the traditional optimizations that rely on dependence information only. 3. Abstract interpretation. Use the dataflow analysis framework to derive library-specific properties as specified by the annotations. 4. Patterns. Search the program for possible optimization opportunities, which are specified in the annotations as syntactic patterns. 10

5. Enabling conditions. For patterns that match, the annotations may specify additional constraints. The constraints are expressed in terms of dataflow dependences and the results of the abstract interpretation. 6. Actions. When a pattern satisfies all the constraints, perform the specified code transformation. This process supports a variety of library-level optimizations for variety of different libraries. In next part of this section we describe the mechanisms in more detail and show how we can use them to take advantage of the opportunities mentioned in Section 2. Finally, we suggest the annotations needed to specify the optimizations.

4.1 Mechanisms Both dataflow analysis and pattern matching have a wide range of realizations, from simple to complex. By combining them, we can keep each individual tool relatively simple. The patterns need not capture complex context-dependent information because it is better expressed by the program analysis framework. This also keeps the annotation language itself simple, making it easier to express optimizations and easier to get them right. 4.1.1 Dataflow Analysis In section 3 we showed how annotations can integrate library functions into pointer and dependence analysis. For libraryspecific analysis, we provide a more general abstract interpretation framework. Our version of abstract interpretation allows the annotations to specify the flow value, which represents the library-specific program state, and the transfer functions, which describe how the library functions affect that state. The compiler reads the specification and runs the given problem on the general-purpose dataflow framework. The framework takes care of propagating the flow values through the program graph, handling control-flow such as conditionals and loops, and testing for convergence. Once complete, it stores the final, stable flow values for each program point. We restrict the possible flow values to a small but useful set of options: (1) a set of objects, (2) an assignment of library-specific properties to objects, or (3) pairs of objects. The term object refers to any storage represented in the program analyzer, such as a local variable, a global variable, or a heap-allocated data structure. The properties are simply names (like a C enumerated type) intended to capture the possible abstract states of individual objects as they are processed by the program. The pairs of objects can maintain relations between objects, such as equivalent or disjoint matrix views. Each library function annotation provides a transfer function for the various abstract interpretation passes needed by the library. The transfer functions are specified as a case analysis on the incoming flow value. Each case consists of a condition, which tests the incoming flow value, and a consequence, which sets the outgoing flow value. The conditions are written in terms of the particular kind of flow value (sets, mappings or pairs). procedure PLA_Obj_horz_split_2( obj, height, upper, lower) { on_entry { obj −−> __view_1, DATA of __view_1 −−> __data } access { __view_1, height } modify { } analyze Distribution { (__view_1 == General) (__view_1 == RowPanel) (__view_1 == ColPanel) (__view_1 == Local) } on_exit

=> => => =>

__view_2 __view_2 __view_2 __view_2

= = = =

RowPanel, __view_3 = General; RowPanel, __view_3 = RowPanel; Local, __view_3 = ColPanel; Local, __view_3 = Empty;

{ upper −−> new __view_2, DATA of __view_2 −−> __data, lower −−> new __view_3, DATA of __view_3 −−> __data }

}

Figure 9: Annotations for matrix distribution analysis. Figure 9 shows the annotations that define the matrix distribution transfer function for PLA_Obj_horz_split_2. The analyze keyword indicates the property to which the transfer function applies. The possible values of the property are General, ColPanel, RowPanel, Local and Empty. We integrate the transfer function with the dependence annotations because we need to refer to the underlying structures. Distribution is a property of the views, not the surface variables. Notice the last case: if we deduce that a particular view is Empty, we can remove any code that computes on that view. 11

4.1.2 Pattern-Based Transformations Our pattern-based transformations consist of three parts: a pattern, which describes the target syntax, enabling conditions, which must be satisfied, and an action, which specifies modifications to the code. The pattern is simply a code fragment that acts as a template, with special meta-variables that behave as “wildcards”. The enabling conditions can then test different properties of the matching application code, including dataflow dependences, control-flow context, and abstract interpretation results. The compiler evaluates the conditions using information from program analyses. The actions can specify several different kinds of code transformations, including moving, removing, or substituting the matching code. The pattern consists of a C code fragment with meta-variables that bind to different components in the application code. The meta-variables have the following types that require the matching application code to have particular properties: Type Identifier Constant Expression Statement Object

Syntax $fid:nameg $fconst:nameg $fexpr:nameg $fstmt:nameg $fobj:nameg

Description Matches any identifier Matches any constant value Matches any expression Matches any statement Matches any expression that evaluates to a specific storage location

The Object meta-type uses the pointer analysis results to allow uniform identification of storage locations, regardless of the specific syntax used to access them. For example, in the following code fragment, the Object meta-type would properly identify *p and x as two names for the same storage, thus enabling the conversion to tan(x): double x; double * p = &x; double z = sin(x)/cos(*p);

/* Same as sin(x)/cos(x) => tan(x) */

When a pattern matches, the compiler binds each meta-variable to the corresponding piece of code in the application. The annotations can then refer to those components in the enabling conditions and code transformations. The enabling conditions are boolean expressions constructed from predicates on the meta-variables. We have identified a small number of fundamental operators and predicates that access program analysis information: - Refer to internal objects by deferencing pointers. - Refer to dataflow dependent computations such as past assignments (reaching definitions) and future uses. - Test the abstract interpretation results. - Test the dataflow dependences of objects, such as the presence of side-effects, and control-dependence equivalence. - Test the control-flow context, such as inside a loop or recursion. When the enabling conditions are satisfied, the compiler invokes the action, which actually transforms the application code. The actions always operate relative to the site of the matching code. Our actions are primarily declarative, with a few crucial procedural capabilities. The basic actions can (1) remove the matching code, (2) replace the matching code, (3) move the code, (4) insert entirely new code, or (5) trigger one of the enabling transformations such as inlining or loop unrolling. We use code templates much like the patterns to specify new code or replacement code. Here, the compiler expands the embedded meta-variables, replacing them with the actual code bound to them. We also support queries on the metavariables, such as the C datatype of the binding. This allows us to declare new variables that have the same type as existing variables. For the Object meta-type, we can choose to expand the meta-variable back to it’s original code, such as *p in the example above, or to the actual objects, such as x. When moving or inserting new code, the annotations support a variety of useful positional indicators that describe where to make the changes relative to the site of the matching code. For example, the earliest possible point and the latest possible point are defined by the dependences between the matching code and it’s surrounding context. Using these indicators, we can perform the MPI scheduling described in Section 2: move the MPI_Isend to the earliest point and the MPI_Wait to the latest point. Other positional indicators might include enclosing loop headers or footers, and the locations of reaching definitions or uses. Figure 10 demonstrates some of the annotations that use pattern-based transformations to optimize the examples presented in this paper.

12

pattern expr{ sin(${obj:x})/cos( } ${obj:x}) } { replace expr{ tan($x) }expr }

}expr

pattern stmt{ MPI_Isend( ${obj:buffer}, ${expr:dest}, ${obj:req_ptr}) }stmt { move @earliest; } pattern stmt{ PLA_Obj_horz_split_2( ${obj:A}, ${expr:size}, ${obj:upper_ptr}, ${obj:lower_ptr}) ) }stmt { on_entry { A −−> __view_1, DATA of __view_1 −−> __data } when (Distribution of __view_1 == Empty) remove; when (Distribution of __view_1 == Local) replace stmt{ PLA_obj_view_all($A, $upper_ptr); }stmt }

Figure 10: Example annotations for pattern-based transformations.

5 Related Work Our work builds upon the huge body of work in static analysis and optimization, including classical optimizations [28], partial evaluation [5, 12], abstract interpretation [13, 25], and pattern matching. Our work attempts to extend these approaches to work for library-level optimizations. Our work is not the first to use annotations to convey information to compilers. For example, DyC [21] uses annotations to guide dynamic compilation, and other annotations, such as HPF’s data distribution directives [24], are complex enough that they effectively redefine the language. Still other compilers use hints or pragmas to guide optimizations such as inlining and loop unrolling, and to indicate that certain procedures are pure functions. To our knowledge, none of these hints or pragmas allows information to be provided in a configurable manner that can convey domain-specific information. One approach to optimizing libraries is to provide special support from within a conventional compiler. For example, semantic expansion recognizes specific class libraries, such as a particular complex number or array library, essentially extending the language to include these libraries [2]. Similarly, some C compilers recognize invocations of malloc() when performing pointer analysis. Our goal is to provide compiler support in a configurable manner so that it can apply to many libraries, not just a favored few. There are many other approaches to specializing libraries. Vandevoorde uses formal semantics and inference capabilities to specialize procedure implementations [33]. However, complete axiomatic theories are difficult to write and do not exist for many domains. In addition, this approach depends on theorem provers, which are computationally intensive and only partially automated. Meta-programming systems such as meta-object protocols [10], programmable syntax macros [34], and the Magik compiler [17], can also be used to customize library implementations, as well as to extend language semantics and syntax. While these techniques can be quite powerful, they require users to manipulate AST’s and other compiler internals without the availability of dataflow information. Automatically generated libraries, such as PHiPAC [6], ATLAS [35] and FFTW [18], have been successful at generating efficient machine-specific implementations for specific domains, namely, dense linear algebra and the Fast Fourier Transform. None of these attempts to provide techniques that apply to other domains. Similarly, some manually written libraries attempt to provide flexibility to choose from amongst different algorithms based on input values to library routines [4, 7], but none of these approaches support the transformational capabilities that compilers provide. Software generators [30] have also been used to build libraries. These systems share many of the goals of our work, but take a different approach. First, generators provide sophisticated transformations of high-level language constructs, but most generators do not support dataflow analysis or the optimizations enabled by it. Second, generators often introduce new programming language syntax. Our approach works within conventional programming languages and development paradigms, and even works for legacy systems. We believe that our technique is complimentary to software generators,

13

providing the high-level optimization passes that many generators lack.

6 Conclusions This paper has outlined the various challenges and possibilities for performing library-level optimizations. In particular, we have argued that such optimizations require precise pointer analysis, domain-specific information, and a configurable compilation scheme. We have also shown how configurable dataflow analysis and pattern matching can combine to form a powerful tool for performing library-level optimizations. A large portion of our Broadway compiler is currently complete. We have implemented a flow- and context-sensitive pointer analysis, a configurable abstract interpretation pass, and the basic annotation language [23] without pattern matching. Experiments with this basic configuration have shown that significant performance improvements are possible for applications written in the PLAPACK parallel linear algebra library. One common routine, PLA Trsm(), was customized to improve its performance by a factor of three, yielding overall speedups of 26% for one application (Cholesky factorization) and 9.5% for another (a Lyapunov program) [23]. While we believe there is much promise for library-level optimizations, several open issues remain. We are in the process of defining the details of our annotation language extensions for pattern matching, and we are implementing its associated pattern matcher. Finally, we need to evaluate the limits of our scheme—and of our use of abstract interpretation and pattern matching in particular—with respect to both optimization capabilities and ease of use.

References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, second edition, 1995. [2] P.V. Artigas, M. Gupta, S.P. Midkiff, and J.E. Moreira. High performance numerical computing in Java: language and compiler issues. In Workshop on Languages and Compilers for Parallel Computing, 1999. [3] G. Baker, J. Gunnels, G. Morrow, B. Riviere, and R. van de Geijn. PLAPACK: high performance through high level abstractions. In Proceedings of the International Conference on Parallel Processing, 1998. [4] M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor collective communication library. In Proceedings of Supercomputing ’94, November 1994. [5] A. Berlin and D. Weise. Compiling scientific programs using partial evaluation. IEEE Computer, 23(12):23–37, December 1990. [6] Jeff Bilmes, Krste Asanovi´c, Chee whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a Portable, HighPerformance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [7] Eric A. Brewer. High-level optimization via automated statistical modeling. In Proceedings of the Fifth Symposium on Principles of Parallel Programming, July 1995. [8] Bradford Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. The case for high level parallel programming in ZPL. IEEE Computational Science and Engineering, 5(3):76–86, July-September 1998. [9] David R. Chase, Mark Wegman, and F. Kenneth Zadeck. Analysis of pointers and structures. ACM SIGPLAN Notices, 25(6):296– 310, June 1990. [10] S. Chiba. A metaobject protocol for C++. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, pages 285–299, October 1995. [11] Jong-Deok Choi, Michael Burke, and Paul Carini. Efficient flow-sensitive interprocedural computation of pointer-induced aliases and side effects. In ACM, editor, Conference record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages: papers presented at the symposium, Charleston, South Carolina, January 10–13, 1993, pages 232–245, New York, NY, USA, 1993. ACM Press. [12] Charles Consel and Olivier Danvy. Tutorial notes on partial evaluation. In Proceedings of the 1993 ACM Symposium on Principles of Programming Languages, pages 493–501, Charleston, South Carolina, 1993. [13] Patrick Cousot and Radhia Cousot. Abstract interpretation frameworks. Journal of Logic and Computation, 2(4):511–547, August 1992. [14] Jack J. Dongarra, Steve W. Otto, Marc Snir, and David Walker. An introduction to the MPI standard. Communications of the ACM, to appear.

14

[15] J.J. Dongarra, I. Duff, J. DuCroz, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–28, 1990. [16] Maryam Emami, Rakesh Ghiya, and Laurie J. Hendren. Context-sensitive interprocedural points-to analysis in the presence of function pointers. In Proceedings of the ACM SIGPLAN ’94 Conference on Programming Language Design and Implementation, pages 242–256, Orlando, Florida, June 20–24, 1994. [17] Dawson R. Engler. Incorporating application semantics and control into compilation. In Proceedings of the Conference on DomainSpecific Languages (DSL-97), pages 103–118, Berkeley, October15–17 1997. USENIX Association. [18] Matteo Frigo and Stephen G. Johnson. An adaptive software architecture for the fft. In IEEE Int’l Conference on Acoustics, Speech and Signal Processing, vol 3, pages volume 3, pp 1381–1384, 1998. [19] Rakesh Ghiya and Laurie J. Hendren. Connection analysis: A practical interprocedural heap analysis for C. International Journal of Parallel Programming, 24(6):547–578, December 1996. [20] Torbjorn Granlund. The GNU Multiple Precision Arithmetic Library. Free Software Foundation, April 1996. [21] B. Grant, M. Philipose, M. Mock, C. Chambers, and S.J. Eggers. An evaluation of staged run-time optimizations in DyC. In SIGPLAN Conference on Programming Language Design and Implementation, pages 223–233, 1999. [22] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, 1996. [23] Samuel Z. Guyer and Calvin Lin. An annotation language for optimizing software libraries. In Second Conference on Domain Specific Languages, pages 39–52, October 1999. [24] High Performance Fortran Forum. High Performance Fortran Specification. November 1994. [25] Neil D. Jones and Flemming Nielson. Abstract interpretation: a semantics-based tool for program analysis. In Handbook of Logic in Computer Science. Oxford University Press, 1994. 527–629. [26] Arvind Krishnamurthy and Katherine Yelick. Optimizing parallel programs with explicit synchronization. In Programming Language Design and Implementation, June 1995. [27] S. P. Midkiff and D. A. Padua. Issues in the optimization of parallel programs. In Proceedings of 1990 International Conference on Parallel Processing, pages Volume 2, Software, pp. 105–113, August 1990. [28] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kauffman, San Francico, CA, 1997. [29] M. Shapiro and S. Horwitz. The effects of the precision of pointer analysis. In 4th International Static Analysis Symposium, Lecture Notes in Computer Science, Vol. 1302, 1997. [30] Y. Smaragdakis and D. Batory. Application generators. Encyclopedia of Electrical and Electronics Engineering, Supplement 1, 2000. [31] Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. Technical Report MCS-P592-0596, Argonne National Laboratory, Mathematics and Computer Science Division, May 1996. [32] Robert van de Geijn. Using PLAPACK – Parallel Linear Algebra Package. The MIT Press, 1997. [33] Mark T. Vandevoorde. Exploiting Specifications to Improve Program Performance. PhD thesis, MIT, Department of Electrical Engineering and Computer Science (also MIT/LCS/TR-598), 1994. [34] Daniel Weise and Roger Crew. Programmable syntax macros. In Proceedings of the Conference on Programming Language Design and Implementation, pages 156–165, June 1993. [35] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra http://www.supercomp.org/sc98/TechPapers/sc98 FullAbstracts/Whaley814/INDEX.HTM, 1998.

software.

In

SC’98,

[36] Robert P. Wilson. Efficient, Context-sensitive Pointer Analysis for C Programs. PhD thesis, Stanford University, Department of Electrical Engineering, 1997.

15

Optimizing High Performance Software Libraries - Semantic Scholar

Optimizing High Performance Software Libraries - Semantic Scholar

Suggest Documents

Constructing Numerical Software Libraries for High ... - Semantic Scholar

Software Performance - Semantic Scholar

Optimizing Aircraft Performance - MSC Software

Optimizing the Software Architecture for ... - Semantic Scholar

PRINS: Optimizing Performance of Reliable ... - Semantic Scholar

Optimizing Surveillance Performance of Alpha ... - Semantic Scholar

Optimizing performance through intrinsic ... - Semantic Scholar

StableBuffer: Optimizing Write Performance for ... - Semantic Scholar

Optimizing Network Performance of Computing ... - Semantic Scholar

Optimizing TCP Start-up Performance - Semantic Scholar

Optimizing Performance and Reliability in ... - Semantic Scholar

Optimizing broiler performance using different ... - Semantic Scholar

Component-based software for high-performance ... - Semantic Scholar

Evaluation of High-Performance Computing Software - Semantic Scholar

Hardware/software Interface for High-performance ... - Semantic Scholar

High-Performance Numerical Software for Control ... - Semantic Scholar

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

Ultra-High Performance, High-Temperature ... - Semantic Scholar

Portable High-Performance Supercomputing: High ... - Semantic Scholar

Optimizing Performance of C++ Threading Libraries (PDF Download ...

Optimizing High Performance Self Compacting Concrete

digital libraries - Semantic Scholar