Ph.D. Proposal: High-level Optimization for Software Libraries

2 downloads 11340 Views 136KB Size Report
May 19, 1999 - The University of Texas at Austin ... In conventional software development, these needs are often at odds, creating a tradeoff between usability.
Ph.D. Proposal: High-level Optimization for Software Libraries Samuel Z. Guyer The University of Texas at Austin Austin, TX 78712 USA May 19, 1999 Abstract We propose a new compiler-based technique for optimizing software libraries and the applications that use them. Our compiler, which we call Broadway, performs a series of source-to-source transformations on both the library and the application to produce an integrated and customized system. The key to this technique is an annotation language that captures high-level information about the semantics of library functions. The annotations are supplied by a library expert, and they allow the compiler to perform library-specific dataflow analysis and optimization. We believe that our technique can overcome some of the performance penalties of using libraries, while preserving the many benefits. Our initial experiments with a prototype Broadway compiler show that this technique can yield significant performance improvements for real applications. We envision libraries as flexible software components that can easily adapt to their surroundings. Adaptivity allows libraries to simultaneously satisfy the various needs of the developer, the application, and the underlying machine. In conventional software development, these needs are often at odds, creating a tradeoff between usability and performance. In contrast, our technique addresses performance automatically and allows the developer to focus on program design.

1 Introduction Software libraries are one of the most popular ways to promote modularity and reuse in conventional software systems. As applications become larger and more ambitious, programmers are likely to rely more and more on libraries to manage complexity. While this trend is beneficial for application design and maintenance, it often comes at the expense of performance. The fixed interfaces and hard boundaries imposed by libraries hinder many kinds of performance improvements. Programmers who are interested in performance often end up fighting modularity through a difficult and invasive process of manual optimization. In this research, we propose a compiler-based technique that mitigates this tradeoff, allowing libraries to contribute to both good design and good performance. Problem Before discussing the proposed solution in detail, it is worth summarizing how libraries are used in current programming practice, and how this can lead to performance problems. A library is a set of pre-compiled functions that provide a computational service to client applications. The functions are accessed through a well-defined public interface and together they often implement a particular domain of computation, such as file access or matrix operations. A library will often make use of other libraries, resulting in a layered system. However, the design and implementation of a library typically takes place in isolation, independent of any specific target application. High-performance applications can present a difficult challenge for library designers. A library is frequently reused in many different applications, each with its own performance characteristics. A particular library function that performs well in one situation, may perform poorly in another. For example, consider a library function that multiplies 1

two matrices together. The general-purpose implementation, which consists of three nested loops, performs many more computations than are necessary when the input matrices are special cases, such as triangular or symmetric. No single matrix multiply algorithm is ideal in all situations. The problem is compounded in applications with many layers of libraries, causing performance to rapidly deteriorate. Library designers often address this problem by providing several different implementations of the same functionality, each optimized for a specific situation. We call this phenomenon interface bloat because the library becomes bloated with functions that do not differ in their capabilities, but only in their performance characteristics. In the example above, the library designer could provide a separate matrix multiply function for each special case. We refer to the specialized functions collectively as the advanced interface of the library, and to the general-purpose functions as the basic interface. Interface bloat is not an acceptable solution because it places an unreasonable burden on the application developer. First, the use of specialized functions exposes performance decisions, effectively “hard-wiring” the application to specific performance assumptions. If those assumptions change—by porting, for example— then the application needs to be rewritten. Second, the advanced interface is typically more difficult to use. Some advanced functions have strict constraints on when they can be used, while others may require complicated and invasive code changes. Furthermore, it is up to the application developer to select the appropriate functions from the advanced interface. In order to know which functions are appropriate, the developer needs to understand the performance characteristics of both the application and the library, and how they interact. Finally, all of this work is performed manually, which is tedious and error-prone. The two fundamental technical problems at work are (1) libraries and applications don’t cooperate to improve performance, and (2) obtaining good performance from a library can be a difficult and invasive task. Solution We propose a new type of compiler that can analyze and optimize the way that an application uses libraries. Our compiler, which we call Broadway, starts with a working application and performs many different code transformations to promote better performance. These include: choosing specialized library functions, eliminating unnecessary library calls, moving library calls around, or customizing parts of the library implementation. We call this process librarylevel optimization and it allows us to overcome the two fundamental problems described above. First, compiling the library and application together ensures that they will perform well together. Second, automating the optimizations relieves the application developer of the responsibility for this painful task. The key to our solution is an annotation language that captures information about how to analyze and optimize library functions. The annotations are supplied by a library expert, and accompany the usual header files and object code that compose a library. The Broadway compiler reads the annotations and applies a series of source-to-source transformations on both the library and application source. The result in an integrated system of library and application code, which is ready to be compiled and linked using conventional tools. Figure 1 shows the overall architecture of the system. A conventional compiler cannot perform library-level optimization because it can only reason about the lowlevel primitives that are built into the language. While a library often specifies a rich set of high-level operations, a conventional compiler is not aware of their semantics. Even if the compiler has access to the implementation of the library, it cannot derive the properties that are useful for optimization. For example, given the implementation of

2

Library

Application source code

Annotations Header files

Broadway Compiler

Optimized source

Source code

Figure 1: Architecture of the Broadway Compiler system the matrix multiply function, a compiler cannot be expected to discover that matrix multiply is associative, but not commutative. Given a particular property, the compiler might be able to verify it against the implementation using a theorem prover, but even this is a significant task that cannot be fully automated. Rather than rely on the compiler to derive properties of the library, we use annotations to convey the information explicitly. The annotations allow our compiler to treat a library as a domain-specific programming language that is embedded in the base language. By analogy, the basic interface serves as the syntax that the programmer uses, while the advanced interface serves as the machine code that the compiler targets. We believe that this approach can yield many of the benefits of domain-specific compilation, without the cost of developing a new compiler for each domain. One of the goals of the annotation language is to allow a library expert, who probably has no compiler background, to easily specify library-level optimizations. Our approach establishes the right division of responsibility for program performance. A library expert designs the annotations, which form a repository for knowledge about the performance properties of the library. Developing the annotations is a considerable task, but this work is amortized over many applications. The compiler automatically performs the tedious, but mechanical, task of analyzing the whole application and applying the specified optimizations. Thus, the application developer is free to use the basic interface of the library and focus entirely on good program design. We envision libraries as flexible software components that can easily adapt to their surroundings. Adaptivity allows libraries to satisfy the various needs of the developer, the application, and the underlying machine. In pursuing our vision, we hope to develop ideas about how to build better libraries that lend themselves to our approach. In fact, these ideas may apply to any type modular software engineering method, including object oriented programming. We have obtained encouraging results from experiments using an initial version of the annotation language and a limited prototype Broadway compiler. We have found that the prototype compiler can transform a simple, but efficient application into a sophisticated one that performs exceptionally well. The experiments have also provided feedback on the annotation language and compiler design. The next phase of this research, which is already under way, is to develop a full-scale compiler that is built around our revised annotation language. As the implementation becomes available, we can test our approach on a wider variety of applications and libraries. Contributions This research makes four primary contributions:

3

1. Overall approach We introduce a compiler-based technique allows libraries to provide both good program design and good program performance. The main idea is to pass semantic information about library operations to the compiler so that it can optimize applications automatically. As a consequence, complex libraries become more usable because the responsibility for library performance is taken off the programmer. 2. Compiler mechanisms Library-level optimization appears to require different or modified compiler mechanisms than traditional optimization. A component of this research is to determine which mechanisms are the most useful and how to make them configurable for different libraries. We are currently exploring two directions. First, we want to integrate library functions into the traditional optimization phases, such as partial redundancy elimination, code motion, and dead-code elimination. Second, we provide a configurable abstract interpretation phase for collecting information about how the library is used. 3. Annotation language and evaluation The annotation language is the key to our approach because it conveys library-specific information that would otherwise not be available to the compiler. The challenge in designing the language is to support the most powerful analysis and optimization mechanisms, while maintaining usability. We will evaluate the effectiveness of the language in allowing the library expert to configure the compiler without necessarily understanding, or even being aware of, the underlying compiler mechanisms. 4. Implementation and experiments We will evaluate the effectiveness of our approach experimentally by building the Broadway compiler and testing it on a variety of applications and libraries. Since learning and annotating a library can be time consuming, we will choose a representative set of libraries. The remainder of this proposal is organized as follows. Section 2 summarizes related work. Section 3 presents the technical details of the proposed research. Section 4 describes the completed work and discusses our initial results. Section 5 outlines the plan for completing the project. Appendix A provides details about the current annotation language and examples. Appendix B contains a fragment of the annotation file used to specify the PLAPACK optimizations.

2 Related Work Our work is related to several areas of compilers and software engineering. Here, we summarize this work and contrast it with our approach. Software generators [15, 16] and program transformation systems [14] are compilers for domain-specific programming languages. These systems share many of the goals of our work, but take a different approach. First, generators provide sophisticated transformations of high-level language constructs, but most generators do not support dataflow anlysis or the optimizations enabled by it. Our approach uses relatively simple transformations, but offers a complete suite of configurable dataflow analysis passes. Second, generators often introduce new programming language syntax. Our approach works within conventional programming languages and development paradigms, and even works 4

for legacy systems. We believe that our technique is complimentary to software generators, providing the high-level optimization passes that many generators lack. There has been considerable work in formal semantics and formal specifications. In particular, Vandevoorde uses powerful analysis and inference capabilities to specialize procedure implementations [18]. However, complete axiomatic theories are difficult to write and do not exist for many domains. In addition, this approach depends on theorem provers, which are computationally intensive and only partially automated. Our work differs from these primarily in the scope and completeness of our annotations, which describe only specific implementation properties instead of complete behaviors. Partial evaluation improves performance by specializing routines for specific inputs [4, 5, 7]. The technique combines inlining, constant propagation and constant folding to evaluate as much of the program as possible at compile time. Our work differs in two important ways. First, our technique supports a higher level form of specialization that applies to library operations, not just language primitives. Second, our approach can perform optimizations such as loop-invariant code motion that cannot be expressed using partial evaluation. Open and extensible compilers give the programmer complete access to the internal representation of the program [10, 9]. While these systems are quite general, they impose a considerable burden. To use them, the programmer needs to understand (1) general compiler implementation techniques, (2) how to configure the specific compiler they are using, and (3) how to express and execute their optimizations. Similarly, meta-object protocols provide sophisticated mechanisms for modifying the compilation of object oriented programs [6, 12], but they can be difficult to use. Our compiler limits configurability to a small but powerful set of capabilities, and provides a simple way to access them.

3 Proposed work This section describes the technical details of proposed work. It consists of three major components: (1) the annotation language for capturing library-specific information, (2) a set of configurable compiler mechanisms, and (3) a compiler implementation.

3.1 Annotation language The goal of the annotation language is to convey library-specific information to the compiler in a simple declarative manner. While it’s clear that more sophisticated specifications could support more sophisticated optimizations, our goal is to show that a few simple annotations can enable many useful optimizations. Simplicity is important because we expect our language users to be library experts who do not necessarily have expertise in compilers or formal specifications. We studied several libraries to determine the most useful ways of optimizing them. First, we noticed that library operations could easily be integrated into many traditional optimizations, such as dead-code elimination, copy propagation and loop-invariant code motion. These optimizations are effective and well understood, and they require only minimal information to enable. For example, to enable loop invariant code motion, the annotations need to indicate which library procedures have no side-effects. Second, we observed that many library-specific optimizations replace a general-purpose library call with a more specific one that takes advantage of information about the calling context. This form of specialization not only improves performance, it often creates additional opportunities for traditional

5

optimizations. Thus, our annotation language consists of two classes of annotations: basic annotations for enabling traditional optimizations, and advanced annotations for specifying library-specific specialization. Basic Annotations. Our basic annotations provide capabilities similar to interval analysis [13]. Interval analysis concisely summarizes the effects of a procedure, so that the compiler can analyze any code that calls the procedure without reanalyzing the procedure itself. Our language allows the library annotator to explicitly summarize the dataflow and pointer effects for each library procedure [20]. In some cases, a modern compiler could derive this information automatically from the library source. However, there are conditions under which this is infeasible or impossible. Many libraries encapsulate functionality for which no source code is available, such as low-level I/O or communication routines. Even if source is available, it may be simpler to provide the information declaratively, especially if it is well known. Advanced Annotations. The second class of annotations define library-specific analyses and optimizations. Our compiler provides a configurable abstract interpretation engine to gather information about how the application manipulates library objects. The annotations are used to define a dataflow analysis problem consisting of a set of abstract object states and the effects of each library procedure on those states. The abstract states form a dataflow lattice and the library procedure effects serve as dataflow transfer functions. The analyzer propagates this information through the program to derive the abstract states of the actual program variables. A separate set of annotations uses this information to trigger library procedure specializations. Each specialization tests the abstract states of its input parameters to determine if the library call can be replaced by code that takes advantage of the context. A detailed discussion of the current annotation language is presented in Appendix A. We are considering several other advanced annotations as well. One such annotation would specify a type of annotation that we call compositional, in which a series of library function calls is replaced with a more efficient series of calls. The underlying mechanism is similar to algebraic simplification and value numbering citeDragon, but it is not yet clear how to express this information in the annotation language. Another advanced annotation would specify code scheduling constraints. For example, in a communication library that has split-phase operations, we want to tell the compiler to push the start and wait as far apart as possible. The use of an annotation language to convey library semantics raises the issue of correctness. Even a library expert may provide incorrect annotations, which can result in optimizations that improperly change the behavior of the application. Errors in the annotations are likely to be subtle problems, such as inconsistencies between different annotations, or missing special cases. Unfortunately, there is no absolute way to check the correctness of a set of annotations. Therefore, our goal is to minimize the possible effects of incorrect annotations. First, we keep the language simple, so that its easy to get the annotations right. Second, the compiler treats incomplete information conservatively by filling in the gaps with the most restrictive assumptions.

3.2 Compiler mechanisms This section describes the compiler mechanisms that we plan to use in the Broadway compiler. Our goal is to find a small set of powerful optimizations that are effective for optimizing library functions and applicable to many different libraries. We determined the initial set of mechanisms, described below, by observing the manual optimization process for a few different libraries and applications. We noticed that this process can be decomposed into simple, systematic steps. In many cases, these steps correspond to existing optimization passes, such as dead-code elimination. These 6

passes are extended to library functions by reading the corresponding annotations whenever the semantics of a library function are needed. 3.2.1 Mechanisms

Abstract interpreter One of the most important mechanisms in our compiler is the abstract interpretation engine. This engine can be configured to collect library-specific information about how the application uses the library functions. It is similar to constant propagation, except that it assigns abstract, rather than concrete, values to the variables. For example, an abstract interpretation over integer variables could classify them as either positive, negative, or zero. The annotations specify the dataflow lattice and transfer functions (see Appendix A for details) that describe the effects of each library function. Abstract interpretation is crucial to many of the optimizations because it yields library-specific information about what the application computes. The result of this analysis is a library-specific abstract state (from the lattice) assigned to each variable. This information is used to trigger several other optimizations. While the basic facility has been easy to implement, there are many improvements that would make it more powerful, but also more complex. Some possible improvements include allowing simple arithmetic lattices, and supporting conditional analysis [19]. Library-specific transformations Library function specialization is implemented using a form of program transformation which is triggered by semantic information from the abstract interpreter rather than just syntactic patterns. The transformations behave like macros except that they are only triggered when the library-specific abstract states of the input variables match the pattern specified in the annotations. These transformations are among the most useful for optimizing libraries because it allows us to take into account more than just the local syntactic context of a library call. We can take into account the global semantic context of the program and the computation it performs. Pointer analysis Many library objects have complex internal structure, which means that pointer analysis is needed for most optimizations. Since pointer analysis is a difficult problem, we will choose an existing algorithm that suits libraries. Information about pointer relationships is provided by the annotations (see Appendix A) using expressions to describe how a library function changes pointer relationships [11]. Liveness for heap objects We are reformulating liveness analysis to handle dynamically allocated heap objects. Initial experiments indicate that this is a very important optimization. Many of the specializations result in “no-op” code that needs to be removed. The basic liveness analysis algorithm is well understood, and we know how to modify it for dynamic heap objects. The liveness range of these objects is defined by their allocation and deallocation, rather than syntactic scope. Liveness for heap objects is related to the abstract data structure model. For now, we only need to know when objects are created and destroyed, and treat any ambiguous situations conservatively. Later, when the pointer analysis is working, object lifetimes can be established more precisely.

7

Partial redundancy elimination Partial redundancy elimination subsumes the functionality of common sub-expression elimination and loop-invariant code motion. The algorithm is well understood, but it needs to identify available expressions. The annotations provide this information for each library function by indicating which arguments are produced without side-effects. Enabling transformations Abstract interpretation information can help decide when to perform enabling transformations such as procedure integration, procedure cloning, and loop transformations. This may prove more effective than conventional heuristics. These transformations do not improve performance themselves, but are critical to enabling other optimizations. Another interesting property of enabling transformations is that they are always safe. 3.2.2 Compilation strategy The specific ordering and control of the compiler mechanisms described above is still an active part of our research. Our analysis of the manual optimization process led us to the following preliminary strategy. 1. Analysis The compiler first performs dataflow analysis to determine the pointer relationships between objects. It then performs the various library-specific analyses specified in the annotations. 2. Enabling transformations The results from the library-specific analysis are used to select library functions to inline. A library function is a good candidate for inlining when abstract interpretation assigns special-case abstract states to its inputs. With the library function inline, the compiler re-analyzes the program. 3. Specialization By integrating the library function into the calling application, the compiler exposes many opportunities to specialize internal library functions (the advanced functions). 4. Traditional optimizations The final step in the process is to clean up the code using traditional optimizations such as redundancy elimination and dead-code elimination. Redundant computations are often introduced when library functions are considered in the context of the calling application. For example, we can often eliminate redundant argument checking. Dead-code typically results from specializations, which cause variables to become unused.

3.3 Implementation Our implementation uses a C compiler construction library called C-Breeze that we wrote ourselves. It includes a parser and un-parser, an internal abstract syntax tree, and a dataflow analysis framework. We used this framework for projects in the graduate compiler course. Many of the traditional optimization passes have already been implemented.

4 Work completed We have completed significant work on both the design and implementation of the Broadway compiler system. We started by examining libraries and applications to determine the types of optimizations we would want to support and the information required to enable them. This work resulted in the initial versions of the annotation language and the

8

compiler used in the experiments described below. We developed the compiler on top of our own open C compiler system, called C Breeze. This section describes our experiences in applying our system to two PLAPACK applications, a Cholesky factorization program and a code for solving Lyapunov equations [3]. PLAPACK is a production-quality library for coding parallel linear algebra algorithms in C or Fortran [17]. It consists of parallel versions of the same kernel routines found in the BLAS [8] and LAPACK [1]. At the highest level, it provides an interface that hides much of the parallelism from the programmer. For these experiments, our compiler performs all analysis automatically. Except for inlining, we perform the transformations manually according to the mechanisms outlined in Section 3.2. While our compiler is not yet complete, the individual transformations are all well-understood. Moreover, the analysis and the overall compilation strategy are the enabling technologies behind these results. The PLAPACK annotations were written by a person who is not a member of the PLAPACK implementation team. For purposes of comparison, the baseline programs were supplied by the PLAPACK group and written using the cleanest PLAPACK interface. The hand-optimized programs were written by PLAPACK experts. All results were obtained on a 40 node Cray T3E.

4.1 Optimizing PLAPACK Figure 2 shows the progressive performance improvements gained for a series of optimizations on the Cholesky factorization program. A detailed discussion of these optimizations would require an in-depth description of the PLAPACK library, which is beyond the scope of this paper. Detailed insights into these optimizations can be found in the literature [2]. The particular annotations are sophisticated variations of those given in Section 3.1. In addition to analyzing matrix special cases, the PLAPACK optimizations exploit knowledge about how various routines affect the distribution of matrices across processors. For example, one routine produces a local matrix when given a fully distributed matrix, which is useful information since subsequent routines can then be specialized to invoke local matrix operations rather than parallel matrix operations.

100

MFLOPS

80

60

40

20

0 (a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 2: Several transformations contribute to the overall performance improvement. Bar (a) is the baseline and bars (b) through (h) represent a series optimizations. PLAPACK programs manipulate linear algebra objects indirectly though handles called views. A view consists of data, possibly distributed across processors, and an index range that selects some or all of the data. A typical algorithm operates by partitioning the views and working on one small piece at a time. While most PLAPACK procedure are 9

designed to accept any type of view, the actual parameters often have special distributions. When this information is propagated into the procedure, it yields a variety of specialization opportunities. In the figure, the bars (b), (c), (d) and (h) all represent the specialization or removal of code based on the distribution. Bar (e) shows the results of high level copy propagation which eliminates unnecessary copying of views. Bar (f) shows the benefits of high level dead code elimination, which removes views that are never used. Finally, bar (g) represents a simple arithmetic simplification.

4.2 Performance Evaluation Cholesky factorization, Cray T3E 36 processors

Lyapunov equation, Cray T3E 36 processors

160

250

140 MFLOPS/Processor

MFLOPS/Processor

200 120 100 80 60 40

0 0

1000

2000

3000 4000 Matrix Size

5000

6000

100

50

Broadway optimized hand optimized baseline

20

150

0 200

7000

Broadway optimized hand optimized baseline 400

600

800 1000 1200 1400 1600 1800 2000 2200 Matrix Size

Figure 3: Performance comparison of hand-optimized and Broadway-optimized PLAPACK applications.

PLA_Trsm kernel, Cray T3E 36 processors 300

MFLOPS/Processor

250 200 150 100 50

Broadway optimized hand optimized baseline

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Matrix Size

Figure 4: Performance comparison of hand-customized and Broadway-customized PLA Trsm() function for the Cholesky program. Figure 3 shows the performance improvement of the Cholesky and Lyapunov programs. For fairly large matrices

(6144  6144), the Broadway-optimized Cholesky program is 26% faster than the baseline and the hand-optimized program is 22% faster than the baseline. For the Lyapunov program, the Broadway system does not perform as well as the manual approach, for example, improving performance by 11% compared to the hand-optimized improvement of 26% for 250  250 matrices.

Much of the performance benefit for these two programs comes from specializing the PLA Trsm() routine. Figure 4 shows the performance difference between the generic PLA Trsm() routine called by Cholesky and the optimized version produced by our compiler. Lyapunov sees smaller performance gains because it spends more time outside of

10

Cholesky 3072x3072 matrix, Cray T3E 3000 2500

MFLOPS

2000 1500 1000 500

Broadway optimized hand optimized baseline

0 0

5

10

15

20 25 Processors

30

35

40

Figure 5: Scalability of the Cholesky programs as the number of processors grows. PLA Trsm(), including a majority of its time in PLA Gemm(), a matrix multiplication routine. When our compiler transformations are complete, we should see additional performance improvements by optimizing all of the PLAPACK routines, including PLA Gemm(). Finally, we observe similar results for different numbers of processors. Figure 5 shows how the performance of the various Cholesky programs scale with the number of processors. The results reveal several interesting points.



A small effort yields a large benefit because the annotations only contain library knowledge, while all compilation expertise resides in the Broadway Compiler. The library annotator supplies the small but critical bits of information—such as specifying the conditions required to substitute a specific PLAPACK routine for a more general one—while the compiler analyzes the program, identifies opportunities for transformations, and manages a number of optimization passes. Figure 2 shows the benefits of this separation of concerns, as many different transformations contribute to the overall performance improvement of the Cholesky program. For example, the procedure substitutions enable high level copy propagation and high level dead code elimination.



The cost of the annotations is amortized across multiple PLAPACK programs, while the cost of manual optimizations is not. Both the Chokesky and Lyapunov programs benefit from the same optimization of specializing the PLA Trsm() routine, so we use the same PLAPACK annotations and the same optimization strategy for the two applications. However, the PLA Trsm() routine is specialized in slightly different ways for the two applications because the contexts are slightly different. Thus, we cannot manually optimize PLA Trsm() once for both programs.



Our approach can apply all optimizations uniformly. There is no fundamental reason why the PLA Trsm() procedure that was hand-optimized for Cholesky factorization is not as efficient as ours, but the manual approach does not employ one transformation that it could have. Given the appropriate annotations, our approach will not miss such opportunities.



The effect of customization is more important for small matrices. For example, for a 1024  1024 matrix, the Broadway-optimized Cholesky factorization is 2.95 times faster than the base, and the hand-optimized is 2.47 times faster than the base. When matrices are small the improvements are larger because there is more overhead relative to matrix operations. Optimizing the small matrix cases is important for scaling to larger numbers of processors, and for supporting sparse matrix operations. 11

5 Plan The work plan is organized around the four main contributions: the overall approach, the annotation language and evaluation, the compiler mechanisms, and the implementation and experiments. In general, we will proceed iteratively, making progress in each area and feeding the results into subsequent work. Currently, our work focuses more on determining the proper compiler mechanisms and refining the annotation language. Towards the end, we will spend more time on the experiments and the evaluation. The next step in our research is to implement more of the Broadway compiler system. The results presented in Section 4 were obtained using the first version of the language and compiler. That version of the compiler is only capable of program analysis – all optimization were performed by hand. We have since revised the annotation language and outlined some of the optimization algorithms. We will start by implementing the passes that proved most useful in the PLAPACK experiments. In the early stages of implementation, we will control the optimization process manually. While each individual pass will be fully automated, we need to experiment will different pass orderings and optimization strategies. Muchnick [13] presents a recommended ordering of optimizations, but its not clear that this applies to library-level optimizations. In addition, we will explore the different ways of optimizing multiple libraries: top-down, bottom-up, or simultaneously. The early experiments will focus on replicating the PLAPACK results using increasing amounts of automation. As the automation matures, we will widen the scope to include more PLAPACK applications. Expanding the experiments will make sure that the design is not tied too closely to a particular application. We will also find out whether or not specific PLAPACK optimizations taken from one application help improve the performance of others. These experiments will require a more thorough annotation of PLAPACK. To ensure wide applicability, we need to test our system on other libraries. Fully annotating a library requires a significant level of expertise in the use of the library. To minimize the cost of gaining this expertise, we will choose a small, but representative set of libraries for our experiments. We are considering three candidates: the standard math library, MPI (Message Passing Interface), and OpenGL. Each of these libraries has unique requirements, and all three are used in applications that need good performance. We have identified many other interesting topics and experiments that are related to this research. While many of them are beyond the scope of this dissertation, it may be possible to address one or two. Machine-specific optimizations. It is often the case that to make good performance decisions requires knowledge of the underlying machine. This is particularly important for parallel architectures, whose performance characteristics can vary considerably. One possible solution is to parameterize our optimizations by particular performance properties. We could also provide annotations that describe how to measure those parameters. Analysis of domain-specific enablers. Enabling transformations, such as inlining, loop peeling, and loop unrolling, are typically driven by heuristics such as the size of code fragment involved, or the composition of instructions it contains. The Broadway compiler chooses when to perform enabling transformations using library-specific information. We expect this strategy to predict more accurately when such transformations would be beneficial. Compositional optimizations. We have identified a useful optimization in which the compiler recognizes a sequence of library function calls and replaces them with a more efficient sequence. We call this a compositional optimization 12

Run-time analysis and specialization. Static program analysis can often fail to yield interesting information. In these cases, we could use the same compiler mechanisms to intrument the program to perform analysis and library-function selection at run-time.

References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, second edition, 1995. [2] G. Baker, J. Gunnels, G. Morrow, B. Riviere, and R. van de Geijn. PLAPACK: high performance through high level abstractions. In Proceedings of the International Conference on Parallel Processing, 1998. [3] P. Benner and E.S. Quintana-Orti. Parallel distributed solvers for large stable generalized Lyapunov equations. In Parallel Processing Letters, 1998 (to appear). [4] A. Berlin. Partial evaluation applied to numerical computation. In Proceedings of the 1990 ACM Conference on Lisp and Functional Programming, Nice, France, 1990. [5] A. Berlin and D. Weise. Compiling scientific programs using partial evaluation. IEEE Computer, 23(12):23–37, December 1990. [6] S. Chiba. A metaobject protocol for C++. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, pages 285–299, October 1995. [7] Crispin Cowan, Tito Autrey, Charles Krasic, Calton Pu, and Jonathan Walpole. Fast concurrent dynamic linking for an adaptive operating system. In Proceedings of the International Conference on Configurable Distributed Systems, May 1996. [8] J.J. Dongarra, I. Duff, J. DuCroz, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–28, 1990. [9] Dawson R. Engler. Incorporating application semantics and control into compilation. In Proceedings of the Conference on Domain-Specific Languages (DSL-97), pages 103–118, Berkeley, October15–17 1997. USENIX Association. [10] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, December 1996. [11] Joseph Hummel, Alexandru Nicolau, and Laurie J. Hendren. A language for conveying the aliasing properties of dynamic, pointer-based data structures. In Howard Jay Siegel, editor, Proceedings of the 8th International Symposium on Parallel Processing, pages 208–216, Los Alamitos, CA, USA, April 1994. IEEE Computer Society Press. [12] John Lamping, Gregor Kiczales, Luis H. Rodriguez Jr., and Erik Ruf. An architecture for an open compiler. In Proceedings of the IMSA’92 Workshop on Reflection and Meta-level Architectures, 1992. [13] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kauffman, San Francico, CA, 1997. [14] J. N. Neighbors. Draco: A Method for Engineering Reusable Software Systems. In T. J. Biggerstaff and C. Richter, editors, Software Reusability, volume I — Concepts and Models, chapter 12, pages 295–319. ACM press, 1989. [15] Y. Smaragdakis and D. Batory. Application generators. Encyclopedia of Electrical and Electronics Engineering, to appear. [16] Yannis Smaragdakis and Don Batory. DiSTiL: a transformation library for data structures. In USENIX Conference on Domain-Specific Languages (DSL-97), October 1997. 13

[17] Robert van de Geijn. Using PLAPACK – Parallel Linear Algebra Package. The MIT Press, 1997. [18] Mark T. Vandevoorde. Exploiting Specifications to Improve Program Performance. PhD thesis, MIT, Department of Electrical Engineering and Computer Science (also MIT/LCS/TR-598), 1994. [19] Mark Wegman and Frank Kenneth Zadeck. Constant propagation with conditional branches. In Brian K. Reid, editor, Conference Record of the 12th Annual ACM Symposium on Principles of Programming Languages, pages 291–299, New Orleans, LS, January 1985. ACM Press. [20] Robert P. Wilson and Monica S. Lam. Efficient context-sensitive pointer analysis for C programs. In Proceedings of the ACM SIGPLAN’95 Conference on Programming Language Design and Implementation (PLDI), pages 1– 12, La Jolla, California, 18–21 June 1995.

14

A

Annotation language

This appendix describes the annotation language in more detail. The first section shows an example annotation file for a fictional library of matrix operations. Each subsequent section describes a particular annotation by showing the grammar and referring to the example.

A.1 Example: Matrix library annotations %f #include ‘‘Matrix.h’’ %g global f system state, global lock g property SpecialCase = f Unknown = any, Upper, Lower, Diagonal, Identity, Zero, NotSpecial = none g; procedure NewMatrix(width, height, new matrix) f on exit f new matrix-->Data1 g access f width, height g analyze SpecialCase f Data1 = Unknown; g g procedure LUFactorMatrix(A, L, U) f on entry f A-->DataA g on exit f L-->DataL; U-->DataU; g access f DataA g modify f DataL, DataU g analyze SpecialCase f DataL = Upper, DataU = Lower; g g procedure MultiplyMatrix(A, B, C) f on entry f A-->DataA; B-->DataB; C-->DataC; g access f DataA, DataB g modify f DataC g analyze SpecialCase f ((A.SpecialCase == Upper) && (B.SpecialCase == Lower)) => C = NotSpecial; ((A.SpecialCase == Upper) && (B.SpecialCase == Diagonal)) => C = Upper; ((A.SpecialCase == Zero) || (B.SpecialCase == Zero)) => C = Zero; (A.SpecialCase == Identity) => C = B; g specializations f ((A.SpecialCase == Zero) || (B.SpecialCase == Zero)) => no op; (A.SpecialCase == Upper) => "TriangularMultiplymatrix(A, B, C)" g g

15

A.2 Overall format Annotations

!

Header

! ! j j

Annotation list

Header Annotation list %{ C-Code %} Property annotation [ Annotation list ] Global annotation [ Annotation list ] Procedure annotation [ Annotation list ]

Procedure annotation

!

Procedure annotation list

!

procedure PROCEDURE NAME ( Formal param list ) { Procedure annotation list } Procedure body annotation [ Procedure annotation list ]

! j j j j j j

OnEntry annotation OnExit annotation Access annotation Modify annotation Fact annotation Analysis annotation Specialize annotation

!

PARAMETER NAME

Procedure body annotation

Formal param list

[ , Formal param list ]

The annotation file starts with the Header, which is a C code fragment that gives the compiler access to the header files of the library. Each subsequent annotation defines either a global object, an analysis property, or a library procedure. The example contains annotations for two global objects, one analysis property and three library procedures. The annotations contained in each procedure section apply only to that procedure and its arguments.

A.3 Basic Annotations A.3.1 Object structure Two annotations are used to describe how object components are changed by the procedure. The on_entry and on_exit annotations establish the internal structure of objects as they enter and leave the procedure, respectively. The names given to internal objects should reflect any structure sharing. Any object names not present in the formal parameter list or on_entry annotation are assumed to be allocated inside the procedure. The on_exit annotation can indicate that an object no longer points to something using the -->0 notation. This information is used to determine liveness.

OnEntry annotation OnExit annotation Points to list Points to

! ! ! ! j

on_entry { Points to list } on_exit { Points to list } Points to ; [ Points to list ] IDENTIFIER [ --> Points to ] IDENTIFIER --> 0

The example annotations indicate that each matrix object points to a separate buffer that contains the actual matrix data. It is only necessary to annotate the significant pointer facts. For example, the MatrixMultiply function does not change the pointer relationships, so there is no need for an on_exit annotation.

16

A.3.2 Access and modify The access and modify annotations describe the procedure arguments that are referenced and modified, respectively. These serve as the uses and defs of the procedure. The identifiers may be formal parameters, internal structures (from the on_entry annotation), or global objects.

Access annotation Modify annotation Identifier list

! ! !

access { Identifier list } modify { Identifier list } IDENTIFIER [ , Identifier list ]

In the example, the MatrixMultiply performs the matrix operation C the function accesses the data in matrices A and B, and updates the data in C.

A  B . To implement this operation,

A.3.3 Global objects The global annotation declares global state objects and gives them names that may be used in other annotations. Global objects are analyzed in the same manner as the other variables.

Global annotation

!

global { Identifier list } ;

A.4 Advanced Annotations A.4.1 Properties The property annotation specifies an abstract interpretation over the objects in the program. The abstract interpretation engine analyzes the program and assigns one of the possible property values to each variable at each point in the program. The list of property values gives names to the possible abstract values that form the dataflow analysis lattice. In addition, names can be given to the lattice top and bottom elements, which we call any and none, respectively. When a variable gets the value any, its state is not known, but could take on any value at a later point in the program. The value none indicates that the state of variable cannot be determined by static program analysis. The analyze annotations (defined below) describe how library procedures change the states of objects from one value to another.

Property annotation Property value list Property value

! ! ! j j

property PROPERTY NAME = { Property value list } ; Property value [ , Property value list ] PROPERTY VALUE NAME PROPERTY VALUE NAME PROPERTY VALUE NAME

= any = none

The example specifies a property that is intended to capture some of the common special case matrix types, such as upper-triangual, diagonal, and identity. The special value Unknown indicates that a matrix could be one of the special cases, but its not known at that point in the program. The special value NotSpecial indicates that a matrix is definitely not one of the special cases. A.4.2 Analysis The analyze annotation describes how a library procedure affects the properties of its parameters. It contains a list of mappings that define the dataflow transfer functions for abstract interpretation. The condition can test any property, but the consequence can only modify the specified property. If more than one mapping applies, the most specific one is chosen, i.e., the mapping for which the most conditions hold. Ties are currently broken in text order.

17

Analysis annotation Mapping list Mapping Condition

Consequence Var property

! ! !

analyze PROPERTY NAME { Mapping list } Mapping ; [ Mapping list ] [ Condition => ] Consequence

! j j j j

( Condition ) Var property == Var property Var property != Var property Condition || Condition Condition && Condition

!

IDENTIFIER

= PROPERTY

OR IDENTIFIER NAME

! j j

IDENTIFIER IDENTIFIER

. PROPERTY

NAME

[ , Consequence ]

CONSTANT

The example annotations show how the three library functions can affect the special case types of their arguments. The NewMatrix function always produces a general case matrix of unknown contents. The LUFactorMatrix function always produces an upper-triangular and a lower-triangular matrix. The effects of the MatrixMultiply function depend on the states of the input matrices. The example rules show several common situations. Note the case involving the identity matrix; in this case the result matrix takes on the state of the second input. A.4.3 Specialize The specialize annotation defines context-specific replacements for general procedures. Like the analyze annotation, each is guarded by a condition on the variables. The replacement can either be no_op, which indicates that the call may be completely removed, or it may be a C code fragment. If the replacement is a C fragment, then the annotation behaves like a macro, replacing the code and substituting the parameters into the fragment.

Specialize annotation

!

specialize Condition => Replacement

Replacement

! j

no_op C-Code

The example shows two possible specializations for matrix multiply. If either matrix is zero, then the library routine has no effect and can be removed. If the first matrix is upper-triangular, then we can call a special upper-triangular matrix multiply that requires many fewer floating point operations. A.4.4 Facts Facts are a catch-all for assertions about a procedure’s behavior. Currently, we support only one type of fact: declaring equality, which is useful for copy propagation.

Fact annotation Fact list

! !

fact { Facts list } Fact ; [ Fact list ]

Fact

!

IDENTIFIER

18

== IDENTIFIER

B PLAPACK Annotations %{ #include "PLA.h" %} // The view components have a distribution property Distribution = { Matrix = any, ColPanel, RowPanel, Local, Vector, Mvector, Mscalar, Empty }; // Special case properties of the data property Special = { Zero, One, MinusOne, Identity, Tridiagonal, Upper, Lower, Symmetric }; // Split size analysis of integers property SplitSize = { SplitTop, SplitLeft, SplitRight, SplitBottom }; // Individual library routines PLA_Vector_create(datatype, length, template, align, new_obj) { on_exit { new_obj-->View1-->Data1-->template; } access { datatype, length, template, align } modify { new_obj } analyze Distribution { View1 = Vector; } } int PLA_Matrix_create( datatype, global_length, global_width, template, global_align_row, global_align_col, mat) { on_exit { mat-->View1-->Data1-->template; } access { datatype, global_length, global_width, template, global_align_row, global_align_col } modify { mat } analyze Distribution { View1 = Matrix; } } PLA_Mscalar_create(datatype, owner_row, owner_col, length, width, template, new_obj) { on_exit { new_obj-->View1-->Data1-->template; } access { datatype, owner_row, owner_col, length, width, template } modify { new_obj } analyze Distribution { View1 = Mscalar; } } int PLA_Obj_view_all(obj, out) { on_entry { obj-->View1-->Data1; on_exit { out-->View2-->Data1; access { View1 }

} }

19

modify { out } fact { View2 == View1 } analyze Distribution { View2 = View1; } } PLA_Obj_view(old_obj, length, width, align_row, align_col, new_obj) { on_entry { obj-->View1-->Data1; } on_exit { out-->View2-->Data1; } access { View1 } modify { out } analyze Distribution { } } PLA_Obj_view_swap(one,two) { on_entry { one-->View1-->Data1; two-->View2-->Data2; } on_exit { one-->View2-->Data2; two-->View1-->Data1; } access { one, two } modify { one, two } } int PLA_Obj_split_size(in obj, in side, out size, out owner) { on_entry { obj-->View1-->Data1; } on_exit { size-->Data1; } access { View1, side } modify { size, owner } analyze SplitSize { (side == PLA_SIDE_TOP) => size = SplitTop; (side == PLA_SIDE_LEFT) => size = SplitLeft; (side == PLA_SIDE_RIGHT) => size = SplitRight; (side == PLA_SIDE_BOTTOM) => size = SplitBottom; } } int PLA_Obj_vert_split_2(obj, length, left, right) { on_entry { obj-->View1-->Data1; length-->Data2; } on_exit { left-->View2-->Data1; right-->View3-->Data1; } access { View1, length } modify { left, right } analyze Distribution {

20

((View1.Distribution == Matrix) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => left = Local, right = Matrix; ((View1.Distribution == ColPanel) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => left = ColPanel, right = ColPanel; ((View1.Distribution == RowPanel) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => left = Local, right = RowPanel; ((View1.Distribution == Local) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => left = Local, right = Empty; ((View1.Distribution == Empty) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => left = Empty, right = Empty; } specialize { ((View1.Distribution == Local) && (length.SplitSize == SplitLeft) && (Data1 == Data2)) => "PLA_Obj_view_all(obj, left);" } } int PLA_Gemm(transa, transb, alpha, a, b, beta, c) { on_entry { alpha-->View1-->Data1; a-->View2-->Data2; b-->View3-->Data3; beta-->View4-->Data4; c-->View5-->Data5; } access { Data1, Data2, Data3, Data4, Data5 } modify { Data5 } specialize { ((View2.Distribution == Empty) || (View3.Distribution == Empty) || (View5.Distribution == Empty)) => noop; ((View2.Distribution == Local) && (View3.Distribution == Local) && (View5.Distribution == Local)) => "PLA_Local_gemm(transa, transb, alpha, a, b, beta, c);" } }

21