Compiling Scienti c Code for Complex Memory ... - Semantic Scholar

5 downloads 0 Views 165KB Size Report
transformation that can be used to attain this result. AC72,AN87,CCK88]. The transformation unrolls an outer loop and then fuses the inner loops back together.
Compiling Scienti c Code for Complex Memory Hierarchies Steve Carr Ken Kennedy Department of Computer Science Rice University Houston, TX 77251-1892

Abstract The trend in high-performance microprocessor design is toward increasing computational power on the chip. At the same time, memory size is increasing but memory speed is not. The result is an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more complicated memory hierarchies. In turn, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-speci c programs. Can we enhance our compiler technology to obviate the need for explicit programming of the memory hierarchy? This paper surveys some recent experiments with compiler techniques designed to address this problem. The results, though only preliminary, show great promise.

1 Introduction The trend in high-performance microprocessor design is to put an increasing amount of computation power on a single chip. This power comes not only in the form of reduced cycle times, but also as multiple instruction issue units and pipelined oatingpoint computation units. The resulting microprocessors can process dramatically more data than previous models. At the same time, the requirements for memory size per processor are also increasing. The result is an imbalance between the rate at which computations can be performed on the microprocessors and the rate at which operands can be delivered onto the chip. In addition, the latencies for accesses to main memory, in basic machine cycles, have been increasing. It is quite common to nd latencies of 10 to 20 cycles and on some machines, like the Cray 2, the latency is as high as 50 cycles.  Research supported by NSF Grant CCR-8809615 and by IBM Corporation

To compensate for these trends, machine designers have turned increasingly to complicated memory hierarchies. The Intel i860 has an on-chip cache memory for 8K bytes of data and there are several systems that use two levels of cache with a Mips R3000. Still, these systems perform poorly on many calculations because the distance between use points for the same data item is not small enough for the datum to be kept in cache or registers. Scienti c calculations present special diculties when the sizes of the working matrices exceed cache. As a result of these problems, many programmers have begun to restructure their codes by hand to improve performance in the memory hierarchy. The entire Lapack project is designed to replace the algorithms in Linpack and Eispack with block algorithms that have much better cache performance. Unfortunately, the resulting programs are highly machine speci c and must be retailored to the memory hierarchy of each new machine on which they are implemented. This means that we have lost one of the most important advantages of Fortran | the ability to write machine-independent programs. In the past, machine independence has been achieved through sophisticated compiler optimization techniques. The Fortran I compiler included enough optimizations to make it possible for scientists to abandon machine language programming, ushering in the era of widespread application of the computer to scienti c problems. More recently, advanced vectorizing compilers have made it possible to write \vectorizable" code and expect the compiler to translate it to ecient machine code for the target vector machine. Vectorizing compilers did not eliminate the need to restructure programs for vector execution; instead, they made it possible to write machineindependent vector programs in a sublanguage of Fortran 77. Is it possible to achieve the same success for memory hierarchy management? More precisely, is it pos-

sible to identify a sublanguage of Fortran that is general enough to conveniently program scienti c applications but restrictive enough for a compiler to generate good code | good enough to discourage machine-dependent programming practices? In this paper, we survey some of our recent experiments with compiler techniques to help manage both cache memory and oating-point registers. These techniques are, in a very real sense, analogs of the techniques used for vectorization and parallelization, in that the same philosophy of loop restructuring is applied to a problem with a di erent goal. We begin with an overview of the fundamental tool for performing these techniques | data dependence. Next, we present compiler techniques for managing data cache memories and, nally, we present techniques for improving the use of oating-point registers.

2 Dependence The fundamental tool used by the compiler for memory hierarchy management is the same tool used in vectorization and parallelization | namely dependence. We say that a dependence exists between two statements if there exists a control ow path from the rst statement to the second, and both statements reference the same memory location [Kuc78].

 If the rst statement writes to the location and the second reads from it, there is a true depen-

dence, also called a ow dependence.

 If the rst statement reads from the location and the second writes to it, there is an antidepen-

dence.

 If both statements write to the location, there is an output dependence.  If both statements read from the location, there is an input dependence.

A dependence is said to be carried by a particular loop if the references at the source and sink of the dependence are on di erent iterations of the loop and the dependence is not carried by any outer loop [AK87]. When applied to memory hierarchy management, a dependence can be thought of as an opportunity for reuse.

3 Cache Memory For the past three years, we at Rice have been studying support for machine-independent programming,

where the principal goal has been to eliminate programming that is speci c to a particular machine's memory hierarchy. The main technique employed by programmers to improve cache behavior of codes is the use of block algorithms, which work on subproblems that t into cache. The fundamental question is: can a compiler automatically construct a block algorithm from the corresponding point algorithm? Compiler researchers have been investigating this question for many years and have developed transformations, such as strip-mineand-interchange, that improve the cache utilization of loop nests [AS79, Wol87, IT88, Wol89, Por89, CK89]. The goal of these transformations is to automatically produce an algorithm blocked for a speci c machine. Although these techniques have been shown to be effective for \simple" loops, they are not directly applicable to the complex loop nests common in linear algebra codes such as Lapack [Por89]. These codes often contain loops which have complex dependence patterns that are non-trivial to analyze or which have non-rectangular-shaped iteration spaces that present problems in deriving loop bounds in the transformed loop nest. To investigate the generality of automatic blocking techniques, we established a goal of looking at linear algebra codes such as those found in Lapack. Dongarra and Sorensen have contributed corresponding point and block versions of many the algorithms used in Lapack, and we are investigating whether the block algorithms can be derived automatically by a compiler. In this section, we report some encouraging preliminary results of that study. In particular, we have found transformation algorithms that can be successfully used on triangular loops, which are quite common in linear algebra. In addition, we have discovered an algorithmic approach that can be used to analyze and block programs that exhibit complex dependence patterns. The latter method has been successfully applied to block LU decomposition without pivoting.

3.1 Strip-Mine-And-Interchange

To improve the memory behavior of loops that access more data than can be handled by a cache, the iteration space of a loop can be blocked into sections whose reuse can be captured by the cache. Strip-mine-andinterchange is a transformation that achieves this result [Wol87, Por89]. The e ect is to shorten the distance between the source and sink of a dependence so that it is more likely for the datum to reside in cache when the reuse occurs.

Consider the following loop nest. DO 10 I = 1,N DO 10 J = 1,M 10 A(I) = A(I) + B(J)

Assuming that the value of M is much greater than the size of the cache, the cache would provide no bene t for B, while the reuse for A could be handled by a register. To capture reuse for both A and B we can use strip-mine-and-interchange. First, we strip mine the loop as shown below. DO 10 I = 1,N,S DO 10 II = I,MIN(I+S-1,N) DO 10 J = 1,M 10 A(II) = A(II) + B(J)

Next, we interchange the strip loop and the inner loop to give: DO 10 I = 1,N,S DO 10 J = 1,M DO 10 II = I,MIN(I+S-1,N) 10 A(II) = A(II) + B(J)

Now, the reuse in B(J) can be captured in a register and reuse in A(II) can be captured in cache if S is not larger than the cache size. Note that it is no longer possible to hold A in a register in the inner loop of this code. Hence, it may also be desirable to perform strip-mine-and-interchange on the J-loop so that we can, once again, get reuse of A from registers. Unroll-and-jam can then be applied to the resulting nest.

3.1.1 Non-Rectangular Iteration Spaces

When the iteration space of a loop is not rectangular, the transformation described above cannot be directly applied as shown. The problem is that when performing interchange of loops that iterate over a triangular region, the loop bounds must be modi ed to preserve the semantics of the loop [Wol86]. Below, we will derive the formula for determining loop bounds when strip-mine-and-interchange is performed on a triangular iteration space. The general form of a strip-mined triangular loop is given below. The I and J loops have been normalized to give a lower bound of 1, and are integer constants ( may be a loop invariant) and > 0.

10

DO 10 I = 1,N,S DO 10 II = I,MIN(I+S-1,N) DO 10 J = 1, II+

loop body

Figure 2 gives a graphical description of the iteration space of this loop. To interchange the II and J loops, we have to account for the fact that the line J= II+ intersects the iteration space at the point (I, I+ ). Therefore when we interchange the loops, the II-loop must iterate over a trapezoidal region, as shown in gure 2, requiring its lower bound to begin at I until (J ? ) > I. This gives the following loop nest. DO 10 I = 1,N,S DO 10 J = 1, (MIN(I+S-1,N))+ DO 10 II = MAX(I,(J- )/ ),MIN(I+S-1,N) 10 loop body

The formula shown for deriving the new loop bounds can be trivially extended to handle the cases where < 0 and where strip-mine-and-interchange is performed on both loops.

3.1.2 Complex Dependence Patterns

Loop nests that cannot be fully blocked and retain their original semantics are common to linear algebra codes. These loops often contain dependence patterns that are too complex for standard dependence

J

N+

1

6

?? J= II+ ? ? ? 6 ?6?6 ?? ? ? 1

I

I+S-1 N

Figure 1 Strip-mined

- II

common region and one loop that accesses the disjoint region. Since no recurrence exists between the new loop and statements 10 and 20, the KK-loop can then be distributed around each disjoint region (or loop) and partial blocking can be performed on the newly created loop. Below is LU decomposition after stripmining, index-set splitting and loop distribution.

J

N+

1

6

?? J= II+ ? -? ? ?? ? ? ? ? 1

I

I+S-1 N

- II

Figure 2 Strip-mine-and-interchanged abstractions, such as direction vectors, to describe [Wol82]. Because of the limited information, the potential for blocking cannot be discovered. As an example, consider the strip-mined version of LU decomposition below. DO 10 K = 1,N-1,KS DO 10 KK = K,K+KS-1 DO 20 I = KK+1,N 20 A(I,KK) = A(I,KK)/A(KK,KK) DO 10 J = KK+1,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J)-A(I,KK)*A(KK,J)

To complete the blocking of the loop, the KK-loop would have to be distributed around the loop that surrounds statement 20 and around the loop nest that surrounds statement 10 before being interchanged to the innermost position. However, there is a recurrence between statements 10 and 20 carried by the KK-loop that prevents distribution. If we analyze the accessed regions of the array A using array summary information originally designed for interprocedural analysis, we nd that the region accessed by statement 20 for the entire execution of the KK loop is a subset of the region accessed by statement 10 [TIF86, BC86, CK87, BK89] (Figure 4 gives a graphical description of the data regions). This means that the recurrence only exists for a portion of the data accessed in the second loop and that indexset splitting can be used to allow partial blocking [Wol87]. The index set of the loop surrounding statement 10 can be split into one loop that accesses the

DO 10 K = 1,N-1,KS DO 20 KK = K,K+KS-1 DO 30 I = KK+1,N 30 A(I,KK) = A(I,KK)/A(KK,KK) DO 20 J = KK+1,K+KS-1 DO 20 I = KK+1,N 20 A(I,J) = A(I,J)-A(I,KK)*A(KK,J) DO 10 KK = K,K+KS-1 DO 10 J = K+KS,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J)-A(I,KK)*A(KK,J)

It should be noted that there is a dependence that prevents the interchange of the KK and I loops in the new loop body, but index-set splitting can be applied on the I-loop to work around the dependence so that partial blocking can be performed. We applied our algorithm by hand to LU decomposition and compared its performance with the original point algorithm and a version hand coded by Sorensen. In the table below, \Block1" refers to the Sorensen version and \Block2" refers to the code resulting from our algorithm. In addition, we used an automatic system to perform unroll-and-jam and scalar replacement to our blocked code, producing the version referred to as \Block2+". The results reported are from an experiment run on a Mips M120 using a 300x300 array of single-precision Reals. The reader should note that these nal transformations could have been applied to the Sorensen version as well, with similar improvements. Point Block1 Block2 Block2+ Speedup 8.36s 6.69s 6.40s 3.55s 2.35

4 Registers Although conventional compilation systems do a good job of allocating scalar variables to registers, their handling of subscripted variables leaves much to be desired. Most compilers fail to recognize even the simplest opportunities for reuse of subscripted variables.

6

N

10

K+KS-1 20 K

subscripted variables in a particularly naive fashion, if at all, making it impossible to determine when a speci c element might be reused. This is particularly problematic for oating-point register allocation, because most of the computational quantities held in such registers originate in subscripted arrays. Using a transformation called scalar replacement, we can give a coloring register allocator the opportunity to allocate values held in arrays to registers [CCK88, CCK90]. As was the case above, many inner-loop-carried dependences can be replaced with a sequence of scalar temporaries to e ect register allocation. The transformed code has the dependence information encoded into the sequence of scalar temporaries, giving a coloring allocator the information necessary to allocate array values to registers.

?? ??? ?? ??? ?? ?? ?? ?? ? ? ? ? ? ? ? ? ?? ?? ?? ? ? ?? ?? ?? ?? ?? ?? ?? @@? @@? @@? @@?? ? ? ? ? ? @@? @@? @@? @@ -

1 1

K

N

Figure 3 LU Decomposition Regions For example, in the code shown below, DO 10 I = 1, N DO 10 J = 1, M 10 A(I) = A(I) + B(J)

most compilers will not keep A(I) in a register in the inner loop. This happens in spite of the fact that standard optimization techniques are able to determine that the address of the subscripted variable is invariant in the inner loop. On the other hand, if the loop is rewritten as

10 20

DO 20 I = 1, N T = A(I) DO 10 J = 1, M T = T + B(J) A(I) = T

even the most naive compilers allocate T to a register in the inner loop. The principal reason for the problem is that the data- ow analysis used by standard compilers is not powerful enough to recognize most opportunities for reuse of subscripted variables. Standard register allocation methods, following Chaitin, et al. [CAC+ 81], concentrate on allocating scalar variables to registers throughout a procedure. Furthermore, they treat

4.1 Unroll-And-Jam

When inner-loop-carried dependences do not provide enough opportunities for scalar replacement, we would like to perform the transformation on dependences carried by an outer loop. Because it is inconceivable to keep all of the values corresponding to an outer-loop-carried dependence in registers for the entire execution of the inner loop, the goal is to convert those dependences to inner-loop-carried or loop-independent dependences. Unroll-and-jam is a transformation that can be used to attain this result [AC72, AN87, CCK88]. The transformation unrolls an outer loop and then fuses the inner loops back together. For example, the loop: DO 10 J = 1, 2*M DO 10 I = 1, N 10 A(I,J) = A(I,J+1) + A(I,J-1)

after unrolling becomes: DO 10 J = 1, 2*M, 2 DO 20 I = 1, N 20 A(I,J) = A(I,J+1) + A(I,J-1) DO 10 I=1, N 10 A(I,J+1) = A(I,J+2) + A(I,J)

and after fusion becomes: DO 10 J = 1, 2*M, 2 DO 10 I = 1, N A(I,J) = A(I,J+1) + A(I,J-1) 10 A(I,J+1) = A(I,J+2) + A(I,J)

S1 S2

The true dependence from A(I,J) to A(I,J-1) is originally carried by the outer loop, but after unroll-

and-jam the dependence edge goes from A(I,J) in S1 to A(I,J) in S2 and is loop independent. We can now allocate a temporary for A(I,J) and remove one load. Unroll-and-jam not only improves register allocation opportunities | it also improves cache performance by shortening distances between sources and sinks of dependences. As a result, cache hits are more likely [Por89]. Finally, unroll-and-jam can be used to improve pipeline utilization in the presence of recurrences [CCK88].

4.1.1 Non-Rectangular Iteration Spaces

When the iteration space of a loop is not rectangular, unroll-and-jam cannot be directly applied as in the previous section. For each iteration of the outer loop, the number of iterations of the inner loop may vary, which makes unrolling a speci c amount illegal. Below, we show how to perform unroll-and-jam on triangular- and trapezoidal-shaped iteration spaces by extending triangular strip-mine-and-interchange. Another way to think of unroll-and-jam is as an application of strip mining, loop interchange and loop unrolling. Using this idea, we will begin with the example loop from the previous section after triangular strip-mine-and-interchange. To complete the triangular unroll-and-jam, we need to unroll the inner loop. Unfortunately, the iteration space is a trapezoidal region (see Figure 3), making the number of iterations of the inner loop vary with J. To overcome this, we can split the index set of J into one loop that iterates over the rectangular region below the line J= I+ and one loop that iterates over the triangular region above the line. Since we know the length of the rectangular region, the rst loop can be unrolled to give the following loop nest. DO 10 I = 1,N,S DO 20 J = 1, I+ 20 loop body unrolled S-1 times DO 10 II = I+1,I+S-1 DO 10 J = MAX(1, I+ +1), II+ 10 loop body

Depending upon the values of and , it may also be possible to determine the size of the triangular region; therefore, it may be possible to completely unroll the second loop nest to eliminate the overhead. The formula above can be trivially extended to handle < 0. While the previous method applies to many of the common irregular-shaped iteration spaces, there are still some important loops that it will not handle.

In linear algebra, signal processing and partial differential equation codes, we often nd loops with trapezoidal-shaped iteration spaces. Consider the following example. DO 10 I = 1,N DO 10 J = I,MIN(I+M,L) 10 loop body

Here, as is often the case, a MIN function has been used to handle boundary conditions, resulting in the loop's trapezoidal shape. Using index-set splitting to create one triangular and one rectangular region will allow us to use triangular and regular unroll-and-jam on the resulting loop nests [CCK90].

4.1.2 Unroll Amounts The major diculty with unroll-and-jam is determining the amount of unrolling necessary to produce optimal results. The current implementation picks the two outer loops that carry the most dependences (true and input) and unrolls them by a factor provided as a parameter. We have developed, but not fully implemented, an ecient heuristic that chooses this parameter automatically. When we apply unrolland-jam to a loop nest, we begin by assuming that optimization for cache performance has already been done. With this in mind, our heuristic will concentrate on the balance between memory accesses and

oating-point operations to prevent program loops from being bound by memory access time. A computer is said to be balanced when it can operate in a steady state manner with both memory accesses and oating-point operations being performed at peak speed. A measure for machine balance is the ratio of the peak rate at which words can be fetched from memory to the peak rate at which oatingpoint operations can be performed. Program loops also have a balance associated with them. A measure for loop balance is the ratio of the number of memory words accessed in the loop to the number of

oating-point operations performed in the loop plus the number of idle pipeline cycles. A more detailed explanation of balance can be found in [CCK88]. We would like to convert loops that are bound by memory access time to loops that are balanced. When a loop's balance matches a machine's balance, memory and oating-point cycles are not wasted, resulting in a faster execution time compared to the unbalanced loop. Since unroll-and-jam can introduce more oating-point operations without a proportional increase in memory accesses, it has the potential to

improve a loop's balance. However, using unroll-andjam carelessly can be counterproductive. If we were to create a loop body that does not t in the instruction cache, or create too much register pressure and introduce spill code, the gains attained with improved balance will be o set. Given these factors, our heuristic becomes an integer optimization problem where the objective is to perform unroll-and-jam on a loop nest so that its balance matches the target machine's balance. The constraints on the objective are the loop body remaining smaller than the instruction cache and the register pressure remaining smaller than the oating-point register set size.

4.1.3 Results

We have implemented a prototype source-to-source translator based on Pfc, a dependence analyzer and parallelization system for Fortran programs. The prototype restructures loops using unroll-and-jam and then replaces subscripted variables with scalars, using both the knapsack and greedy algorithms to moderate register pressure. For our test machine, we chose the Mips M120 (based on the Mips R2000) because it had a good compiler and a reasonable number of oating-point registers (16) and because it was available to us. In examining these results, the reader should be aware that some of the improvements cited may have been achieved because the transformed versions of the program have, in addition to better register utilization, more operations to ll the pipeline on the M120. The experimental design is illustrated in Figure 1. In this scheme, Pfc serves as a preprocessor, rewriting Fortran programs to improve register allocation. Both the original and transformed versions of the program are then compiled and run using the standard product compiler for the target machine.

Linpack Benchmark. An important test for a re-

structuring compiler is to achieve the same results as a hand optimized version of code. We chose the routine Smxpy from Linpack, which computes a vector matrix product and was hand unrolled to improve performance [DBMS79]. We re-rolled it to get the original loop, shown below: DO 10 J = 1, N DO 10 I = 1, N 10 Y(I) = Y(I) + X(J)*M(I,J)

and then fed this loop to Pfc which performed unrolland-jam 15 times (the same as the hand-optimized version) plus scalar replacement. As the table below shows, our automatic techniques achieved slightly

better performance from the easy-to-understand original loop, without the e ort of hand optimization. The speedup column represents the improvement of the automatic version over the hand-optimized version. To make this measurement, we iterated over each loop 30 times using arrays of dimension 1000. Original Linpack Automatic Speedup 28.0s 18.47s 17.52s 1.05

Matrix Multiplication. Next, we performed our test on matrix multiplication with 500  500 arrays

of single precision Real numbers. We used various unroll values on both outer loops plus scalar replacement. We show their separate e ects in the table below. The rst entry in the speedup column of the table represents the speedup due to scalar replacement alone, whereas the second column shows the speedup due to unroll-and-jam plus scalar replacement. Unroll Value Original Scalar Speedup 0 179.82s 162.3s 1.11 1 83.68s 77.92s 1.07 2.31 2 67.09s 62.64s 1.07 2.87 3 63.17s 60.49s 1.04 2.97

A Trapezoidal Loop. To test the e ectiveness of unroll-and-jam on loops with irregular-shaped iteration spaces, we chose the following trapezoidal loop which computes the adjoint-convolution of two time series. DO 10 I = 1,N1 DO 10 J = I,MIN(I+N2,N3) 10 F3(I) = F3(I)+DT*F1(K)*WORK(I-K)

We performed index-set splitting by hand and then ran the resulting code through PFC, which performed triangular and regular unroll-and-jam with an unroll value of 3 plus scalar replacement. We ran it on arrays of single-precision Reals of size 100 so that 75% of the execution was done in the triangular region. Below is a table of the results. Iterations Original Transformed Speedup 5000 15.59s 9.08s 1.72

A Complete Application. We also performed the test on a complete application, onedim. This program computes the eigenfunctions and eigenenergies of a time-independent Schrodinger equation. The table below shows the results when scalar replacement and unroll-and-jam are applied with an unroll value of 2.

Fortran source program

transformer (Pfc)

Fortran compiler

improved time original time

Figure 4 Experimental design. Original Transformed Speedup 47.9s 33.34s 1.44

5 Summary and Future Work We have set out to determine whether a compiler can automatically manage a machine's memory hierarchy. Our early results are encouraging: we can successfully handle the oating-point register set and we have taken encouraging steps in handling cache management. Our results show that dramatic improvements are possible over the untransformed code and that we can attain algorithms that run at least as fast as routines transformed by hand (by experts). However, there is much left to be done if we are to handle an acceptable portion of real-world programs. For example, our methods must be extended to handle pivoting or severe limits will be placed on the premise that memory hierarchy management can be done by the compiler. Our goal is to show wide applicability of memory hierarchy management techniques, leading to a definition of a \blockable" programming style that can be translated into blocked code for speci c machines by their native compilers. This result would represent a major step in the direction of supporting machineindependent parallel programming. In order to reach this goal, our research is pursuing the following directions:  We are completing the implementation of the automatic determination of unroll amounts.  We are performing a complete investigation of the Lapack codes to determine the applicability of current techniques and to develop new techniques to handle more real-world programming styles.  We are studying the automatic determination of multi-dimensional block sizes. Although the results presented in this paper are encouraging, our experimentation is not complete.

Once our implementation is nished, we plan to do a thorough study of the RiCEPS Fortran benchmark suite being developed at Rice. Given the rise of high-performance architectures with large memory latencies, such as the Intel i860, the success of this research will allow advanced microprocessors to achieve their full potential as building blocks for future supercomputers.

6 Acknowledgments Preston Briggs, Keith Cooper and Chau-Wen Tseng made many helpful suggestions during the preparation of the document. Dongarra and Sorensen have provided us with block and point versions of the Lapack algorithms and have made many useful suggestions. Michael Lewis provided the challenge of trapezoidal unroll-and-jam. John Cocke and David Callahan provided many of the original ideas from which this work derives. To all of these people go our heartfelt thanks.

References [AC72]

[AK87]

[AN87]

[AS79]

F.E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1{ 30. Prentice-Hall, 1972. J.R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491{ 542, October 1987. A. Aiken and A. Nicolau. Loop quantization: An analysis and algorithm. Technical Report 87-821, Cornell University, March 1987. W. Abu-Sufah. Improving the Performance of Virtual Memory Computers.

PhD thesis, Dept. of Computer Science, University of Illinois, 1979. [BC86]

[BK89]

M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, July 1986. V. Balasundaram and K. Kennedy. A technique for summarizing data access and its use in parallelism enhancing transformations. In Proceedings of the

ACM SIGPLAN 89 Conference on Program Language Design and Implementation, June 1989.

[CAC+ 81] G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, and P.W. Markstein. Register allocation via coloring. Computer Languages, 6:45{57, January 1981. [CCK88] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. Journal of Parallel and Distributed Computing, 5, 1988.

Fifteenth ACM Symposium on the Principles of Programming Languages, pages

[Kuc78] [Por89]

[TIF86]

[Wol82] [Wol86]

[Wol87]

[CK87]

D. Callahan and K. Kennedy. Analysis of interprocedural side e ects in a parallel programmingenvironment. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens,

Greece, 1987.

[CK89]

S. Carr and K. Kennedy. Blocking linear algebra codes for memory hierarchies. In

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scienti c Computing, Chicago, IL, December 1989.

[DBMS79] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart. LINPACK User's Guide. SIAM Publications, Philadelphia, 1979. [IT88]

F. Irigoin and R. Triolet. Supernode partitioning. In Conference Record of the

Improvement of Cache Performance on Supercomputer Applications. PhD thesis,

Rice University, May 1989. R. Triolet, F. Irigoin, and P. Feautrier. Direct parallelization of call statements. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, pages 176{184, July 1986. M. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of Illinois, October 1982. M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 International Conference on Parallel Processing, August 1986. M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scienti c Computing, Decem-

[CCK90] D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the

SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.

319{328, January 1988. D. Kuck. The Structure of Computers and Computations Volume 1. John Wiley and Sons, New York, 1978. A.K. Porter eld. Software Methods for

[Wol89]

ber 1987. M. Wolfe. More iteration space tiling. In Proceedings of the Supercomputing '89 Conference, 1989.

Suggest Documents