KEN KENNEDY. Abstract. Because computation speed and memory size are both increasing, the latency of memory, in basic machine cycles, is also increasing.
BLOCKING LINEAR ALGEBRA CODES FOR MEMORY HIERARCHIES STEVE CARR KEN KENNEDY
Abstract. Because computation speed and memory size are both increasing, the latency of memory, in basic machine cycles, is also increasing. As a result, recent compiler research has focused on reducing the eective latency by restructuring programs to take more advantage of high-speed intermediate memory (or cache, as it is usually called). The problem is that many real-world programs are non-trivial to restructure, and current methods will often fail. In this paper, we present some encouraging preliminary results of a project to determine how much restructuring is possible with automatic techniques.
1. Introduction. Over the past decade we have seen dramatic reductions in the cycle times of microprocessors, while memories for the same processors have been growing in size. These two trends have yielded computer systems in which memory latency is quite large in terms of basic machine cycles|latencies of 10 to 20 cycles are not unusual. To address this problem, system designers have incorporated cache memories into high performance systems. This strategy reduces the latency for most accesses to one or two cycles. While cache memory works well for most calculations, it is less eective in scienti c computing, because the sizes of the working arrays typically grow as the problem size grows. Hence, one problem might t entirely in cache, while another that is slightly larger will not t. In the second of these cases the eect on performance can be disastrous. Previous studies have shown that computers with large latencies can spend nearly half the execution time waiting for data to be delivered to cache [Por89]. To compensate for large latencies, scientists have begun to restructure their algorithms by hand to increase data locality within loops and, thus, to diminish the eect of latency on performance. The principal technique is to employ block algorithms, which work on subproblems that t into the cache. Unfortunately, this technique forces the programmer to spend time on the tedious details of restructuring his code for performance on each dierent machine architecture for which the program is targeted. The resulting program is highly machine-dependent. The problem is even worse for shared-memory parallel machines, because the processors share global memory but usually have private caches. Much of the motivation for the Lapack project is to provide versions of Linpack and Eispack libraries that are blocked to achieve high performance on parallel processors. If the trend continues, applications written in Fortran will need to be signi cantly restructured for each new machine architecture. The danger is that an increasing fraction of the human resources available for science and engineering will be spent on conversion of high-level language programs from one parallel machine to another|an unacceptable eventuality.
Research supported by NSF Grant CCR-8809615
In the past, we have avoided machine dependent programming by making our compilers more intelligent. The Fortran I compiler included enough optimizations to make it possible for scientists to abandon machine language programming, ushering in the era of widespread application of the computer to scienti c problems. More recently, advanced vectorizing compilers have made it possible to write \vectorizable" code and expect the compiler to translate it to ecient machine code for the target vector machine. Vectorizing compilers did not eliminate the need to restructure programs for vector execution; instead, they made it possible to write machine-independent vector programs in a sublanguage of Fortran 77. Is it possible to achieve the same success for memory hierarchy management? More precisely, is it possible to identify a sublanguage of Fortran that is general enough to conveniently program scienti c applications but restrictive enough for a compiler to generate good code|good enough to discourage machine-dependent programming practices? For the past two years, we at Rice have been studying this issue. Compiler researchers have been investigating the automation of blocking techniques for many years and have developed transformations, such as strip-mine-and-interchange, that improve the cache utilization of loop nests [AS79, Wol87, IT88, Wol89, Por89]. The goal of these transformations is to automatically produce an algorithm blocked for a speci c machine from the corresponding point algorithm. Although these techniques have been shown to be eective for \simple" loops, they are not directly applicable to the complex loop nests common in linear algebra codes such as Lapack [Por89]. These codes often contain loops which have complex dependence patterns that are non-trivial to analyze or which have irregular-shaped iteration spaces that present problems in deriving loop bounds in the transformed loop nest. To investigate the generality of automatic blocking techniques, we established a goal of looking at linear algebra codes such as those found in Lapack. Dongarra and Sorensen are contributing point and block versions of all the algorithms used in Lapack, and we are investigating whether the block algorithms can be derived automatically in a compiler. In this paper we report some encouraging preliminary results of that study. In particular, we have found transformation algorithms that can be successfully used on triangular loops, which are quite common in linear algebra. In addition, we have discovered an algorithmic approach that can be used to analyze and block programs that exhibit complex dependence patterns. The latter method has been successfully applied to block LU decomposition without pivoting. 2. Dependence. The fundamental tool available to the compiler is the same tool used in vectorization and parallelization|namely dependence. We say that a dependence exists between two statements if there exists a control ow path from the rst statement to the second, and both statements reference the same memory location [Kuc78]. If the rst statement writes to the location and the second reads from it, there is a true dependence, also called a ow dependence. If the rst statement reads from the location and the second writes to it, there is an antidependence. If both statements write to the location, there is an output dependence. If both statements read from the location, there is an input dependence. A dependence is said to be carried by a particular loop if the references at the source and sink of the dependence are on dierent iterations of the loop and the dependence is not carried by any outer loop. When applied to memory hierarchy management, a dependence can be thought of as an opportunity for reuse.
3. Iteration Space Blocking. 3.1. Overview. To improve the memory behavior of loops that access more data than can
be handled by a cache, the iteration space of a loop can be blocked into sections whose reuse can be captured by the cache. Strip-mine-and-interchange is a transformation that achieves this result [Wol87, Por89]. The eect is to shorten the distance between the source and sink of a dependence so that it is more likely for the datum to reside in cache when the reuse occurs. Consider the following loop nest.
DO 10 I = 1,N DO 10 J = 1,M 10 A(I) = A(I) + B(J)
Assuming that the value of M is much greater than the size of the cache, the cache would provide no bene t for B, while the reuse for A could be handled by a register. To capture reuse for both A and B we can use strip-mine-and-interchange. First, we strip mine the loop as shown below. DO 10 I = 1,N,S DO 10 II = I,MIN(I+S-1,N) DO 10 J = 1,M 10 A(II) = A(II) + B(J)
And then we interchange the strip loop and the inner loop to give: DO 10 I = 1,N,S DO 10 J = 1,M DO 10 II = I,MIN(I+S-1,N) 10 A(II) = A(II) + B(J)
Now, the reuse in B(J) can be captured in a register and reuse in A(II) can be captured in cache if S is not larger than the cache size. It will also be desirable to perform strip-mine-and-interchange on the J-loop to attain better cache performance for both A and B. 3.2. Triangular Iteration Spaces. When the iteration space of a loop is not rectangular, the transformation described above cannot be directly applied as shown. The problem is that when performing interchange of loops that iterate over a triangular region, the loop bounds must be modi ed to preserve the semantics of the loop [Wol86]. Below, we will derive the formula for determining loop bounds when strip-mine-and-interchange is performed on a triangular iteration space. The general form of a strip-mined triangular loop is given below. The I and J loops have been normalized to give a lower bound of 1, and are integer constants ( may be a loop invariant) and > 0.
10
DO 10 I = 1,N,S DO 10 II = I,MIN(I+S-1,N) DO 10 J = 1,II+
loop body
Figure 1a gives a graphical description of the iteration space of this loop. To interchange the II and J loops, we have to account for the fact that the line II+ intersects the iteration space at the point where J = I+ . Therefore when we interchange the loops, the IIloop must iterate over a trapezoidal region, as shown in gure 1b, requiring its lower bound to begin at I until (J? ) > I. This gives the following loop nest.
10
DO 10 I = 1,N,S DO 10 J = 1,(MIN(I+S-1,N))+ DO 10 II = MAX(I,(J- )/),MIN(I+S-1,N)
loop body
The formula shown for deriving the new loop bounds can be trivially extended to handle the cases where < 0 and where strip-mine-and-interchange is performed on both loops (see appendix). 4. Complex Dependence Patterns. Loop nests that cannot be fully blocked and retain their original semantics are common to linear algebra codes. These loops often contain dependence
J
N+
1
6
II+
? ?? ?6? ?6?6 ?? ? ? 1
I
(a)
I+S-1 N
- II
J
N+
1
6
II+
?? ? ??-?? ?? -? ? 1
I
(b)
I+S-1 N
- II
. (a) Strip-mined (b) Strip-mine-and-interchanged
Fig. 1
patterns that are too complex for standard dependence abstractions, such as direction vectors, to describe [Wol82]. Because of the limited information, the potential for blocking cannot be discovered. As an example, consider the strip-mined version of LU decomposition below. DO 10 K = 1,N-1,KS DO 10 KK = K,K+KS-1 DO 20 I = KK+1,N 20 A(I,KK) = A(I,KK)/A(KK,KK) DO 10 J = KK+1,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
S1 S2
To complete the blocking of the loop, the KK-loop would have to be distributed around the loop that surrounds statement S1 and around the loop nest that surrounds statement S2 before being interchanged to the innermost position. However, there is a recurrence between statements S1 and S2 carried by the KK-loop that prevents distribution. If we analyze the accessed regions of the array A using array summary information originally designed for interprocedural analysis, we nd that the region accessed by statement S1 for the entire execution of the KK loop is a subset of the region accessed by statement S2 [TIF86, BC86, CK87, BK89] (Figure 2 gives a graphical description of the data regions). This means that the recurrence only exists for a portion of the data accessed in the second loop and that index-set splitting can be used to allow partial blocking [Wol87]. The index set of the loop surrounding statement S2 can be split into one loop that accesses the common region and one loop that accesses the disjoint region. Since no recurrence exists between the new loop and statements S1 and S2 , the KK-loop can then be distributed around each disjoint region (or loop) and partial blocking can be performed on the newly created loop. Below is LU decomposition after strip-mining, index-set splitting and loop distribution. DO 10 K = 1,N-1,KS DO 20 KK = K,K+KS-1 DO 30 I = KK+1,N 30 A(I,KK) = A(I,KK)/A(KK,KK) DO 20 J = KK+1,K+KS-1 DO 20 I = KK+1,N 20 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) DO 10 KK = K,K+KS-1 DO 10 J = K+KS,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
X2 N
6
? ? ?? ?? ? ? ?? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ?? ?? ?? ? ?? ? ? ? ? ? ? ? ? ? @@??@@??@@??@@?? ??@@??@@??@@??@@
K+KS-1 K
S2
S1
-
1 1
K
X1
N
. Regions Accessed in LU Decomposition
Fig. 2
It should be noted that there is a dependence that prevents the interchange of the KK and I loops in the new loop body, but index-set splitting can be applied on the I-loop to work around the dependence so that partial blocking can be performed. We applied our algorithm by hand to LU Decomposition and compared its performance with the original program and a version hand coded by Sorensen. In the table below, \Block 1" refers to the Sorensen version and \Block 2" refers to the program resulting from our algorithm. In addition, we used an automatic system to perform two additional transformations, \unroll-and-jam" and \scalar replacement", to our blocked code, producing the version referred to as \Block 2+" [CCK90]. The results reported are from an experiment run on a Mips M120 using a 300x300 array of singleprecision Reals. The reader should note that these nal transformations could have been applied to the Sorensen version as well, with similar improvements. Original Block 1 Block 2 Block 2+ Speedup 8.36s 6.69s 6.40s 3.55s 2.35 5. Previous Work. Wolfe has done a signi cant amount of work on this problem [Wol86, Wol87, Wol89]. In particular, he discusses blocking for triangular-shaped iteration spaces and LU Decomposition, but he does not present compiler algorithms. Instead, he illustrates the transformations by a few examples. Our work takes this a step further by showing how a compiler could automate these steps with information generally available to it. Irigoin and Triolet describe a general technique for blocking iteration spaces for memory that uses a new dependence abstraction, called a dependence cone [IT88]. This technique does not work on non-perfectly nested loops, which are common in linear algebra codes, nor does it appear to be easily implemented in a compiler. 6. Conclusions and Future Work. We set out to determine whether a compiler can automatically determine how to block algorithms from linear algebra. Our preliminary results are encouraging: we can block triangular loops and we have found a method for blocking LU Decomposition without pivoting. These algorithms run as least as fast as routines blocked by hand (by experts). However, there is much left to do. For example, our current methods are unable to deal with pivoting. If we are unable to overcome this limitation, it will severely limit the applicability
of these techniques, and we may have to reject the premise that memory hierarchy management by compiler is possible. On the other hand, if we succeed in showing wide applicability for these techniques, we should be able to de ne a \blockable" programming style that can be translated into blocked code for speci c machines by their native compilers. This would represent a major step in the direction of supporting machine-independent parallel programming. REFERENCES [AS79]
W. Abu-Sufah. Improving the Performance of Virtual Memory Computers. PhD thesis, Dept. of Computer Science, University of Illinois, 1979. [BC86] M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, July 1986. [BK89] V. Balasundaram and K. Kennedy. A technique for summarizing data access and its use in parallelism enhancing transformations. In Proceedings of the ACM SIGPLAN 89 Conference on Program Language Design and Implementation, June 1989. [CCK90] D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990. [CK87] D. Callahan and K. Kennedy. Analysis of interprocedural side eects in a parallel programming environment. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987. [IT88] F. Irigoin and R. Triolet. Supernode partitioning. In Conference Record of the Fifteenth ACM Symposium on the Principles of Programming Languages, pages 319{328, January 1988. [Kuc78] D. Kuck. The Structure of Computers and Computations Volume 1. John Wiley and Sons, New York, 1978. [Por89] A.K. Porter eld. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, May 1989. [TIF86] R. Triolet, F. Irigoin, and P. Feautrier. Direct parallelization of call statements. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, pages 176{184, July 1986. [Wol82] M. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of Illinois, October 1982. [Wol86] M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 International Conference on Parallel Processing, August 1986. [Wol87] M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scienti c Computing, December 1987. [Wol89] M. Wolfe. More iteration space tiling. In Proceedings of the Supercomputing '89 Conference, 1989.
Appendix: Strip-Mine-And-Interchange Formulas 1) one loop: case
10
0
DO 10 I = 1,N,IS DO 10 J = 1,(MIN(I+IS-1,N))+ ,JS DO 10 II = MAX(I,(J- )/),MIN(I+IS-1,N) DO 10 JJ = J,MIN(II+ ,J+JS-1)
loop body
3) both loops: case