GANNON, JALBY, AND GALLIVAN. In the case of loops that are not perfectly nested these formulas still may work. For example, in. For i = 1 to n. Forj= 1 tom.
JOURNAL OF PARALLEL AND DISTRIBUTED
COMPUTING
5,587-616
(1988)
Strategies for Cache and Local Memory Management by Global Program Transformation DENNIS GANNON Department of Computer Science, Indiana University, Bloomington, Indiana 47401, and Center for Supercomputing Research and Development, University of Illinois, Urbana, Illinois 61801 WILLIAM
JALBY
INRIA, domaine de Voluceau-Rocquencourt.B.P. 105, 78150 Le Chesnay, France, and Center for Supercomputing Research and Development, University of Illinois, Urbana, Illinois 61801 AND
KYLE GALLIVAN Center for Supercomputing Research and Development and Department of Computer Science, University of Illinois, Urbana, Illinois 61801 Received March IS, 1987
In this paper we describe a method for using data dependenceanalysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of “reference” windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases,we can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformations in an attempt to optimize cache performance. 8 1988Academic p, he.
1. INTR~DUCTI~N Perhapsthe most critical feature in the design of a shared memory parallel processor is the organization and the performance of the memory system. Generally, the sharedmemory is implemented as a set of independent modules 587 0743-7315188$3.00 Copyright 0 1988 by Academic F’res, Inc. All rights of reproduction in any form resew&.
588
GANNON,
JALBY,
AND GALLIVAN
(which, in turn, may be interleaved) connected to the processors via either a bus or a switch network. This organization exhibits several potential performance problems. First because of the long journey each memory reference must traverse in going to and from this shared resource the latency to access a data may be high; this effect is very important in systems with a large number of processors where the number of network stagesthat each memory request must go through grows as log&) where p is number of processors (due to the number of stages of the network). Second due to the contention at both the network level and the memory level (routing and memory banks conflicts), the practical bandwidth (and also the latency) can be severely degraded. An example of such phenomenon is “hot spot” contention [ 18 1. The use of a hierarchical memory system has already proven efficient on sequential computers in overcoming these problems and speeding up memory accesses.In such a system, the memory is organized in several levels which might be fully shared (each processor may access the whole level), partially shared (the processors “inside a cluster” share the accessto a given level), or fully private (each processor has its own level, accessof which is restricted to itself; therefore accessesdo not suffer from going through the communication medium or from interfering with the other processors). The transfers between these levels are either entirely hardware managed (such as with a cache where the user has no explicit control on the loading and unloading strategies) or fully software managed (such as registers where the user or, preferably, the compiler explicitly moves data between levels). For example, the Alliant FX / 8 usestwo levels of shared memory (trading size for speed): the main memory is connected via a bus to a high bandwidth cache which is turn shared by the processors through a crossbar. Access to vectors from the cache is two to three times faster than accessing from memory. Additionally, each processor has its own instruction cache and its own set of vector registers. In other machines only the main memory level is fully shared and each processor has a private data cache, such as with the Sequent Balance and Encore Multimax, or a local memory which might be physically distinct from the main memory (CRAY2) or just a portion of the shared memory such as the IBM RP3 or BBN Butterfly. In this last case,the difference between local and shared requests is that the local ones do not go through the network. Finally, the Cedar system has perhaps one of the most ambitious designs of memory hierarchy. At the processor level, we have instruction caches and vector registers. The processors are then grouped into clusters which share a local cluster memory which is accessed by a shared cluster cache. Finally the clusters have access to the globally shared memory through the network. Additionally, each cluster has a prefetch unit controlled by sofhvare for prefetching data and therefore hiding the latency induced by the access to the global memory level. However, the overall performance of all these hierarchical memory systems is highly dependent upon the address reference stream of the program (more
STRATEGIES
FOR LOCAL
MEMORY
MANAGEMENT
589
precisely its locality). Several studies [ 5, 8, lo] have shown how algorithm reorganization may result in considerable performance improvement. It is crucial to note that even in the case of hardware managed systems (Alliant FX/ 8)) reorganizing the program in order to make references to the same variable closer in time (or reducing the size of the working set) will speed up program execution [ B-101. Similar phenomena were already observed in studies considering the effect of program organization on paged memory behavior [ 15 ] and our approach can be considered to be following in the spirit of [2]. In this paper we consider the problem of automating the process of transforming programs to optimize the utilization of the memory hierarchy. For the sake of simplicity, in our first approach we will assume that the transfers between levels are completely under software control. In that case, locality optimization is a two-level process: first for a given program, one must solve the allocation problem (which data are to be kept in cache and for how long) and then restructure the program to minimize the number of data transfers. In fact, our basic strategy may still be applied to hardware managed systems. This is because reducing the number of transfers between levels is a dual problem to maximizing the reuse of data (optimizing “hit ratio”). Our key idea in solving the allocation problem is to consider it at a macroscopic level (loop level and section of arrays) rather than at the microscopic level (machine level instruction and individual array elements). Using the fact that for scientific codes, most of the CPU time is spent in loop-like structure execution, we will globally study the interaction between two statements in a loop, by analyzing the sets of all the addressesreferenced by each of them during the whole loop execution. First we show that the theory of data dependenceanalysis used in automatic vectorizing compilers can be extendedso that a more refined algebraicstructure can be given to a class of data dependencesassociatedwith array index expressions that are common in scientific code. We call this class of dependences “uniformly generated.” Next we associate a “reference window” with each data dependence. The reference window of a data dependencebetween two statements describes the set of elements (section of the array) that must be kept in the fast memory level to make sure that any data referenced by both statements will stay in the fast level as long as both statements continue to be executed and continue to reference that data item. More generally, we try for a given dependenceto determine the amount of space required to ensure that each piece of data will be loaded just once. In fact, the reference window can be considered part of the “working set” (i.e., data which are going to be reused later and which may result in cache hits. ) In Sections 3 and 4, we show that “uniformly generated” dependenceshave special properties that relate the structure of the data dependencegraph to the lattice of reference windows and, given information about the loop bounds, we can estimate the size and
590
GANNON,
JALBY, AND GALLIVAN
“hit ratio” of the various windows. Now the problem is very similar to a classical bin packing problem: the size of the windows being the cost, and the “hit ratio” being the benefit associated with a window. In fact, all the elements necessary to manage the data between the different levels can be done symbolically at compile time, while the final decision might be taken only at runtime by substituting values in the symbolic expressions. In Section 5, it is shown that program transformations like loop interchange and blocking can have a substantial effect on the size of the windows and therefore on the demand for space in the fast level. While this fact is well known to most programmers, it is shown here that the data dependence modeling can be used as a mechanism to predict when a loop interchange can improve performance. In Section 6 we show how this mechanism can be used to decide which data should be moved from the global memory of a multiprocessor system to the local memory. We also briefly discuss the implications for multiprocessors with shared cache. 2. DEFINITIONS In this paper we use the standard definitions for data dependences given in many places (for details on the various definitions and restructuring transformations see [ 4, 12, 14, 16, 17, 191). A flow dependence from a statement Si to a statement & exists when a value computed in S, is stored in a location associated with some variable name x which is later referenced and used in SZ and is denoted
An antidependence from Si to SZ exists when a variable x referenced by Si must be used before it is overwritten by & and is denoted
^ 6,:S* +sz. An output dependence from Si to SZ exists when both statements modify a common variable x and Si must complete before Sz does. This is denoted by
In order to track memory references another dependence type, known as an input dependence, is used. Unlike the other three types of dependences, an input dependence does not impose a constraint on the potential parallel execution of the two statements, but we still use a notation similar to the others:
STRATEGIES
FOR LOCAL
MEMORY
MANAGEMENT
591
In the case of referencesto elements of structured variables such as vectors or arrays, most references occur within loops. In this paper we consider only simple “for loop” iterations although much of what we say applies to “while loops” and other tail recursive control structures. For each data dependence between two references nested within a loop, we extend the work of [ 61 and associate with the dependenceof a set of distance vectors which is defined as follows. Consider a nested sequenceof k loops of the form For il = LI to UI For i2 = Lz to U2 .....
For ik = Lk to uk .... Sl ....
s2 ....
endfor .....
endfor endfor The module Zk is called the extended iteration space and the product n fE1Q where Di is the range of the ith induction variable [ Li * . * Vi], is called the bounded iteration space. Both the extended and the bounded iteration spaceshave a total order which is defined by the point in time at which the element is executed; i.e., (Ul,
v2,
* - * >
vk)
0 andf(t)
= g(t + v)).
Note that we insist that the distance vectors point forward in time (which translatesto the requirement that the leading nonzero component of the vector be positive). There are a number of important, very common special cases.We say that the dependenceis unzjimn~ygeneratedif there is a linear function h: Zk + Zd and two vectors CJ and C, such that
f(t) = h(t)+ c, g(t)= 40 + c,, where t E Z k. In this case it is easy to see that V,J = (v E ZklV > 0, h(v) = Cf - C,) and that the right-hand side is constant in time (independent of t). Clearly if a nonnegative member of h-‘(Cf - C,) does not exist the dependence does not exist. If h-’ (C, - C,) is a single vector ( vl, v2, . . . , t)k) then we say that the dependenceis uniquelygeneratedand denote it by
&(v,, v2,. . . , vk):s, + s,. Another method of describing the set of distance vectors is to represent it as the sum of a positive vector plus the kernel of h. To do this let v = (v, , v2, . . . 9 uk) be a positive vector in h-’ (C, - C,); then clearly
STRATEGIES FOR LOCAL MEMORY MANAGEMENT
593
V,,[ = (u + wI w E Ker(h), 2)+ w > 0), where Ker(h) is the kernel of the mapping, i.e., the set of all w E Zk such that h(w) = 0. To illustrate these ideas consider the loop For i = 1 to n 1 ton x[i,j] = 13.5 y[ i] = x[ i-3,j+51 z[j] = 1.4 ~W3,Z[~-71)
Forj=
Sl s2 s3 s4
+ 19.0
endfor endfor
This program has six dependencesand we will look at three of them. On the variable x, there is a flow dependencefor each (i, j) pair to iterates (i + 3, j - 5) when i < n - 3 andj > 5. In this case h(i,j) = (i,j) and Cf = (0,O) and C, = (3, -5). Becausethe Ker(h) is trivial we can write this as 6,( 3, -5): s, --+ s,. For the variable y we note that for eachj iteration y[ i] is modified. Thus for each j we have a family of output dependences,one from each value of i to the next. In this case h( i, j) = i. This is a selfdependence so we have Cf = C, and the set of distance vectors is just equal to Ker( h) which is generated by the vector (0, 1). This dependencevector is written as 6;(1,0)“:
s, + s2,
where the superscript “+” is used to denote that a full module of dependences are generated by this vector; i.e., we have 6,(k, 0) for all k > 1 and (k, 0) = k( 1,O). Such dependencesare called cyclic self-dependencesand are very important for cache management. The third dependence involves the variable z. Here we have h( i, j) = j, Cf = 0, and C, = -7. The kernel of h is generatedby the vector (1,O) and h(O,7) = CJ - C,. Hence the set of dependencevectors can be described as &((O, 7) + (l,o)+):
s3
+ s4,
where the vector summation is to denote the one-parameter family of vectors (097) + P*(L 0) = (P, 7)
for all
pE z.
594
GANNON,
JALBY,
AND GALLIVAN
It is important to note that for p > 0 this is a flow dependence, but for p < 0 the direction in time is reversed and in fact we have described a set of antidependences 6,(-p,
-7):
s, + s,
all
p c 0.
Of course not all data dependences are uniformly generated. For example,
Sl s2
Fori= 1,n x[2*i+3] = 29.9 y[i] = x[4*i+7] - 39.0 endfor
has an antidependence from SZ to S, which, at iteration i, has a distance vector 8,(i + 2): S1 + Sz
1 =Gi G (n - 3)/2.
The vector is not uniformly generated becausethe vector is of the wrong form (it depends upon time) and it is carried only by the even iterations. In this situation we use the classical “direction” vector notation. In the case above, the distance is always positive, so we denote it as 8,( + ) . If, on the other hand, the lower bound of the for loop were -3, the distances would have ranged from - 1 to ( n - 3 )/ 2 and we would use the nota:ion a,( - / + ) . If the lower bound of the loop were -2, the vector would be 6,(0/+), etc.
3. THE REFERENCE
WINDOW
FOR A DEPENDENCE
VECTOR
For our purposes we are interested in the following question: Let two referencesto a variable be linked by a data dependence.Under what conditions can we be assured that if the first reference to the variable brings the current value into cache, then the second reference will result in a cache hit?
In general this is a very hard problem whose answer depends on more than a single data dependence. One must know not only more global program information but also a great deal about the way the cache replacement policy works. As described in Section 1, we take the approach that the compiler can restructure the program and suggesta replacement schedule that will be reasonably good. The consequence of letting the compiler manage the cache is that we may focus attention on one variable at a time and estimate the effect of program restructuring on the demands that are made on cache resources by that variable.
STRATEGIES
FOR LOCAL
MEMORY
MANAGEMENT
595
If the compiler has complete control of the cache and it is to be managed much like a massive set of registers, the basic cache optimization problem can be stated as follows: when we read a scalar or an element of a structured variable, should we, or should we not, keep the element in cache?If we have complete knowledge of the data dependencestructure of the program, a reasonable solution is to see if that reference was the source (tail) of a data dependence.This means that the element will be referenced again, if so, and if there is room, we should keep the element. To make this idea more precise, we need the following. DEFINITION. The referencewindow, W( 6X)1for a dependence&: S1 + & on a variable X at time E, is defined to be the set of all elements of X that are referenced by S, at or before t that are also referenced (according to the dependence) after t by SZ.
For example, the loop fori= Sl s2
1,n
x[i] = 1.4 x[i-31 = y[i] end,
has an output dependenceSt i in statement S1 have been previously generated by assignments in SZ. Consequently, 6x(+): sz + s, has an associated window
W(Sx(+))t=i = (x[i], x[i + 11,. . . , x[m]). One of the most important characteristics of the reference windows associated with each dependence is its size. For a uniformly generated dependence the window can be approximated in a simple way and this approximation can be used to estimate the window’s size. The basic idea ofthis approximation is to enclose the window in a “frame” of fixed size which moves through the bounded iteration space. Let Dk be the bounded iteration space and let x be a structured variable. If we have a dependence ax: x[h(t)
+ Cl] * x[h(t)
+ c-21
we can define the following special subspaces of Qk where Q is the field of rationals. Let E?ifor i = 1 - * . k define the natural basis of Z k that corresponds to the induction variables il, i2, . . . , ik. Define the subsets of Zk, Vj
= Span(ej,
ej+l,
. . . , ek),
where span( uI, . . . , u,) is defined as [u~Q~lu=
&+u,,withoc,EQl. r=l
STRATEGIES
FOR LOCAL
MEMORY
MANAGEMENT
597
We have
We can now characterize the window for 6, as follows. THEOREM 3.1. Let [ uj, pi] be the range of induction variable ij for 1 d j f k and let h: Zk -* Zd be a linear transformation. Let rl be the largest integer such that Ker( h) C V,, and let v E h-‘( C, - Cz). Let r2 be the index of the leading nonzero term in v and set r = min( rI, 12). De3ne the set B to be n&l [0, pj - Uj] and define Xv C Zk to be the convex set
X, = -s*v
+ (V,, f-I B),
where s is a rational in the range [ 0, 11. We then have W(sx)l~(i,,i2,...,i~) C
(X[Wl I W
E (Nil,
. . . , i,, %+I, t&+2, . . . ,
uk)
+
Cl
+
h(X,)) C Zd).
Proof: Let x[ sl] = x[ sz] be two references to the same element of x where s1 = h(t,) + Cr and s2 = h(t2) + CZ and tl G t < t2. Becauses1 = ~2,we have h(tz - t,) = Cl - C2. Letting Z = t2 - tl, we have i? - v E Ker(h). Because Ker(h) C V,, we can let g = Z - v which must take the form
By the definition of tl and t2 we have
t1G t < t, + a. Assume t, is of the form
t1= (i;, z;,. . . ) i;o. Because v E V,.and b = v + g, the first r - 1 terms of ijare zero. Also, since
we have 5 = ij for all j < r. Consequently, if we let t0
= (il,
i2,
. . . , ir, %+I,
t&+2,
. . . , uk)
598
CANNON,
JALBY, AND GALLIVAN
then we can write
tl = to - s*zi + w, where s =G1 is chosen to satisfy i; = i, - s * ijr and w is of the form (0, 0, 0, . . . , %+I,
wr+2,
. . . , wk)
with Wj = Z: + s*Zj - Uj. Because t2 = tl + E and tl are both in the iteration space we have by convexity that t, + s*b must be in the rational closure of the iteration space for all s E [0, 11. Hence for all j B r + 1 we have Uj =S1: + S* Ej G pj and hence Wj E [ 0, pj - Uj] . Consequently, w E V,+I n B. Applying h( ) to tl and adding Cr we have s1 = h(t,) + c, = h(t()) + c, + h(-s*u
+ w) - s*h(g).
The last term vanishes and the theorem is proved. The advantage of the formulation in Theorem 3.1 is that the window has been enclosed in a moving frame. The term h(i,,
i2,
. . . , jr, w+l,
ur+2,
. . . , uk)
describes how the frame moves in time. The term
CI + Wv)
with
X, = (-WI
+ V,, f-l B, SE [0, 11)
is the time-independent “frame” for the window. Note that h(X,) is independent of the choice of 21because any other choice will differ from u by a member of Ker( h) . In the following section it is shown that this formulation provides the necessary machinery to compute the size of reference windows. 4. HIT RATIOS AND SELECTING
REFECRENCE WINDOWS
Assume, for now, that we have a machine where the compiler can select which memory references to keep in cache. (For local memory this is always the case.) Clearly we would like to select those referencesbelonging to reference windows associated with dependences that somehow generate many cache hits. Our problem is twofold: 1. How do we compute the total size of a reference window and what is the cache hit ratio if we keep the entire window in cache? 2. How do we decide which windows to keep and which to discard when the total is too big to fit in cache?
599
STRATEGIES FOR LOCAL MEMORY MANAGEMENT
Another question that we would like to answer is if we have no control over the cache then can we estimate the hit ratios for hardware cache policies other than the one above?While the mechanisms described in this paper can be used to solve this problem, we do not consider it here. Our basic cache scheduling algorithm will be as follows. First, at compile time, we select a set of dependencewindows that we consider important. Our policy for cache replacement will be the following: we will read an element into the cache as soon as it enters the window and remove it from cache as soon as it leaves the window. A simple way to estimate the total number of elements in cache for this policy is simply to sum the window sizesfor each of the selecteddependences. Unfortunately, as we have seen, the relationship between reference windows in a system of dependencesfor a given variable can be rather complex. In particular, referencewindows overlap and an element m ight be counted several times by this scheme. For example, for i = 1 to m x[i] = 1.5 forj= 5 to9 5-2 v[i,il = x[il end
Sl
s3
z[ i] = x[ i-31 endfor
In this case there are six dependenceslisted as W1(6,(O/+): S1 + Sz)+i = (x[5], x[6], . . . , x[min(i W2(6,(+/-):
- 1, 9)])
S2 + S3)t=(i,j) = (x[max(5, i - 3)], . . . , X[9])
W3(&(3): S1 + S3)f=i= (x[i - 31, x[i - 21, x[i - 11) ~4(Ml,
0):
W,(&(O/+): W6(&(+/-):
s2
*
S2)I=(i,j)
=
(X[519
+a . Y
X[91)
& + &)t=i = (X[5], . . . , x[min(9, i - 3)]) S2 +
Sl)t=(i,j)
=
(x[mNk
01, . . . ,X[91).
Note that, in fact, there are only two significant windows here. One is the window of size 3, IV,, and the other is the window of size 5, W4. Each of these are based on uniquely generated dependences.The other windows are subsets of IV,. The correct cache policy for this program is to select IV, and IV4to be kept in cache. Also note that IV, sweepsover the entire x array while IV4 is constant in time (after i = 1). At certain times they are disjoint, but at other times they overlap and it is only during the period of overlap that the other dependencesexist at all.
600
CANNON,
JALBY, AND GALLIVAN
Obviously, the fact that there may be a subset of dependences that “span” the other dependences can cause difficulties in the estimation of the data locality of a code. (A spanning dependence of a set of dependences is one whose window contains the windows of the other dependences in the set.) Because of the fact that nonuniformly generated dependences have reference windows that tend either to be subsets of uniformly generated dependences or to have low hit ratios, we drop them from consideration for inclusion into the cache. (We attempt to justify this restriction better in the next section.) The identification of spanning uniform dependencesis still an open question. If one works in the unbounded iteration space, Zk, it is possible to develop theorems which relate spanning dependences to topological properties of the data dependence graph (see the appendix of [ 111 for these theorems). Unfortunately, these results can sometimes be misleading when used to approximate the behavior of the windows in the bounded iteration space. They can however indicate what occurs in the bounded iteration space in areas which are “sufficiently far from the boundary.” At present, we choose spanning uniformly generated dependences via heuristics. For now our approach is based on the following strategy. Let UG( X) be the subgraph of the atomic data dependence graph consisting of those reference nodes involving the variable x and those edgescorresponding to uniformly generated dependences. For each connected component C C UC(x) we estimate the size and the “hit ratio” of the reference window for the spanning dependence. If the total size is greater than the cache capacity, we attempt to restructure the program to reduce the size. If the program cannot be restructured but the cache can be managed by tagging the references that should be retained in cache, we attempt to solve the corresponding bin packing problem to select the reference windows that will give best performance. Our next task is to describe the machinery for estimating window sizes and hit ratios of uniformly generated dependences. In some casesthe task is easy. For example, the loop for i = 1 to n forj= 1 tom Sl ... X[ i,j] .. s2 ... X[i-3,j+5] endfor endfor
. ..
defines a dependence with distance vector 6x( 3, -5). At iteration ( iO, j,), X[ io, jo] is referenced. This element is not referenced again until iteration (i. + 3, j0 - 5). Consequently, statement S1 will have to access elements
STRATEGIES
X[i+l,l] .. X[ i+2,11 .. X[ i+3,l],..,X[
FOR LOCAL
MEMORY
MANAGEMENT
601
X[i ,j],X[i ,j+11, ... ,X[i ,m], X[i+l,j],X[i+l,j+l], ... ,X[i+l,m], X[i+2,j],X[i+2,j+l], *.. ,X[i+2,m], i+2,j-51
This set of elements defines the window W( &) for this dependenceat iteration (i + 3, j - 5). Clearly any smaller set would not include reference X[ i, j] and would cause a cache miss for statement &. In the more general case of a uniformly generated dependence 6 from a term of the form x[ h( t) + C] , where h: Zk --t Z “, we need to consider the formulation from Theorem 3.1. Let t = (ii, . . . , ik). Let v E I’, and r be chosen according to the conditions of the theorem. We have , w(bdt
c
h(il,
. . . , irp %+I,
. . . , uk)
+
c
+
h(x”),
where
x, = (-~2, + v,+, n B). The problem of estimating the size of the window then reduces to computing the size of h(X,). If the reference is a self-cycle then v is the zero vector and the problem can be reduced in dimension and shown to be equivalent to computing
where 8 = n&+1 [0, pj - Uj] and g: Z k-’ -* Zd is the restriction of h to Z k-r derived by setting the first I subscripts to 0. Hence, the problem is equivalent to computing the size of the image of a linear transformation applied to a box tied to the origin and contained in the positive cone of Zk-‘. If the vector v is not the zero vector then a changeof basis can transform the polytope into a box thereby reducing the problem to the type above. The calculation of the “hit ratio” corresponding to a particular dependence also reduces to this type of problem. Supposethe total number of references to a variable x is R. Define the image in the computation of x to be the set Im (x) = ( w E Zd 1x[ w ] is referenced in the computation ) . If the referencewindow remains in cache then each of the ] Im (x) ] referenced elements of the array is loaded only once and all remaining R - 1Im(x) ( references are hits. Consequently we have hr(x) =
R-
IWx>l R
’
602
GANNON, JALBY, AND GALLIVAN
Therefore the computation of hit ratios reduces to the problem of computing the size of h(@‘) where Dk is the k-dimensional box in the positive cone of Z k bounded by the subscript rangesof the loops. Consequently, the machinery neededto compute the referencewindow is identical to the machinery needed to compute the hit ratio. The mechanism used to estimate the size of the window is presented in Theorem 4.1. This presentation,however, requires the definition of a weighting function required by the theorem. DEFINITION. Suppose h: Zk + Zd is a linear transformation of rank m l,beaboxinZkund
let h: Zk + Zd be a linear transformation of rank m G d. Let the set S C Z” be the set of the C( k, m) ways (each representedas an m-vector) to choosem of the integers 1 to k and let F = (Fi = span(ei,, . . . , ei,,,)1i = (i, , . . . , i,,,) E S) be the set of submodularfaces of B which have dimension m. Then, tf thefunction p and set Y are definedas above,it follows that lb(B) I G C P(Fi)di, * * * 4,. i&S
Proof: For the proof of this theorem and a method to produce the required basis Ysee [7]. As
an example consider the matrix multiplication routine for i = 0 to nl-1 forj = 0 to n2-1 for k = 0 to n3-1
a[i,j] += b[i,k]*c[k,j]; end; end, end,
STRATEGIES
FOR LOCAL
MEMORY
MANAGEMENT
603
For each of the three variables the dependencesare self-cycles
MO, 0, I)+,
Ab(o,
Ul,
1, o)+,
0, o)+.
In terms of Theorem 3.1 we have the function h, kernel position r, the vector u, and set X given by
h”(i,j, k) = (&A,
r, = 3
hb(i,j, k) = (i, k),
t-b= 2
h”(i,j,
r,= 1.
k) = (kj),
In each case the vector u may be taken to be zero becausethese are self-cycles. The corresponding X sets are
xa = ((0,0,0)) Xb = ((0, 0, k) IO 6 k < n3) Xc= ((O,j,k)IO