Storage Size Reduction by In-place Mapping of Arrays Remko Tron¸con1 , Maurice Bruynooghe1 , Gerda Janssens1 , and Francky Catthoor2 1
Katholieke Universiteit Leuven, Department of Computer Science Celestijnenlaan 200A, B-3001 Heverlee, Belgium {remko,maurice,gerda}@cs.kuleuven.ac.be 2 IMEC/DESICS, Kapeldreef 75, B-3001 Heverlee, Belgium
[email protected]
Abstract. Programs for embedded multimedia applications typically manipulate several large multi-dimensional arrays. The energy consumption per access increases with their size; the access to these large arrays is responsible for a substantial part of the power consumption. In this paper, an analysis is developed to compute a bounding box for the elements in the array that are simultaneously in use. The size of the original array can be reduced to the size of the bounding box and accesses to it can be redirected using modulo operations on the original indices. This substantially reduces the size of the memories and the power consumption of accessing them.
1
Introduction
The design of embedded systems starts with code in a high level programming language (typical C) implementing the algorithms required by the processing of the input signals received by these systems (e.g., video images, . . . ). Typical for the involved algorithms is that they manipulate large multi-dimensional arrays. A direct translation of this code in an embedded system results in a design with large memory banks. A large share of the total cost and power consumption of such designs is due to data transfer and storage. The DESICS group at IMEC has developed a Data Transfer and Storage Exploration (DTSE) methodology[4] that aims to reduce the memory requirements and data transfer costs of such designs. In a first step, the program is brought into a single assignment form[9] where each memory cell is written at most once. Typically, this further increases the memory usage of the program (as extra dimensions are introduced for arrays with elements that are written several times). However, this form simplifies the dependencies between read and write statements, and hence facilitates transformations of loop nests that bring consumers of data closer to the producers and shortens the time span between the creation (write event) of a data element and its last consumption (read event). These transformations reduce the number of data elements that are in-use (live) at any time in the program. Moreover, all introduced overhead is afterwards
mod 8
for (i=0; i =< 4; i++) for (j=0; j =< 9; j++) if (j >= 2) /* P_1 */ i A[i][j] = ... ; /* P_2 */ if (i>=2 && j >= 2 && j =< 8) ... = A[i][j] + A[i-2][j+1];
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 j
Fig. 1. Live elements of an array. The picture on the right indicates for program point P2 what elements of the array contain a value that will be used in the future, given the current values of i and j.
removed again, so even in the worst case no penalty is introduced. In practical cases, the more explicit search space allows better solutions than what can be achieved using the multiple assignment code. To reduce the size of the arrays, De Greef[7, 5, 6] considers all possible linearizations of an array (for an n-dimensional array there are n! orderings; moreover, elements in a given dimension can be placed in increasing or in decreasing order, hence there are 2n n! possible linearizations) and computes what is at any time the largest distance in the linearized array between two live elements. This distance for the chosen linearization (+1) is then the size required for the array. In this paper we develop a different approach to approximate the optimal size of a n-dimensional array. In a first phase we compute in each program point a description of the data elements of the array that are live (they contain a value that will be read in the future) as a set of areas, where each area is described by a conjunction of equality (=) and inequality (≤) constraints. The purpose of the second phase is to compute values (w1 , . . . , wn ) that can be used as operands in modulo operations that redirect all accesses to the array; i.e., an access A[exp1 ] . . . [expn ] is replaced by an access A[exp1 mod w1 ] . . . [expn mod wn ]. This mapping has to preserve the correctness of the program, i.e., two distinct elements live in the same program point should not be mapped by the modulo operations to the same element. The values (w1 , . . . , wn ) (the window or bounding box) determine the size required for storing the array. A small example illustrating our approach is depicted in Fig. 1. If we look at the program point P2 , we can see that only a part of the array is in-use. The part is determined by the current values of surrounding iterators i and j. As can be seen in the illustration, if we apply ’mod 8’ in the second dimension of all the accesses, the last 2 elements are mapped back to the first 2 (empty) locations. The same can be applied for the first dimension, this time with ’mod 2’. So, we can replace each access A[exp1 ][exp2 ] by A[exp1 mod 2][exp2 mod 8], resulting in a 2 × 8 (=16) element array, instead of the original 50 element c Springer-Verlag °
2
array. Note that, if we linearize our array first, we can only apply ’mod 20’, because the distance between first and last live element is at least 19. Assumptions. We apply our transformation on programs written in a subset of the C language. This subset only allows assignments, if-then-else statements, and for-loops. Additionally, we assume that the code is in accordance with the following requirements: – The program is in single assignment form, i.e., each array element is written at most once (it can be read several times). – The index expressions used in the array accesses, the conditions of the if-then-else statements and the lower and upper bounds of the iterators in the for-loops are linear in the iterators of the surrounding for-loops 1 . – Array elements that are assigned a value, will also be read in the future. While these assumptions are quite strong, programs produced by applying the DTSE methodology[4] meet them. In section 6, we will reconsider them. Organisation of paper. Section 3 describes how to compute the live elements in each program point and discusses the complexity. In section 4 it is described how to compute the size of the window. Section 5 reports on the results obtained with a prototype in the CLP(Q) extension of SICStus Prolog. Finally, section 6 discusses possible extensions and related work.
2
Preprocessing
In a very first step of our approach, we perform some transformations on the source program in order to simplify the code. – As only one array at a time is analysed, only the for-loop statements and the statements that access the array of interest are kept. – Some transformations are applied to simplify the structure of the if-statements. This includes hoisting of accesses common to the then and else branches out of the if-statement and replacement of nested if-statements by simple ones. These steps decrease the number of program points and hence the cost of the further analysis. Note that the final program can be transformed into a program consisting only of for-loops, and assignments annotated with the conditions under which the assignment is executed, e.g., for (. . . ) hCond1 i : A[expi ][expj ] = ; for (. . . ) hCond2 i : A[expk ][expl ] = A[expm ][expn ]; ... 1
Of the form c0 + c1 i1 + . . . + ck ik with c0 , . . . , ck constants and i1 , . . . , ik the iterators of the surrounding for-loops.
c Springer-Verlag °
3
0
P0b → W0b = ∅, R0b = R0b for (i1 = l1 ; i1 ≤ u1 ; i1 ++ ) 0 0 P1b → W1b (ic1 ) = W0b ∪ W1 b (ic1 ), R1b (ic1 ) = R0a ∪ R1b (ic1 ) for (i2 = l2 ; i2 ≤ u2 ; i2 ++) 0 P2b → W2b (ic1 , ic2 ) = W1b (ic1 ) ∪ W2 b (ic1 , ic2 ), 0 R2b (ic1 , ic2 ) = R1a (ic1 ) ∪ R2b (ic1 , ic2 ) ... for (im = lm ; im ≤ um ; im ++) b b b (ic1 , . . . , icm−1 ) Pm → Wm (ic1 , . . . , icm ) = Wm−1 0 ∪ Wmb (ic1 , . . . , icm ), b a Rm (ic1 , . . . , icm ) = Rm−1 (ic1 , . . . , icm−1 ) 0 ∪ Rmb (ic1 , . . . , icm ) w w r r r hCond i A: A[expw 1 ][exp2 ]. . . [expn ] = A[exp1 ][exp2 ]. . . [expn ] a a b Pm → Wm (ic1 , . . . , icm ) = Wm (ic1 , . . . , icm−1 ) 0 ∪Wma (ic1 , . . . , icm ), a a Rm (ic1 , . . . , icm ) = Rm−1 (ic1 , . . . , icm−1 )∪ 0 Rma (ic1 , . . . , icm ) end for ... 0 P2a → W2a (ic1 , ic2 ) = W1b (ic1 ) ∪ W2 a (ic1 , ic2 ), 0 a a a R2 (ic1 , ic2 ) = R1 (ic1 ) ∪ R2 (ic1 , ic2 ) end for 0 0 P1a → W1a (ic1 ) = W0b ∪ W1 a (ic1 ), R1a (ic1 ) = R0a ∪ R1a (ic1 ) end for 0 P0a → W0a = W0 a , R0a = ∅ Fig. 2. Schematical representation of a program with one read and one write operation A, under condition hCondi.
The assignments may contain an empty left-hand side or right-hand side.
3
Liveness Analysis
In this section we describe how to compute for each program point the sets containing the elements that are live for a given array. As explained in the introduction, elements are live or in-use in a program point if they have been written when control reaches the program point and will be read later on. Clearly, for a program point inside a nest of for-loops, which elements are live depends on the current values of the iterators of the surrounding for-loops. 3.1
Written Elements
Past Iteration Spaces. To compute the written elements due to an assignment labelled A, we first determine for each program point Pi the past iteration space c Springer-Verlag °
4
of the assignment A. The past iteration space defines the set of iterations for which the statement A has been executed given that control is in Pi and given the values of the surrounding iterators. In the example of Figure 1, the past iteration space in P1 will be a 2-dimensional set of points (depending on ic and jc ) of the form PI 1 (ic , jc ) = {(x, y) | . . . } where (x, y) ∈ PI 1 (ic , jc ), means: if the current value of iterator i is ic , and the current value of j is jc , then A has been executed in the iteration where i was x and j was y. We will represent these past iteration spaces by (parameterized) integral polyhedra. Definition 1 (Integral Polyhedron). An integral polyhedron is the set of solutions to a finite system of linear inequalities on integer valued variables. Equivalently, the intersection of a finite number of linear half-spaces in n . Note that an n-dimensional parametrized polyhedron with 2 parameters PI (i c , jc ) can always be represented by a normal n + 2 dimensional polyhedron, by taking the parameters ic and jc extra dimensions. In the example, PI 1 then is defined as PI 1 = {(x, y, ic , jc ) | . . . } We now define the past iteration spaces for the basic case of 1 assignment (and its condition) surrounded by m for-loops. With m surrounding for-loops, we can distinguish 2m + 2 different past iteration spaces that can be associated b a ,Pm ,P0a as shown in Figure 2, i.e., P0b is the with the program points P0b ,. . . ,Pm first program point, the program point Pkb (k > 0) is the first program point of the a is the first program point after the assignment, and Pka (k < m) k th for-loop, Pm is the first program point after exiting the (k + 1)th for-loop (in other program points, the past iteration space is identical to that of the preceding program point in the schema of Figure 2). We use ic1 , . . . , icm as variables denoting the current values of respectively the iterators i1 , . . . , im . Note that lk and uk are respectively the lower and upper bound of the for-loop with iterator ik 2 First we define the iterator spaces PI bk associated with the points Pkb . With Cond the condition under which the assignment is executed, we define: PI b0 ≡ ∅ © PI b1 (ic1 ) ≡ PI b0 ∪ (j1 , . . . , jm ) | l1 ≤ j1 < ic1 ∧ l2 ≤ j2 ≤ u2 ∧ . . . ∧ lm ≤ jm ≤ um ∧ Cond } © b b PI 2 (ic1 , ic2 ) ≡ PI 1 (ic1 ) ∪ (j1 , . . . , jm ) | j1 = ic1 ∧ l2 ≤ j2 < ic2 ∧ l3 ≤ j3 ≤ ic3 ∧ . . . ∧ lm ≤ jm ≤ um ∧ Cond } .. . PI bk (ic1 , . . . , ick ) ≡ PI bk−1 (ic1 , . . . , ick−1 ) ∪ {(j1 , . . . , jm ) | 2
Recall that the lower and upper bound are a linear combination of the surrounding current iterator values ic1 , . . . , ick−1 .
c Springer-Verlag °
5
j1 = ic1 ∧ . . . ∧ jk−1 = ick−1 ∧ lk ≤ jk < ick ∧ lk+1 ≤ jk+1 ≤ uk+1 ∧ . . . ∧ lm ≤ jm ≤ um ∧ Cond } (∀2 ≤ k ≤ m) .. . b PI m (ic1 , . . . , icm ) ≡ PI m−1 (ic1 , . . . , icm−1 ) ∪ {(j1 , . . . , jm ) | j1 = ic1 ∧ . . . ∧ jm−1 = icm−1 ∧ lm ≤ jm < um ∧ Cond } This is equivalent to PI bk (ic1 , . . . , ick ) ≡ {(j1 , . . . , jk ) | (j1 , . . . , jk )