278
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
Memory Layout Techniques for Variables Utilizing Efficient DRAM Access Modes in Embedded System Design Yoonseo Choi, Taewhan Kim, and Hwansoo Han Abstract—The delay of memory access is one of the major bottlenecks in embedded systems’ performance. In software compilation, it is known that there are high variations in memory access delay depending on the ways of storing/retrieving the variables in code to/from the memories. In this paper, we propose effective storage assignment techniques for variables to maximize the use of memory bandwidth. Specifically, we study the problem of DRAM memory layout for storing the nonarray variables in code to achieve a maximum utilization of page and/or burst modes for the memory accesses. The contributions of our work are, for each page and burst modes: 1) we prove that the problem is NP-hard and 2) we propose an exact formulation of the problem and efficient memory layout algorithms, called Solve-MLP for the page mode and Solve-MLB for the burst mode. From experiments with a set of benchmark programs, we confirm that our proposed techniques use on average 28.2% and 10.1% more page accesses and 82.9% and 107% more burst accesses than those by the order of first use and the technique of Panda et al. in Proc. Int. Conf. Computer-Aided Design, 1997, and Panda et al. in ACM Trans. Design Automation Electron. Syst., 1997, respectively. Index Terms—DRAM access modes, embedded systems, storage assignment.
I. INTRODUCTION In embedded systems, memory is one of the major sources of performance bottleneck and power consumption [3]. A large number of techniques to reduce memory access delay and increase memory bandwidth in the execution of code, have been proposed. In compiler and computer-architecture literature, for example, cache organization, cache block alignment, prefetching of predicted data from memory, and software pipelining are considered [4]–[6]. In high-level/system synthesis community, alleviating performance bottleneck due to memory access is one of the major research topics; Wuytack et al. [7] addressed the problem of flow-graph balancing to minimize the required memory bandwidth. They added ordering constraints to the flow graph to minimize the number of memory ports in system-level design. Catthoor et al. [8] addressed the customization of memory architecture, including memory allocation, data packing into memories, and memory port sharing in embedded multimedia system design. It should be noted that most of the modern DRAMs used in embedded system design support efficient memory access modes. In particular, the page and burst access modes are extensively supported in DRAMs [9], [10]. In general, access latency of random access is much higher than that of page/burst access. Further, memory operates in lower power per bit in page/burst mode than in random access mode. For example, IBM’s Cu-11 Embedded DRAM [9] macro supports a random access mode of 10 ns and a page mode access of 5 ns. It has an active current of 60 mA/Mb roughly in random cycle and 13 mA/Mb in page cycle. Consequently, a full exploitation of these Manuscript received April 22, 2003; revised February 16, 2004. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc). This paper was recommended by Associate Editor R. Camposano. Y. Choi and H. Han are with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science & Technology, Daejeon, Korea (e-mail:
[email protected];
[email protected]). T. Kim is with the School of Electrical Engineering and Computer Science, Seoul National University, Seoul, Korea (e-mail:
[email protected]). Digital Object Identifier 10.1109/TCAD.2004.837721
memory-access modes is very necessary to alleviate the performance bottleneck caused by the delay of memory accesses. Panda et al. [1] modeled a number of realistic page-access modes in DRAMs and proposed an algorithm for arranging nonarray variables to memory and organizing array variables via loop-transformation techniques in high-level synthesis to utilize the access modes. The formulation of the problem of arranging nonarray variables in a memory is analogous to the one in [2], which is originally used to reduce the number of cache misses. Though the formulation looks reasonable, it is, in a strict sense, not an exact formulation. Khare et al. [11] extended the work in [1] to support the burst mode in a dual-memory architecture, in which they focused mainly on an efficient interleaving of memory accesses. Grun et al. [12] proposed an approach that allows the compiler to exploit detailed timing information of the memories. They showed a further optimization (using page and burst modes) is possible when an accurate timing of memory accesses to multiple memories is extracted. They also presented an approach [13] of extracting, analyzing, and clustering the most active memory-access patterns in an application and customizing memory architecture. Ayukawa et al. [14] proposed an access-sequence control scheme for relieving the page-miss penalty in random-access mode. They introduced an embedded DRAM macro attached with a special hardware logic for access-sequence control. Mckee et al. [15] described a stream memory controller (SMC) whose access ordering circuitry attempts to maximize memory-system performance based on the device characteristics. Their compiler is used to detect streams, while access ordering and issue are determined by hardware. Chame et al. [16] manually optimized an application for a PIM-based (embedded-DRAM) system by applying loop unrolling and memory-access reordering to increase the number of page-mode accesses. Shin et al. [17] designed and implemented a compilation technique which exploits page mode automatically. Their technique unrolls loops and reorders memory accesses by hoisting some codes in the loop body. Hettiaratchi et al. [18] proposed an address-assignment technique for array variables to reduce energy consumption on memory accesses by maximizing page-mode accesses. They formulated the problem into a multiway graph partitioning, and solved it using a Kernighan–Lin-based partitioning algorithm. In this paper, to complement the prior work [1], [2], [11]–[18], we propose a new approach to the problem of arranging nonarray variables in code to DRAM so that the efficient page and/or burst modes are maximally utilized. The key features are, for each of page and burst modes: 1) We show that the problem is NP-hard. Consequently, finding optimal solutions within a reasonable amount of time to relatively large-sized problem instances look infeasible. This validates our strategy to design an efficient heuristic for solving the problem. 2) Unlike the previous approaches, we solve the original problem based on an exact and direct formulation of the problem. Consequently, in terms of the quality of results, our proposed heuristic is superior to existing techniques. II. PRELIMINARIES AND MOTIVATING EXAMPLE Before we illustrate how the memory layout for variables affects the execution of page and/or burst-mode memory accesses, we explain briefly what the normal, page, and burst modes are. The execution of the normal mode access starts from a row-decoding stage, where the entire row that contains a set of words1 is copied into a row buffer. In the succeeding column-decoding stage, the element (i.e., word) in
m
m
1The value of is memory-dependent. The are collectively called a page of size .
0278-0070/$20.00 © 2005 IEEE
m
m words in each row of memory
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
279
Fig. 1. Graphical notation of stages for a sequence of DRAM memory accesses.
the buffer corresponding to the column address is selected, and is read or written according to the status of r/w signal. Finally, a precharging stage is performed to prepare the execution of row decoding for the next memory-access operation. On the other hand, in page mode an initial access which starts from row decoding followed by column decoding is performed. Then, if the word to be accessed in the next memory operation has already been in the same page that was retrieved just before, the execution of row decoding is omitted. That is, an additional column decoding using the corresponding column address is only needed to access the word from the buffer. Such a subsequent access is called page access, and the initial access followed by page access(es) is called page mode access(es). Similarly, in burst mode, from the result of an initial access, if the word to be accessed is the one at the next succeeding column of the row (i.e., of the same page), the access corresponding to the column decoding is called burst access. Fig. 1 shows a graphical notation of stages for a sequence of DRAM memory accesses where notations N and P indicate a normal access and a page access in page mode, respectively. Note that in Fig. 1, the initial row-decoded stage, the column-decode stage, and a precharge stage following a consecutive page accesses are collectively denoted as a normal access. It is clear from the figure that for a sequence of k memory references, if it is implemented with the normal mode only, k row decoding, k column decoding, and k precharge stages are needed. However, if it is implemented with both normal and page modes, the best solution is to use one row decoding, k column decoding (one for initial access, k 0 1 for page access), and one precharge stages. The page access not only has a much shorter access delay, but it also consumes much less accessing energy than the initial normal access. Consequently, applying as many page accesses as possible to a sequence of memory accesses is a key to reduce the overall memory-access latency and memory-access energy consumption. Further, it is also true that an aggressive use of burst accesses rather than normal mode accesses definitely reduces the memory access delay. We formally define the page and burst accesses as follows. Definition 1: For a sequence of variable references a1 ; a2 ; . . . ai01 ; ai . . ., and an assignment of the variables to the memory, let us denote v (ai ) the variable referenced by ai . Further, let fpage (v (ai )) and faddr (v (ai )) be the page and relative location of the memory at which variable v (ai ) is located, respectively. • When normal mode and page mode are available, the access of ai is called a page access if and only if fpage
v (a i )
= fpage
01 )
v (a i
:
• When normal mode and burst mode are available, the access of ai is called a burst access2 if and only if fpage (v (ai ))
= fpage
01 )
v (a i
2There are a number of techniques for implementing burst modes. Our definition is an abstraction of them.
Fig. 2. Example illustrating the effects of memory layout on the number of page/burst accesses. (a) A memory layout and its corresponding sequences of access. (b) Another memory layout and its corresponding sequences of access.
and faddr
v (a i )
= faddr (v (ai01 )) + 1:
One of the most effective ways to reduce the memory-access latency using page or burst mode is to find the (relative) placement of variables in memory (i.e., memory layout for variables) which offers the maximum number of page/burst accesses. Fig. 2 illustrates how different memory layouts affect the memory-access latency. The left side of Fig. 2(a) shows a memory layout for a sequence of variable accesses in a program segment where variables a; b; and c are in the first page, and d; e; and f are in the second page. The sequence of normal and page accesses corresponding to the layout is shown in the upper box of Fig. 2(a). It consists of six page accesses (i.e., six column decodes) and six normal-mode accesses (i.e., six row decodes + column decodes + precharge). On the other hand, the upper box of Fig. 2(b) shows the sequence of normal and page accesses corresponding to another layout shown in the left side of Fig. 2(b). Note that the number of page accesses used is two more than the case of Fig. 2(a), resulting in 22% reduction (i.e., from 36 to 28 cycles) in memory-access latency. The lower boxes of Fig. 2(a) and (b) show the sequences of memoryaccess modes used for the corresponding memory layouts when the burst and normal modes are available. This also clearly reveals that the arrangements of variables in memory drastically affect the length of memory-access latency and energy consumption. Thus, the optimization problem we want to solve is to find an efficient arrangement of variables to memory for a sequence of variable accesses so that the page/burst accesses are extensively utilized. Note that the order of access sequence of variables, as well as memory layout, both heavily affect the page-/burst-mode exploitation. However, in a general compilation procedure the spilled variables are determined after the execution of code-scheduling and register allocation/assignment tasks, resulting in a fixed order of access sequence. Thus, in our work, we assume that the proposed layout techniques are applied to the fixed order of access sequence of these spilled variables, and we do not take into account the coupling of the techniques with the determination of order of access sequence.
280
Fig. 3.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
Sequence of variable accesses and its access graph representation.
We discuss the optimization problem with page mode and its solution in Section III, followed by the problem with burst mode and its solution in Section IV. We show experimental results using a number of benchmark programs in Section V. Finally, concluding remarks are given in Section VI. III. MEMORY LAYOUT UTILIZING PAGE MODE A. Problem Formulation The optimization problem is, given a sequence of variable accesses, to find a memory layout with maximum number of page accesses (e.g., IBM Cu-11 Embedded DRAM [9]). We call the problem memory layout with page mode (MLP). An instance of MLP is characterized by (S; V ; m), where S is a variable sequence to be accessed, V is the set of variables in S , and m is the page size, partitioning the variables in V into disjoint groups, each of which has m or less than m variables. Problem 1 Decision-MLP: For an MLP instance with (S; V ; m) and k , where m 3, is there a memory layout, L, in which the number of page accesses for S is greater than or equal to k ? To show that MLP is NP-hard, we show that decision-MLP is NP-complete, by reducing a NP-complete graph problem (decision-AccGP is shown later) to decision-MLP. First, we build a graph called access graph from the access sequence. Definition 2: The access graph of S is a multigraph G(V ; E ), where node set V is the set of variables in S and there are n edges between two nodes vi and vj if and only if vi and vj are adjacent to each other in S exactly n times. Fig. 3 shows the access graph as in definition 3.1 for a sequence of variable accesses S . For example, the access graph has three edges between nodes a and c, since variables a and c are adjacent to each other three times in S . From the observation that the access sequence is in fact a path that covers all the edges of the access graph once and only once (i.e., Euler path) we can easily check as follows. Property 1: A multigraph is an access graph if and only if all of its node degrees are even or only two node degrees are odd. A graph partitioning 5m G(V;E ) on multigraph G(V ; E ) partitions V into t disjoint subsets V1 ; V2 ; . . . Vt , where V1 [ V2 [ 1 1 1 [ Vt = V , each containing at most m vertices. The gain of 5m G(V;E ) is defined as g
5
m G(V;E )
=
2
(v ;v ) E;v ;v
2V ;l=1;...;t
(
w vi ; v j
)
(1)
where w(vi ; vj ) represents the number of edges between vi and vj in the access graph G. An optimal graph partitioning of G(V ; E ) is defined to find 5m G(V;E ) that maximizes the quantity of g ( 1 ). We claim that the partitioning problem of an access graph is NP-complete. Problem 2 Decision-GP: For a multigraph G(V ; E ) and k , is there m a 5m G(V;E ) (m 3) such that g (5G(V;E ) ) k ? Problem 3: decision-AccGP: For an access graph G(V ; E ) and k , m is there a 5m G(V;E ) (m 3), such that g (5G(V;E ) ) k ? Note that decision-GP is NP-complete, since the partitioning problem for graphs without multiple edges was known to be NP-complete [19].
Theorem 1: decision-AccGP is NP-complete. Proof: decision-AccGP is in NP since, for a partitioning of G, its gain can be checked in polynomial time. According to Property 1, it is easy to see that if every multigraph in decision-GP is restricted to contain a Euler path, decision-GP is equivalent to decision-AccGP. Otherwise, we claim decision-GP is reducible to decision-AccGP in polynomial time: For every node pair vi and vj in the multigraph (called G) in decision-GP, if there are w edges between them, we augment additional w + 2 edges to become total of 2w + 2 edges between them. The new multigraph (called G0 ) produced by the augmentations always contains a closed Euler path because the degree of every node in G0 is even. Consequently, G0 is also an access graph (by Property 1). By considering the same partitions in G and G0 we can easily check that G has a partition of gain k if and only if G0 has a partition of gain 2k + 2. Theorem 2: decision-MLP is NP-complete. Proof: For an MLP with (S; V ; m) and memory layout, the number of corresponding page accesses can be calculated in polynomial time. Thus, decision-MLP is in NP. Next, we show that decision-AccGP is polynomial time reducible to decision-MLP. Let G(V ; E ) be an access graph in decision-AccGP, and 5m G be a partition of G. We construct a problem instance (S; V 0 ; m0 ) and L of decision-MLP from the instance G(V ; E ) and 5m G of decision-AccGP. We set V 0 = V ; m0 = m. To derive S , we apply Fleury’s algorithm [20] to access graph G to find out a Euler path. (Note that a Euler path always exists in G by Property 1, and the time complexity of Fleury’s algorithm is O(jej2 ).) Let S = (a1 ; a2 ; . . . ; aN ) be the sequence of nodes on the path. Further, for each node subset (i.e., group) in 5m G , let each group be a distint page in L and the number of groups equals to the number of pages. We claim that for (S; V 0 ; m0 ) and L, the number of page accesses equals to g (5m G ) (which implies that the solution L for the derived problem instance (S; V 0 ; m0 ) of decision-MLP exactly means the solution 5m G for the given instance G of decision-AccGP.): Since S = (a1 ; a2 ; . . . ; aN ) is a Euler path of G, for each pair of adjacent accesses (ai ; ai+1 ) in S , there is a corresponding edge in G(V ; E ). (Let (v (ai ); v (ai+1 )) be the edge in G.) If v (ai ) and v (ai+1 ) are in the same group in 5G (V ; E )m , they contribute to g (5m G ) by 1. Since v (ai ) and v (ai+1 ) will be the same page in L; ai+1 becomes a page access in L. On the other hand, if v (ai ) and v (ai+1 ) are in different m groups in 5m G , the edge does not contribute to g (5G ), and ai+1 will be a normal access in L since v (ai ) and v (ai01 ) are not in the same page. Thus, the number of page accesses by L is g (5m G ). Conversely, we claim that for a decision-MLP instance (S; V ; m) with L and k page accesses, we can construct an decision-AccGP instance G0 (V 0 ; E 0 ) and 50Gm with g (50Gm ) = k by letting S = (a1 ; a2 ; . . . ; aN ). We construct a multigraph G0 (V 0 ; E 0 ) as in Definition III.1. By mapping each page in L into a disjoint group, we obtain a graph partitioning 50Gm of G0 . For each pair of ai , and ai+1 in S , there is exactly one corresponding edge (v (ai ); v (ai+1 )) in G0 . Thus, if v (ai ) and v (ai+1 ) are in the same page in L; ai+1 is a page access and the corresponding edge (v (ai ); v (ai+1 )) in G0 contributes by one to g (50Gm ). Otherwise, ai+1 will be a normal access and the corresponding edge (v (ai ); v (ai+1 )) does not contribute to 0m 0m g (5G ). Thus, g (5G ) = k . For an instance (S; V ; m) of MLP problem, a memory layout L implied by a 5m G of the access graph G is the one that each page contains the set of variables in each group of 5m G . Note that the number of page accesses by L is the same as g (5m G ) as described in the Proof of Theorem 2. Thus, we have following Theorem. Theorem 3: For an instance (S; V ; m) of problem MLP, every memory layout L implied by an optimal 5m G of the corresponding instance G(V ; E ) of problem AccGP is optimal.3 3AccGP is the optimization version of the decision problem decision-AccGP.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
Fig. 4. Sequence of variable accesses and (a) the partitioning of its access graph. (b) Implied memory layout of (a).
Proof: Let 5 be a graph partitioning with an optimal gain . Let L0 be a memory layout implied by 50 (which may not be an optimal 0 g
partitioning) with gain g . From the transformation procedure from MLP to AccGP as in the proof of Theorem 2, if the number of page accesses in L0 is larger than that of L, then g 0 > g . This contradicts 5 is an optimal partition. Thus, no other memory layouts implied by nonoptimal partitionings can have page accesses whose number is larger than that of a memory layout implied by an optimal partitioning. For example, Fig. 4(a) shows an optimal graph partitioning, where one partition consists of grey nodes and another consists of white nodes and the number of heavy edges indicates the gain of the partitioning as in Fig. 2(b). Fig. 4(b) shows its implied memory layout, leading to eight page accesses that is the same as the gain of the partitioning. Theorem 3 indicates that we can solve the MLP problem by formulating it into a graph partitioning problem of AccGP and applying any efficient partitioning heuristic. Note that there are a number of works [1], [2] which addressed the problem of graph partitioning to find a minimum-access latency. However, their graph derivation is different from our method in that the weight assigned to edge (u; v ) by [1] and [2] represents, rather than the number of consecutive accesses of (u; v ) or (v; u) in the access sequence S , the total number of pairs of accesses of u and v in S in which the distance between them is within page size m. By their definition, distance(u; v ) is the number of distinct variable nodes encountered on a path from u to v , or v to u, including u and v in an access sequence S . In their graph, called a closeness graph, if the access interval of two variables is close enough to be within the page size, those accesses are considered as having a high possibility of execution in page mode. However, this is not always true, since whether an access ai is a page access or not is determined by only the location of the immediately previously accessed variable v (ai01 ). Other accesses such as ai02 or ai+1 do not affect ai . Consequently, counting all pairs of access of u and v within distance m has a tendency of overestimating the number of page accesses. For example, Fig. 5 shows an example of how the edge weights of access graph for a sequence of variable references are assigned according to the approaches in [1], [2], and ours. Fig. 5(a) shows the closeness graph in [1] and [2] in which w(a; e) = 3, since a and e occur within distance m(=3) three times in S . However, a and e appear consecutively only once in S , as denoted in parentheses, resulting only one page access at most if a and e are allocated in the same page. Our graph is represented with a simple graph as in Fig. 3(b), except the self loop. Note that as indicated by the graph on the left side of the figure, a partition with maximal gain for the graph by [1] and [2] does not necessarily produce the maximum number of page accesses whereas an optimal partition (shown on the right side of the figure) by our approach directly results in the maximum number of page accesses. (Dark vertices belong to a page, lighter ones belong to another and the heavy edges indicate the gain.) B. Proposed Algorithm Due to the NP-completeness of the memory-layout problem, we develop an efficient (greedy) heuristic, called Solve-MLP (memory
281
layout for page mode), based on a node-clustering procedure, which is performed on the edge weighted (simple) access graph G0 (V ; E 0 ; W ) for S (e.g., Fig. 6) which is derived from the access graph G(V ; E ) (e.g., Fig. 3) by, for each pair of nodes u and v in G(V ), setting an edge whose weight is the same as the number of edges w(u; v ) in G.4 The inputs to the proposed algorithm are the edge-weighted access graph G(V ; E ; W ) and page size m. The algorithm is an iterative process. At each iteration, a group of nodes whose size is m is extracted from the access graph. This process iterates for the remaining nodes until all the nodes are extracted. Each group extracted represents a distinct page, and the variables corresponding to the nodes in the same group indicate the variables in the same page. Note that unlike the case of burst mode (to be discussed later), the relative placement order of the variables in the same page of memory is not important because the number of page accesses is determined by the grouping of variables and not by the placement order of the variables in the same group. Each iteration of the algorithm starts from a seed which is the edge with the largest edge weight. Let (u; v ) be the edge selected, and set P = fu; v g. We iteratively expand set P by adding nodes, one at a time, until jP j = m or no nodes are left. The selection of node to be included at each time is determined based on the following two measures. For each node x not in P
attract in(
P; x
attract out(
P; x
)= )=
e=(x;y )
2E and y2P
2E and y2= P
()
(2)
()
(3)
w e
w e :
e=(x;y )
We select the node with the largest value of attract in(P; 1 ). If there are ties and jP j = m 0 1, we choose the one with the smallest value of attract out(P; 1 ) because the value of attract out(P; 1 ) directly contributes to the number of normal accesses which require a long access delay. However, when jP j < m 0 1, we choose a node among the ties in a random manner because there are no clear clues at this point. Once a node is selected, it is added to P and the node is deleted from the access graph G. The grouping process then repeats for the nodes in G. Note that Solve-MLP does not result in increased page requirement since by Solve-MLP at most one page has less thatn m variables in it, and all other pages will have exactly m variables, where m is determined by the given memory architecture. Fig. 7 shows the steps of clustering nodes by Solve-MLP. First, we select the edge (a; c) because it has the largest edge weight and set P = fa; cg as shown in Fig. 7(a). Then, we expand P by adding node e because its attract in(P; 1) is one of the largest, as indicated in Fig. 7(b). Now, we complete the grouping of size four by including one more node. Since there are more than two nodes d and g whose attract in(P; 1 ) values are the largest, we choose g because its attract out(P; x) value is smaller than that of d, as shown in Fig. 7(c). Consequently, the variables to be in the same page are a; c; e; and g . We repeat the grouping process for the updated access graph in Fig. 7(d). Fig. 8 summarizes the flow of Solve-MLP. Since the inner-while loop iterates m times and the time for computing attract in(P; 1 ) at each iteration is bounded by O(m 1 jV j), the time complexity of the inner-while loop is bounded by O(m2 1jV j), assuming that the number of ties at the last iteration is relatively small ( 1; max out degree( 1 ) > 1) and/or length violation (i.e., max path len( 1 ) > m). We select the node connected to an arc in T with the least value of attract out burst( 1 ). Since computing attract out burst( 1 ) at each iteration of the while-loop takes O(jV j2 ) and the number of iterations is at most jAj, the time complexity of Solve-MLB is bounded by O(jAj 1 jV j2 ). Fig. 14 shows the steps of finding the path cover by Solve-MLP. First, we select the arc he; ai because it has the largest arc weight and
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
TABLE I SIZE OF BENCHMARK PROGRAMS
set C = fe; ag as shown in Fig. 7(a). Then, we collect the arcs with the next largest weight. Those arcs are hc; ei and ha; di. Among them we select hc; ei and add it to path cover C because the value of its attract out burst(C; 1) is smaller, as indicated in Fig. 14(b). Similarly, in the next iteration, we include arc hb; di to C as shown in Fig. 14(c). Finally, there remains only one candidate arc hd; f i, and the arc is added to C as shown in Fig. 14(d), resulting in the final path cover being fc ! e ! a; b ! d ! f g. V. EXPERIMENTAL RESULTS We conducted a set of experiments to check the effectiveness of the proposed memory-layout opti mization algorithm Solve-MLP and Solv-MLB. Our algorithm was implemented in C++ and executed on a Pentium III 500-MHz Linux machine. We tested our algorithm on a set of benchmark examples in Table I [23]–[25], and then compared our results with that produced by the the order of the first use (OFU) variable assignment and the closeness-graph-based (CGB) algorithm [1], [2] which was overviewed in Section III. OFU is a naive approach, which places variables to memory in the order as they appear in the code, and CGB algorithm clusters variables into pages based on the degree of “closeness” between nodes (variables) in a graph. • Memory optimization for maximizing DRAM’s page accesses. Table II summarizes the number of page accesses used by OFU and CGB, which were overviewed in Section III and Solve-MLP with various page sizes. Further, we include in the table the results produced by a branch-and-bound (BB) technique to check how much the results produced by our proposed heuristic are close to the optimal results. For large designs like ELLIP, since BB were not able to produce a final (optimal) solution due to memory overflow, we simply use the best solution found so far. Table I summarizes the size of designs that were tested in the experiments where jV j and jSj represent the number of variables and the length of access sequence of design, respectively. BIQUAD (biquad one section) and FIR are taken from DSPstone benchmark suite [23]. as static were regarded as automatic variables. Complex multiplier (COMP) and elliptical wave filter (ELLIP) are taken from high-level synthesis benchmark designs [24]. Gauss–Legendre weights and abscissas (GAULEG), Gauss–Laguerre weights and abscissas (GAUHER), and Gauss–Jacobi weights and abscissas (GAUJAC) are taken from [25]. (GAULEG, GAUHER, and GAUJAC are the main routines for Gaussian quadrature. Gaussian quadrature is one of widely used techniques in numerical integration.) ADPCM and IDCT are from “adpcm.c” of adpcm and “idct.c” of mpeg2 in [27]. The entries marked with a “-” indicate that a single page contains all the variables of the corresponding design example. In summary, the
285
average reductions by Solve-MLP over OFU andCGB algorithm [1] are 28.2% and 10.1%, respectively. In addition, it should be noted that Solve-MLP uses only 2.5% less page accesses on the average than that of the optimal technique BB, which clearly indicates that the simple heuristic Solve-MLP is sufficient to use. • Memory optimization for maximizing DRAM’s burst accesses: Table III summarizes the number of burst accesses used by OFU, CGB, and Solve-MLB with various page sizes. The average reductions by Solve-MLB over OFU and CGB are 82.9% and 107%, respectively. Solve-MLB uses 7.6% fewer burst accesses compared to optimal solutions. The authors in [1] and [2] considered only page mode, and in their closeness graph, the total number of pairs of accesses of u and v in S was assigned to the weight of edge (u; v ), regardless of their order of access, if only the distance between u and v is within page size. For the sake of comparison of the number of burst accesses, we made CGB for burst mode by modifying their closeness graph into directed graph. Namely, we classified an edge (u; v ) in the graph into two arcs hu; vi and hv; ui, according to their accessed order within the distance of page size. The improvement of Solve-MLB over OFU and CGB is much greater than that of Solve-MLP. This is due to the fact that the number of burst-mode accesses is greatly affected not only by the pages of the accessed variables, but by their relative location in pages. Consequently, the modeling of the directed graph, which accurately reflects the number of burst accesses, is more significant than the modeling of graphs in page-mode accesses. The large amount of improvement over OFU and CGB clearly reveals that solve-MLB is performing well in maximizing the number of burst accesses. • Checking the reductions on total memory access latency and energy consumption: Figs. 15 and 16 graphically show how much the total memory access latency and energy consumption are actually saved by our Solve-MLP over the conventional methods, respectively. The number of variables and the length of access sequence of each example code is given in the x axis of each figure. We used the access figures in IBM Cu-11 Embedded DRAM [9] macro as reference in the experiments, in which a random (i.e., normal) access cycle is 12 ns at best and a page access is 4 ns at best. In addition, under the supply voltage of 1.5 V an active current is 59.19 mA on average for random access, and 11.21 mA for page access. It is shown that careful layouts of variables for maximizing page/burst accesses considerably contributes to the reduction of total access latency and memory access energy consumption. Note that the access sequence of a program on a specific real processor will be extracted only after the execution of code-selection and register-allocation, which are machine-dependent parts of compilation. However, in our experiments, rather than extracting the access sequences of complete code for a specific processor, we manually extracted the access sequences using the 3-address code under the assumption that only a small number of registers are available, which is true for most of synthesis-optimized designs with embedded DRAMs. However, to estimate real processor cycles or energy overhead accurately, a complete experimental environment that provides an assembly code ready to execute on the processor and cycle-accurate simulation is needed, and this issue should be addressed in the future. VI. CONCLUSION In this paper, we proposed a set of comprehensive solutions to the problems of storage assignment for variables to achieve a full utilization of two efficient DRAM access modes: page and burst modes. Since memory access is one of the major bottlenecks in embedded
286
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
TABLE II NUMBERS OF PAGE ACCESSES USED BY OFU, CGB [1], [2], Solve-MLP, AND BB
TABLE III NUMBERS OF BURST ACCESSES USED BY OFU, CGB [1], [2], Solve-MLB, AND BB
Fig. 16. Fig. 15.
Access latency of randomly generated code.
system’s performance, the proposed techniques can be effectively used in DRAM-access-intensive embedded system applications. For each of the page and burst modes, the contributions are: 1) we showed that the problem is NP-hard and 2) we propose direct and exact formulation of the problem and efficient memory layout algorithms, called Solve-MLP for the page mode and Solve-MLB for the burst mode. From a set of experiments with benchmark examples, it was shown that the proposed techniques used on average 10.1%–28.2%
Energy consumption in access of randomly generated code.
more page accesses and 82.9%–107% more burst accesses over the existing techniques. Note that there are many existing algorithms which utilize memory access modes for array variables (e.g., [13]). Our algorithm is specialized in dealing with scalar variables and can be applied independently of those algorithms for array variables because, typically, nonarray variables are mixed with array varia bles and they appear together in access sequences. However, note that, in general, elements of an array variable (e.g., A[0]; A[1]; . . . ; A[9]) are stored consecutively in one or several pages of memory, possibly according to either row-major or
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2005
column-major order. Thus, we can reasonably assume that there are no or little page/burst accesses between array variables and nonarray variables. Consequently, the algorithms for array variables and our algorithms can be applied independently, even though the overall results may not be necessarily optimal. The proposed technique Solve MLP cannot be used to the applications in which all the spilled variables are stored to one page of memory, which is quite often in general. Nevertheless, Solve MLP and Solve MLB have their strengths in the following situations. 1) Even in the case where all spilled variables are contained in one page of memory, if we use a memory with burst mode, the use of Solve MLB is still effective, because it finds a maximum-length path cover, which directly leads to a maximum burst accesses 2) In many applications where the arrays and scalars are stored together, the holes that are left in pages by the arrays are used by scalars. If these holes are small enough in size, an extended application of existing array alignment techniques to scalar alignment, utilizing the underlying ideas of Solve MLP and Solve MLB, may produce better results than that of the application of the existing techniques solely to array alignment. Note that our proposed techniques are not only applicable to embedded DRAMs, but also to many other devices such as flash memory [21], which have page and/or burst modes. That means the effectiveness of our approach to the current applications is quite limited. For example, our approach may possibly be applied effectively for the specially designed processors, such as energy-efficient (clustered) processors (e.g., [28] and [29]) with heavily distributed register files in which only a few locations of each cluster are available. In addition, we carefully expect, even it is not certain, that in the future, some multibank SRAMs that will be used as level 1 in power-critical embedded systems will also support page or burst access modes due to the energy-efficiency of the access modes, even for small-sized memories. In short, in most (current) designs of embedded systems, our work may be ineffective, but we hope that in the future, power-critical embedded system designs, our theoretical, and neat features of the work, as implied by the mentioned special cases, are used valuably. Finally, to validate the effectiveness of the work even in the limited applications, it is required, to test the approach in more realistic design environments. REFERENCES [1] P. R. Panda, N. D. Dutt, and A. Nicolau, “Exploiting off-chip memory access modes in high-level synthesis,” in Proc. Int. Conf. ComputerAided Design, 1997, pp. 333–340. [2] , “Memory data organization for improved cache performance in embedded processor applications,” ACM Trans. Design Automation Electron. Syst., vol. 2, no. 4, pp. 384–409, 1997. [3] N. D. Dutt, “Memory organization and exploration for embedded systems-on-silicon,” presented at the Int. Conf. VLSI Computer-Aided Design, Seoul, Korea, 1997. [4] T. C. Mowry, M. S. Lam, and A. Gupta, “Design and evaluation of a compiler algorithm for prefetching,” Architec. Support Program. Lang. Oper. Syst., vol. 27, no. 9, pp. 62–73, 1992. [5] M. S. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in Proc. SIGPLAN 88 Conf. Program. Lang. Design Implement., vol. 23, 1998, pp. 318–328. [6] A. W. Appel, Modern Complier Implementation in C, MA: Cambridge, 1998, ch. 21, pp. 502–509. [7] S. Wuytack, F. Catthoor, G. de Jong, B. Lin, and H. De Man, “Flow graph balancing for minimizing the required memory bandwidth,” in Proc. Int. Symp. Syst. Synthesis, 1996, pp. 127–132. [8] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology. Norwell, MA: Kluwer, 1998. [9] IBM Cu-11 Embedded DRAM Macro (2002). [Online]. Available: http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/4CBB96F927E2D6D287 256B98004E1D98/$file/Cu11_embedded_DRAM.10.pdf
287
[10] CS70DL Embedded DRAM (1999). [Online]. Available: http://www.fme.fujitsu.com/products/asic/pdf/CS70DLFS.pdf [11] A. Khare, P. R. Panda, N. D. Dutt, and A. Nicolau, “High-level synthesis with synchronous and RAMBUS DRAMs,” in Workshop Synthesis Syst. Integr. Mixed Technol., 1998. [12] P. Grun, N. D. Dutt, and A. Nicolau, “Memory aware compilation through accurate timing extraction,” in Proc. Design Automation Conf., 2000, pp. 316–321. [13] , “APEX: Access pattern based memory architecture exploration,” in Proc. Int. Symp. Syst. Synthesis, 2001, pp. 25–32. [14] K. Ayukawa, T. Watanabe, and S. Narita, “An access-sequence control scheme to enhance random-access per formance of embedded DRAMs,” IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 800–806, May 1998. [15] S. A. Mckee, W. A. Wulf, J. H. Aylor, R. H. Klenke, M. H. Salinas, S. I. Hong, and D. Weikle, “Dynamic access ordering for streamed computations,” IEEE Trans. Comput., vol. 49, no. 11, pp. 1255–1271, Nov. 2000. [16] J. Chame, J. Shin, and M. Hall, “Compiler transformations for exploiting bandwidth in PIM-based systems,” presented at the 27th Annu. Int. Symp. Comput. Architecture, Workshop Solving Memory Wall Problem, Vancouver, BC, Canada, 2000. [17] J. Shin, J. Chame, and M. Hall, “A compiler algorithm for exploiting page-mode memory access in embedded-DRAM devices,” presented at the 4th Workshop Media Streaming Process., Istanbul, Turkey, 2002. [18] S. Hettiaratchi, P. Cheung, and T. Clarke, “Energy efficient address assignment through minimized memory row switching,” in Proc. Int. Conf. Computer-Aided Design, 2002, pp. 577–582. [19] M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness. New York: Freeman, 1979, p. 209. [20] D. B. West, Introduction to Graph Theory. Englewood Cliffs, NJ: Prentice-Hall, 1996, pp. 85–91. [21] Burst/Page Mode Family (2002). [Online]. Available: http://www.amd.com/us-en/FlashMemory/ProductInformation/0,,37_1447_1451,00.html [22] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita, “Power analysis and minimization techniques for embedded DSP software,” IEEE Trans. VLSI Syst., vol. 5, no. 1, pp. 123–135, Jan. 1997. [23] V. Zivojnovic, J. Velarde, and C. Schlager, “Dspstone: A DSP-oriented benchmarking methodology,” in Proc. Int. Conf. Signal Process. Applicat. Technol., 1994, pp. 715–720. [24] Benchmark Archives at CBL [Online]. Available: http://www.cbl.ncsu.edu/CBL_Docs/Bench.html [25] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge Univ. Press, 1993, pp. 154, 152–155. [26] S. Liao, S. Devadas, K. Keutzer, S. Tjiang, and A. Wang, “Storage assignment to decrease code size,” ACM Trans. Program. Lang. Syst., vol. 18, no. 3, pp. 235–253, 1996. [27] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: A tool for evaluating and synthesizing multimedia and communications systems,” in Proc. Int. Symp. Microarchitec., 1997, pp. 330–335. [28] M. F. Jacome, G. De Veciana, and V. Lapinskii, “Exploring performance tradeoffs for clustered ASIPs,” in Proc. Int. Conf. Computer-Aided Design, 2000, pp. 504–510. [29] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, “Register organization for media processing,” in Proc. Int. Symp. High-Performance Comput. Architec., 2000, pp. 375–386.