Code Placement Techniques for Cache Miss Rate Reduction HIROYUKI TOMIYAMA and HIROTO YASUURA Kyushu University
In the design of embedded systems with cache memories, it is important to minimize the cache miss rates to reduce power consumption of the systems as well as improve the performance. In this article, we propose two code placement methods (a simplified method and a refined one) to reduce miss rates of instruction caches. We first define a simplified code placement problem without an attempt to minimize the code size. The problem is formulated as an integer linear programming (ILP) problem, by which an optimal placement can be found. Experimental results show that the simplified method reduces cache misses by an average of 30% (max. 77%). However, the code size obtained by the simplified method tends to be large, which inevitably leads to a larger memory size. In order to overcome this limitation, we further propose a refined code placement method in which the code size provided by the system designers must be satisfied. The effectiveness of the refined method is also demonstrated. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—code generation; optimization; B.1.4 [Control Structures and Microprogramming]: Microprogram Design Aids—languages and compilers; optimization General Terms: Design, Performance Additional Key Words and Phrases: Code placement, instruction cache, integer linear programming
1. INTRODUCTION Many recent mid- to high-range embedded systems such as communication and multimedia systems consist of RISC processors with cache memories. In design of embedded systems with caches, it is important to minimize the cache miss rates to enhance system performance. Cache misses make the execution speed of programs slow because they require extra cycles to transfer code or data between the cache and the main memory. For This work has been partially supported by Grant in Aid for Scientific Research of the Ministry of Education, Science and Culture of Japan. Authors’ addresses: H. Tomiyama, Department of Computer Science and Communication Engineering, Graduate School of Information Science and Electrical Engineering, Kyushu University. 6 –1 Kasuga-koen, Kasuga, Fukuoka 816, Japan; email: ^
[email protected]&. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1997 ACM 1084-4309/97/1000 –0410 $03.50 ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997, Pages 410 –429.
Code Placement Techniques
•
411
example, in a computer system that consists of a CPU, a cache, and a main memory, the execution time of a program on the system is defined as follows,
CPU time
S
5 CPI 1
Memory accesses Instruction
3 Cache miss rate 3 Miss penalty 3 IC 3 Clock cycle time,
D (1)
where CPI and IC denote clock cycles per instruction and the instruction count to be executed, respectively [Hennessy and Patterson 1996]. Let us assume that the CPI is 1, memory accesses per instruction are 1.3, and cache miss penalty for each read miss and write miss is 10 cycles. If the miss rate is reduced from 7% to 4%, the execution time becomes 20% shorter. In terms of power consumption, a lower miss rate is also desired. Cache misses consume more power not only because the main memory is activated but also because off-chip buses are driven. Liu et al. mentioned that power for off-chip driving is very dominating and becoming more dominating, up to 70% of the total chip power, along with the progress of transistor scales [Liu and Svensson 1994]. So far, various techniques for miss rate reduction have been proposed from the viewpoint of hardware and software. Hardware-based miss rate reduction techniques include larger caches, higher associativity, hardware prefetching of code and data, and so on. Since these techniques need extra hardware cost, they cannot be applied in low-cost system design. Softwarebased techniques are compiler optimizations such as compiler-controlled prefetching of code and data, loop transformations for data caches, data layout for data caches, and code placement for instruction caches. Some software techniques achieve miss rate reduction without an overhead of modifying the hardware. In this article, code placement is viewed as an optimization problem, and we present two code placement methods (a simplified method and a refined method). In these methods, application programs are placed in a main memory in such a way that the instruction cache misses are minimized. Integer linear programming (ILP) is used to formulate a component problem (trace placement problem). In the simplified method, the techniques of trace selection (proposed by Hwu et al. [Hwu and Chang 1989]) and trace placement which is formulated as an ILP are employed. Although the simplified method achieves significant reduction in cache misses, it enlarges the code size. To solve this, we propose a refined code placement method that combines trace selection, trace merging, and trace placement. The trace merging is designed to satisfy the code size set by the system designers. Our proposed methods are effective in reducing power consumption of embedded systems as well as enhancing the performance without increasing the hardware cost. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
412
•
H. Tomiyama and H. Yasuura
This article is organized as follows. In the following section, some related works are discussed. In Section 3, we define a simplified code placement problem and formulate it as an ILP problem. Experimental results are also shown in the section. In Section 4, we propose a refined code placement algorithm, and show the experimental results. Section 5 presents conclusions and addresses our future works. 2. RELATED WORKS Code placement methods for instruction caches have been studied by many researchers in the field of computer architectures [Hwu and Chang 1989; McFarling 1989, 1991]. Most methods use profile information containing frequencies of basic blocks and functions. In particular, the IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology—Stage I) C Compiler developed by Hwu et al. has proposed effective techniques on reducing cache miss rates [Hwu and Chang 1989]. The following techniques are employed in IMPACT-I. Function inline expansion. The function calls with high execution counts are replaced with the function body where possible. Function inline expansion decreases both first reference misses and conflict misses, as well as eliminates an overhead of the function calls. However, function inline expansion may lead to an increase in code size. Trace selection. For each function, basic blocks that tend to execute in sequence are grouped into a trace. Trace selection is helpful to decrease both first reference misses and conflict misses. We explain the trace selection technique in more detail in the following section. Function layout. For each function, the trace of the function entrance is placed first, and then the most important descendant is selected to be placed immediately. Function layout decreases cache conflicts among traces in the function. Global layout. Functions that are closer to each other in the execution order are placed so as not to cause cache conflicts. Global layout decreases cache conflicts among functions. McFarling [1989] proposed a function placement method for instruction caches. The main characteristic of this method is that dependencies among functions are analyzed first. If function A is called in B, the two functions are placed so that they do not conflict with each other. McFarling [1991] also proposed a technique to determine which functions should be merged without a consideration of code placement inside functions. Another effective technique is to sort functions according to their frequencies, which is proposed by Chow (referred to in McFarling [1989]). The technique reduces cache conflicts among functions with high frequencies. In the preceding methods, code placement is restricted within a function, and therefore, global optimization of code placement is difficult to achieve. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
Fig. 1.
•
413
Overview of the simplified code placement method.
Our proposed methods overcome this limitation by allowing code to be placed beyond function boundaries. 3. A SIMPLIFIED CODE PLACEMENT METHOD 3.1 Assumptions In a simplified code placement problem, we concentrate on systems that assume the following. —The system has the Harvard architecture whose instruction and data caches are separated physically as well as logically. —The associativity of the instruction cache is direct mapped or set associative with the LRU (least recently used) algorithm for replacement.1 —The CPU does not fetch instructions that may not be executed. This assumption means that the hit rate of branch prediction is 100% if the CPU has a pipelined architecture. —The system can have a secondary cache, but our method takes no account of its behavior since our objective is to minimize the miss rate of the primary (level-1) instruction cache. —No other task runs during the execution of a program. 3.2 Code Placement Method Overview We show an overview of the simplified code placement method in Figure 1. The inputs of our algorithm are an assembly code, a profile information on the program, and a specification of the memory organization. The profile information is one or more sequence(s) of basic blocks that are accessed when a typical data set is input to the program. For example, in Figure 2(a), a program consists of two functions, A and B. Each node in the graph, denoted by bi, represents a basic block, and each directed edge represents a control dependency between basic blocks. A number associated with each
1
The LRU algorithm replaces the cache line that has been unused for the longest time. A cache line (block) is the minimum unit of instructions/data that can be present in the cache [Hennessy and Patterson 1996]. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
414
•
H. Tomiyama and H. Yasuura
Fig. 2.
edge (bi, bj) is given by:
Weighted control flow graph and traces.
O Prob z F k
~i, j!,k
,
(2)
k
where Probk is the probability that data Dk will be input to the application program, and F(i,j),k is the number of times that data Di pass through edge (bi, bj) in one execution. Furthermore, we assume that function B is called at the end of basic block b4 in function A. The profile information for the program contains the following basic block sequence,
~ b 0 , b 2 , b 3 , b 4 , b 6 , b 7 , b 8 , b 3 , b 4 , b 6 , b 8 , b 3 , . . . , b 3 , b 5! .
(3)
This basic block sequence indicates a program execution; basic block b0 is executed first, next b2, five basic blocks {b3, b4, b6, b7, b8} are executed iteratively, and finally, b5 is executed. A specification of the memory organization consists of the line size, the number of sets and ways of the instruction cache, and the size of the instruction memory. The output is an assembly code of the program whose cache miss rate is minimized. The simplified method consists of two techniques, trace selection and trace placement (see Figure 1). The trace selection decreases first reference misses and conflict misses, and the trace placement decreases conflict misses. 3.3 Trace Selection Trace selection preprocesses a given assembly code and the profile information to construct a weighted control flow graph. Figure 2(a) shows a weighted control flow graph. In this figure, the likelihood of b2 being executed after b0 is greater than b1. In this case, b2 should be placed just after b0 since the two basic blocks are probably on the same cache line. Similarly, b4 should be placed before b3, b5 after b3, b7 after b6, and b8 after b7. Consequently, the weighted control flow graph (a) is partitioned into four paths (linear subgraphs) shown in Figure 2(b). Each path is called a ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
Fig. 3.
•
415
Trace placement.
trace. The term “trace” is introduced by Fisher [1981] which addresses global microcode compaction. The idea of traces is extended in the IMPACT-I C Compiler [Hwu and Chang 1989] to reduce miss rates of instruction caches. The trace selection problem is formally defined as follows. For a given weighted control flow graph, partition the graph into traces in such a way that the sum of weights of edges in the traces is maximized.
Note that the last instruction of each trace must be either an unconditional jump operation or an exit of the function. This property enables us to place traces in arbitrary order. A trace is the minimum unit of machine instructions that can be placed without insertions of extra-jump operations. 3.4 Trace Placement After trace selection, the order of extracted traces is determined with a view to minimize cache misses caused by cache conflicts. To accomplish this, we introduce a new concept, pseudo-memory, an imaginary main memory. First, an arbitrary order of traces is placed in the pseudo-memory, with the restriction that no two traces must be placed in the same pseudo-memory block. Here, a memory block is a block in a main memory that is mapped onto one cache line, and a pseudo-memory block is a block in a pseudo-memory. An example of trace placement in a pseudo-memory is depicted in Figure 3(a), in which bis and pjs represent basic blocks and pseudo-memory blocks, respectively. By virtue of the pseudo-memory, trace placement can be considered as a matching problem between pseudomemory blocks and memory blocks. For example, in Figure 3(a) and (b) where mks represent memory blocks, arrows between pjs and mks represent matches. In the following section, we formulate the trace placement problem as an ILP problem. In the rest of this section, we describe how the trace ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
416
•
H. Tomiyama and H. Yasuura
Fig. 4.
Contents of the cache during execution.
placement reduces cache misses using the examples shown in Figures 2 and 3. Having placed traces in the pseudo-memory, a sequence of pseudomemory blocks that are accessed during an execution of the program is uniquely determined. In this article, we call the sequence of pseudomemory blocks to be accessed during the execution of the program given data (an input data set to the program), the access sequence of pseudomemory blocks, or simply the access sequence. Given the pseudo-memory illustrated in Figure 3(a) and the basic block sequence (3) in Section 3.2, the following access sequence of pseudomemory blocks is obtained,
~ p 0 , p 1 , p 3 , p 5 , p 6 , p 7 , p 3 , p 5 , p 7 , p 3 , . . . , p 3 , p 4! .
(4)
The preceding access sequence tells us that four pseudo-memory blocks p3, p5, p6, and p7 are accessed iteratively. We assume a direct-mapped cache in which the number of cache lines is 4. If we place traces in main memory in the same order as Figure 3(a), cache misses occur frequently due to the cache conflicts between p3 and p7. In Figure 4(a), the contents of the cache during the execution are illustrated. In Step 7, p3 is displaced from the cache by p7 although p3 will be executed just after p7. An optimal trace placement is shown in Figure 3(b) and the contents of the cache during the execution are shown in Figure 4(b), where no cache conflict occurs. Because of the restriction that different traces must be placed in different pseudo-memory blocks, our method requires a lot of small fragments in the main memory where no instructions reside. In actual fact, these memory fragments must be filled with no-ops or any other instructions. However, these extra instructions are never executed and cause an increase in code size. The refined version that overcomes this shortcoming is further proposed in Section 4. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
•
417
3.5 ILP Formulation of Trace Placement Problem In this section, we define the trace placement problem. The problem is formulated as an ILP problem. 3.5.1
Definitions.
We use the definitions:
Nset Nway Nmem Np
the number of sets of an instruction cache the number of ways of an instruction cache the number of blocks of an instruction memory the number of blocks of a pseudo-memory.
Main memory blocks are numbered from 0 to (Nmem 2 1) consecutively. Each memory block has a unique integer number that is used for its identification. Pseudo-memory blocks are also numbered from 0 to (Np 2 1) consecutively. Each pseudo-memory block has a unique integer number that is used for its identification. Moreover, we use the definitions: pi P xi X tj T
the ith pseudo-memory block. (i 5 0, 1, . . . , Np 2 1) a set of pseudo-memory blocks. P 5 øi{pi} the memory block number where pi is placed a vector of xi; X 5 (x0, x1, . . . , xNp21) an ordered set of pseudo-memory blocks that construct the jth trace; tj [ P 3 P 3 . . . a set of tj; T 5 øj{tj}.
In the example illustrated in Figure 3(a), four traces (b0, b2), (b1), (b4, b3, b5), and (b6, b7, b8) are placed in pseudo-memory blocks (p0, p1), (p2), (p3, p4), and (p5, p6, p7), respectively. Then, T of this example is defined as follows:
t 0 5 ~ p 0 , p 1! ,
t 1 5 ~ p 2! ,
t 2 5 ~ p 3 , p 4! ,
t 3 5 ~ p 5 , p 6 , p 7!
T 5 $ t 0 , t 1 , t 2 , t 3% .
(5)
For each pi, we define sets of pseudo-memory blocks ci,ls.
c i,l 5 $ p j u p j ~ i Þ j ! appears between the lth appearance and the
~ l 1 1 ! th appearance of p i in the access sequence of pseudo-memory blocks.}
(6)
In the access sequence, the same ci,l may appear more than once. In order to remove the redundancy, we define Ci as a set of ci,ls, that is, Ci 5 {ci,l}, and let ai,k (k 5 0, 1, . . . , uCiu 2 1) denote an element in Ci. Note that ai,k Þ ai,k9 if k Þ k9. For example, we show how to define a3,ks. In the access sequence of pseudo-memory blocks (4) in Section 3.4, there are two different intervals between accesses to p3, viz. {p5, p6, p7} and {p5, p7}. Then a3,ks ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
418
•
H. Tomiyama and H. Yasuura
are defined as follows.
a 3,0 5 $ p 5 , p 6 , p 7% ,
a 3,1 5 $ p 5 , p 7% .
Next, we define ei,k as the number of times ai,k appears in the access sequence of pseudo-memory blocks. In the preceding example, e3,ks are defined as follows.
e 3,0 5 10,
e 3,1 5 5.
We assume an n-way set associative cache. If more than (n 2 1) pseudomemory blocks in ai,k are mapped onto the same cache set as pi, cache misses occur at least ei,k times. We represent Ai,k as a tuple of ai,k and ei,k (i.e., Ai,k 5 (ai,k, ei,k), and define a set of Ai,k as Ai.
Ai 5
ø $A
i,k
%.
(7)
k
3.5.2 lows.
Problem Definition.
Trace placement problem is defined as fol-
For given Nset, Nway, Nmem, P, T, and Ais for all pis, find X that minimizes the cache misses. This problem can be neatly formulated as an ILP problem. The objective function is defined in formula (8). The value of M(X) represents the number of cache misses when the program is executed once.
M~X! 5
O O
e i,k 3 replaced ~ a i,k! 1 Constant.
(8)
pi[P ~ei,k,ai,k![Ai
Here, Constant is the number of cache misses caused by the first references, which is determined by trace selection independent of trace placement. The value of function replaced(ai,k) is 1 if the number of pseudomemory blocks in ai,k that are mapped onto the same cache set as pi is greater than or equal to Nway. In other words, replaced(ai,k) is 1 if pi is removed from the cache by another pseudomemory block in ai,k and pi need to be transferred to the cache for its next access. Function replaced(ai,k) is formulated as follows.
replaced ~ a i,k! 5
5
O
conflict ~ x i , x i9! $ N way
1
if
0
otherwise.
pi9[ai,k
(9)
The value of conflict(xi, xi9) is 1 if the two pseudo-memory blocks pi and pi9 are mapped onto the same cache set, otherwise it is 0. Function conflict ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
•
419
(xi, xi9) is defined in the following formula.
conflict ~ x i, x i9! 5
H
1 0
if ~ x i mod N set! 5 ~ x i9 mod N set! otherwise.
(10)
In formula (10), (xi9 mod Nset) means the integral residue of xi9 divided by Nset. Constraints to M(X) are expressed in the following three formulas.
0 # x i # N mem 2 1,
0 # i # Np 2 1
(11)
i Þ i9 f x i Þ x i9
(12)
~ . . ., p i , p i11 ,. . . ! [ T f x i 5 x i11 2 1.
(13)
Formula (11) ensures that all the pseudo-memory blocks must be mapped to physical memory blocks. Formula (12) ensures that no two pseudomemory blocks may be placed in the same memory block. In other words, the first two constraints guarantee that each pseudo-memory block corresponds to a unique memory block. Formula (13) ensures that consecutive pseudo-memory blocks that constitute a trace must be placed in consecutive main memory blocks. 3.5.3 Linearization. Replaced(ai,k) and conflict(xi, xi9) defined previously are nonlinear functions. In this section, we explain the process of linearization of the two nonfunctions. First, we prepare new variables yi,i9s and zi,i9s whose ranges are
y i,i9 [ $ 0, 1 % ,
z i,i9 [ Z,
(14)
where Z is the set of integers. Intuitively, yi,i9 holds the value of conflict(xi, xi9), and zi,i9 holds the integer value of (xi 2 xi9)/Nset. Then, formula (10) is replaced by formulas (14) through (17).
0 # ~ x i 2 x i9! 2 N set z z i,i9 , N set .
(15)
~ x i 2 x i9! 2 N set z z i,i9 1 y i,i9 z U Þ 0.
(16)
~ x i 2 x i9! 2 N set z z i,i9 2 ~ 1 2 y i,i9! z U # 0.
(17)
Here, U is a large integer.2 Next, we prepare variables wi,ks whose ranges are
w i,k [ $ 0, 1 % .
(18)
2 U is sufficiently large if U is larger than the size of the access sequence of pseudo-memory blocks.
ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
420
•
H. Tomiyama and H. Yasuura
Intuitively, wi,k holds the value of replaced(ai,k). Then, formula (9) is replaced by the three formulas (18) through (20).
O
y i,i9 1 ~ 1 2 w i,k! z U $ N way
(19)
pi9[ai,k
O
y i,i9 2 w i,k z U , N way .
(20)
pi9[ai,k
The objective function M(X) is re-defined as follows.
M~X! 5
O O
e i,k z w i,k 1 Constant.
(21)
pi[P ~ei,k,ai,k![Ai
We now obtain the mathematical formulation of the trace placement problem in which the objective function is formula (21) and constraints are formulas (11)–(20), all in linear form. 3.6 Experiments In order to evaluate the simplified code placement method, we execute the four code placement methods described in the following with four benchmark programs. Using the obtained results, we calculate their cache miss counts and miss rates for each method. NONE
TS
TSFS
TSTP
Benchmark programs are compiled with the SunPro SPARCompiler C 3.0. No code placement technique for instruction caches is applied. After translating benchmark programs into assembly code with the same compiler as NONE, trace selection is applied. A greedy algorithm is used for trace selection. After trace selection, functions are sorted according to their frequencies. This technique is helpful in reducing cache conflicts among functions that are frequently executed. The simplified code placement method proposed in this section is applied, which employs trace selection and trace placement techniques. We use a local search algorithm for the ILP problem.
We use the SPARC instruction set as a target architecture, and use four UNIX application programs as benchmark programs. Characteristics of the benchmarks are summarized in Table I. We assume that the code size of each C library function is zero since we could not analyze their control flow dependencies. First of all, we vary the associativity of the instruction cache from 1 to 8 and calculate the cache miss count and the miss rate for each of the four methods. The cache size and the line size are fixed to 1K bytes and 32 bytes, respectively. The experimental result is summarized in Table II. Secondly, we vary the cache size from 256 bytes to 2K bytes and also calculate the cache miss counts and the miss rates. Here, we assume ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques Table I.
Table II.
•
421
Benchmark Programs
Cache Miss Counts and Cache Miss Rates when Varying Associativity
direct-mapped caches (their associativity is 1), and the cache line size is fixed to 32 bytes. The result is summarized in Table III. These results indicate that the proposed method (TSTP) realizes the lowest cache miss rate under any conditions. TSTP achieves a 30% decrease in cache misses on average (max. 77%) as compared with NONE. As for the cache miss rate, TSTP reduces 2.2% of cache miss rate (max. 6.6%). In most cases, TSTP achieves lower miss rates than NONE with a double cache size, and there ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
422
•
H. Tomiyama and H. Yasuura
Table III.
Cache Miss Counts and Cache Miss Rates when Varying Cache Size
exist some cases where TSTP achieves lower miss rates than NONE with a quadruple cache size. TS achieves 11% decrease in cache misses and 0.9% reduction in the miss rate on average, which confirms the effectiveness of trace selection. However, TSTP outperforms the others owing to trace placement without limitation on function boundaries. There is an interesting indication in Table II that if programs are placed in the main memory optimally or suboptimally, higher associativity of instruction caches with the LRU algorithm does not promise lower miss rates. Even though our simplified method achieves a significant reduction in cache miss rates, it requires a long computation time to solve the ILP problem. Obtaining an optimal solution for large programs may not be practical. We have implemented a solver for the ILP problem which employs a local search algorithm. The solver required 1.5– 6.3 hours for grep and 6.8 –38 hours for sed to find a locally optimal solution on Sparc Station 5 (microSPARC-II, 85 MHz, 32 MB, Solaris 2.4). More details on computation time on SPARC Station 5 are shown in Table IV. As mentioned in Section 3.4, our simplified method requires a lot of small unusable code in the main memory which results in a large code size. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques Table IV.
•
423
Computation Time (CPU Time)
Experimental results show that the code size with the proposed optimization is increased by 13–30% (See Table V). The memory size usually varies discretely (e.g., 512 K bytes, 1 M bytes, 2 M bytes, and so on). Because of this, a slight increase in code size may not necessarily lead to an increase in the memory size. However, in most embedded system designs, the code size is one of the severest design constraints to reduce the system cost. Since the simplified code placement method may violate the code size constraint, we further propose a refined code placement method in which the code size provided by the system designers must be satisfied. 4. A REFINED CODE PLACEMENT METHOD In this section, we present a technique, called trace merging, which extends the simplified method previously discussed. This technique attempts to limit an increase in code size. Using trace merging, we propose a refined code placement method that minimizes the cache miss rates with an explicit attempt to satisfy the code size provided by the system designers. 4.1 Trace Merging As mentioned in Section 3.4, the simplified code placement method requires a lot of unusable code in the main memory, and inevitably makes the code size larger. Let Sline and si denote the size of a cache line and the size of the ith trace, respectively. The total size of the unusable code Sunused is expressed in the following equations.
S unused 5
O unused ~ s !
(22)
H
(23)
i
i
unused ~ s i! 5
0 if ~ s i mod S line! 5 0 S line 2 ~ s i mod S line! otherwise.
The two equations (22) and (23) suggest three approaches to decrease the unusable code: (1) to use a cache with a smaller line size; (2) to make the number of traces small (decreasing the number of traces will result in a smaller code size); and ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
424
•
H. Tomiyama and H. Yasuura Table V.
Comparison of Code Size
(3) to make (si mod Sline) be zero or close to Sline. If the size of each trace is a multiple of the cache line size, the code size does not increase. Using a cache with a small line size surely reduces the unusable code. However, cache line sizes gravely affect the miss rates, and very small line sizes generally result in high miss rates. Moreover, modification of the cache organization may not be an option in some cases of system design. Hence, software-based techniques are required for code size reduction. A software technique, called trace merging, is proposed to realize the last two approaches. Trace merging combines several traces in such a way that the size of the merged trace becomes a multiple of the cache line size. Trace merging is performed between trace selection and trace placement in the simplified method. As the number of traces decreases, the cache misses may increase. This is because, in the ILP formulation, there is a constraint that consecutive pseudo-memory blocks must be placed in consecutive memory blocks (see Section 3.4). If two traces are merged, an additional constraint for the merged trace will have to be included in the ILP formulation. Normally, as constraints for an ILP problem increase, the quality of the solution for the problem becomes worse. Hence, it is desirable to have more nonconsecutive pseudo-memory blocks in the program since nonconsecutive ones do not have to satisfy the constraint of consecutive placement in the main memory. In other words, trace merging should be minimized in order to keep the number of nonconsecutive pseudo-memory blocks high. 4.2 A Refined Code Placement Algorithm Based on the previous discussions, we propose a refined code placement method. The algorithm is shown in Figure 5. We use these notations in the algorithm: Nmem Sinst Sline Nset Nway T Tmerged AP PI
the maximum number of memory blocks for the programs. Nmem is a constraint on code size set by the system designers the length of instructions in CPU. If the CPU has the variable instruction length, Sinst denotes the smallest one the cache line size the number of cache sets the number of cache ways a set of traces before trace merging a set of traces after trace merging the assembly code of an application program to be placed the profile information on the application program.
ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
Fig. 5.
•
425
Refined code placement algorithm.
Function Sizewith(T) returns the total size of traces in T including the unusable code, and Sizewithout(T) returns the total size of traces in T excluding the unusable code. Here, we summarize the refined code placement algorithm. (1) Perform trace selection, and set T to a set of traces. (2) Set Tmerged to f, move all traces with a multiple size of the cache line in T to Tmerged, and set n to 2. (3) If Sizewith(T ø Tmerged) # Nmem, go to Step (7). (4) If there exists T9 # T such that uT9u 5 n and that Sizewithout(T9) is a multiple of Sline, go to Step (6). (5) Increment n by 1. If n . Sline/Sinst, merge all traces in T into a new trace t9, join t9 into Tmerged, and go to Step (8). Otherwise, go to Step (3). (6) Merge traces in T9 and create a new trace t9, remove all traces in T9 from T, join t9 into Tmerged, and go to Step (3). (7) Move all traces in T into Tmerged. (8) Perform trace placement. 4.3 Experiments We have applied the refined code placement method to the benchmark program “grep” specified in Table I. We vary the constraint on code size, and calculate cache miss counts and miss rates for various cache organizaACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
426
•
H. Tomiyama and H. Yasuura Table VI.
Miss Rate and Computation Time Varying Associativity
tions. The constraint on code size is given in terms of the number of memory blocks for the program, denoted by Nmem. The target architecture is the SPARC architecture. The experiments are performed on SPARCstation 5 (microSPARC-II, 85 MHz, 32 MB, Solaris 2.4). First, we assume a 1 K byte cache with 32-byte cache lines, and vary the constraint on code size and the cache associativity. Experimental results are shown in Table VI. For each associativity, Nmem is set to 2,673, 2,271, and 2,166. In Case (a), trace merging is not applied (i.e., the simplified method). In Case (b), the range of variable n in the refined code placement algorithm is 1 # n # 2. In other words, only two traces are merged at a time. In Case (c), trace merging is repeated without a limitation on the range of n so that no unusable code exists. Case (d) shows the results of a naive placement, in which no code placement optimization is applied. A number in parentheses in the first column gives the code size in terms of kilobytes. The fourth column in each table shows the CPU time required to find a locally optimal solution for the ILP problem in trace placement. Next, we assume a direct-mapped cache with 32-byte cache lines, and vary the constraint on code size and the cache size. The results are shown in Table VII. In Table VI and Table VII, there is no significant difference in cache miss counts caused by the constraint on code size. The refined code placement achieves about 36% decrease in cache misses on average compared with the naive placement. Table VII indicates that the cache miss rates with the refined code placement optimization are still lower than those with a double size cache and the naive placement. On the other hand, the experimental results indicate a tendency that trace merging leads to a shorter computation time. The results demonstrate that the trace merging technique is effective in reducing computation times as well as code sizes, while keeping miss rates low. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques Table VII.
•
427
Miss Rate and Computation Time Varying Cache Size
When trace merging is not applied, the code size is increased by 25% compared with the naive placement. Even when Nmem is 2,166, the refined methods make the code size 1.6% larger. The cause of an increase in code size lies in the way that trace selection has been performed. This is exemplified by the program illustrated in Figure 6(a). The naive placement places basic blocks in the main memory in the order shown in Figure 6(b), and the placement by trace selection is shown in Figure 6(c). No jump operation is required in the naive placement whereas a jump operation is required in the trace selection at the exit of b1. This means that the code size after the trace selection becomes larger due to the jump operation. This example illustrates the fact that the trace selection disturbs an explicit attempt to minimize the code size. Therefore, if the code size is the first priority, then we should select traces in such a way that the number of traces is minimized, and apply trace merging and trace placement. 5. CONCLUSIONS In this article, we have proposed two code placement methods (a simplified method and a refined one) for miss rate reduction of instruction caches. The methods are effective in reducing power consumption of embedded systems as well as in enhancing the performance. The simplified code placement problem is formulated as an ILP problem by which an optimal placement can be obtained. Although an optimal placement for large applications cannot be obtained in a reasonable time, a suboptimal solution obtained by our methods is sufficiently effective. Experimental results show that the simplified method reduces cache misses by an average of 30% (max. 77%). We have further proposed a refined code placement method in which the code size provided by the system designers must be satisfied. Experimental results show that the refined method achieves almost the same miss rates as the simplified method. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
428
H. Tomiyama and H. Yasuura
•
Fig. 6.
An example of trace selection that makes code size larger.
Although our code placement methods achieve significant reduction in cache miss rates, the methods require a long execution time for the ILP problem. Our current ILP implementation employs a local search algorithm, and it takes a prohibitively long computation time. More efficient algorithms such as heuristic algorithms are required for large applications. Our code placement methods are designed to minimize the average execution time of the application programs without an attempt to minimize the worst case execution time. The proposed methods are suitable for applications that require high average speed rather than real-time applications since the worst case execution time generally is the first priority in hard real-time systems. To develop code placement methods to minimize the worst case execution time is also an interesting research avenue for us. ACKNOWLEDGMENTS
The authors would like to thank Kaoru Yamamoto of Institute of Systems and Information Technologies/KYUSHU for carefully correcting grammatical errors and inadequate expressions. REFERENCES AHO, A. V., SETHI, R., AND ULLMAN, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. FISHER, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. C-30, 7 (July), 478 – 490. HENNESSY, J. L. AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitative Approach (2nd ed.). Morgan-Kaufmann, San Mateo, CA. HWU, W. W. AND CHANG, P. P. 1989. Achieving high instruction cache performance with an optimizing compiler. In Proceedings of the 16th International Symposium on Computer Architecture (Jerusalem, May 28 –June 1), 242–251. LI, Y.-T. S., MALIK, S., AND WOLFE, A. 1995. Performance estimation of embedded software with instruction cache modeling. In Proceedings of the International Conference on Computer-Aided Design (San Jose, CA, Nov. 5–9), 380 –387. LIU, D. AND SVENSSON, C. 1994. Power consumption estimation in CMOS VLSI chips. IEEE J. Solid-State Circuits 29, 6 (June), 663– 670. MARWEDEL, P. AND GOOSSENS, G. (EDS.) 1995. Code Generation for Embedded Processors. Kluwer Academic, Boston, MA. MCFARLING, S. 1989. Program optimization for instruction caches. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA, April 3– 6), 183–191. MCFARLING, S. 1991. Procedure merging with instruction caches. In Proceedings of Programming Language Design and Implementation (Toronto, Ont., June 26 –28), 71–79. ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.
Code Placement Techniques
•
429
PANDA, P., DUTT, N., AND NICOLAU, A. 1996. Memory organization for improved data cache performance in embedded processors. In Proceedings of the Ninth International Symposium on System Synthesis (La Jolla, CA, Nov. 6 – 8), 90 –95. SPARC INTERNATIONAL, INC. 1992. The SPARC Architecture Manual Version 8. Menlo Park, CA. WOLFE, A. 1994. Software-based cache partitioning for real-time applications. J. Comput. Softw. Eng. 1, 3, 315–327. Received January 1997; revised June 1997; accepted July 1997
ACM Transactions on Design Automation of Electronic Systems, Vol. 2, No. 4, October 1997.