Code Reordering for Multi-level Cache Hierarchies

Code Reordering for Multi-level Cache Hierarchies

John Kalamatianos & David R. Kaeli Northeastern University Department of Electrical and Computer Engineering Boston, MA 02115, USA E-mail: kalamat,[email protected] Abstract

As the gap between memory and processor performance continues to grow, it becomes increasingly important to exploit cache memory eectively. Both hardware and software techniques can be used to better utilize the cache. Many software solutions produce new programs layouts to better utilize the available memory and cache address space. In this paper we present a new link-time code reordering algorithm targeted at reducing the frequency of misses in all levels in a cache hierarchy. In past work we focused solely on eliminating cache con icts in the rst level cache using either local (the calling frequency between procedures) or global (an approximation of the worst-case number of con ict misses that can occur between two procedures) temporal information. We employed cache line coloring to guide the placement of procedures in the rst-level cache address space. In this work we propose exploiting temporal-based procedure and intraprocedural basic block reordering to avoid misses in a multi-level cache hierarchy. We use global temporal information to reposition procedures and use visitation counts to restructure a procedure's body so that the most frequently visited sequences of basic blocks lie at the beginning of the executable. We then propose a modi ed cache line coloring algorithm which attempts to nd a con ict-free mapping between two procedures in a given multi-level cache hierarchy. Currently, our implementation targets two levels of cache but can be easily extended to additional levels. We also perform memory-based procedure placement to keep executable images compact. Using both C and C++ applications, we explore the eectiveness of reodering code when considering multiple cache levels. Using a modi ed version of the Simplescalar toolset, we have found that our new placement algorithms can reduce the total cycle count by as much as 18%. We also discuss why multi-level cache reordering does not provide an advantage in all cases, and suggest how to remedy this situation.

1 Introduction Cache memories are found on most microprocessors designed today. A cache exploits the temporal and spatial locality present in programs developed in high-level languages. Caching the instruction stream can be very bene cial since instruction references exhibit a high degree of spatial and temporal locality. Still, cache misses will occur for one of three reasons: (i) rst time reference, (ii) nite cache capacity, or (iii) memory address con ict. Our work here is focused on reducing memory address con icts. Our goal is to rearrange a program on the available memory space in order to reduce the probability of con ict during run time. This is accomplished by avoid mapping two highly interacting code units to the same portion of the cache. We work at a ne code unit 1

granularity, rearranging basic blocks contained within procedures, but we do not attempt to split procedures [1]. We organize basic blocks into sequences based on the transitional frequencies between pairs of basic blocks and place them in the cache in such a way so that a procedure's body consists of hot and cold regions. At a coarser level, we identify pairs of procedures that would compete for the same cache address space if allowed to overlap. We use temporal procedure interaction occurring in a nite window, equal in size to the number of unique cache lines which could possibly live in the rst-level cache of the cache hierarchy under study. Procedure interaction for procedures outside of this window is not of interest since it is estimated that the cache's nite capacity will be exhausted if a reference falls outside of this window. We then place the highly interacting procedures in a multi-level cache hierarchy using cache line coloring. The coloring algorithm takes into account the hot areas of each procedure when performing mapping. Its goal is to achieve a con ict-free mapping at all cache levels. Our algorithm works even if the caches dier in any of the following parameters: size, line size or degree of associativity. In this paper we will describe the dierent steps of our algorithm, as they are applied to a two-level cache hierarchy. We compare the performance of the reordered executable to that of an unmodi ed executable, as well as to results when only rst-level cache reordering is applied. Furthermore, we simulate the eects of adding intraprocedural basic block reordering and memory-based procedure placement. This paper is organized as follows. Section 2 brie y reviews past work on code reordering. In Section 3 we describe our code repositioning algorithm, while in Section 4 we report simulation results. We conclude the paper in Section 5.

2 Related work There has been a considerable amount of work done on pro le-guided code positioning for improved instruction cache performance [1, 2, 3, 4, 5, 6]. We will limit our discussion here to work that is directly related to this paper. Hashemi et.al. in [2] use a Call Graph (CG) to guide procedure placement. Their placement algorithm utilizes information about the cache organization to color procedures so that they do not overlap in the cache address space. A slightly modi ed version of their algorithm has been used to perform cache coloring in the rst level (L1) cache in our work. In [6, 7] two temporal-based procedure reordering algorithms are proposed. Both of them suggest building a graph whose edge weights capture the temporal interaction between procedure pairs. Procedure interaction is de ned within a nite window that includes live procedures [7] or live cache lines [6]. Edge weights are updated by counting either the number of times a live procedure follows another [7], or by the maximum overlap computed as the number of unique live cache lines used by two live procedures [6]. Placement is done using cache line coloring based on a locally optimal search [7] or based on heuristics [6]. In [8], Gloy describes procedure reordering for multiple caches. Besides trying to minimize con icts in multiple caches, he also uses heuristics in order to improve spatial locality at the page level. Simulation results clearly motivate the need for mapping procedures over the entire cache hierarchy of a system. Our work focuses on both procedures and basic blocks. We rearrange basic blocks within a procedure body, in a way similar to the algorithm proposed in [1], since it is important to improve instruction fetch eciency. We use temporal information at the procedure level to guide a modi ed cache line coloring algorithm that targets multiple levels of caching. Our coloring approach works for caches of any size, line size and associativity in contrast to the approach described in [8], where proper procedure mapping is maintained only if all caches have 2

the same size and line size. In addition, the approach described in [8] does not explicitly take into account the degree of associativity when checking for color con icts. We also describe how heuristics are used when placing procedures in the cache and memory address space in order maintain the working set size of the application.

3 Code Reordering Algorithm Next, we will discuss the dierent steps of our code reordering framework.

3.1 Pro ling Step During pro ling we collect information on a basic block basis. There are two major graphs that are built with pro le information: (i) the Con ict Miss Graph (CMG) [6] and (ii) the Dynamic Control Flow Graph (DCFG). The CMG is an undirected graph where every node is a procedure, and every edge is weighted by an estimate of the worst case number of con ict misses that can occur between the procedures connected via that edge. In order to generate the CMG edge weights we use a queue with a size equal to the size of the cache under consideration. The queue is maintained with the LRU replacement policy and basically models a fully associative cache. It is used to approximate code liveness in a cache. Two procedures interact if they are simultaneously live. A procedure is live if it has at least one cache line in that queue 1 . Edge weights are generated assuming a worst-case scenario (worst case implies procedures completely overlap in the cache address space every time they are found to be live). On every pro le entry, we update the edge weights between procedure Pi whose code is currently activated and all live procedures Pj that have been activated since the last activation of Pi . Every edge weight is incremented by the minimum of: (i) the accumulated number of unique live cache lines of Pj (since Pi 's last occurrence) or (ii) the number of unique cache lines of Pi 's current activation (excluding those cache lines that would result to cold-start misses). A detailed example of CMG construction can be found in [6]. CMG edge weights are more accurate than CG edge weights 2 because: 1. CMG edge weights record the number of cache lines that may con ict per call (CG edge weights only capture the frequency of calls), and 2. the CMG captures the interaction between procedures that do not directly call each other (the CG only captures interaction between caller-callee pairs). For further discussion on the dierences between a CG and a CMG, see [9]. Most of the CMG edges will possess a very low edge weight, and hence, can be ignored during code placement. Therefore, a pruning step is utilized in order to focus on the procedure pairs with the heaviest interaction. The pruning step we implemented is identical with the method described in [6]. To prune the graph, we select a threshold value by rst sorting the edge weights in descending order, and then starting at the heaviest edge, sum the edge weights until their sum is equal to or greater than a certain value expressed as a percentage of the sum of all edge weights (the value is set to 80% for all of the experiments). All edges with a weight under the threshold are deleted from the CMG. Procedures that have at least one edge remaining after the pruning are labeled popular. The remaining procedures are categorized as unpopular. All CMG edges with 1 2

Given a cache con guration we partition a procedure's body into cache lines. A CG edge weight is equal to the number of times one procedure calls another procedure.

3

weights greater than the threshold value are called popular edges. Pruning essentially partitions the CMG into a Popular Procedure Graph (PPG) and an UnPopular Procedure Graph (UPPG). A DCFG is a directed graph where every node models a basic block, and every edge is weighted by the number of transitions between two blocks. The number of edges recorded in the DCFG depends on the transitions between basic blocks that are recorded in the pro le. We generate a DCFG for every popular procedure.

3.2 Intra-procedural Basic Block Reordering Algorithm After building the DCFGs, we sort the DCFG edges in decreasing weight order. Then we form compound nodes of basic blocks as we traverse the sorted list of edge weights. There are three possible cases: Case I : If both basic blocks are unmapped, then form a new compound node by placing the source followed by the destination basic block. Case II : If one block is mapped and the other is unmapped, and if the mapped basic block is the tail of its compound node, insert the unmapped block at the end of the compound node (right after the mapped block), otherwise create a new compound node having as the only member the unmapped block. Case III: If both blocks are mapped but belong to dierent compound nodes, and if the source basic block is the tail of its associated compound node, and the destination basic block is the head of its associated compound node, then concatenate these two nodes together, with the node of the source block placed before the node of the destination (i.e., the two nodes are merged into a single node). After completing this step, there may be several, unmerged, compound nodes remaining. We then traverse the sorted edge list again, applying Case III. At this point, there should remain only one compound node containing all of the activated basic blocks in the program. One underlying principle of basic block reordering is to try to make every conditional branch fall through during execution. The unactivated basic blocks are listed at the tail of the remaining compound node in the order they are found in the original executable. The reordering algorithm nishes with a code x-up step, where unconditional branches are inserted wherever needed to maintain correct program semantics. Subsequently, we apply a partitioning algorithm that separates a procedure's body into a hot and a cold segment. The idea behind the partitioning step is to eectively reduce the size of the procedure that needs to be colored in the cache address space. Therefore, the hot region includes the most frequently accessed basic blocks (in the order they are found after reordering) while the cold region includes all rarely accessed and unactivated parts of a procedure's body. The partitioning step works as follows: we compute the total sum of DCFG edge weights. We nd an upper bound of the sum, based on a xed threshold (set to 90% for all the experiments 3 ) and determine the hot region of a procedure by tagging all basic blocks whose total sum of edge weights is less than or equal to the upper bound. Clearly, only tagged basic blocks belong to the hot region. Tagging starts from the rst basic block in the reordered procedure body. The hot region of the procedure then determines its size during the next step, which is cache line coloring.

3.3 Cache-conscious Procedure Reordering Algorithm Next we present the cache line coloring algorithm used to layout the PPG. The rst step is to prune the PPG as was described in Section 3.1. The pruning step reduces the PPG into (usually) one strongly connected subgraph 3

90% was selected through experimentation

4

that includes all heavily interacting procedures. After pruning, we process all of the popular edges, working in descending edge weight order. We then form compound nodes by merging procedures together. The goal is to nd a relative procedure placement in a cache hierarchy so that the number of con icts is minimized. In order to accomplish this task we had to de ne how and when procedures can cause con icts, when mapped into a cache hierarchy. Therefore, we slice up the cache address space on a cache set basis (note that n cache lines can be mapped to a single set in an n-way set-associative cache), for the speci c cache organization of interest. A procedure is mapped in the cache as follows: each procedure is partitioned into cache lines and mapped in a modulo fashion in the cache address space. If two procedures have at least one cache line being mapped to the same cache set, then they overlap. The degree of overlap (ovd) for a given cache set and group of procedures, is equal to the number of cache lines of all the procedures that are being mapped onto the same cache set. If the maximum degree of overlap for a set of procedures (over all cache sets in a given cache) is greater than the degree of associativity of that cache ca, then there exist potential con icts between those procedures. Thus, we must relocate at least ovd ? ca procedures in order to fully eliminate any possible con icts between this set of procedures. 2-way set associative cache cache address space : 12 cache sets cache address space P1 P1 P1 P2 P2 P2 P3 P3 degree of overlapping : 1 2 3 1

1 0 0 0

0 0 0

0

cache address space P1 P1 P1 P3 P3 P2 P2 P2 degree of overlapping : 1 1 2 1 1 0 0 0 1 1 0

0

P1, P2 size : 3 cache lines P3 size : 2 cache lines

Figure 1: Degree of overlapping computation. Figure 1 presents an example of how to calculate the ovd. In this example we show a 2-way set-associative cache containing 12 cache sets. Initially, procedures p1 ; p2 and p3 are mapped in such a way so that the maximum ovd is equal to 3. Since ovd > ca = 2, con icts can occur in the shaded cache set of the cache address space. By moving one of the three procedures (p3 in Figure 1), we can decrease the maximum ovd to 2 and avoid any con icts between the three procedures. Notice that this approach works for any cache size, line size, or degree of associativity. Another important issue that has to be addressed when mapping to multiple levels of cache is to ensure that the mapping at cache levels 0; :::; i is maintained when mapping at cache level i + 1. In this work we present the necessary and sucient conditions for preserving the mapping in a two-level cache hierarchy. When we place a procedure in the L1 cache, there exist a xed number of valid L2 cache positions that maintain the L1 cache mapping. Assume that L1ls ; L2ls are the L1 and L2 cache line sizes, SL1; SL2 are the number of cache sets for L1 and L2 respectively, and l1 and l2 are the starting cache indices in the L1 and L2 cache address spaces for a mapped procedure. The l1 and l2 are the colors of that procedure in the L1 and L2 cache respectively. We present the valid l2 indices given an l1 index in four dierent cases. Depending on the cache organization, after xing the l1 ; l2 cache set indices, we may limit the range of valid memory addresses a procedure can occupy. This is recorded with the formulas for L2tag and L2cl which denote the valid values for the tag and cache line 5

index eld of an address accessing the L2 cache. Case I: 2ls 1ls L2 L1 L

L

;S

S

= 21k + 2Lk1 = 0 2l ? 1 2cl = (( 1 &(2k ? 1)) log2 ( 1ls )) + S

l

l2 L

Case II: 2ls L

1ls L2

< L

;S

i; i

; :::;

l

< S

S

; :::;

k

i

;i

(2)

; :::;

L2 < SL1

1. IIIa: log2 ( 1ls ) + log2 ( L1 ) log2 ( 2ls ) + log2 ( L2 ) L

S

L

S

= 21k + 2Lk1 = 0 2l ? 1 2cl = (( 1 &(2k ? 1)) log2 ( 1ls )) + l

l2 L

S

i; i

; :::;

l

> k

L

S

k

; :::;

i

;i

size?log 2 (SL2 )?log 2 (L2ls )?l ; :::; 2

(6)

where k =j log2 (L2ls ) ? log2 (L1ls ) j, l =j (log2 (SL2 ) + log2 (L2ls )) ? (log2 (SL1 ) + log2 (L1ls )) j and size is the address width in bits. Figure 2 shows the dierent cases. Given an L3 cache, the number of valid L3 cache indices is given by the exact same formulas presented above, provided that we replace the L1 with the L2 label and the L2 with the L3 label. Clearly, we can extend this further to provide a valid mapping for an arbitrary number of cache levels. However, the number of valid colors for a given cache level increases as the depth of the cache hierarchy increases. For example, the total number of valid L3 indices is equal to: j (l1 ; l2 ) j 2l where l =j (log2 (SL3 ) + log2 (L3ls )) ? (log2 (SL2 ) + log2 (L2ls )) j and j(l1 ; l2 )j is the number of valid l1 ; l2 pairs of colors for the procedure under investigation. 6

Cases I, IIIa L1 cache set index L1 cache line index

L1 tag L2 tag

L2 cache line index

L2 cache set index l

k Cases II, IVb

L1 cache set index

L1 tag

L1 cache line index

L2 cache set index

L2 tag l

L2 cache line index

k Case IIIb L1 cache line index

L1 cache set index

L1 tag

L2 cache set index

L2 tag l

L2 cache line index k

Case IVa L1 tag

L1 cache set index

L2 tag

L1 cache line index

L2 cache set index l

L2 cache line index k

Figure 2: Dierent cases for coloring over two levels of cache. In case the L1 and L2 caches have the same cache line size, k = 0 and l =j log2 (SL2 ) ? log2 (SL1 ) j. Cases I, IIIb and IVa become equivalent and provide valid L2 colors as follows:

l2 = l1 + SL1 i; i = 0; :::; SSL2 ? 1 L1

(7)

which agrees with the ndings in [8]. Cases II, IIIa and IVb lead to a single L2 valid color, given by:

l2 = L1 &(SL2 ? 1)

(8)

When forming compound nodes during L1 cache coloring we use the same heuristics as presented in [2] and [6]. The L2 cache line coloring is only attempted after xing the L1 colors. The four possible states that can be encountered are:

State I: Procedures P1 and P2 are not colored. We place in them in the L1 cache using the rst available L1 colors. The larger procedure is placed rst. A new compound node is formed consisting of these two procedures. L2 cache mapping is also performed using the rst available L2 valid colors.

State II: Both procedures P1 and P2 are mapped, but belong to dierent compound nodes. We rst nd

the larger node in terms of the total number of L1 cache colors occupied by its members. We then place the smaller node on the side of the larger one so that the distance between the two procedures is minimized. If there exist L1 cache color con icts between the two procedures, we shift the smaller node away from the larger one until no more con icts occur. If con icts still occur, we restore the original mapping. We then color the procedures in the L2 cache. Let us assume that P1 belongs to the smaller node. We rst 7

select an L2 valid color for P1 so that L2 cache con icts and the sum of distances between procedures in the larger and P1 are minimized. The distance heuristic is used to avoid spreading procedures around the L2 cache address space, which in turn can cause large gaps in the nal code layout. Then we relocate the remaining procedures in the smaller node so that L2 color con icts and the sum of distances between them (including P1 ), are minimized. Finally, the two compound nodes are merged into a new one.

State III: One procedure is mapped and the other procedure unmapped. We place the unmapped procedure

on the side of the compound node so that the distance between the two procedures is minimized. If L1 color con icts occur, we attempt to shift the newly mapped procedure away from the compound node until all con icts are eliminated. If con icts still occur, we do not move the procedure. The new procedure is added to the compound node. We then seek a valid L2 mapping that eliminates both L2 cache con icts and minimizes the distance between the two procedures.

State IV: Both procedures P1 and P2 are mapped and belong to the same compound node. If L1 cache

color con icts occur, we move the procedure closest to the boundary between the compound node to the boundary boundaries. If there remain L1 cache color con icts, we attempt to shift the procedure away from the compound node so that L1 cache color con icts between the two procedures are eliminated. If con icts still occur, we restore the original mapping. If the selected procedure has been repositioned, then we try to nd another L2 color. We iterate over all of its L2 colors, selecting the one that reduces the L2 cache con icts.

3.4 Memory-based Procedure Placement Algorithm Next we describe how we place procedures in the main memory address space. We begin by selecting the procedure that is currently mapped closest to the beginning of both the L1 and L2 cache address spaces. We then perform the following steps:

Step 1: we advance a pointer that always points to the next free memory address, incremented by the size of the current procedure,

Step 2: we nd the L1 and L2 cache colors corresponding to that address, Step 3: we check to see if there are any procedures that have been assigned to those colors, Step 4: if there is no mapped procedure, then we advance the L1 color by 1 (we proceed to the next L1 cache set), nd its equivalent L2 color and go to Step 3. Otherwise we proceed to Step 5,

Step 5: if there exist more than one procedure mapped to those colors, we pick the procedure that is

connected to the previously mapped procedure by the heaviest edge in the CMG. The goal is to place highly interacting pairs of procedures together in main memory so that the working set of the application does not increase after reordering. If none of the procedures is connected to the previously mapped procedure via a CMG edge, then we select one by random.

Step 6: Once we have selected a procedure P to be mapped, we nd its new memory address and calculate

the gap size created between P and the previously mapped procedure. If the gap size is larger than the page size, we relax the L2 coloring for P and try to nd a valid L2 coloring that shrinks the gap to smaller than a memory page. If we were able to nd a valid coloring, we select the new address, else we fallback to the original memory address. We then return to Step 1. 8

Notice that the heuristic used in Step 6 may modify the L2 color selected by the procedure coloring algorithm. This is done to avoid large increases in the working set of the program. Although the new L2 color may not be optimal (in terms of con ict reduction) it does preserve the original L1 coloring. The algorithm completes when all popular procedures have been mapped. We then sort any memory gaps created from the previous step, in increasing size order, and all unpopular procedures in decreasing size order. Placement of unpopular procedures takes place using a best t algorithm so that the empty space among popular procedures is minimized. We then perform the same task for any unactivated procedures. Finally, the remaining unmapped unpopular and unactivated procedures are appended to the end of the image.

4 Experimental Results To evaluate the merit of our algorithm, we compare several variations:

sc ) CMG-based procedure reordering (Pcmg

CMG-based procedure and intra-procedural basic block reordering (Pbbsc cmg ) mc ) CMG-based procedure reordering for multiple cache levels (Pcmg

CMG-based procedure and intra-procedural basic block reordering for multiple cache levels (Pbbmc cmg )

We also provide results for the standard static DFS layout employed by the Compaq C/C++ compilers used to compile our benchmarks (we call this version unoptimized, Unopt). We use cycle-based simulation to quantify the performance of code reordering. We modi ed the SimpleScalar v3.0 simulator for the Alpha AXP architecture [10] to simulate an accurate memory system. More speci cally, we fully model bus contention on all levels of the memory hierarchy and provide write buer support between the L1 data and the L2 uni ed cache. We also modi ed the simulator so that code reordering can be simulated. The eects of extra code required for preserving program semantics is also properly simulated in the out-of-order CPU model. A brief description of the model is shown in Table 1. Due to the lack of adequate system call support we could not simulate some of the benchmarks using SimpleScalar (lcom, gs, ixx, porky, bore and eon) 4 . Instead we have developed an ATOM-based model using the same modi ed memory system which we used in our SimpleScalar simulations. The memory system accurately models timing constraints, and bus and cache line access contention. It also considers the side eects of code reordering (extra branches being inserted/deleted). All applications are compiled with the Compaq C V5.2 and C++ V5.5 compilers on Alpha-based workstations running Unix v4.0. Table 2 presents our benchmarks along with some of their static and dynamic characteristics. Columns 3 and 4 list the training and test inputs used during the experiments. In order to reduce the time needed for pro ling/simulation, we simulated 300M executed instructions for both SimpleScalar and ATOMbased experiments (perl, gcc, eon, porky and bore). We warm up the simulator with 50M instructions before recording results. If an application required less than 300M instructions to complete, we run it (or pro le it) to completion (eqn, tro, lcom, ixx, edg, gs). Columns 5 and 6 display the static count and total size in Kbytes of all procedures in the application respectively. Columns 7 and 8 present the same statistics for popular procedures only. Column 9 shows the total size in Kbytes of the hot regions of popular procedures. The label bb is appended to indicate that a hot region is de ned only after basic block reordering is applied. 9

Unit Description L1 I-cache 16Kb, 32 byte line size, direct mapped, write back, 1 cycle hit, 32 byte wide bus to L2 L1 D-cache 16Kb, 32 byte line size, 2-way (LRU), write through, 1 cycle hit, 32 byte wide bus to L2 Write buer 12 entries, 32 byte entry size, retire-at-8, read-from-WB load hazard policy L2 uni ed cache 256Kb, 64 byte line size, 2-way (LRU), write back, fetch-on-write on write miss, 7 cycles hit Main Memory 85 cycles, 8 byte wide bus, 8K page size, 2 memory ports Instruction, Data TLB 32 entries, 4-way (LRU), 30 cycle miss latency I-fetch unit up to 4 insts/cycle from the same cache block, 8 Ins-queue, hybrid predictor(8Kb bidomal + 8Kb gshare, 12-bit HR + 8Kb 2-bit selector), 16 entry RAS, 2Kb 2-way (LRU) tagged BTB, 2 cycles misfetch penalty, 8 cycles extra misprediction penalty, speculative update on ID stage Register File 32 Integer & 32 FP registers Decode unit 4 insts/cycle Issue unit 4 wide out-of-order integer operation issue, 2 wide out-of-order FP operation issue, 64 instruction window size, 32 entry load/store queue Execute unit 4 Integer ALUs, 1 Mul/Div Integer unit, 2 FP ALUs, 1 Mul/Div FP unit Commit unit 4 inst/cycle

Table 1: Machine description Table 3 presents the number of Instructions Per Cycle (IPC) measured with SimpleScalar and the average number of Memory Cycles Per Instruction (MCPI) measured with the ATOM-based model. The higher the IPC, the better, since more instructions were executed per cycle. The lower the MCPI, the better, since the average number of cycles needed to service an instruction is lower. Memory cycles reduction

Cycle reduction edg

lcom

ixx

eqn

eon troff bore

mc−bb−cmg mc−cmg sc−bb−cmg sc−cmg

porky

mc−bb−cmg mc−cmg sc−bb−cmg sc−cmg

perl

gs gcc

−8

−6

−4

−2

0

2

4

6

8

10

12

−25

−20

−15

Performance improvement (%)

−10

−5

0

5

10

15

20

Performance improvement (%)

Figure 3: Performance improvement measured as the total number of cycles relative to the unoptimized case. Figure 3 shows the performance improvement expressed as the percentage of cycles over the unoptimized case. If the bar is positive, then we have an improvement. We present two plots, one for the SimpleScalar experiments and another for the ATOM-based results. 4

we are working on this and will provide results in the nal version of this paper

10

Program

Description

Train

Test

perl (C) gcc (C) edg (C) gs tro (C++) eqn (C++) bore (C++) porky (C++) lcom (C++) ixx (C++) eon (C++)

script language C compiler C/C++ front end postscript Interpreter document formatter equation formatter SUIF code transf.tool SUIF scalar optimizer HDL compiler IDL parser ray-tracing tool

jumble expr pic.cc photon6.ps gcc.1 eqn.inp combine cse circuit3 widget chair

scrabbl cp-decl input.cc tiger.ps

ex.1 same cse combine circuitx layouts same

Procs # in Kb 671 444 2328 1393 3486 1530 4400 1535 1818 447 850 246 2471 1230 3478 1945 1527 398 1581 469 2763 706

# 23 481 138 415 182 99 308 290 40 87 90

Pop Procs in Kb in Kb (bb) 51.2 8.8 614 345 113 65 154 111 69 43 52 30 134 91 124 87 37 15 56 29 26 19

Table 2: Attributes of traced applications. Training/test inputs are described in columns 3-4. The remaining attributes include the number and total size of static procedures, the number and total size of popular procedures and the total size of hot regions. Figure 4 displays the L1/L2 cache miss ratios. The two sets of plots on the left side correspond to the ATOM-based results while the results on the right side correspond to SimpleScalar results. The upper plots show the L1 cache miss ratios while the lower ones show the L2 cache miss ratios. Note that the L2 cache miss ratios do not include references that hit in the L1, so a higher L2 miss ratio, given a higher L1 hit ratio, can be desirable. As we can see from the results, utilizing intraprocedural basic block reordering is always bene cial. In some cases we can reduce the cycle count by over 16%. This bene t is partly due to reducing the working set size of each procedure which in turn allows the coloring algorithm to make better placement decisions. Providing more compact hot regions in an image also improves the eciency of the instruction fetch unit by increasing the sequentiality of the dynamic instruction stream (see IPC results). On the other hand, the improvement coming from applying multi-level cache coloring (mc-cmg and mcbb-cmg in Figure 3), seems to be application dependent. There are two reasons for this. First, memory defragmentation is performed at the expense of L2 cache coloring. For instance, in tro and porky, more than 93% of the popular procedures are moved in L2 during memory placement. This means that most of the work done in the L2 coloring algorithm is undone during memory placement. One reason why movement is performed this often is due to the disparity in L1 and L2 cache sizes (8KB and 256KB, respectively). To address this issue, we could limit the distance that memory placement can move procedures in L2. Also, adjusting the threshold for activating the heuristic could provide even better results. One other issue is that this set of applications does not experience a signi cant number of misses due to instruction accesses in the shared L2 cache. Most of the misses are due to data-code interference. However, some of the applications do bene t from mc coloring and we anticipate that applications with larger working sets such as databases, CAD tools, etc. will also bene t since the footprint of their images in an L2 cache will be reduced.

11

Program perl gcc edg tro eqn gs bore lcom ixx eon porky

Unopt

1.25 (300M) 1.02 (300M) 0.66 (304M) 1.59 (144M) 1.93 (70M) 1.43 (99.6M) 1.64 (300M) 1.46 (32.4M) 1.53 (48.7M) 1.69 (300M) 1.70 (300M)

IPC/MCPI

sc Pcmg

1.28 0.99 0.68 1.78 1.65 1.43 1.63 1.46 1.64 1.64 1.68

mc Pcmg

1.39 1.00 0.63 1.74 1.60 1.43 1.64 1.46 1.58 1.61 1.69

P bbsc cmg

1.45 1.15 0.69 1.91 2.05 1.40 1.62 1.45 1.43 1.57 1.66

P bbmc cmg

1.41 1.15 0.69 1.90 2.08 1.40 1.61 1.44 1.43 1.52 1.65

Table 3: IPC/MCPI for single level cache coloring con gurations.

5 Conclusions Cache performance is critical for today's microprocessors. Research has shown that compiler optimizations can signi cantly reduce the number of cache misses, and every opportunity should be taken by the compiler to do so. In this paper we propose a new link-time algorithm for code reordering which targets the elimination of con ict misses in a memory hierarchy with multiple caches. The algorithm exploits temporal locality on a procedure basis and intra-procedural spatial locality on a basic block basis. Furthermore, it employs a colorbased placement algorithm that not only colors procedures based on their most frequently accessed segment but also places them in multiple level caches so that con icts are minimized. Our approach can be further enhanced in dierent ways. First, we plan to consider jointly reordering data and code at the L2 cache level so that con icts between them are also minimized. We could also incorporate temporal-based page allocation into our code reordering framework to further improve performance by reducing the number of TLB misses. Finally, we would like to perform sensitivity analysis on the L2 cache size and the heuristics controlling memory placements to investigate the performance potential of multiple level cache coloring.

References [1] K. Pettis and R. Hansen. Pro le-Guided Code Positioning. In Proceedings of the International Conference on Programming Language Design and Implementation, pages 16{27, June 1990. [2] A.H. Hashemi, D. R.Kaeli, and B. Calder. Ecient Procedure Mapping using Cache Line Coloring. In Proceedings of the International Conference on Programming Language Design and Implementation, pages 171{182, June 1997. [3] W.M. Hwu and P.P. Chang. Achieving High Instruction Cache Performance with an Optimizing Compiler. In Proceedings of the International Symposium on Computer Architecture, pages 242{251, May 1989. [4] J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the International Conference on High Performance Computer Architecture, pages 360{369, January 1995.

12

L1 cache miss ratios


mc−bb−cmg mc−cmg sc−bb−cmg sc−cmg unopt

lcom

edg

ixx


eqn

eon troff bore

perl

porky

gs gcc

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

Miss ratio (%)

Miss ratio (%)



3.5

4

4.5

5

edg lcom

ixx

eqn


eon troff


bore

porky

perl

gs gcc

0

2

4

6

8

10

12

14

0

2

4

Miss ratio (%)

6

8

10

12

14

Miss ratio (%)

Figure 4: L1/L2 cache miss ratios. [5] S. McFarling. Program Optimization for Instruction Caches. In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 183{191, April 1989. [6] J. Kalamatianos and D.R. Kaeli. Temporal-Based Procedure Reordering for Improved Instruction Cache Performance. In Proceedings of the International Conference on High Performance Computer Architecture, pages 244{253, February 1998. [7] N. Gloy, T. Blackwell, M.D. Smith, and B. Calder. Procedure Placement using Temporal Ordering Information. In Proceedings of the International Symposium on Microarchitecture, pages 303{313, December 1997. [8] N. Gloy. Code Placement using Temporal Pro le Information. PhD thesis, Harvard University, 1998. [9] J. Kalamatianos, A. Khala , D.R. Kaeli, and W. Meleis. Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance. IEEE Transactions on Computers, 48(2):168{175, February 1999. [10] D. Burger and T. Austin. The SimpleScalar Tool Set, v2.0. Technical Report UW-CS-97-1342, University of Winsconsin, Madison, June 1997.

13

Code Reordering for Multi-level Cache Hierarchies

Code Reordering for Multi-level Cache Hierarchies

Suggest Documents

Reordering Method and Hierarchies for Quantum and Classical ...

Matrix Reordering Using Multilevel Graph Coarsening for ILU ...

Scaling Distributed Cache Hierarchies through ... - People.csail.mit.edu

A Comparison of Cache Hierarchies for SMT Processors - UPCommons

Read-Tuned STT-RAM and eDRAM Cache Hierarchies for ... - arXiv

Studying Microarchitectural Structures with Object Code Reordering

Code Reordering and Speculation Support for Dynamic Optimization ...

Code Reordering of Decision Support Systems for ... - Semantic Scholar

Code Reordering and Speculation Support for Dynamic Optimization ...

Managing Leakage Energy in Cache Hierarchies - Semantic Scholar

Optimization of mesh hierarchies in Multilevel Monte Carlo samplers

MC2: Multiple Clients on a Multilevel Cache - CS, Technion

Processor cache memory as RAM for execution of boot code

RFS-UCM: A Unified Multilevel Cache Management Policy

A Cache-Defect-Aware Code Placement Algorithm for ... - CiteSeerX

Aggressive Function Inlining with Global Code Reordering - IBM

Code Placement Techniques for Cache Miss ... - ACM Digital Library

Code Reordering on Limited Branch Offset - ACM Digital Library

PORTA: A three-dimensional multilevel radiative transfer code for ...

Learning Lexicalized Reordering Models from Reordering Graphs

A Framework for Unifying Reordering

Hypergraphs for run-time reordering

Framework for modeling reordering heuristics for asynchronous

Framework for modeling reordering heuristics for asynchronous ...