Trace Reduction for LRU-based Simulations - Semantic Scholar

3 downloads 0 Views 237KB Size Report
insertion of one element does not cause the eviction of another. It is easy to con rm that the sequence: abcadb. (3) has identical LRU behavior to (1) for a stack of ...
Trace Reduction for LRU-based Simulations Yannis Smaragdakis University of Texas at Austin Abstract

LRU bu er management is a policy under which an element bu er of size k always stores the k most recently used elements. Many variants of the policy are widely used in memory systems (e.g., virtual memory subsystems). We study a simple algorithmic problem (called the trace reduction problem): given a sequence of references to elements (a reference trace), compute the shortest sequence with identical LRU behavior for bu ers of at least a certain size. Despite the straightforward statement of the problem, its solution is non-trivial. We o er an algorithm and discuss its practical applications. These are mostly in the area of trace-driven program simulation: for quite realistic bu er sizes, reference traces from real programs can be reduced by a factor of up to several orders of magnitude. This compares favorably with all previous trace reduction techniques in the literature.

1 Introduction

An LRU stack of size k is a data structure storing the k most recently accessed elements of a larger set. References to set elements may cause the subset stored in an LRU stack to change. A full record of these changes is said to constitute the (LRU) behavior of the stack under the sequence of references. We will discuss the LRU trace reduction problem: given a sequence of references, nd the shortest sequence with identical behavior for an LRU stack of (at least) a given size. As an example, consider a set with four elements (a, b, c, and d) and the reference sequence: abcbacdabd (1) For an LRU bu er of size k = 3, the behavior of the sequence is: ha; NF i ; hb; NF i ; hc; NF i ; hd; bi ; hb; ci ; Last (2) The above behavior consists of pairs of elements entering and leaving the LRU stack. The special value Last signals the end of the sequence, while NF denotes that the LRU stack is not full and, hence, the insertion of one element does not cause the eviction of another. It is easy to con rm that the sequence: abcadb (3) has identical LRU behavior to (1) for a stack of size k = 3 (or greater). In fact, something more can be said: there is no sequence shorter than (3) with the same LRU behavior. Hence, sequence (3) is a solution to the LRU trace reduction problem for input (1) and a stack size of 3. In this paper we will describe a general algorithm for computing such solutions. The algorithm has already found application in the area of trace-driven simulation of replacement policies for storage hierarchies. Storage hierarchies are used in various parts of modern computer systems (CPU, disk, and network caches, for instance). In such hierarchies, one level acts as a fast bu er that stores elements from the next level of storage. When the bu er is full, a decision needs to be made about the bu er element that will get replaced. The Least Recently Used (LRU) policy (to which the \LRU stack" data type owes its name) is the most extensively studied replacement policy for storage hierarchies. Under LRU the element replaced is the one used least recently (i.e., furthest in the past) | the bu er is essentially an LRU stack. Replacement policies can be studied cost-e ectively using trace-driven simulation: reference traces from actual programs are collected and re-used in simulations of various policies. The size of such traces is usually enormous. As a 1

result, traces are often hard to store and process (simulation time is commonly proportional to the length of the trace). Using our algorithm to solve the LRU trace reduction problem, traces can be drastically reduced in size and the results of the simulation are entirely accurate, provided the simulated LRU stack is of at least a certain size. What makes this application of even greater interest are the many variations of LRU that have been studied (e.g.,[FeLW78, GlCa97]). These policies handle a small part of the bu er using LRU and the rest using a di erent policy. Under the guarantees o ered by our algorithm, simulation with reduced traces is exact for all such policies (provided that the LRU part of the bu er is larger than the reduction stack size of k elements). The results of trace reduction using our algorithm are signi cant: for quite realistic bu er sizes, reference traces can be reduced by several orders of magnitude. All previous techniques that achieve similar compression results have one of three drawbacks: they are either inaccurate (i.e., introduce error by dropping information from the trace), or require a decompression phase (and, hence, do not speed up simulation which is performed on the decompressed result), or require modi cations to the simulator to accept information in a di erent format (i.e., the result of the reduction is not itself a reference trace).

2 LRU Trace Reduction 2.1 Background

In this section we will de ne more formally the LRU trace reduction problem as well as some helpful concepts. An LRU stack of size k is a data type de ning a single operation touch (m), where m identi es a \block" (we assume no properties for blocks other than identity). The semantics of the operation is de ned as follows:  If there have been less than k distinct blocks \touched" in the past, touch (m) returns the special value NF .  If m is among the k most recently \touched" blocks, touch (m) returns the special value none.  Otherwise touch (m) returns an identi er of the k + 1-th most recently \touched" block. A reference trace is a nite sequence of references to blocks. We say that a reference trace r is applied to an LRU stack by performing the operation touch for each block of the trace in increasing order (i.e., touch (r(0)); touch (r(1)); : : : ; touch (r(#r ? 1)) 1 ). We de ne the complete (LRU) trace of a reference trace r to be a sequence c with c (i) = hr(i); touch (r(i))i, 0  i < #r. The complete trace is terminated by a special element Last (that is, c (#r) = Last). We also de ne the relevant event sequence re to consist of all indices i (in increasing order) such that touch (r(i)) 6= none. In the following, we will drop subscripts when they can be deduced from the context (both for the above de nitions and for ones still to be introduced). Thus we may often refer to touch , c and re as touch, c and re, for simplicity. We de ne the behavior of a reference trace r relative to an LRU stack to be the subsequence b of the complete sequence c, consisting only of relevant events. That is, b(i) = c(re(i)) = hr(re(i)); touch(re(i))i, for 0  i < #re. By analogy to the complete trace, b(#re) = Last. Thus, the behavior of a reference trace contains all references made to blocks that were not among the k most recently touched, and the results of the respective touch operations. The rst component r(re(i)) of a pair hr(re(i)); e(re(i))i will be called the fetch part (since it corresponds to a block \fetched" in the LRU stack) and the second component e(re(i)) will be called the evict part (since it corresponds to a block \evicted" from the stack). We will write b(i):fetch and b(i):evict for the fetch and evict parts of the i-th element of a sequence b. Data structures conforming to the LRU stack speci cation are usually studied in the context of storage hierarchies, where they correspond to the Least Recently Used replacement policy. This policy is known to belong in the general class of stack algorithms [MGST70]. In fact, LRU belongs to the subset of stack k

k

k

k

k

k

k

k

k;r

k;r

k

k;r

k;r

k

k

1

k;r

With #r we denote the length of a sequence r.

2

k;r

algorithms de ned in [MGST70] that base their replacement decisions on a priority list for elements. A property of these algorithms is that if two reference traces have identical behavior for a bu er of size k, they will also have identical behavior for all bu ers larger than k. We can now de ne the problem we are trying to solve: The LRU Trace Reduction Problem: Given a reference trace, compute a trace such that it has the same behavior for an LRU stack of size k and no shorter trace has that behavior. Due to the aforementioned property of LRU stacks, the resulting trace will also have identical behavior for larger bu ers. A trace satisfying the conditions of the LRU trace reduction problem will be called a reduced trace. In general, there will be more than one reduced trace, but all of them will be of the same length.

2.2 Developing a Lower Bound

Before we present an algorithm for the LRU trace reduction problem, it is useful to examine some of the characteristics of the reduced trace. A simple observation regarding the reduced reference trace is that it depends only on the behavior of the original trace: two input traces with the same behavior will have the same set of reduced traces. Hence, our problem is to compute a reduced trace from a given behavior sequence (derived from the input trace through LRU simulation). Consider the sequence formed by taking the \fetch" parts (i.e., rst elements of pairs) of the elements in a behavior sequence. Clearly, this needs to be a subsequence of any reference trace having this behavior. The LRU trace reduction problem is equivalent to adding the smallest possible set of elements to the behavior sequence, so that it becomes a complete LRU trace. We will attempt to develop a lower bound for the contents of this set. Later we will show that this bound is tight: our algorithm produces a trace containing exactly the extra references speci ed by the lower bound. Consider a sequence of pairs (of the form found in a complete LRU trace or a behavior sequence). Any consecutive subsequence of a sequence seq is called an interval of seq. Two concepts that are important for our later development are those of a \run" and a \tight run": De nition 1 (run) A fetch-evict run (or just run) in a sequence of pairs of the form hr(i); touch(r(i))i is an interval, such that:  it begins with a pair hr(s); touch(r(s))i and ends either with a pair hr(f ); r(s)i or with Last  no other element in the run has block r(s) in its fetch part (i.e., for every i, s < i < f it is r(i) 6= r(s)). Intuitively, a run describes the behavior of an LRU stack between the point where an element is last referenced before it is evicted, and the corresponding eviction point. We will describe intervals (and, hence, runs) using the starting and nishing indices in the sequence where they occur. A run (s0 ; f0 ) (that is, a run starting at index s0 and nishing at index f0 ) will be said to contain another run (s1 ; f1 ) i s0 < s1 and f1 < f0 (note that this de nition prescribes that runs extending till the end of a sequence do not contain one another).

De nition 2 (tight run) A tight fetch-evict run (or just tight run) is a run that contains no other runs. For illustration purposes, consider the example reference trace presented in the Introduction. Its complete LRU trace appears below (indexed for easier reference): 0 1 2 3 4 5 6 7 8 9 10

ha; NF i hb; NF i hc; NF i hb; nonei ha; nonei hc; nonei hd; bi ha; nonei hb; ci hd; nonei Last

There are ve runs in this trace: (3; 6), (5; 8), (7; 10), (8; 10), and (9; 10). All of them are tight. According to the following lemma, this is not a coincidence. 3

Lemma 1 All runs in a complete LRU trace are tight.

Proof: Assume that a run (s0 ; f0 ) in c is not tight. Then there exists a run (s1 ; f1) with s0 < s1 and f1 < f0 . The last inequality means that c(f1 ) 6= Last (because there is a later element in the trace). Hence, the result of the operation touch(r(f1 )) must have been r(s1 ) (by de nition of a run). By de nition of touch, this is the k + 1-th most recently touched element. This is, however, impossible since r(s0 ) is less recently last touched than r(s1 ) (no reference to r(s0 ) can exist between indices s0 and f1 by de nition of a run) and it is still among the k most recently touched elements (since it has not been evicted from the stack by point f1 ). Hence, every run in c is tight. 2

Continuing our example, consider the behavior sequence (2) from the Introduction (reproduced below with indices): 0 1 2 3 4 5 (4)

ha; NF i hb; NF i hc; NF i hd; bi hb; ci

Last

Not all runs in this sequence are tight: for instance, run (0; 5) contains both runs (1; 3) and (2; 4). To derive a complete LRU trace from a behavior sequence, we need to add \extra" references to ensure that all runs are tight. The following lemma is straightforward: Lemma 2 If a run (s0; f0) of a behavior sequence b contains another run (s1 ; f1), then the reduced trace r contains a reference to b(s0 ):fetch between re?1 (s1 ) and re?1 (f1 ). (Recall that re was de ned to be monotonically increasing, hence its inverse re?1 always exists.) Proof: Consider the complete LRU trace c produced from r. From Lemma 1, we have that all runs in c are tight. Since b is a subsequence of c, either interval (re?1 (s0 ); re?1 (f0 )) or interval (re?1 (s1 ); re?1 (f1 )) is not a run in c. Quick inspection of all possible cases for reference addition shows that the only way the tightness constraint can be preserved is if there exists i, such that c(i) = b(s0 ):fetch = c(re?1 (s0 )):fetch, re?1 (s1 ) < i < re?1 (f1 ). 2 k;r

k;r

k;r

k;r

In the example of sequence (4), run (0; 5) contains run (1; 3) and run (2; 4). Lemma 2 implies that there needs to be a reference to b(0):fetch = a between indices 1 and 3, as well as between 2 and 4. A reference to a after position 2 satis es both requirements, as seen in sequence (3). The corresponding complete LRU trace is: 0 1 2 3 4 5 6 (5) ha; NF i hb; NF i hc; NF i ha; nonei hd; bi hb; ci Last

In the general case, we are looking for the smallest number of extra references that if added to a behavior sequence will turn it into a complete LRU trace. Lemma 2 can be applied to all pairs of runs contained within one another to give a set of constraints for the extra references. In this way, we obtain for every block a set of constraints represented as intervals. The meaning of every constraint is that a reference to the block must appear within the given interval. (In the following, we will not distinguish between a constraint and the interval representing it.) As we saw in our example, several constraints can be satis ed by adding a single reference. We can recast the problem in a more general form: given intervals (s0 ; f0); : : : ; (s ; f ), compute a minimal set of points S , such that for each (s ; f ), there exists p 2 S with s < p < f . It is not hard to show that the problem is almost identical to the activity-selection problem (e.g., [CoLR90], p.330): given activities represented as intervals, compute the maximum-size set of non-overlapping activities. The extra requirement in our case is that we also want to compute the intersection of intervals that a certain point in the above minimal set will satisfy. Because of this requirement, a slightly di erent algorithm than that presented in [CoLR90], p.330, is more suitable. Clearly, a reference satis es constraints (s0 ; f0 ); : : : ; (s ; f ) i it belongs to the interval (maxfs0 ; : : : ; s g; minff0; : : : ; f g). The interval is empty if maxfs0 ; : : : ; s g  minff0 ; : : : ; f g. Examining the constraints in increasing starting index order (so that maxfs0; : : : ; s g = s ) leads to the following lemma: k

i

k

k

i

k

k

k

i

k

k

4

k

i

k

Lemma 3 Consider all constraints for a single block (assuming at least one exists), represented as intervals

(s0 ; f0 ); : : : ; (s ; f ) with s < s for i < j . 1. The least number of extra references required to satisfy all such constraints is equal to #e ? 1, where e is the sequence with e(0) = 0 and e(i + 1) is the least index, such that s ( +1)  minff ( ) ; : : : ; f ( +1) g or, if none exists, e(i + 1) = n + 1 and e(i + 1) is the last element of the sequence e. 2. The corresponding intervals (#e ? 1 in total) where the extra references must lie are (s ( +1)?1 ; minff ( ); : : : ; f ( +1)?1 g), for 0  i < #e ? 1. For a sequence seq and a block identi er bl we call the set of these intervals, int . n

n

i

j

e i

e i

e i

e i

e i

e i

seq;bl

Proof by simple induction (omitted). Very similar in spirit to Theorem 17.1 in [CoLR90], p.332.

We will also use the name conjunctive constraints for the elements of int (again, we may drop any subscripts whose value is clear from the context). Lemma 3 essentially describes an algorithm for computing a minimal set of intervals where references satisfying all constraints must lie (the same algorithm can be applied to the activity-selection problem). The algorithm gives a lower bound for the set of extra references required to produce a reduced trace. Notice that it is not clear that there is always a trace with no more extra references than these speci ed by the lower bound. Also, the procedure does not determine the exact place in an interval where extra references need to be inserted, or the relative positions of extra references. The algorithm described in the next section ensures that the lower bound is tight and provides a very ecient way of computing a reduced trace (in essence, without having to recognize long runs, which would be computationally costly). seq;bl

2.3 The Algorithm

We will present an algorithm to solve the LRU trace reduction problem. The algorithm takes a behavior sequence as input (derived from a reference trace through simple LRU simulation) and outputs a reduced trace. Additionally, it uses a data structure queue, which is an LRU stack augmented by two operations:  blocks after(block): returns the set of blocks, touched less recently than block, but still in the data structure (i.e., within the last k distinct blocks touched). If block has the special value NF , the returned set is empty (this is useful for uniform treatment of the boundary case where the structure is being lled up).  more recent(block1 ; block2): returns a boolean value indicating whether block1 was touched more recently than block2 . If block1 has the special value LowerLimit, or block2 has the special value NF , False is returned. We use the form queue. to designate the invocation of an operation on queue. The input sequence, behavior, is represented as an array for simplicity. The full algorithm (called olr for optimal LRU reduction) appears below:

5

olr(behavior, k)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

lookahead 0; current 0; fetched in future ; previous evict LowerLimit queue (k) . (k) denotes an empty LRU stack of size k while behavior[current] 6= Last do must touch queue:blocks after(behavior[current]:evict) lookahead done False while behavior[lookahead] 6= Last and :lookahead done do if behavior[lookahead]:evict 2 fetched in future then lookahead done True else if queue:more recent(previous evict; behavior[lookahead]:evict) then produce reference(behavior[lookahead]:evict) must touch must touch n fbehavior[lookahead]:evictg previous evict behavior[lookahead]:evict fetched in future fetched in future [fbehavior[lookahead]:fetchg lookahead lookahead + 1 for x 2 must touch do produce reference(x) produce reference(behavior[current]:fetch) fetched in future fetched in future n fbehavior[current]:fetchg current current + 1

produce reference(block) 1 queue:touch(block) 2 output(block) olr essentially outputs the fetch parts of pairs in the behavior sequence, interspersed with references to elements in the stack. It works by advancing two pointers (current, lookahead) through the input (behavior sequence). After each iteration of the loop in lines 7-15, the lookahead pointer has advanced to the end of the rst tight run beginning at or after position current. With each iteration of the loop in lines 4-20, the current pointer advances to the next pair in the behavior sequence. The output (reduced) trace is created through successive calls to function produce reference (lines 11, 17, and 18). Before a reference to a block is output, it gets applied to the queue data structure. This ensures that queue re ects the state of an LRU stack of size k, to which the output trace is being applied. During the execution of the algorithm, two sets of blocks are maintained: fetched in future contains the set fbehavior[i]:fetch : current  i < lookaheadg (this property holds everywhere other than between lines 14 and 15). The must touch set is initialized to contain the blocks touched less recently than behavior[current]:evict. Finally, previous evict is introduced for convenience: it holds the same block as behavior[lookahead ? 1]:evict, except in the rst iteration when it holds the value LowerLimit. An important observation concerns the if statement in lines 10-12 of the algorithm: Observation 1 After the execution of lines 10-12 the following property holds: if current  i < j  lookahead, then behavior[j ]:evict was touched more recently than behavior[i]:evict (in the current state of queue). That is, lines 10-11 make sure (by adding references to blocks) that eviction order matches touch order by increasing recency (up to the lookahead point). Line 12 removes any touched elements from the must touch set (if the element was in the set). It is illustrative to follow the execution of the algorithm's outer loop (lines 4-20) in a simple example (for a stack of size k = 7). Consider the following state of execution (the contents of queue appear on the left,

6

ordered by recency of touches, with the rightmost element being the most recently touched). queue = a; b; c; d; e; f ; g : : : hh; bi hi; ci hj; f i hk; di hl; gi hb; ii : : : fh; ig fetched in future = current lookahead = fag must touch The state shown is reached after the execution of line 5 (computation of the must touch set). This set represents the blocks in the stack that need to be referenced before behavior[current]:fetch. The loop in lines 7-15 will iterate till lookahead reaches the end of the rst tight run after position current (that is, till lookahead points to the hb; ii element). For each element that lookahead visits (namely hj; f i ; hk; di ; hl; gi ; hb; ii ) the algorithm examines the blocks identi ed by their evict parts. Extra references are added to out-of-order blocks, so that Observation 1 holds. Note that once a block is referenced, it becomes the most recently touched block. Hence, the check of line 10 will always succeed till the end of the next tight run, causing all other blocks examined in the course of the inner loop (lines 7-15) to also be referenced. In our example, the output would be d, g. After the end of the inner loop, lines 16-17 will add references to all remaining blocks in the must touch set. Finally, a reference for the fetch part of the current behavior element is output. This completes the output sequence for this iteration of the outer block. The complete output and nal state after the execution of line 20 are: queue = c; e; f ; d; g; a; h : : : hh; bi hi; ci hj; f i hk; di hl; gi hb; ii : : : fetched in future = fi; j; k; l; bg current lookahead output = : : : d; g; a; h : : : Note that the queue structure is not fully ordered according to eviction order (for instance, there is an e between the c and f). The need to add just enough references to preserve some order in the structure is what makes the LRU trace reduction problem challenging. Related problems (e.g., minimum reference trace to convert an LRU stack from one con guration to another) have trivial solutions (which cannot be translated into a solution for the reduction problem).

d

d

d

2.4 Algorithm Correctness

d

To argue that the algorithm is correct, we must show that the behavior of the produced trace is identical to the algorithm's input and that no shorter trace can have that behavior. Due to its length, the proof can be found in the Appendix. Nevertheless, we will try to give some intuition on the proof development here. Proving that the behavior of the output is identical to the input is straightforward. We will concentrate on the more interesting task of proving that the produced trace is minimal:  Extra references produced by lines 11 and 17 of the algorithm correspond to non-tight runs. Additionally, in both cases the references satisfy the requirements of Lemma 3: the next constraint on the block referenced has s ( +1)  minff ( ) ; : : : ; f ( +1) g (and thus cannot be satis ed by the same reference as the previous constraints). This guarantees that the extra references are as few as possible.  The previous steps are meaningless if we cannot establish a correspondence between non-tight runs in the behavior sequence and in the produced trace. In particular, the diculty is that the algorithm adds references to a behavior trace to produce the reduced trace. These references can introduce new runs, and these runs entail more constraints (in the sense of Lemma 3). Fortunately, we can prove that the new constraints do not a ect the computation of conjunctive constraints (i.e., do not increase the lower bound). Proving the latter point is quite interesting. Part of the value of the olr algorithm lies in its eciency: the algorithm only needs to examine at most k references at any given time. Thus, it does not explicitly detect the endpoints of long non-tight runs. This is, for instance, re ected in lines 16-17, where new references e i

e i

e i

7

are output in any relative order (i.e., without regard to the subsequent eviction order of the blocks). As it turns out, this order does not matter: extra references to the same blocks need to be inserted in the future before the blocks are actually evicted. Lines 10-11 ensure that the order of nal references is such that no new non-tight runs are introduced. Note that this freedom in the relative ordering of extra references by line 17 results into multiple traces that are all solutions to the trace reduction problem2.

2.5 Time and Space Requirements

We will rst examine the running time of olr. Its asymptotic complexity depends on the complexity of operations on our augmented LRU stack (queue data structure). Assume that h(k) is the complexity of operations touch and more recent, for a stack of k elements. It is then reasonable to assume that the complexity of operation blocks after is O(h(k) + n), if n is the size of the returned set. For most practical applications, h(n) will be the function log n and the queue structure will be implemented as an augmented self-balancing tree. Another factor is the cost of maintaining simple sets (fetched in future and must touch) under additions of a single element and deletions. It is quite reasonable to assume that this cost is also bounded by h(n). Under the above assumptions, the running time of olr is O(m  h(k)+ n), where m is the size of the input (behavior sequence) and n the size of the output (the inequality m  n  m  k holds). This is not hard to see: Loop 7-15 of the algorithm will be executed exactly m times and the complexity of every operation is at most h(n), for a total running time of O(m  h(n)). The outer loop (lines 4-20) will also be executed m times and the total cost of the m blocks after operations is O(m  h(k) + n) (the total size of sets returned cannot be more than the length of the output since a reference is produced for each element). Similarly, the running time of the loop in lines 16-17 is O(n) and the cost of lines 18 and 19 is O(m  h(n)). Thus the total running time of olr is O(m  h(k) + n). Note that the input of olr (behavior sequence) is produced by applying a reference trace to an LRU stack. If we assume h(k) to also be the cost of maintaining a simple LRU stack of k elements (i.e., one with only the operation touch), then for a reference trace of length l, producing the behavior sequence has a runtime cost of O(l  h(k)). Thus, producing the behavior sequence is generally much more costly than applying olr. Another interesting aspect of olr is that the size of its working storage is bounded by k (i.e., all structures used, including the portion of the input being examined, have at most k elements). This makes the algorithm ideal for online implementations (i.e., reduction can take place while a trace is being produced).

3 Applications

We implemented the reduction algorithm and applied it to actual memory reference traces. The reduced traces were then used in experiments with adaptive replacement policies based on LRU and proved quite valuable. These results provide a good preliminary indication of the applicability of the algorithm. Traces in our experiment came from six actual programs 3 (\espresso", \gcc", \grobner", \ghostscript", \lindsay", and \p2c"). The traces were collected using the VMTrace utility [Kapl98]. The original traces were already in a \blocked" form (i.e., contained references to pages and not to individual addresses, and contiguous references to the same page were replaced by a single reference). The sizes of the blocked traces appear on the table below: Trace name Original size (KBytes) Trace name Original size (KBytes) espresso 2,288,568 gcc-2.7.2 262,670 grobner 54,514 gs3.33 940,603 lindsay 865,835 p2c 215,057 There are, however, solutions that cannot be produced by olr (due to its bounded lookahead). These traces are the ones to which we had immediate access for our initial experiments. Many more traces have been reduced since, but the above results are highly typical. 2 3

8

Figure 1: Reduction results We reduced the traces for four di erent stack sizes (k = 5, 10, 20, and 40). These sizes are quite conservative, given the memory capacitites of modern machines, which are in the thousands of pages. (Also compare to the memory sizes required for accurate simulation using the technique in [GlCa97], which are in the order of hundreds of pages.) The results of the reduction appear in Figure 1. The values depicted are the ratios of the size of the original trace divided by the size of the reduced trace. For reference purposes we also present the same metric for the length of the behavior sequence. It is worth noting that for these traces and the stack sizes chosen, the size of the behavior sequence (in KBytes) is very close to the size of the reduced trace. Since the behavior sequence contains two block references per eviction (one for the fetched and one for the evicted block), we see that the ratio of extra references added by olr to references that cause evictions is close to 1. (Strictly speaking, this ratio does get to be as high as 3 in some of our experiments. All these cases, however, were for a stack of size 40 (e.g., see \grobner" in Figure 1), where the input trace has been already dramatically reduced.) As seen from Figure 1, the reduction results vary widely between programs. This is quite expected since 9

reduction depends on the temporal locality of reference for the given program. Even for the smallest stack size (5 pages) the reduced trace was about 15 times smaller than the original (and even 60 in the case of \espresso"). The worst reduction ratio was still above 6 (i.e., the reduced trace was one sixth the size of the original). For larger stack sizes, almost all traces were reduced by at least a factor of 100 and in some cases dramatically more. As we will discuss in the next section, these results compare very favorably to other reduction techniques in the literature. Recall that the reduced trace has exactly the same behavior as the original for LRU stacks of sizes k or larger. Perhaps more importantly, the traces can be used for simulations of replacement policies that are not strictly LRU. In collaboration with Paul Wilson and Scott Kaplan of the University of Texas, we have already experimented with various adaptive replacement policies. All the policies we examined are not LRU but, nevertheless, use an LRU algorithm for the rst few positions of the bu er. The bene ts in this case were great: the policies we studied are not stack algorithms. Hence, separate simulations would be necessary for di erent memory sizes. Time and storage constraints would make this virtually impossible given the sizes of the original traces. Simulation was, however, quite feasible using reduced traces. Note that we did not discuss the e ects of standard le compression applied to the reduced traces. This would result in an even smaller representation of the trace. For instance, the six traces above, reduced for a stack of size 5, can be compressed by another factor of 10 (the actual average is 9.52 with a small deviation) using the standard utility gzip. Of course, the traces need to be decompressed before they are used in simulations so the compression gains translate only into storage savings | not simulation speedup. In general, since the output of olr is itself a trace, all standard trace reduction techniques can be used to reduce it further. Such techniques could, for instance, exploit the spatial locality of the reference trace (e.g., [AgHu90]) and result in an extra signi cant factor of reduction.

4 Related Work

There are numerous pieces of work on various aspects of the LRU replacement policy and LRU-based data structures. Indicatively, we mention Sleator and Tarjan's well known results [SlTa85] and some more recent treatment of LRU competitive ratio issues [BIRS95, ChNo98]. Nevertheless, to the best of our knowledge, the LRU trace reduction problem has not been studied before in a formal algorithmic setting. Empirical results on trace reduction, on the other hand, abound in the literature. We will examine the approaches we feel are most relevant to our discussion. The recent survey of trace-driven simulation by Uhlig and Mudge [UhMu97] is a good further reference. The need to store reference trace information in a reduced form was recognized early on. One of the earliest ideas essentially amounted to storing the behavior sequence for an LRU stack of size k instead of the complete trace [CoRa70]. As we saw in our experiments, the size of the behavior sequence can be much smaller than that of the original trace, even for small values of k. Using the behavior sequence, exact simulations for memory sizes above k can be performed. The behavior sequence, however, has a di erent format from traces (fetch-evict pairs instead of references). Thus, the simulator as well as any other tools (e.g., trace browsers) will need to change to be able to process the behavior format. This is a practical burden to the simulator implementors and makes it hard to distribute traces in a compatible form. This is the main reason why this simple technique has not become more widespread. Various di erential encoding techniques combined with standard compression have been used to compact reference traces. Most notably, the Mache [Samp89] and PDATS [JoHa94] systems explore spatial locality in the reference trace to encode it di erentially. Subsequently, standard text compression techniques (e.g., Lempel-Ziv) can result into further reduction of its size. Neither of these techniques achieves the compression ratios (and especially the simulation speedups) that we have indicated in the previous section. On the other hand, they can be used to accurately reconstruct a trace for all purposes. An approach which is conceptually related to ours is that of Smith [Smit77]. Smith's stack deletion method consists of only keeping references that cause evictions from an LRU stack of size k. Using our 10

terminology, stack deletion results into a reference trace that consists of the fetch parts of all records in a behavior sequence. The assumptions of this technique are similar to ours: the simulated memory size must be greater than k, or the results are completely unreliable. The problem with this trace reduction approach (as well as with several others | see [UhMu97]) is that error is introduced in the simulation even for memory sizes greater than k. In general, our approach is quite bene cial if its preconditions are satis ed. That is, for LRU-based simulations with a stack of size larger than that used in reduction, our technique o ers very high reduction ratios. We quote from the survey of Uhlig and Mudge [UhMu97]: If fast and exact simulation results are required, the best trace-reduction methods are limited to size-reduction factors of about 10. We believe that there are several applications for which the preconditions are true and our technique can o er signi cantly higher compression factors. Recall that the simulated policy does not have to be strict LRU. It is enough for the rst k blocks of the bu er to be managed in an LRU fashion. Many experiments in trace-driven simulation concern policies that exhibit an LRU character for the rst part of their bu ers (e.g., [FeLW78, GlCa97]). Finally, our work may be quite competitive in a whole di erent eld: that of producing arti cial traces that satisfy certain guarantees. There has been work (e.g., [Baba81]) on producing traces that are deemed \realistic enough" for simulations because they have the same LRU behavior as actual traces. Given the high reduction ratios achieved by our algorithm, it may be possible to completely replace arti cial traces with drastically reduced actual traces.

5 Conclusions and Future Work

We presented an ecient solution to a simple-to-state but challenging algorithmic problem. The result has signi cant practical applications in the area of trace driven simulation. At this point, there are many possibilities for further work. From a theoretical standpoint, an interesting problem is to develop reduction techniques for other \stack algorithms". We have begun sketching a reduction algorithm for OPT (the optimal replacement policy) and we conjecture that the nal algorithm will have similar characteristics to the LRU reduction algorithm. More interestingly, there is the possibility that properties of stack algorithms can be used to develop a general and ecient reduction scheme for all of them. There is also much applied work remaining. We have already experimented with a simple algorithm that yields reduced traces valid for simulations both with the LRU and with the OPT replacement policy. The reduction factors obtained are, clearly, not as high as these of olr but may still be acceptable in practice. Another interesting question concerns the accuracy of our reduced traces when used in non-LRU simulations or in LRU simulations for stack sizes smaller than k. This seems to open up possibilities for experimentation and, perhaps, theoretical work.

References

[AgHu90] A. Agarwal and M. Hu man, \Blocking: Exploiting spatial locality for trace compaction", Proc. SIGMETRICS '90, pp.48-57. [Baba81] O. Babaoglu, \Ecient Generation of Memory Reference Strings Based on the LRU Stack Model of Program Behaviour", Proc. PERFORMANCE '81, pp.373-383. [BIRS95] A. Borodin, S. Irani, P. Raghavan, and B. Schieber, \Competitive paging with locality of reference", Journal of Computer and System Sciences, 50(2) pp.244-258 (1995). [ChNo98] M. Chrobak and J. Noga, \LRU is better than FIFO", Proc. Symposium on Discrete Algorithms (SODA) '98, pp.78-81. 11

[CoRa70] E.G. Co man and B. Randell, \Performance Predictions for Extended Paged Memories", Acta Informatica 1, pp.1-13 (1970). [CoLR90] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, MIT Press, 1990. [FeLW78] E.B. Fernandez, T. Lang, and C. Wood, \E ect of Replacement Algorithms on a Paged Bu er Database System", IBM Journal of Research and Development, 22(2) pp.185-196 (1978). [GlCa97] G. Glass and P. Cao, \Adaptive Page Replacement Based on Memory Reference Behavior", Proc. SIGMETRICS '97. [JoHa94] E.E. Johnson and J. Ha, \PDATS: Lossless Address Trace Compression for Reducing File Size and Access Time", Proc. IEEE International Conference on Computers and Communications pp.213-219 (1994). [Kapl98] S.F. Kaplan, \Virtual Memory Reference Tracing Using User-Level Access Protection", unpublished manuscript. [MGST70] R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger, \Evaluation Techniques for Storage Hierarchies", IBM Systems Journal 9 pp.78-117 (1970). [Samp89] A.D. Samples, \Mache: No-Loss Trace Compaction", Proc. SIGMETRICS '89, pp.89-97. [SlTa85] D.D. Sleator and R.E. Tarjan, \Amortized Eciency of List Update and Paging Rules", Communications of the ACM 28(2), pp.202-208 (1985). [Smit77] A.J. Smith, \Two Methods for the Ecient Analysis of Memory Address Trace Data", IEEE Transactions on Software Engineering SE-3(1) (1977). [UhMu97] R.A. Uhlig and T.N. Mudge, \Trace-Driven Memory Simulation: A Survey", ACM Computing Surveys 29(2), pp.128-170 (1997).

A Appendix: Algorithm Correctness Proof

We must show that the behavior of the produced trace is identical to the algorithm's input and that no shorter trace can have that behavior. The rst part is straightforward: Theorem 1 The behavior of the output reference trace of olr (for an LRU stack of size k) is identical to its input. Proof: The outer loop of the algorithm (lines 4-20) causes current to range over all input elements and output a reference to the fetch part of each one (line 18). Consider a single iteration of the loop and the corresponding segment of the output. All other output references (lines 11, 17) are guaranteed to be to blocks that are in the stack when the output up to this point is applied to an LRU stack (recall that queue re ects the state of an LRU stack of size k, to which the output trace is being applied). All blocks in the stack that are less recent than behavior[current]:evict belong to the must touch set (line 5). These get touched (loop in lines 16-17) before the reference to behavior[current]:fetch is output. Hence, when the output trace is applied to an LRU stack of size k, the only reference causing an element to be fetched into memory is that to behavior[current]:fetch. The corresponding evicted element (least recently touched) is behavior[current]:evict. Thus the behavior of the output trace is identical to the input of olr. 2 Showing that the output trace is as short as possible is a little more interesting. To do this, we need to express some properties of the pre xes of the output trace. We will name output the smallest pre x of the output sequence that contains exactly p extra references. (Recall that extra references are the ones not causing evictions, and, thus, are only produced by lines 11 and 17 of the olr algorithm.) With each such pre x, we can associate a state of olr (the point in the execution right after the last element of output was produced). Then we can de ne current and lookahead to be the values of current and lookahead for this state. Consider the sequence partial formed by concatenating the complete LRU trace of output with p

p

p

p

p

p

12

the sequence behavior[current ; : : :] (i.e., the contiguous subsequence of behavior, beginning at current ). This sequence represents the partial result of the algorithm up to a speci c point. To make evident the correspondence between the behavior sequence and its part in partial , we introduce the function trans , with trans (i) = i ? current + #output , for i  current (i.e., trans maps an index in behavior to the corresponding index in partial ). (Again, we will drop the subscripts when our discussion concerns a single value of p.) A simple observation is the following: Lemma 4 Every extra reference produced by olr satis es a constraint resulting from a non-tight run in the \partial" sequence. Proof: Recall that the queue data structure re ects the state of an LRU stack, to which output has been applied. The condition in line 10 of the algorithm is true i the relative order of the latest references to two blocks does not match their relative eviction order. In this case, the run for the less recently referenced block contains the run for the more recently referenced block. Thus, every time an extra reference is output by the code of line 11, it satis es a constraint resulting from a non-tight run. Similarly for line 17, a reference to block bl is produced i bl was referenced less recently than the immediately next evicted element (behavior[current]:evict). Thus, the run for the referenced block contains the run for behavior[current]:evict and the reference satis es a non-tight run constraint. 2 p

p

p

p

p

p

p

p

p

p

In fact, something more can be said: Lemma 5 For the \partial" sequence, consider all constraints on the extra references to a block, (s0; f0); : : : ; (s ; f ), ordered by starting indices of their intervals (just like in Lemma 3). If an extra reference produced by olr satis es a set of constraints S, then let k = maxfi : (s ; f ) 2 S g. It is either k = n or s +1  minff : (s ; f ) 2 S g. Proof: First note that by Lemma 1, there can be no non-tight runs in the complete LRU trace of output . Hence, any constraint (s; f ) will have f  #output. Assume k 6= n. We consider rst the references produced by line 11 of the algorithm. These satisfy a constraint resulting from a run ending at position trans(lookahead) which contains a run ending at position f = trans(lookahead ? 1). There are three cases for constraint (s +1 ; f +1 ): s +1 < #output, or s +1  #output and f +1 < lookahead, or s +1  #output and f +1  lookahead. In the rst case we have that f +1  #output and, hence, the extra reference will satisfy this constraint (contradiction). The second case is also impossible, since lookahead advances only till the end of the rst tight run after current (and trans(current) = #output). The third case means that s +1 > f  minff : (s ; f ) 2 S g. Similarly for line 17, the extra references produced satisfy a constraint ending at position #output. We have two cases for constraint (s +1 ; f +1 ): s +1 < #output, or s +1  #output. The rst case leads to contradiction in the same way as above. The second case means s +1  #output  minff : (s ; f ) 2 S g. 2 n

n

i

i

i

i

k

i

current

l

k

k

k

k

k

k

k

k

k

k

k

l

i

k

i

i

k

k

i

i

i

Note that the previous two lemmata constitute steps in the right direction. They are, however, quite weak: they prove that the algorithm is satisfying the conjunctive constraints of sequence partial. This is meaningless by itself: partial0 is obviously the same sequence as behavior but partial changes in time while the algorithm is being executed. We need some way of arguing that new constraints introduced by extra references do not signi cantly a ect the lower bound computed using Lemma 3 for sequence partial. The upcoming theorem does exactly that.

Theorem 2 The cardinality of all sets int (written #int p p ) is a non-increasing function of p. Additionally, if partial (#output ? 1) = bl, then #int < #int p p?1 . partial ;bl

p

partial ;bl

p

partial ;bl

13

partial

;bl

The theorem essentially states that producing an extra reference does not increase the lower bound computed in Lemma 3 for other blocks, and decreases it for the referenced block. Proof: Every time an extra reference to block bl is output, the conjunctive constraint in int p?1 with the least starting index is satis ed. This is a straightforward consequence of Lemma 5. To prove the theorem we need to show that any extra constraints that are created do not result into new conjunctive constraints (i.e., can be satis ed by the same references as existing ones). Consider the state of the algorithm right after p extra references have been produced. We will distinguish the two cases where the p-th reference was introduced by line 11 or by line 17.  Extra reference produced by line 11: A reference to bl is output i behavior[lookahead ]:evict = bl and behavior[lookahead ? 1]:evict was touched more recently than bl. A new run is created in partial between point #output and trans (lookahead ) (also an old run is removed, but removing a run clearly cannot increase the lower bound computed by Lemma 3). The new run cannot contain any other runs (since lookahead only advances till the end of the next tight run). Hence, no new constraints for bl are introduced. The run could, however, be contained in another run, resulting in a new constraint for some other block. We show the general form of the partial sequence schematically below (y is the newly introduced reference, x is the block whose set int p x may change). h i j j +1 k l partial

;bl

p

p

p

p

p

p

p

partial ;

: : : hx; i : : :

d? current

: : : hy; nonei

h; i

p

1

d current h; i

d lookahead h; yi

:::

p

: : : h; xi : : :

p

(We use the star symbol (*) as a \don't care" value for blocks.) Note that it must be lookahead ?1 < lookahead = k (whenever line 11 outputs a reference, lookahead advances). Recall that lookahead stops at the ending index of the tight run immediately following current. Hence, there is a tight run in the interval (i + 1; k ? 1). Consider now the construction of the set int p x of conjunctive constraints, as in Lemma 3. No constraints can exist with a right endpoint before point j + 1 in partial (due to Lemma 1). Hence, the constraint induced by the newly created run will be grouped together with the constraint for run (i +1; k ? 1) (and possibly others) to form a conjunctive constraint. Since the upper endpoint of the conjunctive constraint is of the form minff0; : : : ; f (1)?1 g), the newly introduced constraint will not a ect it (since its endpoint k is greater than another endpoint k ? 1). Thus the introduction of a new reference cannot increase the cardinality of set int p x.  Extra reference produced by line 17: A reference to bl is output i behavior[current ]:evict was touched more recently than bl. The new run (s ; f ) that is created is a sub-interval of the old run on bl (begins at point #output and ends at the old endpoint). Thus, if it contains any runs, these do not introduce new constraints for bl. The run could, however, be contained in another run, resulting in a new constraint for some other block. In that case, lookahead < f (if lookahead had reached f , a reference to bl would have been output by line 11 and bl would have been removed from the must touch set). Hence, there is a tight run (s ; f ) beginning at or after point current and ending at or before point lookahead . This run is contained in the newly introduced run, (s ; f ). Consider a constraint introduced because (s ; f ) is contained in another run. A more restrictive constraint will be introduced for (s ; f ). Thus the computation of conjunctive constraints cannot be a ected 0 by the new run (and no increase in the cardinality of a set int p 0 , for some bl , can result). 2 This theorem concludes the minimality proof for the output of olr. Since partial0 = behavior, we have that int = int for every block bl. Producing extra references will then only decrease the 0 cardinality of one of the int sets and leave the others the same. Thus the algorithm will not p?1 p

p

partial ;

p

e

partial ;

p

bl

bl

p

p

tight

bl

tight

bl

bl

bl

tight

partial ;bl

partial ;bl

bl

p

p

tight

p

behavior

partial

;bl

output more extra references than those prescribed by the lower bound of Lemma 3. 14

bl

Suggest Documents