insertion of one element does not cause the eviction of another. It is easy to con rm that the sequence: abcadb. (3) has identical LRU behavior to (1) for a stack of ...
Optimal Trace Reduction for LRU-based Simulations Yannis Smaragdakis University of Texas at Austin Abstract
LRU is a well-known buer management policy under which an element buer of size k always stores the k most recently used elements. Many variants of the policy are widely used in memory systems (e.g., virtual memory subsystems). We study a simple-to-state algorithmic problem called the LRU trace reduction problem: for a sequence of references to elements (a reference trace), compute the shortest sequence with identical LRU behavior for buers of at least a given size. Despite the straightforward statement of the problem, its solution is non-trivial. We oer an ecient algorithm with important practical applications in performance analysis.
1 Introduction
An LRU stack of size k is a data structure storing the k most recently accessed elements of a larger set. References to set elements may cause the subset stored in an LRU stack to change. A full record of these changes is said to constitute the (LRU) behavior of the stack under the sequence of references. We will discuss the LRU trace reduction problem: given a sequence of references, nd the shortest sequence with identical behavior for an LRU stack of (at least) a given size. As a simple example, consider a set with four elements (a, b, c, and d) and the reference sequence: abcbacdabd (1) For an LRU buer of size k = 3, the behavior of the sequence is: ha; NF i ; hb; NF i ; hc; NF i ; hd; bi ; hb; ci ; Last (2) The above behavior consists of pairs of elements entering and leaving the LRU stack. The special value Last signals the end of the sequence, while NF denotes that the LRU stack is not full and, hence, the insertion of one element does not cause the eviction of another. It is easy to con rm that the sequence: abcadb (3) has identical LRU behavior to (1) for a stack of size k = 3 (or greater). In fact, there is no sequence shorter than (3) with the same LRU behavior. Hence, sequence (3) is a solution to the LRU trace reduction problem for input (1) and a stack size of 3. In this paper we will describe an algorithm for computing such solutions, prove that it is correct, and analyze its running time. The algorithm has already been applied in the area of trace-driven simulation for storage hierarchies and was evaluated experimentally in a previous paper [KSW98]. Storage hierarchies are used in various parts of modern computer systems (CPU, disk, and network caches, for instance). In such hierarchies, one level acts as a fast buer that stores elements from the next level of storage. When the buer is full, a decision needs to be made about the buer element that will get replaced. The Least Recently Used (LRU) policy (to which the \LRU stack" data type owes its name) is the most extensively studied replacement policy for storage hierarchies. Under LRU the element replaced is the one used least recently (i.e., furthest in the past)|the buer is essentially an LRU stack. Replacement policies can be studied costeectively and without interference from other system components using trace-driven simulation: reference traces from actual programs are collected and re-used in simulations of various policies. The size of such traces is usually enormous. As a result, traces are often hard to store and process (simulation time is 1
commonly proportional to the length of the trace). Using our algorithm to solve the LRU trace reduction problem, traces can be drastically reduced in size and the results of the simulation are entirely accurate, provided the simulated LRU stack is of at least a given size. As we discuss in [KSW98], our algorithm is preferrable to other trace reduction alternatives in many kinds of virtual memory trace-driven simulations. Most actual or proposed replacement policies are either variants or approximations of LRU. Variants of LRU (e.g., GLRU [FeLW78], SEQ [GlCa97], FBR [RoDe90], EELRU [SKW98]) handle a small part of the buer using LRU and the rest using a dierent policy. Under the guarantees oered by our algorithm, simulation with reduced traces is exact for all such policies (provided that the LRU part of the buer is larger than the reduction stack size of k elements). Even when the LRU part of the buer is small (10 to 100 blocks) our algorithm achieves reduction factors of up to several orders of magnitude. Our algorithm is also useful when studying approximations of LRU. The most prominent approximations, used in most actual operating systems, are clock and segq (segmented queue|also known as hybrid FIFOLRU [BaFe83] or segmented FIFO [TuLe81]). In [KSW98] we showed that the error introduced by our algorithm for approximations of LRU is small|under 2% in number of faults in most cases. This error is smaller than that of stack deletion [Smit77], which is a common technique for removing high frequency reference information from virtual memory traces. This paper will not be further concerned with the applications of the algorithm, but with its analysis. Subsequent sections are organized as follows: Section 2 presents the background of the problem. Section 3 derives a lower bound on the number of extra references that a solution to the trace reduction problem needs to contain. The insights of this section are the essence of a solution to the trace reduction problem. In Section 4 we describe an ecient algorithm and prove that the previously derived lower bound is tight. (Much of the proof deals with connecting the main insight with the speci cs of the algorithm and is quite low-level. Therefore, that part is included separately in the Appendix.) Section 5 discusses related work, while Section 6 contains our concluding remarks.
2 Background
An LRU stack of size k is a data type de ning a single operation touch (m), where m identi es a \block" (we assume no properties for blocks other than identity). The semantics of the operation is de ned (informally) as follows: If there have been less than k distinct blocks \touched" in the past, touch (m) returns the special value NF . If m is among the k most recently \touched" blocks, touch (m) returns the special value none. Otherwise touch (m) returns an identi er of the k + 1-th most recently \touched" block. A reference trace is a nite sequence of references to blocks. We say that a reference trace r is applied to an LRU stack by performing the operation touch for each block of the trace in increasing order (i.e., touch (r(0)); touch (r(1)); : : : ; touch (r(#r ? 1)) 1 ). We de ne the complete (LRU) trace of a reference trace r to be a sequence c with c (i) = hr(i); touch (r(i))i, 0 i < #r. The complete trace is terminated by a special element Last (that is, c (#r) = Last). We also de ne the relevant event indices sequence ri to consist of all indices i in increasing order such that touch (r(i)) 6= none. In the following, we will drop subscripts when they can be deduced from the context (both for the above de nitions and for ones still to be introduced). Thus we may often refer to touch , c and ri as touch, c and ri, for simplicity. We de ne the behavior of a reference trace r relative to an LRU stack to be the subsequence b of the complete sequence c, consisting only of relevant events. That is, b(i) = c(ri(i)) = hr(ri(i)); touch(ri(i))i, for 1 With #r we denote the length of a sequence r . k
k
k
k
k
k
k
k
k;r
k;r
k
k;r
k;r
k
k
k;r
2
k;r
0 i < #ri. By analogy to the complete trace, b(#ri) = Last. Thus, the behavior of a reference trace contains all references made to blocks that were not among the k most recently touched, and the results of the respective touch operations. The rst component r(ri(i)) of a pair hr(ri(i)); e(ri(i))i will be called the fetch part (since it corresponds to a block \fetched" in the LRU stack) and the second component e(ri(i)) will be called the evict part (since it corresponds to a block \evicted" from the stack). We will write b(i):fetch and b(i):evict for the fetch and evict parts of the i-th element of a sequence b. Data structures conforming to the LRU stack speci cation are usually studied in the context of storage hierarchies, where they correspond to the Least Recently Used replacement policy. This policy is known to belong in the general class of stack algorithms [MGST70]. In fact, LRU belongs to the subset of stack algorithms de ned in [MGST70] that base their replacement decisions on a priority list for elements. A property of these algorithms is that if two reference traces have identical behavior for a buer of size k, they will also have identical behavior for all buers larger than k. We can now de ne the problem we are trying to solve: The LRU Trace Reduction Problem: Given a reference trace, compute a trace such that it has the same behavior for an LRU stack of size k and no shorter trace has that behavior. Due to the aforementioned property of LRU stacks, the resulting trace will also have identical behavior for larger buers. A trace satisfying the conditions of the LRU trace reduction problem will be called a reduced trace. In general, there will be more than one reduced trace, but all of them will be of the same length.
3 Developing a Lower Bound
Before we present an algorithm for the LRU trace reduction problem, it is useful to examine some of the characteristics of the reduced trace. A simple observation regarding the reduced reference trace is that it depends only on the behavior of the original trace: two input traces with the same behavior will have the same set of reduced traces. Hence, our problem is to compute a reduced trace from a given behavior sequence (derived from the input trace through LRU simulation). Consider the sequence formed by taking the \fetch" parts (i.e., rst elements of pairs) of the elements in a behavior sequence. Clearly, this needs to be a subsequence of any reference trace having this behavior, and, hence, it needs to be a subsequence of the reduced trace. The LRU trace reduction problem is equivalent to adding the smallest possible set of elements to the behavior sequence, so that it becomes a complete LRU trace. We will attempt to develop a lower bound for the contents of this set. Later we will show that this bound is tight: our algorithm produces a trace containing exactly the extra references speci ed by the lower bound. Consider a sequence of pairs (of the form found in a complete LRU trace or a behavior sequence). Any consecutive subsequence of a sequence seq is called an interval of seq. Two concepts that are important for our later development are those of a \run" and a \tight run": De nition 1 (run) A fetch-evict run (or just run) in a sequence of pairs of the form hr(i); touch(r(i))i is an interval, such that: it begins with a pair hr(s); touch(r(s))i and ends either with a pair hr(f ); r(s)i or with Last no other element in the run has block r(s) in its fetch part (i.e., for every i, s < i < f it is r(i) 6= r(s)). Intuitively, a run describes the behavior of an LRU stack between the point where an element is last referenced before it is evicted, and the corresponding eviction point. We will describe intervals (and, hence, runs) using the starting and nishing indices in the sequence where they occur. A run (s0 ; f0 ) (that is, a run starting at index s0 and nishing at index f0 ) will be said to contain another run (s1 ; f1 ) i s0 < s1 and 3
f1 < f0 (note that this de nition prescribes that runs extending till the end of a sequence do not contain one another).
De nition 2 (tight run) A tight fetch-evict run (or just tight run) is a run that contains no other runs. For illustration purposes, consider the example reference trace presented in the Introduction. Its complete LRU trace appears below (indexed for easier reference): 0 1 2 3 4 5 6 7 8 9 10
ha; NF i hb; NF i hc; NF i hb; nonei ha; nonei hc; nonei hd; bi ha; nonei hb; ci hd; nonei Last
There are ve runs in this trace: (3; 6), (5; 8), (7; 10), (8; 10), and (9; 10). All of them are tight. According to the following lemma, this is not a coincidence.
Lemma 1 All runs in a complete LRU trace are tight.
Proof: Assume that a run (s0 ; f0 ) in c is not tight. Then there exists a run (s1 ; f1) with s0 < s1 and f1 < f0 . The last inequality means that c(f1 ) 6= Last (because there is a later element in the trace). Hence, the result of the operation touch(r(f1 )) must have been r(s1 ) (by de nition of a run). By de nition of touch, this is the k + 1-th most recently touched element. This is, however, impossible since r(s0 ) is less recently last touched than r(s1 ) (no reference to r(s0 ) can exist between indices s0 and f1 by de nition of a run) and it is still among the k most recently touched elements (since it has not been evicted from the stack by point f1 ). Hence, every run in c is tight. 2
Continuing our example, consider the behavior sequence (2) from the Introduction (reproduced below with indices): 0 1 2 3 4 5 (4)
ha; NF i hb; NF i hc; NF i hd; bi hb; ci
Last
Not all runs in this sequence are tight: for instance, run (0; 5) contains both runs (1; 3) and (2; 4). To derive a complete LRU trace from a behavior sequence, we need to add \extra" references to ensure that all runs are tight. The following lemma is straightforward: Lemma 2 If a run (s0; f0) of a behavior sequence b contains another run (s1 ; f1), then the reduced trace r with behavior b contains a reference to b(s0 ):fetch between ri?1 (s1 ) and ri?1 (f1 ). (Recall that ri was de ned to be monotonically increasing, hence its inverse ri?1 always exists.) Proof: Consider the complete LRU trace c produced from r. From Lemma 1, we have that all runs in c are tight. Since b is a subsequence of c, either interval (ri?1 (s0 ); ri?1 (f0 )) or interval (ri?1 (s1 ); ri?1 (f1 )) is not a run in c. Quick inspection of all possible cases for reference addition shows that the only way the tightness constraint can be preserved is if there exists i, such that c(i) = b(s0 ):fetch = c(ri?1 (s0 )):fetch, ri?1 (s1 ) < i < ri?1 (f1 ). 2 k;r
k;r
k;r
k;r
In the example of sequence (4), run (0; 5) contains run (1; 3) and run (2; 4). Lemma 2 implies that there needs to be a reference to b(0):fetch = a between indices 1 and 3, as well as between 2 and 4. A reference to a after position 2 satis es both requirements, as seen in sequence (3). The corresponding complete LRU trace is: 0 1 2 3 4 5 6 (5) ha; NF i hb; NF i hc; NF i ha; nonei hd; bi hb; ci Last
In the general case, we are looking for the smallest number of extra references that if added to a behavior sequence will turn it into a complete LRU trace. Lemma 2 can be applied to all pairs of runs contained within one another to give a set of constraints for the extra references. In this way, we obtain for every 4
block a set of constraints represented as intervals. The meaning of every constraint is that a reference to the block must appear within the given interval. (In the following, we will identify a constraint by the interval representing it.) As we saw in our example, several constraints can be satis ed by adding a single reference. We can recast the problem in a more general form: given intervals (s0 ; f0 ); : : : ; (s ; f ), compute a minimal set of points S , such that for each (s ; f ), there exists p 2 S with s < p < f . It is not hard to show that the problem is almost identical to the activity-selection problem (e.g., [CoLR90], p.330): given activities represented as intervals, compute the maximum-size set of non-overlapping activities. The extra requirement in our case is that we also want to compute the intersection of intervals that a certain point in the above minimal set will satisfy. Because of this requirement, a slightly dierent algorithm than that presented in [CoLR90], p.330, is more suitable. Clearly, a reference satis es constraints (s0 ; f0 ); : : : ; (s ; f ) i it belongs to the interval (maxfs0 ; : : : ; s g; minff0; : : : ; f g). The interval is empty if maxfs0 ; : : : ; s g minff0 ; : : : ; f g. Examining the constraints in increasing starting index order (so that maxfs0; : : : ; s g = s ) leads to the following lemma: Lemma 3 Consider all constraints for a single block (assuming at least one exists), represented as intervals (s0 ; f0 ); : : : ; (s ; f ) with s < s for i < j . 1. The least number of extra references required to satisfy all such constraints is equal to #e ? 1, where e is the sequence with e(0) = 0 and e(i + 1) is the least index, such that s ( +1) minff ( ) ; : : : ; f ( +1) g or, if none exists, e(i + 1) = n + 1 and e(i + 1) is the last element of the sequence e. 2. The corresponding intervals (#e ? 1 in total) where the extra references must lie are (s ( +1)?1 ; minff ( ); : : : ; f ( +1)?1 g), for 0 i < #e ? 1. For a sequence seq and a block identi er bl we call the set of these intervals, int . Proof by simple induction (omitted). Very similar in spirit to Theorem 17.1 in [CoLR90], p.332. k
i
k
i
k
k
i
k
k
k
k
k
n
i
k
n
i
j
e i
e i
e i
e i
e i
e i
seq;bl
We will also use the name conjunctive constraints for the elements of int (again, we may drop any subscripts whose value is clear from the context). Lemma 3 essentially describes a greedy algorithm for computing a minimal set of intervals where references satisfying all constraints must lie (the same algorithm can be applied to the activity-selection problem). The algorithm gives a lower bound for the set of extra references required to produce a reduced trace. Notice that it is not clear that there is always a trace with no more extra references than these speci ed by the lower bound. The main problem is that adding extra references may cause new non-tight runs to be introduced. The algorithm described in the next section ensures that the lower bound is tight and provides a very ecient way of computing a reduced trace (in essence, without having to recognize long runs, which would be computationally costly). seq;bl
4 The Algorithm
We will present an algorithm to solve the LRU trace reduction problem. The algorithm takes a behavior sequence as input (derived from a reference trace through simple LRU simulation) and outputs a reduced trace. Additionally, it uses a data structure queue, which is an LRU stack augmented by two operations: blocks after(block): returns the set of blocks, touched less recently than block, but still in the data structure (i.e., within the last k distinct blocks touched). If block has the special value NF , the returned set is empty (this is useful for uniform treatment of the boundary case where the structure is being lled up). more recent(block1 ; block2): returns a boolean value indicating whether block1 was touched more recently than block2 . If block1 has the special value LowerLimit, or block2 has the special value NF , False is returned. 5
We use the form queue. to designate the invocation of an operation on queue. The input sequence, behavior, is represented as an array for simplicity. The full algorithm (called olr for optimal LRU reduction) appears below: olr(behavior, k)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
lookahead 0; current 0; fetched in future ; previous evict LowerLimit queue (k) . (k) denotes an empty LRU stack of size k while behavior[current] 6= Last do must touch queue:blocks after(behavior[current]:evict) lookahead done False while behavior[lookahead] 6= Last and :lookahead done do if behavior[lookahead]:evict 2 fetched in future then lookahead done True else if queue:more recent(previous evict; behavior[lookahead]:evict) then produce reference(behavior[lookahead]:evict) must touch must touch n fbehavior[lookahead]:evictg previous evict behavior[lookahead]:evict fetched in future fetched in future [fbehavior[lookahead]:fetchg lookahead lookahead + 1 for x 2 must touch do produce reference(x) produce reference(behavior[current]:fetch) fetched in future fetched in future n fbehavior[current]:fetchg current current + 1
produce reference(block) 1 queue:touch(block) 2 output(block) olr essentially outputs the fetch parts of pairs in the behavior sequence (line 18), interspersed with \extra" references to elements in the stack. It works by advancing two pointers (current, lookahead) through the input (behavior sequence). After each iteration of the loop in lines 7-15, the lookahead pointer has advanced to the end of the rst tight run beginning at or after position current. With each iteration of the loop in lines 4-20, the current pointer advances to the next pair in the behavior sequence. The output (reduced) trace is created through successive calls to function produce reference (lines 11, 17, and 18). Before a reference to a block is output, it gets applied to the queue data structure. This ensures that queue re ects the state of an LRU stack of size k, to which the output trace is being applied. During the execution of the algorithm, two sets of blocks are maintained: fetched in future contains the set fbehavior[i]:fetch : current i < lookaheadg (this property holds everywhere other than between lines 14 and 15). The must touch set is initialized to contain the blocks touched less recently than behavior[current]:evict. These are blocks that have to be touched before the next touch to a page that is not in the LRU stack (so that behavior[current]:evict will be the least recently touched page in the stack). Finally, previous evict is introduced for convenience: it holds the same block as behavior[lookahead ? 1]:evict, except in the rst iteration when it holds the value LowerLimit. An important observation concerns the if statement in lines 10-12 of the algorithm: Observation 1 After the execution of lines 10-12 the following property holds: if current i < j lookahead, then behavior[j ]:evict was touched more recently than behavior[i]:evict (in the current state of
6
queue). That is, lines 10-11 make sure (by adding references to blocks) that eviction order matches touch order by increasing recency between the current and the lookahead point. Line 12 removes any touched elements from the must touch set (if the element was in the set). It is illustrative to follow the execution of the algorithm's outer loop (lines 4-20) in a simple example (for a stack of size k = 7). Consider the following state of execution (the contents of queue appear on the left, ordered by recency of touches, with the rightmost element being the most recently touched). queue = a; b; c; d; e; f ; g : : : hh; bi hi; ci hj; f i hk; di hl; gi hb; ii : : : fetched in future = fh; ig current lookahead must touch = fag The state shown is reached after the execution of line 5, which computes the must touch set. This set represents the blocks in the stack that need to be referenced before behavior[current]:fetch. The loop in lines 7-15 will iterate till lookahead reaches the end of the rst tight run after position current (that is, till lookahead points to the hb; ii element). For each element that lookahead visits (namely hj; f i ; hk; di ; hl; gi ; hb; ii ) the algorithm examines the blocks identi ed by their evict parts. Extra references are added to out-of-order blocks, so that Observation 1 holds. Note that once a block is referenced, it becomes the most recently touched block. Hence, the check of line 10 will always succeed till the end of the next tight run, causing all other blocks examined in the course of the inner loop (lines 7-15) to also be referenced. In our example, the output would be d, g. After the end of the inner loop, lines 16-17 will add references to all remaining blocks in the must touch set. Finally, a reference for the fetch part of the current behavior element is output. This completes the output sequence for this iteration of the outer block. The complete output and nal state after the execution of line 20 are: queue = c; e; f ; d; g; a; h : : : hh; bi hi; ci hj; f i hk; di hl; gi hb; ii : : : fetched in future = fi; j; k; l; bg current lookahead output = : : : d; g; a; h : : : Note that the queue structure is not fully ordered according to eviction order (for instance, there is an e between the c and f). The need to add just enough references to preserve some order in the structure is what makes the LRU trace reduction problem challenging. Related problems (e.g., minimum reference trace to convert an LRU stack from one con guration to another) have almost trivial solutions.
d
d
d
4.1 Algorithm Correctness
d
To argue that the algorithm is correct, we must show that the behavior of the produced trace is identical to the algorithm's input and that no shorter trace can have that behavior. Due to its length and low-level detail, the proof can be found in the Appendix. Nevertheless, we will try to give some intuition on the proof development here. Proving that the behavior of the output is identical to the input is straightforward. We will concentrate on the more interesting task of proving that the produced trace is minimal: Extra references produced by lines 11 and 17 of the algorithm correspond to non-tight runs. Additionally, in both cases the references satisfy the requirements of Lemma 3: the next constraint on the block referenced has s ( +1) minff ( ) ; : : : ; f ( +1) g (and thus cannot be satis ed by the same reference as the previous constraints). This guarantees that the extra references are as few as possible. The previous steps are meaningless if we cannot establish a correspondence between non-tight runs in the behavior sequence and in the produced trace. In particular, the diculty is that the algorithm adds references to a behavior trace to produce the reduced trace. These references can introduce new runs, and these runs entail more constraints (in the sense of Lemma 3). Fortunately, we can prove that the new constraints do not aect the computation of conjunctive constraints (i.e., do not increase the lower bound). e i
e i
e i
7
Proving the latter point is quite interesting. Part of the value of the olr algorithm lies in its eciency: the algorithm only needs to examine at most k references at any given time. Thus, it does not explicitly detect the endpoints of long non-tight runs. This is, for instance, re ected in lines 16-17, where new references are output in any relative order (i.e., without regard to the subsequent eviction order of the blocks). As it turns out, this order does not matter: extra references to the same blocks need to be inserted in the future before the blocks are actually evicted. Lines 10-11 ensure that the order of nal references is such that no new non-tight runs are introduced. Note that this freedom in the relative ordering of extra references by line 17 results into multiple traces that are all solutions to the trace reduction problem. 2
4.2 Time and Space Requirements
We will rst examine the running time of olr. Its asymptotic complexity depends on the complexity of operations on our augmented LRU stack (queue data structure). Assume that h(k) is the complexity of operations touch and more recent, for a stack of k elements. It is then reasonable to assume that the complexity of operation blocks after is O(h(k) + n), if n is the size of the returned set. For most practical applications, h(n) will be the function log n and the queue structure will be implemented as an augmented self-balancing tree. Another factor is the cost of maintaining simple sets (fetched in future and must touch) under additions of a single element and deletions. It is quite reasonable to assume that this cost is also bounded by h(n). Under the above assumptions, the running time of olr is O(m h(k) + n), where m is the size of the input (behavior sequence) and n the size of the output (the inequality m n m k holds). This is not hard to see: Loop 7-15 of the algorithm will be executed exactly m times and the complexity of every operation is at most h(n), for a total running time of O(m h(n)). The outer loop (lines 4-20) will also be executed m times and the total cost of the m blocks after operations is O(m h(k) + n) (the total size of sets returned cannot be more than the length of the output since a reference is produced for each element). Similarly, the running time of the loop in lines 16-17 is O(n) and the cost of lines 18 and 19 is O(m h(n)). Thus the total running time of olr is O(m h(k) + n). Note that the input of olr (that is, the behavior sequence) is produced by applying a reference trace to an LRU stack. If we assume h(k) to also be the cost of maintaining a simple LRU stack of k elements (i.e., one with only the operation touch), then for a reference trace of length l, producing the behavior sequence has a runtime cost of O(l h(k)). Thus, producing the behavior sequence is generally much more costly than applying olr, since typical traces from real programs are orders of magnitude longer than their behavior sequences, even for small values of k. Another interesting aspect of olr is that the size of its working storage is bounded by k (i.e., all structures used, including the portion of the input being examined, have at most k elements). This makes the algorithm ideal for online implementations (i.e., reduction can take place while a trace is being produced).
5 Related Work
There are numerous pieces of work on various aspects of the LRU replacement policy and LRU-based data structures. Indicatively, we mention Sleator and Tarjan's well-known results [SlTa85] and some more recent results on the competitive ratio of LRU replacement [BIRS95, ChNo98]. Nevertheless, to the best of our knowledge, the LRU trace reduction problem has not been studied before in a formal algorithmic setting. On the other hand, empirical results on trace reduction, the area where our algorithm has found application, abound in the literature. We will only examine the approaches that are conceptually related to our algorithm. A more extensive discussion of the trace reduction literature can be found in [KSW98] and in the recent survey of trace-driven simulation by Uhlig and Mudge [UhMu97]. 2 There are, however, solutions that cannot be produced by olr. 8
The need to store reference trace information in a reduced form was recognized early on. One of the earliest techniques was that of Coman and Randell [CoRa70]. This essentially amounted to storing the behavior sequence for an LRU stack of size k instead of the complete trace. The size of the behavior sequence can be much smaller than that of the original trace, even for small values of k. Using the behavior sequence, exact simulations for memory sizes above k can be performed. The biggest drawback of the Coman and Randell approach, however, is that the product of reduction is not itself a trace. For instance, it is not clear how the LRU behavior sequence of a trace can be used for simulations of non-LRU policies. In the best case, the simulator as well as any other tools (e.g., trace browsers) will need to change to accept the new format. This is a practical burden to the simulator implementors and makes it hard to distribute traces in a compatible form. This is the main reason why this simple technique has not become more widespread. The algorithm presented in this paper is complementary to the approach of Coman and Randell: it oers an ecient way to turn the behavior sequence format into the shortest possible trace exhibiting this LRU behavior. Another approach which is conceptually related to ours is that of Smith [Smit77]. Smith's stack deletion method consists of only keeping references that cause evictions from an LRU stack of size k. Using our terminology, stack deletion results into a reference trace that consists of the fetch parts of all records in a behavior sequence. The assumptions of this technique are similar to ours: the simulated memory size must be greater than k, or the results are completely unreliable. The problem with this trace reduction approach (as well as with several others|see [UhMu97]) is that error is introduced in the simulation even for memory sizes greater than k. In contrast, our algorithm oers accurate LRU simulations and, as we show in [KSW98], introduces less error than stack deletion for LRU approximations (such as clock and segmented FIFO [TuLe81]|policies commonly found in real operating systems). Finally, our work can be viewed as trace synthesis from a detailed speci cation of LRU behavior. There has been work (e.g., [Baba81]) on producing traces that are deemed \realistic enough" for simulations because they have statistically similar LRU behavior as actual traces. Given the high reduction ratios achieved by our algorithm, it may be possible to completely replace \statistically similar" arti cial traces with drastically reduced actual traces.
6 Conclusions and Future Work
We presented an ecient solution to a simple-to-state but challenging algorithmic problem. The result has signi cant practical applications in the area of trace driven simulation. There are many possibilities for further work. From a theoretical standpoint, an interesting problem is to develop reduction techniques for other \stack algorithms". We have begun sketching a reduction algorithm for OPT (the optimal replacement policy) and we conjecture that the nal algorithm will have similar characteristics to the LRU reduction algorithm. More interestingly, there is the possibility that properties of stack algorithms can be used to develop a general and ecient reduction scheme for all of them. There is also much potential for applied work. We have already experimented with a simple algorithm that yields reduced traces valid for simulations both with the LRU and with the OPT replacement policies. Although the reduction is not optimal for either policy, the reduction factors obtained are high enough to be useful in practice. Another interesting question concerns the error introduced by our algorithm when used in LRU simulations with stack sizes smaller than k, or entirely dierent replacement policies (e.g., LFU). This needs to be evaluated experimentally but may also be addressable by theoretical work.
References [Baba81]
O. Babaoglu, \Ecient Generation of Memory Reference Strings Based on the LRU Stack Model of Program Behaviour", Proc. PERFORMANCE '81, pp.373-383.
9
[BaFe83] [BIRS95] [ChNo98] [CoRa70] [CoLR90] [FeLW78] [GlCa97] [KSW98] [MGST70] [RoDe90] [SlTa85] [SKW98] [Smit77] [TuLe81] [UhMu97]
O. Babaoglu and D. Ferrari, \Two-Level Replacement Decisions in Paging Stores", IEEE Transactions on Computers, 32(12) (1983). A. Borodin, S. Irani, P. Raghavan, and B. Schieber, \Competitive paging with locality of reference", Journal of Computer and System Sciences, 50(2) pp.244-258 (1995). M. Chrobak and J. Noga, \LRU is better than FIFO", Proc. Symposium on Discrete Algorithms (SODA) '98, pp.78-81. E.G. Coman and B. Randell, \Performance Predictions for Extended Paged Memories", Acta Informatica 1, pp.1-13 (1970). T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, MIT Press, 1990. E.B. Fernandez, T. Lang, and C. Wood, \Eect of Replacement Algorithms on a Paged Buer Database System", IBM Journal of Research and Development, 22(2) pp.185-196 (1978). G. Glass and P. Cao, \Adaptive Page Replacement Based on Memory Reference Behavior", Proc. ACM SIGMETRICS '97. S.F. Kaplan, Y. Smaragdakis, and P. Wilson, \Trace Reduction for Virtual Memory Simulations", to appear at ACM SIGMETRICS '99. Available before publication at . R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger, \Evaluation Techniques for Storage Hierarchies", IBM Systems Journal 9 pp.78-117 (1970). J.T. Robinson and M.V. Devarakonda, \Data Cache Management Using Frequency-Based Replacement", In Proc. ACM SIGMETRICS '90. D.D. Sleator and R.E. Tarjan, \Amortized Eciency of List Update and Paging Rules", Communications of the ACM 28(2), pp.202-208 (1985). Y. Smaragdakis, S.F. Kaplan, and P.R. Wilson, \EELRU: Simple and Ecient Adaptive Page Replacement", to appear at ACM SIGMETRICS '99. Available before publication at . A.J. Smith, \Two Methods for the Ecient Analysis of Memory Address Trace Data", IEEE Transactions on Software Engineering SE-3(1) (1977). R. Turner and H. Levy, \Segmented FIFO Page Replacement", In Proc. ACM SIGMETRICS '81. R.A. Uhlig and T.N. Mudge, \Trace-Driven Memory Simulation: A Survey", ACM Computing Surveys 29(2), pp.128-170 (1997).
A Appendix: Algorithm Correctness Proof
We must show that the behavior of the produced trace is identical to the algorithm's input and that no shorter trace can have that behavior. The rst part is straightforward: Theorem 1 The behavior of the output reference trace of olr (for an LRU stack of size k) is identical to its input. Proof: The outer loop of the algorithm (lines 4-20) causes current to range over all input elements and output a reference to the fetch part of each one (line 18). Consider a single iteration of the loop and the corresponding segment of the output. All other output references (lines 11, 17) are guaranteed to be to blocks that are in the stack when the output up to this point is applied to an LRU stack (recall that queue re ects the state of an LRU stack of size k, to which the output trace is being applied). All blocks in the stack that are less recent than behavior[current]:evict belong to the must touch set (line 5). These get 10
touched (loop in lines 16-17) before the reference to behavior[current]:fetch is output. Hence, when the output trace is applied to an LRU stack of size k, the only reference causing an element to be fetched into memory is that to behavior[current]:fetch. The corresponding evicted element (least recently touched) is behavior[current]:evict. Thus the behavior of the output trace is identical to the input of olr. 2 Showing that the output trace is as short as possible is a little more interesting. To do this, we need to express some properties of the pre xes of the output trace. We will name output the smallest pre x of the output sequence that contains exactly p extra references. That is, output (#output ? 1) is the block referenced by the p-th extra reference. (Recall that extra references are the ones not causing evictions, and, thus, are only produced by lines 11 and 17 of the olr algorithm.) With each such pre x, we can associate a state of olr (the point in the execution right after the last element of output was produced). Then we can de ne current and lookahead to be the values of current and lookahead for this state. Consider the sequence partial formed by concatenating the complete LRU trace of output with the sequence behavior[current ; : : :] (i.e., the contiguous subsequence of behavior, beginning at current ). This sequence represents the partial result of the algorithm up to a speci c point. To make evident the correspondence between the behavior sequence and its part in partial , we introduce the function trans , with trans (i) = i ? current +#output , for i current (i.e., trans maps an index in behavior to the corresponding index in partial ). (Again, we will drop the subscripts when our discussion concerns a single value of p.) A simple observation is the following: Lemma 4 Every extra reference produced by olr satis es a constraint resulting from a non-tight run in the \partial" sequence. Proof: Recall that the queue data structure re ects the state of an LRU stack, to which output has been applied. The condition in line 10 of the algorithm is true i the relative order of the latest references to two blocks does not match their relative eviction order. In this case, the run for the less recently referenced block contains the run for the more recently referenced block. Thus, every time an extra reference is output by the code of line 11, it satis es a constraint resulting from a non-tight run. Similarly for line 17, a reference to block bl is produced i bl was referenced less recently than the immediately next evicted element (behavior[current]:evict). Thus, the run for the referenced block contains the run for behavior[current]:evict and the reference satis es a non-tight run constraint. 2 p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
In fact, something more can be said: Lemma 5 For the \partial" sequence, consider all constraints on the extra references to a block, (s0; f0); : : : ; (s ; f ), in increasing order of their starting indices (just like in Lemma 3). If an extra reference produced by olr satis es a set of constraints S, then let t = maxfi : (s ; f ) 2 S g. It is either t = n or s +1 minff : (s ; f ) 2 S g. Proof: First note that by Lemma 1, there can be no non-tight runs in the complete LRU trace of output . Hence, any constraint (s; f ) will have f #output. Assume t 6= n. We consider rst the references produced by line 11 of the algorithm. These satisfy a constraint resulting from a run ending at position trans(lookahead) which contains a run ending at position f = trans(lookahead ? 1). There are three cases for constraint (s +1 ; f +1 ): s +1 < #output, or s +1 #output and f +1 < lookahead, or s +1 #output and f +1 lookahead. In the rst case we have that f +1 #output and, hence, the extra reference will satisfy this constraint (contradiction). The second case is also impossible, since lookahead advances only till the end of the rst tight run after current (and trans(current) = #output). The third case means that s +1 > f minff : (s ; f ) 2 S g. Similarly for line 17, the extra references produced satisfy a constraint ending at position #output. We have two cases for constraint (s +1 ; f +1 ): s +1 < #output, or s +1 #output. The rst case leads to n
n
i
i
i
t
i
i
current
l
t
t
t
t
t
t
t
t
t
t
t
l
i
t
11
i
i
t
contradiction in the same way as above. The second case means s +1 #output minff : (s ; f ) 2 S g. 2 t
i
i
i
Note that the previous two lemmata constitute steps in the right direction. They are, however, quite weak: they prove that the algorithm is satisfying the conjunctive constraints of sequence partial. This is meaningless by itself: partial0 is obviously the same sequence as behavior but partial changes in time while the algorithm is being executed. We need some way of arguing that new constraints introduced by extra references do not signi cantly aect the lower bound computed using Lemma 3 for sequence partial. The upcoming theorem does exactly that.
Theorem 2 The cardinality of all sets int (written #int p p ) is a non-increasing function of p. Additionally, if output (#output ? 1) = bl, then #int < #int p p?1 . partial ;bl
p
partial ;bl
p
partial ;bl
partial
;bl
The theorem essentially states that producing an extra reference does not increase the lower bound computed in Lemma 3 for other blocks, and decreases it for the referenced block. Proof: Every time an extra reference to block bl is output, the conjunctive constraint in int p?1 with the least starting index is satis ed. This is a straightforward consequence of Lemma 5. To prove the theorem we need to show that any extra constraints that are created do not result into new conjunctive constraints (i.e., can be satis ed by the same references as existing ones). Consider the state of the algorithm right after p extra references have been produced. We will distinguish the two cases where the p-th reference was introduced by line 11 or by line 17. Extra reference produced by line 11: A reference to bl is output i behavior[lookahead ]:evict = bl and behavior[lookahead ? 1]:evict was touched more recently than bl. A new run is created in partial between point #output and trans (lookahead ) (also an old run is removed, but removing a run clearly cannot increase the lower bound computed by Lemma 3). The new run cannot contain any other runs (since lookahead only advances till the end of the next tight run). Hence, no new constraints for bl are introduced. The run could, however, be contained in another run, resulting in a new constraint for some other block. We show the general form of the partial sequence schematically below (y is the newly introduced reference, x is the block whose set int p x may change). h i j j +1 k l partial
;bl
p
p
p
p
p
p
p
partial ;
: : : hx; i : : :
d? current h; i
p
: : : hy; nonei
1
d current h; i
d lookahead h; yi
:::
p
: : : h; xi : : :
p
(We use the star symbol (*) as a \don't care" value for blocks.) Note that it must be lookahead ?1 < lookahead = k (whenever line 11 outputs a reference, lookahead advances). Recall that lookahead stops at the ending index of the tight run immediately following current. Hence, there is a tight run in the interval (i + 1; k ? 1). Consider now the construction of the set int p x of conjunctive constraints, as in Lemma 3. No constraints can exist with a right endpoint before point j + 1 in partial (due to Lemma 1). Hence, the constraint induced by the newly created run will be grouped together with the constraint for run (i +1; k ? 1) (and possibly others) to form a conjunctive constraint. Since the upper endpoint of the conjunctive constraint is of the form minff0; : : : ; f (1)?1 g), the newly introduced constraint will not aect it (since its endpoint k is greater than another endpoint k ? 1). Thus the introduction of a new reference cannot increase the cardinality of set int p x. Extra reference produced by line 17: A reference to bl is output i behavior[current ]:evict was touched more recently than bl. The new run (s ; f ) that is created is a sub-interval of the old run on bl (begins at point #output and ends at the old endpoint). Thus, if it contains any runs, these do not introduce new constraints for bl. The run could, however, be contained in another run, resulting in a new constraint for some other block. In that case, lookahead < f (if lookahead had reached f , a p
p
partial ;
p
e
partial ;
p
bl
bl
p
p
12
bl
p
bl
reference to bl would have been output by line 11 and bl would have been removed from the must touch set). Hence, there is a tight run (s ; f ) beginning at or after point current and ending at or before point lookahead . This run is contained in the newly introduced run, (s ; f ). Consider a constraint introduced because (s ; f ) is contained in another run. A more restrictive constraint will be introduced for (s ; f ). Thus the computation of conjunctive constraints cannot be aected 0 by the new run (and no increase in the cardinality of a set int p 0 , for some bl , can result). 2 This theorem concludes the minimality proof for the output of olr. Since partial0 = behavior, we have that int = int for every block bl. Producing extra references will then only decrease the 0 cardinality of one of the int sets and leave the others the same. Thus the algorithm will not p?1 output more extra references than those prescribed by the lower bound of Lemma 3. tight
tight
p
p
bl
bl
tight
bl
tight
partial ;bl
partial ;bl
behavior
partial
;bl
13
bl