Most database systems use a main-memory area as a cache buffer, to reduce ... Torino, Italy; M. Schkolnick, IBM T. J. Watson Research Center, P.O. Box 218, ..... enon, which we call external thrashing, was first studied in the context of virtual.
Buffer Management in Relational Database Systems GIOVANNI MARIA SACCO Universit& di Torino and MARIO SCHKOLNICK IBM T. J. Watson Research Center
The hot-set model, characterizing the buffer requirements of relational queries, is presented. This model allows the system to determine the optimal buffer space to be allocated to a query; it can also be used by the query optimizer to derive efficient execution plans accounting for the available buffer space, and by a query scheduler to prevent thrashing. The hot-set model is compared with the working-set model. A simulation study is presented. Categories and Subject Descriptors: H.2.4 [Database
Management]:
Systems-query processing
General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: Buffer management, hot points, hot-set model, merging scans, nested scans, scheduling, sequential scans, thrashing, unstable intervals, working set
1. INTRODUCTION Most database systems use a main-memory area as a cache buffer, to reduce accessesto disks. This area is subdivided into frames, and each frame can contain a page of a secondary storage file. A process requesting a page will cause a fault if the page is not in the buffer: The requested page is then read into an available buffer frame (demand paging). When no available frames exist, a frame is made available by a replacement policy. If required, its contents are copied back to the disk. The most popular replacement policy is LRU (least recently used page), which replaces the page that has not been referenced for the longest time. LRU belongs to the family of stack algorithms [25], having the desirable property that an increase in available buffer space never produces an increase in the fault rate. Moreover, the LRU strategy is simple and can be very efficiently implemented. This is especially important as the buffer manager is one of the most heavily Authors’ addresses: C. M. Sacco, Dipartimento di Informatica, Universiti di Torino, Via Valperga Caluso, 37, 10100, Torino, Italy; M. Schkolnick, IBM T. J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1986 ACM 0362~5915/86/1200-0473 $00.75 ACM Transactionson DatabaseSystems,Vol. 11,
No. 4, December
1986, Pages 473-498.
474
l
G. M. Sacco and M. Schkolnick
used system components. Finally, an LRU policy appears to be the best for managing the replacement of shared pages [6]. The problem of managing a buffer has a very close relationship with virtual memory management [ll], and in fact it was suggested that files be mapped into the user process addressing space [27]. However, the philosophy used in existing address-mapping hardware is not compatible with database applications, since it is based on the assumption of a relatively small process space, while it is not uncommon for a file to be several megabytes large [35]. Although most of the results available for virtual memory systems apply to buffer management, it will be shown that, at least for relational database systems, it is possible to relax the assumption that the reference string of a process is unknown, and therefore to obtain a more precise model of buffer access. This model is called the hot-set model and characterizes the buffer requirements of a query before its execution. The main advantage over its counterpart in virtual memory systems, the working-set model [13], is that the hot-set model is a static a priori estimator. It can therefore be used in the cost analysis performed by the query optimizer to discriminate among different access plans, and in a lowoverhead scheduling strategy to prevent thrashing. In the following, discussions on the hot-set model primarily focus on System R [ 11, a relational database system implemented at IBM, and reviewed in Section 2. The ideas presented here can be easily extended to other relational DBMSs, as well as to ad hoc applications. Section 3 introduces the hot-set model; scheduling and buffer management strategies to avoid thrashing, based on the hot-set model, are discussed in Section 4. Section 5 discusses the use of the hot-set model by the system query optimizer. The hot-set model is compared with the working-set model in Section 6. Variations on buffering strategies are discussed in Section 7. The following symbols are used in this paper: Ri Pi Ki NDVi pi(j) faults(x) 2. SYSTEM
relation i number of pages in Ri cardinality of Ri number of distinct values in range of the joining attribute of relation Ri denotes the jth page of relation Ri where x is a buffer size, denotes the number of faults (i.e., disk accesses)when the available buffer contains x frames. R
Database management systems based on the relational model [lo] achieve a high degree of data independence by providing high-level nonprocedural interfaces. The user specifies a description of the data to be retrieved (i.e., “what” the user needs), and not how to access it. Relational systems rely on a system component, called the query optimizer, to determine an efficient accessplan for a given query, using the available access paths to the data. System R is a multiuser relational database system, supporting the SQL nonprocedural query language [7]. The basic structure of the system is outlined in Figure 1. ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
Buffer Management in Relational Database Systems
475
PREOPTIMIZED QUERIES
USER I
I
QUERY OPTIMIZER
I
PHYSICAL DEVICES Fig. 1.
System R architectural
outline.
UFI (User Friendly Interface) 1sthe interactive user interface with the system; it supports SQL. The optimizer selects an efficient strategy for evaluation, on the basis of system data on the stored relations such as existing access paths, cardinality, number of pages, and selectivity factors [33]. Preoptimized queries are directly passed to the RSS component. The RSS component provides low-level data management primitives, such as “get tuple.” Inside the RSS, a paged buffer preserves the most recently used pages. A relation can be accessed either by a sequential scan (i.e., exhaustive reading of the whole relation) or through an index scan. Indices are organized as B trees [12] and can be clustered or unclustered. An index is clustered if both the index and the data pages are sorted on the same attributes; this property does not hold for unclustered indices. Joins can be evaluated using two different methods: nested scans and merging scans. (1) Nested scans do not require any particular order on the joining attributes. For a 2-way join, a nested scan strategy amounts to scanning one of the relations (called the outermost), and for each filtered tuple, to locate (via sequential scan or index access) tuples matching the current joining value in the innermost relation. Extensions to n-way joins are straightforward. The result relation is ordered in the same way as the scan on the outermost relation. (2) Merging scans require both relations to be ordered in the same way according to their joining attributes. The merging scan method uses a placeholder p to reduce the length of the scans over the inner relation, exploiting the ordering. In the case of an equijoin over two relations, Rl and R2, p is initially positioned before the first tuple in R2. When a scan is started on R2 to match the current ACM Transactions on Database Systems, Vol. 11, No. 4, December 19%.
476
l
G. M. Sacco and M. Schkolnick
value tl retrieved from Rl, p is set to the first tuple in R2 such that t2 = tl. The scan is terminated when a tuple with t2 > tl is found. The current scan position is retained. On the next scan, initiated for a value tl’ of Rl, two cases arise: If tl’ = tl, the scan initiates at p; otherwise the scan will continue from the current position. By this method, scans are performed only on subsets of R2, so that the cost is generally much smaller than the nested-scan cost. If the relations are not appropriately ordered, the cost of presorting may offset these benefits. 3. THE HOT-SET
MODEL
The cache buffer is used to avoid refetching those pages that are reused. Thus, to minimize the number of faults, the buffer space allocated to a query should be sufficient to hold all the pages to be reused. In the case in which the buffer space is insufficient, frames containing pages to be reused will be “stolen” by other pages, either requested by the same process (internal thrashing) or by another process (external thrashing). It is shown in the following that, since relational systems use standard evaluation strategies, the required buffer space can be estimated before query execution. If the system insures that a process is never run with an insufficient number of frames, then internal and external thrashing can be minimized (or altogether avoided). The key idea of the hot-set model is that the number of query faults as a function of the available buffer space is a curve consisting of a number of stable intervals (within each of which the number of faults is a constant), separated by a small number of discontinuities, called unstable intervals. Figure 2 is an example, showing two stable intervals and one discontinuity. A discontinuity occurs when a set of pages, which are rereferenced, does not fit in the available buffer space. On subsequent rereferencing of a page, the missing page must be read from secondary storage, generating a fault. Depending on the properties of the access pattern, unstable intervals can exhibit sharp or smooth discontinuities. Sharp d&continuities are characterized by a rapid increase in the number of faults, in an interval [B - 1, B] (a single frame increase). Such d&continuities are usually caused by looping reusal (Figure 2). Smooth discontinuities usually occur in connection with indexed access, in which the reusal is “less precise” (Figure 3). A stable interval can be completely characterized by the value of the fault function for the lower extremum of the interval. This buffer size is called a hot point. The minimum number of frames needed by a query to be run is called the minimum hot point. In most systems it is one frame. Notice that each hot point (with the exception of the minimum one) is both the lower extremum of a stable interval, and the upper extremum of an unstable one. Therefore the fault curve of any query can be completely characterized in terms of its unstable intervals, with the addition of the minimum hot point. The fault curve in Figure 2 is relative to a join: Rl join R2, executed by nested loops using sequential scans. The fault-rate graph exhibits a sharp increase in the transition between a buffer holding 1 + P2 pages, and a buffer holding P2 ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
Buffer Management
100,000
rn 5
in Relational Database Systems
I
I
477
I
I
1,000
2 iii %
100
I I I I 20 40 60 80 BUFFER SIZE (PAGES)
IO 0
I 100
Fig. 2. Fault curve for a join computed by nested scans using sequential scans. 3000
I
I
I
I
I
I
I
I
20
40
60
80
5; % 2 2000 cc3 5 z s
1000
2
0
1 0
BUFFER
100
SIZE (PAGES)
Fig. 3. Fault curve for a join computed by nested scans using a sequential scan on the outer relation and a clustered index scan on the inner.
pages. This sharp discontinuity is explained as follows: The accesspattern of the join is (1) access the current page of Rl, (2) perform a sequential scan on R2, accessing page pZ(l), . . . , pZ(P2). If 1 + P2 pages are available, the entire loop on R2 plus the current page of Rl fits in the buffer, thus requiring the minimum number of faults. When this ACM Transactionson DatabaseSystems,Vol. 11, No. 4, December1986.
478
l
G. M. Sacco and M. Schkolnick
is no longer possible (e.g., buffer = P2), there is no reusal of pages. In fact, at the end of the first loop instance, the LRU stack will contain pages p2(1) to p2(P2), with page p2(P2) being at the top of the stack. The reference to the current page of Rl will cause the replacement of page p2(1). The subsequent reference to p2(1) will cause the replacement of p2(2), and so on. In fact the fault rate for B < 1 + P2 is stable in the interval [l, P2] and is equal to the fault rate at B = 1, that is, Kl * (1 + P2). A query of this type can then be exactly characterized by the minimum hot point (B = 1) and the unstable interval [PS, 1 + P2]. As a matter of fact, the analysis of fault behavior for a query with only sharp discontinuities can be further restricted to hot points only (in the example, hpl = 1, hp2 = 1 + P2). The analysis of unstable intervals is useful only in the context of smooth discontinuities. Figure 3 (a join computed by nested scans using a sequential scan on the outer relation and an index scan on the inner) shows an instance of a smooth discontinuity. In these cases, the fault curve inside an unstable interval must be analyzed. In the following, this analysis is performed by means of interpolation between the extremes of the unstable interval. Thus, in addition to the hot point relative to the smooth discontinuity (upper extremum), the lower extremum, called cold point, is needed. The term cold point is used to indicate that this buffer size is never to be chosen for execution, since it is the upper extremum of a stable interval (and therefore consumes more buffer resources, with no fault benefit, than the lower extremum of the stable interval). Whether discontinuities are smooth or sharp, the basic principle is that buffer sizes inside a stable interval, and different from the hot point relative to that interval, do not produce any benefit in terms of fault reduction, while using more buffer resources. The following discussion provides tools to identify hot points and unstable intervals for a given query and to estimate the number of faults generated by a query running with a buffer size equal to a given hot point, or inside an unstable interval. For the purpose of discussion, it is useful to distinguish between simple, loop, unclustered, and index reusal. In this section, queries are assumed to run in isolation. 3.1 Simple Reusal Simple reusal occurs when a page is referenced several times, but once it is left it is never referenced again. An example of such a reusal is the solution of a single relation query by sequential scans. In this case, the relation is exhaustively read, but no looping on different pages occurs. The buffer size required to run this query is exactly one frame, needed to keep the current page in core. A buffer space exceeding one frame does not produce any benefit. Thus, hpl = 1,
faults(x) = Pl,
x r 1.
3.2 Loop Reusal Loop reusal occurs for the pages referenced in a loop. The simplest case was discussed in the example above, which showed one unstable interval [P2, P2 + 11, and two hot points hpl = 1, hp2 = 1 + P2. The fault curve can be ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
Buffer Management in Relational Database Systems
l
479
characterized by faults(x) = Pl + P2, x 2 1 + P2 faults(x) = Kl * (1 + PZ), 1 I x < 1 + P2. The same arguments carry over to n-way nested scans, executed by nested sequential scans. In this case unstable intervals are associated to the relations referenced in the execution plan, in reverse order: The smallest one is associated with the innermost relation, and so on. In fact, when the buffer size decreases, the pages in the innermost loop will steal frames from other loops, starting with the outermost (which is the farthest in the reference string). As an example, consider a 3-way join performed by nested sequential scans among Rl, R2, and R3 (in this order). The fault curve of this join is completely characterized by four hot points: hpl hp2 hp3 hp4
= = = =
1 1 + P3 2 + P3 1 + P2 + P3
as follows, faults(x) faults(x) faults(x) faults(x)
= = = =
Pl Kl Kl Kl
+ * t *
P2 + P3, x L (1 + P2) + P3, (1 + P2 + P3), (1 + K2 * (1 +
1 + P2 + P3 2 + P3 I x < 1 f P2 -t P3 x = 1 + P3 P3)), 1 5 x c 1 + P3.
Hot points hpl and hp4 are quite obvious. Hot point hp2 is explained in the following way: At the end of a loop on R2 and R3, the LRU stack will contain the following pages (most recently used is leftmost): P3W3L * - * , P3W, P2W2). Now page pi(i) must be accessed: PM, P3(P3), * - - , P3(1), followed by the start of the loop on R2: ~20), pi(i), ~3(P3), . . . , ~3(2). As a consequence, no reusal is possible on R3 whenever a new tuple of Rl is fetched. It is easy to see that an increase of one frame (hp3) avoids this problem. Looping reusal is found in merging scans, too. In this case, looping occurs on runs of equal values of R2, matching a run of equal values in Rl. By using the uniform distribution and attribute independence assumption (used by the System R optimizer [33]), the average length of a run of pages of P2 (rlen2) can be estimated to be rlen2 = ceiling(PZ/NDVB) or, if NDV2 is not known, rlen2 = ceiling(PS/Pl). ACM Transactions on Database Systems, Vol. 11, No. 4, December 1966.
480
.
G. M. Sacco and M. Schkolnick
Since there is no guarantee that a run begins on a physical page boundary, this quantity should be augmented by 1 in order to account for boundary conditions. The unstable interval can then be defined in [rlenZ, 1 + rlen2]. Unlike nested loops estimates, merging scans estimates may suffer from errors. First, the buffer requirements of the query change at each loop instance. Thus discontinuities tend to be smoothed down. The above definition of the unstable interval really reflects an average case, since the minimum fault rate is guaranteed when the maximum (and not the average) run plus the current page of the outermost relation fits in the buffer. Second, the definition of the unstable interval assumes that all tuples in Rl have at least a matching tuple in R2, and that looping is always performed. This is a rather crude approximation. Very reliable estimates can be obtained if value data distributions are either known or approximated by histograms [30]. 3.3 Unclustered Reusal
The simplest instance of unclustered reusal is given by the processing of a single relation query by an unclustered index. The basic pattern of access follows: (1) Initial positioning phase: The first index entry containing the required value is located by traversing the index tree. (2) The index-leaf level is scanned until no index entry satisfying the predicate is found. (3) For each such entry, the corresponding tuple is accessed. Since the index is unclustered, different index entries may point to the same page. In order to avoid reaccessing a page, index entries might be sorted by pointer values. This method is known as the TID sorting algorithm [5], and produces result relations which are not sorted by scan order. The current implementation of System R does not use such a method. Thus the number of unique pages to be accessed must be estimated. This estimate is provided by Yao’s function [37], Y(n1, n2, P), which estimates the number of pages accessed in a file holding n2 tuples over P pages, for nl requests. In the case at hand, the buffer space required for a complete reusal is hp2 = 1 + Y(KP, K2, P2) where KP denotes the number of index entries satisfying the index predicate, a quantity estimated by the optimizer. Unlike looping reusal, the fault rate does not exhibit sharp increases, but rather a smooth discontinuity in the interval [2, hp2]. The cold point cpl = 2 reserves a frame for the current leaf page and the current data page. There is another discontinuity at [ 1, 11,because the fetch of the current data page replaces the current leaf page. The estimated number of faults is DI - 1 + PP + Y(KP, K2, P2) at hp2, where DI represents the depth of the unclustered index, and PP the number of leaf pages that are accessed. The number of faults between cpl and hp2 can be estimated by DI - 1 + PP + Y(KP, K2, P2) + FP * (KP - Y(KP, K2, P2)) where B is the buffer size and FP = (1 - (B/Y(KP, K2, P2)) denotes the uniform fault probability. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems
481
3.4 Index Reusal Index reusal occurs when an index is accessed repeatedly. The basic pattern of access is a tree traversal in which the probability of reusal decreases from the root (1.0 reusal probability) to the leaves. This fact is well known; the use of an LRU buffer manager was initially proposed for B-tree processing [3, 211. Index reusal usually arises in nested scan evaluations, in which Rl is processed by sequential scans and R2 by a clustered index scan. This type of evaluation requires a scan of Rl, and for each filtered tuple in the current page of Rl, a leaf page of R2, where the tuples matching the current join value would be, must be located by traversing the index tree of R2. Subsequently, a sequential scan on the leaf level of R2 is initiated, and all tuples in the leaf level, matching the join value, are retrieved. The number of pages to be accessed in R2 is initially estimated. Let NDVl and NDV2 represent the number of different values in Rl and R2, respectively. Both of these values are usually available to the optimizer. In order to estimate the number of pages touched by index traversal, Yao’s function can be applied to each index level in isolation. Thus, the number of pages touched at level i (T(i)) is given by T(i) = Y(NDV1, NDVB, PI(i)), where PI(i) is the number of pages at level i. T denotes in the following the sum of all T(i)s, leaf level excluded. The number of different pages touched by the leaf-level sequential scan (TS) is estimated as follows: The average length of a run of pages holding records with the same joining value is given, under the uniform value distribution assumption, by rli = (PI(leaf_level)/NDV2). TS can then be estimated by T(leaf-level) * rli. The first instance of index reusal to be analyzed is the case in which the outer relation is ordered on the joining attribute(s) in the same way as the inner one. In this case there is no reusal at nonleaf levels of the index tree because a page is never reaccessed once it is left. There is instead a looping behavior on the leaf level (and a single unstable interval), if Kl > NDVl, since runs of equal values will be repeatedly accessed. The cold point for the unstable interval occurs when there is not sufficient space to hold the current page, plus one frame for each level of the index of R2, plus the length of a run of equal values at the leaf level of R2. The inspection of the access pattern shows that in this case no page can ever be reused. In the particular case in which K2 = NDVP, all values being unique, only one frame is needed to store a run of equal values. Thus, JX ” =
DI + rli - 1,
faults(x) = Kl * (1 + x s cp, faults(x) = Kl * (1 + x 5 cp,
DI), if K2 = NDV2 DI + rli - l), otherwise
ACM Transactions on Database Systems, Vol. 11, No.
4,
December 1986.
462
l
G. M. Sac00 and M. Schkolnick
where DI denotes the depth of the index on R2. The hot point occurs when all the required frames are present: 1 + DI,
faults(x) = Pl + T + T(leaf-level), x> hp, if K2 = NDV2 hp = DI + rli, faults(x) = Pl + T + TS, x ’ hp, otherwise. i Note that hp minimizes the number of faults and is therefore the maximum hot point. In the case in which the outer relation is not ordered, there are two discontinuities (a sharp one and a smooth one). If K2 = NDV2 (i.e., no duplicates in R2), the query is characterized by hpl = 1, hp2=cpl=l+DI, hp3 = 1 + T + T(leaf-level),
faults(x) = Kl * (1 + DI), faults(x) = Pl + ACC, faults(x) = Pl + T f T(leaf,level),
laxchp2 hp2Ixchp3 x = hp3
where ACC is given by the sum for all levels i of ACC(i) = T(i) + Kl * (1 - l/T(i)). The number of faults at hpl and hp3 is easy to derive. The situation at x: = hp2 is explained by noting that (1) a page at index level i can only replace a page at index level i; (2) a page of Rl can only replace a page of Rl. Thus, only Pl accessesare required to scan Rl. Moreover, the number of faults at each index level is given by T(i) plus the probability of a failure in finding the required page in the buffer, times the number of accesses. A smooth discontinuity is found between cpl and hp3. The number of faults for cpl < x < hp3 can in practice be interpolated linearly. The case in which K2 > NDV2 can be treated similarly. 3.5 Extensions Although the previous analysis is by no means exhaustive, its methodology can be used to compute characterizations of other access strategies, and in particular to completely characterize all the strategies implemented in System R. A special case occurs when temporary results are computed and stored. This strategy is used in System R when one or more relations must be sorted prior to a merging-scan join, and also in INGRES [34]. In this case, the query is decomposed into several subevaluation plans, and each of the subplans can be independently characterized by the hot-set model. 3.6 Example As an example of the computation of hot sets and the accuracy of the estimate, experiments were conducted involving the computation of a join between two relations by nested scans with a sequential scan on the outer relation and an index scan on the inner. Statistics for the relations involved in the experiments are reported in Table I. Semantically, the inner relation (R2) contained a lexicon of 5,888 words, ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
Buffer Management in Relational Database Systems Table I. Relation
Rl
Kl NDVl Pl
= = =
Statistics
for the Relations
(Experiment 272 272 5
Relation R2 (all experiments) K2 = 5888 NDV2 =5888 P2 = 99
1)
(Experiment 590 212 10
483
l
in the Exneriments 2)
Index depth = 3
(Experiment 590 272 11
3)
PI(O) = 1 PI(l) = 2 PI(2) = 97
3000
Fig. 4. Observed and solid line = estimated).
estimated
fault
curves
for
experiment
1 (dotted
line
=
observed;
while Rl contained a list of unordered unique words from a document (experiment 1) of unordered replicated words (experiment 2), and of ordered replicated words (experiment 3). The reference probability for leaf pages was almost uniform in experiment 1, while it followed Zipfs law of distribution in experiment 2. The graphs of observed and estimated behaviors are shown in Figures 4-6. For experiment 1, the hot points can be calculated in the following way: The smallest hot point is hpl = 1, at which the fault rate is faults(l)
= Kl * (1 + DI) = 272 : (1 + 3) = 1088,
which is equal to the observed value. Hp2 is given by hp2 = 1 + DI = 4, faults(4) = Pl + 1 + (2 f 272 t (1 - $.)) + (97 + 272 * (1 - $)) = 510. ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
484
l
G. M. Sacco and M. Schkolnick
0
Fig. 5. Observed and solid line = estimated).
estimated
20 40 60 80 BUFFER SIZE (PAGES) fault
curves
for
experiment
100
2 (dotted
line
=
observed,
3000
2000
1000
0 0
20 40 60 80 BUFFER SIZE (PAGES)
100
Fig. 6. Observed and estimated fault curves for experiment 3 (dotted line = observed and estimated; solid line = estimated, not knowing that the outer relation is ordered).
In the above formula, Yao’s function was estimated using the approximating formulas proposed by [4]. Other approximating formulas are found in [24] and [361. The observed value is 360 (42 percent error). The maximum hot point is estimated at hp3 = 99, faults(99) = Pl + T + T(leaf-level) ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
= 104.
Buffer Management in Relational Database Systems
l
485
The observed values were b3 = 92, faults(92) = 93. These experiments show that the hot-set model is accurate as far as hot-point estimation is concerned. Experiment 2 was devised to show that nonuniform value distributions cause a significant decrease in the fault rates in the interval [cpl, hp3]. In practice, interpolation is conservative (due to uniformity assumptions [9]). As important causes of fault rate decrease, in addition to nonuniform value distributions, there are partial orderings in the values of outer relations. Partial orderings (or, equivalently, nonuniform interreference length probability) can significantly increase locality even for small allocated buffers; in the limit (a completely ordered outer relation), the observed fault curve is the one shown in Figure 6. The graphs show that a linear interpolation of the number of faults in the interval [cpl, hp3] is not completely satisfactory, although it is probably adequate in practice. In order to analyze the number of faults in finer detail, one must account for the considerations shown above, and for nonuniformity of frame usage by different levels of the tree. In fact, higher levels of the tree (with lower interreference times) tend to steal frames from the lower levels (with higher interreference times). 4. THRASHING
AVOIDANCE
Under a global LRU replacement policy with unrestricted access to the buffer, users can steal pages from each other. In the (likely) event that the sum of the buffer requirements for each active query exceeds the available buffer space, users will be working with an insufficient number of frames, and competition for buffer space will cause an increase in the overall system fault rate. This phenomenon, which we call external thrashing, was first studied in the context of virtual memory systems: It is well known that under circumstances of severe buffer overcommitment the system can literally collapse under the weight of paging due to page stealing. In this case, the expected trade-off between throughput and response time due to multiprogramming is usually not realized-since a large part of paging is useless, system responsiveness is also adversely affected. 4.1 A Simple Thrashing Avoidance Strategy
The hot-set model can be used to avoid thrashing by scheduling queries for execution if their buffer requirements do not exceed the available buffer space. In order to implement this strategy, a definition of the optimal buffer requirement for a given query is needed. We call this quantity the query hot set. If we regard the problem from a traditional point of view (i.e., the objective is to minimize the number of faults in the execution of the query, and therefore the query response time), then the query hot set is obviously the largest hot point not exceeding the system buffer space. On this basis, we can formulate the following strategy: (1) Scheduling. Queries are scheduled for execution in such a way that the sum of their hot-set sizes does not exceed the available buffer space. Queries in ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
466
l
G. M. Sacco and M. Schkolnick
the waiting list are ordered by increasing buffer consumption. The buffer consumption of a query is given by its hot-set size, times its expected response time in isolation (i.e., it is the integral of buffer allocation over time). This strategy is chosen to provide fast service to “small” queries. (2) Buffer management. Each active query has a local LRU stack (i.e., LRU replacement is applied in isolation for each query). This simple strategy was compared with the current implementation of System R (global LRU replacement strategy). The simulation was engineered as a low-cost feasibility study. Several simplifications and approximations were done. Among these were the following: (1) The scheduler overhead was not considered. This approximation is justified by the fact that this overhead is very low when compared with query execution times. (2) Service was assumed to be equidistributed among users within a simulation clock tick. Interleaving of CPU and I/O activity was ignored. This simplification biases the experiment in favor of System R. (3) For queries consisting of several subplans (see Section 3.5), each characterizable by a hot-set size, the maximum hot-set size was chosen to characterize the query. Again, this is conservative, since simulated queries use the buffer longer than in actuality. (4) The database used in the experiments was small, with relation sizes varying from a few tens to a few hundred pages. In order to approximate the behavior of real databases, which are significantly larger than the sample one and serve far more users, the buffer size used in the experiments (less than 80 kbytes) was significantly smaller than buffer sizes used in practice (ranging from several hundred kilobytes to a few megabytes). This approximation is legitimate if the most important parameters in buffering performance are, as it intuitively appears, the ratio between database size and buffer size, and the ratio number of users versus buffer size. All time measures are expressed in “timerons,” a virtual elapsed-time measure used in System R, which estimates both CPU and I/O time. A timeron is roughly equivalent to 0.02 second on a dedicated IBM 370/158 system. Since timerons are estimates of actual time, they provide indicative measures. I/O activity figures are, on the other hand, actual observed values. Nine base queries were chosen, and their hot sets determined analytically. The evaluation plans chosen by the current System R optimizer were used. The characteristics of the queries are tabulated in Table II. A random-number generator was used to generate random sequences of queries for n users. For each user, as soon as a query terminated, the next one was immediately submitted. The following measures were chosen to represent the behavior of the simulated and the real systems: (1) (2) (3) (4)
total number of faults, system throughput per 1000 timerons, average response time of a query, system unresponsiveness: given by the maximum ratio between actual response time and response time when the query is run in isolation with a
ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems Table II.
Characterization
487
of the Query Load
Query
Hot-set size
Timerons with hot-set size
1 2 3 4 5 6 I 8 9
3 4 5 15 5 6 7 8 8
121 102 21 2,414 432 1,854 2,447 11,187 11,191
Table III.
l
Simulation
Results
Experiment 1: Users: 4 Total number of queries: 20 Buffer size: 15 Number of queries delayed by scheduling: 9 Simulation Faults Throughput Response Unresponsiveness
30,242 0.19 16,072 6.85
System R
Simulation Svstem R
54,331 0.15 24,152 25.33
0.56 1.16 0.66 0.27
Experiment 2: Users: 5 Total number of queries: 220 Buffer size: 19 Number of queries delayed by scheduling: Simulation Faults Throughput Response Unresponsiveness
40,295 1.75 2,306 167.50
15
System R
Simulation System R
48,482 1.74 2,966 24.38
0.83 1.07 0.78 6.87
buffer equal to its hot-set size. This measure gives an indication possible response.
of the worst
The first experiment was conducted with four users and a buffer size of 15 frames. The second experiment was conducted on a buffer size of 19 frames, with five active users. The first four users issued the same query stream of the previous experiment; the fifth user generated a long stream of fast queries (queries 1 and 2 in Table II); results are tabulated in Table III. In interpreting the results, it should be noted that System R allows a process to fix (i.e., insure against LRU replacement) a small number of frames [23]. Page fixing is allowed for performance reasons. As an example, consider a process ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
488
l
G. M. Sacco and M. Schkolnick
locating a tuple in page p. In systems that allow fixing, the process will issue a request for p, fix it, locate the tuple by directly addressing the page, process it, and subsequently unfix the page. In nonfixing systems, a request for page p will cause the page to be copied into the private process space, where it can be processed. If page fixing is allowed, the replacement policy is not a pure LRU policy, although fixed pages tend to be those at or near the top of a pure LRU stack. In the experiments, both the real and the simulated system implement fixing. As a side effect, note that page fixing requires that the buffer be large enough to accommodate all active fixing requests. In terms of the hot-set model, the minimum hot point is equal to the maximum number of frames the process can have fixed at the same time. Page fixing explains why the two experiments were not conducted on the same buffer size: Not enough fixable frames were available for the second experiment, so that the version of System R used in these experiments could not run. These preliminary results show that even a simple-minded strategy based on the hot-set model gives significant benefits. The total number of faults was greatly reduced as a consequence of external thrashing avoidance. The average response time significantly improved, while the system unresponsiveness was better or tolerably worse than the real system. The high unresponsiveness of experiment 2 was caused by some fast queries, which were occasionally queued after some very long ones. The inspection of raw statistical data for experiment 1 shows that more than 3 million system calls (CPU cost) were issued against about 30,000 (hot-set) or 54,000 (System R) faults. Thus, especially if the number of faults is reduced by hot-set scheduling, the CPU tends to become one of the bottlenecks of the database system. Another bottleneck, more evident in experiment 2, is represented by the buffer itself. In practical situations, where many users access very large databases, insufficient buffer resources can severely degrade the potential performance of the system, even if intelligent scheduling is used. 4.2 A More Effective Policy
In this section a more effective policy is presented. In the interests of generality, page fixing is ignored; effective strategies for page-fixing systems are presented in [31] and [32]. For purposes of discussion, the buffer manager is illustrated before the scheduler. 4.2.1 Buffer Manager. The main problems in the simple buffer management strategy seen before are the following: (1) No page sharing among concurrent queries ever occurs. This is especially undesirable in a large class of practical applications, such as teller banking applications, which are characterized by a large number of small queries sharing the same files. In these systems, the simple strategy will cause an unwarranted number of faults. (2) There is no provision to minimize the amount of internal thrashing due to errors in the estimation of the hot-set size. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems
489
The buffer manager proposed below solves these problems. Note that the algorithm uses the concept of a “deficient” query, which is a query in execution with an allocated number of frames (called NALL) smaller than the hot-set size of the query (HS). Arriving processes are always deficient, a strategy that ensures that frames possibly needed by other processes are claimed only when actually needed. The basic ideas in the algorithm are as follows: (1) LRU stacks are independently managed.
(2) A freelist region in the buffer contains all the free frames. This region is managed with an independent LRU stack. Processes replacing a page will push the page to be replaced on top of the freelist stack, and obtain the frame at the bottom of the freelist LRU stack. (3) A request for a page scans the whole buffer to search for the requested page. Fast scan is usually accomplished by hashing for performance reasons. If the requested page belongs to the local LRU stack of the requesting process, the local stack is updated. If it belongs to the freelist, it is absorbed by the process local LRU stack (causing a replacement if the process is not deficient). Otherwise, nothing is done. Thus a shared page will be present in only one local stack. (4) Deficient processes never cause external thrashing, but can incur internal thrashing. The algorithm follows: INITIALIZE: Assign to the freelist all the buffer frames, initialized to “empty.” Push them on the freelist LRU stack. Set EMPTY to the number of frames in the buffer. NEW PROCESS P ARRIVES: Assign to P an empty local LRU stack. Set HS(P) to the process hot-set size. Set NALL(P) = 0. Note: all new processes are deficient. NONDEFICIENT PROCESS P REQUESTS A PAGE: Search for page in the buffer pool. [l] page is found [ 1.11 page is in local LRU stack: update stack. [1.2] page is in freelist LRU stack: put bottom of local stack to the top of freelist stack. Put found frame to the top of local LRU stack. Discard found frame from freelist stack. [ 1.31 page is in the stack of another process: do nothing. [2] page is not found: a page fault occurs. Push bottom of local LRU stack to the top of freelist stack. Get bottom frame of the freelist stack and push it to the top of local LRU stack. Read page in the frame at the top of the local LRU stack. DEFICIENT PROCESS P REQUESTS A PAGE: Search for page in the buffer pool. [l] page is found: [l.l] page is in local LRU stack: update stack. [1.2] page is in the stack of another process: do nothing. [1.3] page is in freelist LRU stack: put found frame at the top of local LRU stack, discarding it from the freelist. Increment NALL(P), decrement EMPTY. [2] page is not found: a page fault occurs. If EMPTY = 0, move the frame at bottom of local LRU stack to the top of the stack. Otherwise (EMPTY > 0): get bottom frame ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
490
l
G. M. Sacco and M. Schkolnick
of the freelist stack and push it to the top of local LRU stack; increment NALL(P), decrementEMPTY. In both cases,read pagein the frame at the top of the local LRU stack.
PROCESS P TERMINATES: Prepend the local LRU stack to the freelist stack. Increment EMPTY by NALL(P). This policy allows page sharing, since a request will scan the whole buffer. The fact that a shared page belongs to only one local stack does not cause any problems. In fact, the most important type of sharing occurs for hot pages, such as index roots, file control pages, and so on. These pages will then be high in the local LRU stack. Should they be flushed out (for example at process termination), they will end up at the top of the freelist and be promptly absorbed by another sharing process. In the event of a sporadic sharing (usually involving a data page touched by two concurrent processes), absorption depends on the timing characteristics of the two processes. Provided that the freelist is not empty, this scheme tends to buffer errors in hot-set estimations. In the extreme case in which only one process is active, the buffering strategy is equivalent to a global LRU strategy. As a matter of fact, this scheme is analogous to a two-level hierarchy in which local LRU stacks are at the top of the hierarchy and the freelist is at the bottom. As a consequence, a maximum response time for nondeficient queries (equal to the one required for running in isolation with a buffer space equal to the query hot set) can be guaranteed. In fact, competition for frames arises only in the common freelist portion of the buffer, where some thrashing phenomena can eventually occur. This thrashing is, however, negligible, since it concerns only surplus pages whose reusal might lower the expected response time. The bulk of pages to be reused is preserved in local LRU stacks. This scheme also prevents deficient queries from causing external thrashing, since a deficient query cannot steal pages from other LRU stacks. 42.2 Scheduling. Apart from such obvious problems as indefinite waits (which can be solved by well-known techniques, for instance, by priority schemes), the simple scheduler suffers from other problems: (1) The definition of the hot-set size as the maximum hot point smaller than the system buffer size tends to serialize query executions in the (likely) event of a buffer that is underdimensioned, either because the database exceeds the buffer space by several orders of magnitude or because the multiprogramming level is very high. This is undesirable, since serialization has a severe negative impact both on system throughput and query response time. The basic flaw in the hot-set definition appears to be that the buffer resources needed for the query execution are not charged to it. These resources can be accounted for by using the buffer consumption measure proposed in [28], which is used to order the wait list in the simulation experiment. Figures 7 and 8 show the observed buffer consumption for the queries shown in Figures 2 and 3, respectively. Using this metric, the hot set of a query can be defined as the hot point-no larger than the system buffer space-at which buffer consumption is minimum. This definition is effective in systems where the buffer is underdimensioned, ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems I
I
I
1
I
I
I
I
l
491
1
f
Fig. 7.
I
20 40 60 80 BUFFER SIZE (PAGES)
0
Observed buffer consumption
100
for the query in Figure 2.
E20’ooor---t3 15,000 2 t if -I 10,000 2 ii2
5,000
0’
I 0
Fig. 8.
1
I
1
20 40 60 80 BUFFER SIZE (PAGES)
Observed buffer consumption
100
for the query in Figure 3.
although it might not seem appropriate when this does not happen (e.g., singleuser systems). A more flexible definition of hot set is the minimum hot point that minimizes the weighted sum of the number of faults and of buffer consumption. A system designer can then set appropriate weights to effectively optimize queries for a given system. It should be noted, however, that the freelist mechanism used in the buffer manager reduces the impact of a raw buffer consumption measure. In fact, if the ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
492
9
G. M. Sacco and M. Schkolnick
buffer is not underdimensioned, thrashing is unlikely to occur, and the freelist contains a number of frames that are shared by concurrent processes. (2) The experiments showed that fast queries can occasionally be queued after long ones, thus significantly increasing their response time. This can be avoided by changing the definition of the size of the system buffer space to a quantity that would not prevent small queries from running concurrently with large ones. That is, the maximum buffer space available to a given query is smaller than the total system space. (3) Another part of the strategy that negatively affects response time is that incoming queries are not eligible for execution unless their hot-set size is less than the system space, minus the sum of the active hot-set sizes. This may result in a severe buffer underutilization in high-sharing applications (e.g., bank-teller applications). Since the buffer manager allows page sharing and deficient processes, the sum of the active hot-set sizes could exceed the system space; if the excess refers to shared pages, no internal thrashing occurs. It seems advisable therefore that a certain amount of buffer overcommitment be allowed, as a parameter chosen by the database administrator, and depending on the amount of sharing in the specific installation. High-sharing installations may allow a relatively high degree of overcommitment. A more sophisticated approach might allow the specification of allowed overcommitment on a per-transaction class basis (avoiding overcommitment for nonsharing transactions). (4) Since a number of frames is allocated to a process until termination, buffer allocation must be considered in deadlock detection [26]. As an example, assume that a scheduled process Pl requests a lock, which is held by process P2, and P2 cannot be scheduled until Pl terminates. This causes a deadlock. Another undesirable property of scheduling is that scheduled processes that are waiting, either for operator intervention (i.e., a process displaying results on a terminal) or until a lock is granted, waste significant buffer resources and slow down the system considerably. None of these problems occur under global LRU replacement, since unused pages are gradually flushed out of the buffer. A general solution to these problems may be provided by time-out preemption, that is, scheduled processes that idle for more than a given time interval are preempted and rescheduled at a later time. The process status, with the possible exception of the local LRU stack, must be saved. A softer approach might allow nonwaiting processes to steal frames from waiting processes when the freelist is empty. A waiting process will then become deficient, and eventually disappear from the buffer. 5. HOT-SET
MODEL COSTS VERSUS TRADITIONAL
I/O COSTS
Traditionally, the I/O cost of an access plan was a single quantity, usually computed assuming that the current page of each accessed file would stay in main memory. If the effects of buffering are considered, these measures must be reformulated, because the I/O cost of a plan becomes a function of the available buffer size. This new approach requires the critical revision of a number of results in the area of database access methods. The hot-set model helps considerably by characterizing continuous cost functions by the cost of a small number of buffer sizes. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems
493
The use of traditional cost functions is especially misleading in a query optimizer because fault curves can intersect. As a simple example, consider a nested-scan join by sequential scans, on two relations Rl (Pl = 20, Kl = 5,000) and R2 (P2 = 10, K2 = 300). There are two ways to compute this join (i.e., by taking as outermost either Rl or R2). The traditional I/O costs of the two plans are TCOST TCOST
(Rl outer) = Pl + Kl * P2 = 50,200 (R2 outer) = P2 + K2 * Pl = 6,010.
An optimizer using such measures would then choose R2 as outermost. However, using the hot-set model, these costs are HSCOST
(Rl outer, X) =
HSCOST
(R2 outer, X) =
30, if x L 11 50,200, if x < 11 30, if x L 21 6,010,
if x < 21.
Assuming an available buffer space B, B L 11, the plan “Rl outer” should be chosen, contrary to what is suggested by traditional cost measures. Note that in the interval [ll, 201, the plan chosen using traditional cost measures causes 5,980 more faults in the execution of the query (an increment of two orders of magnitude). Nested loops can be implemented more efficiently on a per-page rather than a per-tuple basis. This means that the loop is repeated for each page (rather than for each tuple) of Rl, and that all the tuples in the current page of Rl and in the current page of R2 are immediately joined. Although the cost difference decreases, in the present example it is still almost one order of magnitude greater. Analogous considerations apply to other types of access. As another example, note that nested scans are never worse than merging scans (and considerably better if one or both relations must be presorted) if the smallest relation fits in the buffer. The query optimizer can discriminate among different plans on the basis of their hot-set sizes if the maximum available buffer space is known. The output of the optimizer is then a triple: (plan, hot set, expected cost) Quite obviously, such a modification to the optimizer is warranted if the buffer manager ensures that the query will be run with an available buffer no smaller than its hot set. 6. HOT SET VERSUS WORKING
SET
The working-set model [13] was the first to describe program behavior in terms of available memory, and is the most effective method to reduce thrashing in multiprogrammed virtual memory systems. Given a window of length T over a process reference string, the process working set at time t, W(t, T), is defined as the set of distinct pages referenced in the interval (t - T, t). The working-set size changes dynamically and may vary from 1 to T. The working set W(t, 2’) is used to estimate the working set W(t + 1, T), and, in particular, these two sets are assumed to be equal. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
494
l
G. M. Sacco and M. Schkolnick
The assumptions underlying the working-set model are very general: the principle of locality and the fast transition between different localities. These assumptions are justified by the empirical observation that processes tend to loop for a long time over a small number of pages, and that references cluster into regions (e.g., modules) of the process space. A scheduler using the working-set model ensures that, while a process is running, a number of frames no smaller than its working-set size is allocated to it. If the working-set size requirements of a running process change, the scheduler will eventually have to back out some processes in order to free the requested number of frames. The main difference between the working-set and the hot-set models is that the working-set model is a dynamic, run-time estimator of localities, while the hot-set model is a static, a priori estimator. As regards the assumptions, the hotset model does not assume a general, unknown reference string, but relies on the fact that the processes that generate the references are standardized, so that it is feasible to analyze their reference patterns and estimate the size of localities prior to actual execution. Consequently: (1) The working-set model cannot be used by the optimizer to compute a query cost. The introduction of the hot-set model is then justified, at least in this context. (2) The working-set model is expensive in terms of instructions executed, a fact that has discouraged its use in high-performance systems. The hot-set model has no run-time overhead. (3) Hot-set sizes do not vary during the execution of a query (or of the subplan of a query, see Section 3.5). Thus hot-set scheduling does not require backing out processes because their allocation must be varied. A known problem with the working-set model is that its performance varies with the size of the window T. If T is too small, then some localities will be lost. On the other hand, if T is too large, useless pages will be kept in memory, unnecessarily inflating the working-set size. In database systems, determining a global value for T is extremely difficult. Assume that T is defined as the window that will capture the “average” locality. This definition requires monitoring the system, and will be useless in systems with a large number of ad hoc queries. We contend that, even if a correct determination of T is possible, hot-set buffer management is better in several important cases: (1) Consider a process with a simple reusal pattern. The working-set size of this process will in general be T frames because a looping pattern is expected by the working-set model. The single frame granted through the hot-set model generates the same number of faults. (2) Analogous considerations apply to a looping reusal on a locality larger than T. Again, the working-set model will assign T frames to the process, unnecessarily inflating the process requirements. (3) When a process moves from one locality to another one (e.g., merging scans), the working set will try to keep both localities in the buffer. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems
l
495
These examples suggest that working-set scheduling might consistently cause a lower buffer utilization than hot-set scheduling. There are algorithms that dynamically adjust the value of T for each active process [17]. These algorithms have a high computational overhead, and their effectiveness seems questionable for short nonlooping queries, such as those locating a single record in a file through random access. A performance evaluation study by Chou and Dewitt [8] confirms the validity of these arguments, and shows that hot-set policies significantly outperform working-set policies, even when an optimal global value for T is used. As a topic for further research, we formulate the hypothesis that an “optimal” window size T can be estimated for a given query by hot-set model techniques. In this way the effectiveness of the working-set policy can be increased at a very low cost. Alternatively, working-set buffer management could be used to correct the errors in the estimation of the hot-set size, by requesting working-set buffer management with a window T’ somewhat larger than the “optimal” window T for those queries for which underestimation errors could arise. 7. VARIATIONS
ON THE THEME
OF BUFFERING
Many variations on buffering are possible. They can roughly be categorized by the,following parameters: (1) The buffer. The internal buffer can be unique and shared by all processes (as assumed in the present paper), or it can be statically partitioned into pools which can be assigned to disk drives, to file partitions, or even to single files. Static partitioning is used by some commercial systems such as IBM VSAM, IMS, and Tandem’s Enscribe. The hot-set model can be extended to static partitions as well, by computing hot sets on a per-partition basis. The scheduling algorithm then becomes considerably more complex, because enough buffer space in each partition must be available for a query to be scheduled. It must be noted, however, that static partitions do not appear to be a good choice, since any type of static partitioning of resources is known to cause underutilization. (2) The replacement strategy. A number of different replacement strategies can be used [ 141. No single strategy can guarantee a good replacement for all types of queries: Kaplan [19] has shown that the fault rate can be cut lo-15 percent by replacement policies specifically chosen for the given type of evaluation. It can even be envisioned that different replacement policies be used to solve a single query (multistrategy replacement). All these variations cause added complexity, but can still be treated with the hot-set model. Most replacement policies (with the possible exception of random replacement) can easily be described by hot-set techniques by investigating the reference pattern generated by the evaluation method, in connection with the replacement strategy. Even multistrategy replacement fits well in the hot-set model. In fact, one can partition the buffer into several independently managed subbuffers: The query hot-set size is then the sum of the hot-set sizes for all subbuffers. No modification is required on the scheduler proposed here because queries are still scheduled on the basis of their total hot-set size. ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986.
496
l
G. M. Sacco and M. Schkolnick
Multistrategy replacement was investigated by Chou and Dewitt [8]: Experimental evidence of a typical throughput increase of 10 percent over LRU hot-set scheduling was reported. (3) Special pages. This is really a combination of (1) and (2), above. Some systems (such as, e.g., System R) manage special pages (e.g., file control pages, address translation tables) differently from data pages. Different management here means that either a special partition or a special replacement policy (or both) is used for special pages. In this paper we chose to ignore special pages, since their definition tends to be implementation-dependent. However, as discussed above, they can be easily incorporated into the model. 8. CONCLUSIONS
The examples found in this paper provide evidence that much of the performance of a database system depends on effective buffering techniques. Bad strategies can increase the I/O cost of a query even by orders of magnitude, and also produce thrashing in multiuser systems, It is perhaps surprising that such an important topic has received so little attention in the past: [14-16, 19, 22, 351 are the most noteworthy exceptions. The reasons are several. Until recently, main memory storage was so expensive that only very small buffers could be used. In this situation, almost all strategies behave badly. Second, the comprehensive study devoted to virtual memory systems, and their affinity to buffering, leads us to think that all, and only, the results of virtual memory systems apply to buffering systems. The most important reason, however, is that the previous observation is true for traditional, procedural systems (e.g., network model systems), in which reference strings are unpredictable (see [15] for a study of network databases). This fact leads us to conjecture that relational systems, originally introduced for productivity and ease of use, might be winners even from the viewpoint of performance. As a clue confirming this conjecture, note that processes written in pointer-based languages (e.g., LISP, the analog of a network system) are known to have a much worse virtual memory behavior than processes written in structured languages (e.g., ALGOL, the analog of a relational system). The present paper introduces a model for the buffer requirements of relational queries and ad hoc applications, and also an effective buffer management strategy. A number of new problems are also introduced. (1) Results based on traditional cost measures have to be revised, and, in particular, the comparison of methods should account for buffering and determine under which buffer size one method is better than another. (2) Access strategies based on buffering must be devised. The key idea in this case is buffer locality (as in virtual memory systems). Examples of promising methods are [20] and [29], which devise divide-and-conquer strategies to artificially increase both buffer locality and parallelism. (3) Read-only transactions were considered in this paper. Strategies for effective buffering with write transactions must be devised. The initial foundations in this area are provided in [ 21 and [ 181. ACM Transactions
on Database Systems, Vol. 11, No. 4, December
1986.
Buffer Management in Relational Database Systems
497
REFERENCES
1. ASTRAHAN,M. M., BLASGEN, M. W., CHAMBERLIN, D. D., ESWARAN,K. P., GRAY, J. N., GRIFFITHS, P. P., KING, W. F., LORIE, R.A., MCJONES, P.R., MEHL, J. W., PUTZOLU, G. R., TRAIGER, I. L., WADE, B. W., AND WATSON, V. System R: Relational approach to database management. ACM Trans. Database Syst. 1,2 (June 1976), 97-137. 2. BAYER, R. Database system design for high performance. In Information Processing 83, Mason, Ed. Elsevier North-Holland, New York, 1983, 147-155. 3. BAYER, R., AND MCCREIGHT, C. Organization and maintenance of large ordered indexes. Acta hf. 1,3 (1972), 173-189. 4. BERNSTEIN, P. A., WONG, E., REEVE, C. L., AND ROTHNIE, J. B., JR. Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6,4 (Dec. 1981), 602-625. 5. BLASGEN,M. W., AND ESWARAN,K. P. Storage access in relational databases. IBM Syst. J. 16 (1977).
6. BORAL, H., AND DEWITT, D. J. A methodology for database system performance evaluation. In Proceedings of the 1984 ACM SIGMOD Conference, ACM, New York, 176-185. 7. CHAMBERLIN,D. D., ET AL. Sequel 2: A unified approach to data definition, manipulation, and control. IBM J. Res. Deu. 20, 6 (1976). 8. CHOU, H., AND DEWITT, D. J. An evaluation of buffer management strategies for relational database systems. In Proceedings of the 11th Conference on Very Large Data Bases (Stockholm, 1985), 127-141.
9. CHRISTODOULAKIS,S. Implications of certain assumptions in database performance evaluation. ACM Trans. Database Syst. 9, 2 (June 1984), 163-186. 10. CODD, E. F. A relational model of data for large shared data banks. Commun. ACM 13,6 (June 1970) 377-387. 11. COFFMAN, E. D., AND DENNING, P. J. Operating Systems Theory. Prentice-Hall, Englewood Cliffs, N.J., 1973, 241-312. 12. COMER,D. The ubiquitous B-tree. ACM Comput. Suru. 11, 2 (June 1979), 121-137. 13. DENNING, P. J. The working-set model for program behavior. Commun. ACM 11,5 (May 1968), 323-333.
14. EFFELSBERG,W., AND HAERDER,T. Principles of database buffer management. ACM Trans. Database Syst. 9, 4 (Dec. 1984), 560-595. 15. EFFELSBERG,W., AND LOOMIS, M. E. S. Logical, internal, and physical reference behavior in CODASYL database systems. ACM Trans. Database Syst. 9,2 (June 1984), 187-213. 16. FERNANDEZ,E. B., ET AL. Effect of replacement algorithms on a paged buffer database system. IBM J. Res. Deu. 22,2 (1978), 185-196. 17. GHANEM, M. Z. Dynamic partitioning of the main memory using the working set concept. IBM J. Res. Deu. (1975), 445-450. 18. GRAY, J. N. Notes on database operating systems. In Lecture Notes in Computer Science 60.
Springer-Verlag, New York, 1978,393-481. 19. KAPLAN, J. Buffer management policies in a database system. M.S. thesis, Univ. of California, Berkeley, 1980. 20. KIM, W. A new way to compute the product and join of relations. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 1980. 21. KNUTH, D. The Art of Computer Programming. Vol. 3, Sorting and Searching. Addison-Wesley, Reading, Mass., 1973. 22. LANG, T., WOOD, C., AND FERNANDEZ,I. B. Database buffer paging in virtual storage systems. ACM Trans. Database Syst. 2,4 (Dec. 1977), 339-351. 23. LORIE, R. A. Physical integrity in a large segmented database. ACM Trans. Database Syst. 2, 1 (Mar. 1977), 91-104. 24. LUK, W. S. On estimating block accesses in database organizations. Commun. ACM 26, 11 (Nov. 1983), 945-947. 25. MATTSON, R. L., ET AL. Evaluation strategies for storage hierarchies. IBM Syst. J. 9, 2 (1970), 78-117.
26. OBERMARK,R. Global deadlock detection algorithms. IBM Res. Rep. RJ2845, 1980. 27. REDELL, D., DALAL, Y. K., HORSLEY, T. R., LAUER, H. C., LYNCH, W. C., MCJONES, P. R., MURRAY, H. G., AND PURCELL, S. C. Pilot: An operating system for a personal computer. Commun. ACM 23, 2 (Feb. 1980), 81-92. ACM Transactionson DatabaseSystems,Vol.
11, No. 4, December
1986.
498
l
G. M. Sacco and M. Schkolnick
28. SACCO, G. M. Fragmentation: A technique for efficient query processing. TR 20/11/82, Dip. Informatica, Univ. of Torino, Turin, Nov. 20, 1982 (revised: Aug. 26, 1983). 29. SACCO, G. M. Fragmentation: A technique for efficient query processing. ACM Trans. Database Syst. 11, 2 (June 1986), 113-133. 30. SACCO,G. M., AND BALBO, G. On the estimation of join result cardinalities. TR 24/2/83, Dip. Informatica, Univ. Torino, Turin, Feb. 24, 1983. 31. SACCO,G. M., AND SCHKOLNICK,M. A technique for managing the buffer pool in a relational system using the hot set model. In Proceedings of the 8th Conference on Very Large Data Bases (Mexico City, 1982). 32. SACCO,G. M., AND SCHKOLNICK,M. Thrashing reduction in demand accessing of a data base through an LRU paging buffer pool. U.S. Patent 4.422.145, Dec. 20, 1983. 33. SELINGER, P. G., ET AL. Access path selection in a relational database system. In Proceedings of the ACM SZGMOD Conference. ACM, New York, 1979. 34. STONEBRAKER,M. The design and implementation of INGRES. ACM Trans. Database Syst. I, 3 (Sept. 1976), 189-222. 35. STONEBRAKER,M. Operating system support for database management. Commun. ACM 24, 7 (July 1981), 412-418. 36. WHANC, K., WIEDERHOLD, G., AND SAGALOWICZ,D. Estimating block accesses in database organizations: A close noniterative formula. Commun. ACM 26, 11 (Nov. 1983), 940-944. 37. YAO, S. B. Approximating block accesses to database organizations. Commun. ACM 20, 4 (Apr. 1977), 260-261. Received February 1985; revised April 1986; accepted April 1986
ACM Transactionson DatabaseSystems,Vol. 11,No. 4, December1986.