On-Line Database Tuning - Semantic Scholar

8 downloads 79 Views 642KB Size Report
performance of each index against a sample of queries from each cluster. To reduce the index space, ... implementation inside the Postgres database server. Our results ... and its variants. Similar to our proposed COLT framework, on-line techniques continuously monitor ...... and Indexes for SQL Databases. In Proceedings ...
On-Line Database Tuning Karl Schnaitter† †

Serge Abiteboul‡

Univ. of California Santa Cruz {karlsch,alkis}@cs.ucsc.edu



Tova Milo±

INRIA and Univ. Paris 11 [email protected]

Neoklis Polyzotis† ±

University of Tel Aviv [email protected]

Technical Report UCSC-CRL-06-07

Abstract This paper introduces Colt (Continuous On-Line Tuning), a novel self-tuning framework that continuously monitors the incoming queries and adjusts the system configuration in order to maximize query performance. The key idea behind Colt is to gather performance statistics at different levels of detail and to carefully allocate profiling resources to the most promising candidate configurations. Moreover, Colt uses effective heuristics to self-regulate its own performance, lowering its overhead when the system is well tuned and being more aggressive when the workload shifts and it becomes necessary to re-tune the system. We detail the design of the generic Colt system, and present its specialization to the important problem of selecting an effective set of indices for a relational query load. We describe an implementation of the proposed framework in the PostgreSQL database system and evaluate its performance experimentally. Our results validate the effectiveness of Colt in self-tuning a relational database, demonstrating its ability to modify the system configuration in response to changes in the query load. Moreover, Colt achieves performance improvements that are comparable to more expensive off-line techniques, thus verifying the potential of the on-line approach in the design of self-tuning systems.

1

Introduction

One of the main tasks of a database administrator involves tuning the physical schema of the system, that is, installing physical access structures, such as indices or materialized views, that assist the data server in optimizing the query load more efficiently. The selection of these access structures is itself an optimization problem: the administrator must optimize the system throughput, assuming some limited storage resources for the materialized structures. To assist the administrator in this challenging task, recent studies [7, 8, 24] have introduced techniques that analyze an expected workload and automatically generate a recommended physical configuration. To assess accurately the potential gains of candidate configurations, these techniques rely heavily on what-if optimization calls that simulate the materialization of specific structures and evaluate their effect on the generated physical plans. The typically large number of candidate physical structures and the cost of resulting what-if calls imply that such techniques are better suited for off-line tuning, prior to the normal operation of the data server. This off-line approach presents serious shortcomings when the workload is not stable, as the recommended access structures may soon become inappropriate. Furthermore, more and more data sets are made available on the Web to large audiences and it is more and more the case that there is no database administrator or that the person in charge is not a database expert. These observations form the motivation behind the present work. We introduce the Continuous On-Line Tuning framework

1

(Colt for short) that supports the automatic on-line selection of data access structures. More precisely, Colt builds a model of the current workload based on the incoming flow of queries, estimates the respective gains of different candidate access structures, and selects those that would provide the best performance for the observed workload within the available space constraint. Thus, the system performs continuous profiling and reorganization in order to match the most recent traits of the query workload. Of course, the profiling and reorganization are performed within reasonable overhead so that the global performance of the system is optimized. To somewhat focus the presentation, we assume further that the data lives in a relational database and that the access structures are indices. This context has been chosen because many readers are familiar with the model and because there is already a long history of works on the self-tuning of relational database systems [7, 8, 17] to compare to. We want to stress however that the technique goes beyond the relational model. Indeed, the present technique was originally designed for the on-line tuning of an XML system taking tag and keyword indices into account. In our presentation we distinguish the generic structure of Colt and its generic components from other “plug-in” components that need to be tailored for a specific context. One major difficulty of on-line tuning stems from the possibly large number of queries that need to be considered and the possibly large number of candidate indices. Essentially, it is infeasible to use what-if optimization to obtain benefit estimates for each (query, index) pair. To reduce the query space, Colt uses a clustering of queries based on how they perform with respect to indices, and measures the performance of each index against a sample of queries from each cluster. To reduce the index space, Colt splits the set of indices into three sets: (a) the materialized indices, (b) the hot indices that are not materialized but are considered promising, and (c) the remaining indices. The intuition is that hot indices have a solid chance to become materialized and hence the system is willing to use what-if optimization to estimate their potential gains. For the less promising indices, on the other hand, Colt uses much coarser (and cheaper) estimates of their performance. Indices move between these classes following changes in gain expectations; hence, Colt may promote indices to the materialized set (index creation), demote some from materialized (index deletion), and update the set of hot indices based on the accumulated statistics. To make the most effective use of what-if optimization, we dynamically allocate a slice of the resources we are willing to spend in each (query cluster, index) pair based on two criteria: (a) the fraction of the query load that the cluster represents, and (b) the variability of gains for the specific index and the queries of the cluster. We provide a statistically sound formalization of this strategy and use it to adaptively drive the invocation of what-if calls. To further control the total overhead of what-if optimization, we determine the amount of profiling resources dynamically, based on the degree of stability of the workload. Intuitively speaking, Colt lowers its overhead if the workload is stable and the system is well-tuned, and starts spending more resources when a shift is detected and the system has to adapt to a new configuration. We present an experimental validation of the proposed Colt framework based on a prototype implementation inside the Postgres database server. Our results demonstrate that, on a stable query load, Colt quickly discovers an effective set of access structures, reaching rapidly the performance of an ideal off-line tuning technique. For evolving query loads, our experiments show that Colt rapidly adapts to shifts of the query distribution, thus maintaining the materialized access structures up-todate with respect to the latest traits. At the same time, the system is shown to be resilient to noise in the workload, as it avoids “thrashing” between different configurations due to short-lived transitions in the query distribution. Finally, our experimental study demonstrates how the optimization overhead is controlled by the system and in particular how this overhead adapts to changes in the query load. The paper is organized as follows. Section 2 presents related work, while Section 3 discusses the problem and its key parameters. We present an overview of Colt in Section 4, and detail its specific 2

components in Sections 5–7. Possible extensions to the system are considered in Section 8 and the experiments are described in Section 9.

2

Related Work

Earlier studies have considered self-tuning techniques for several aspects of a relational system’s configuration, including the selection of indices and materialized views [8, 2, 17], the creation of optimizer statistics [1, 6], and the maintenance of caches [12, 14]. (As an interesting side note, the first reference that is known to us dates back to 1976, to the early days of relational database development.) For the purpose of our discussion, we focus solely on works on physical database organization since they are most relevant to our proposed COLT framework. We classify such techniques in two broad categories, namely, on-line and off-line. The first category of on-line techniques essentially contains earlier works on semantic data caching and its variants. Similar to our proposed COLT framework, on-line techniques continuously monitor the query load and adapt some aspect of the system’s physical organization in order to maximize performance. These earlier studies have a very specific focus and the proposed techniques are tailored for particular application domains, e.g., object databases [19], web forms [18], or page-caching [12]. Our work, on the other hand, focuses on a generalized tuning framework that can be applied, after some specializations, to different problem domains. Moreover, our approach relies on the use of ’what-if’ optimization and addresses key issues such as the effective allocation of profiling resources and the selfregulation of the tuning process. In terms of general principles, our approach shares similarities to the work of Hammer and Chan [17] on the automatic selection of relational indices. The authors, however, employ a simplistic query and cost model and it is not clear how their techniques can be extended to the complex workloads that we consider in this paper. As the name suggests, off-line techniques function outside of the continuing operation of the database system. Typically, such tuning methods employ a representative query load in order to both generate candidate physical configurations and to evaluate their effectiveness. Earlier studies on this topic include techniques for the automatic selection of indices [8, 16] and/or materialized views [2, 3, 5, 7], while several commercial systems include auto-tuning tools [9, 24] that are based on the off-line approach. Off-line methods, however, are not well suited for on-line tuning since their cost is prohibitive for the continuous monitoring of the query load. Moreover, they do not incorporate any mechanisms for allocating the use of profiling resources or self-regulating the overhead of tuning – these are clearly non-issues as all computation is performed off-line. As a final note, it is interesting to mention the work of Ghosh et al. [15] on query optimization through query clustering. Similar to our workload model, the authors use clustering to group together queries with similar traits in their optimized plans. The goal is to generate a single optimized plan for each cluster and then re-use it for all matching queries. The specifics are thus different from our technique, where the clustering is solely used for driving the collection of performance statistics.

3

Problem definition

In this paper, we consider a specific aspect of database tuning, namely, the inclusion of access structures in the physical database schema. Our presentation focuses on the relational model and relational indices, but, as we discuss in Section 8, our techniques are readily extensible to other domains. At an abstract level, the tuning (optimization) problem may be described as follows: given an expected query load Q and a storage budget of B units, select the set I of indices that minimize the expected query evaluation cost and fit in the alloted storage budget. Typically, we expect the query 3

load Q to vary during the operation of the database system. This leads to the on-line version of the tuning problem where the set I of indices is modified on-line to adapt to changes of the query load. Similar to off-line approaches, an on-line tuning algorithm must rely on what-if calls to the query optimizer to select access structures that match the underlying query execution cost model. Moreover, the on-line algorithm must address the following important issues that are unique to continuous tuning. – Adaptivity and Resilience to Noise. In order to be effective, the tuning process must adapt relatively fast to changes in the query load. At the same time, it should not overreact to temporary variations of the query distribution but focus instead on real changes of the workload. – Adaptive overhead. Since the tuning process runs concurrently with normal query processing, it is important to maintain a controlled overhead in order to avoid penalizing normal query execution. Furthermore, the overhead the system is willing to spend on tuning should depend on the gain that is expected from a modification of the physical schema. These issues introduce interesting trade-offs that are not trivial to balance. For instance, the tight coupling with the query optimizer hints at additional optimizer invocations, which in turn increase the cost of self-tuning. Another example concerns the speed of adaptivity: if the system is too fast to adapt, then its performance becomes vulnerable to short-lived transitions in the workload; if it is too slow, then it may be constantly far from optimal in terms of matching the materialized indices to the current query distribution. Our observations also indicate that adopting existing off-line methods is not a feasible solution to the problem. In particular, off-line methods use a specific instance of the workload and thus behave well only if the actual load is rather stable in time. Moreover, they do not address adequately the issue of low overhead as they are executed separately from the running system. Hence, they typically require plenty of CPU resources and they can cause a serious slow-down if deployed on-line. We now introduce some notation and terminology for the development of our solution. We use Q to denote the current distribution of queries in the workload. Such a distribution may be seen as a set of pairs (q, pq ), where q is a query and pq is the probability of observing q as the next query. To keep our presentation simple, we assume that Q only contains (q, pq ) pairs for which pq > 0, and use the notation q ∈ Q to denote this. Observe that, in general, Q varies in time, that it is possibly infinite, and that the system can only guess it from the previous queries that were observed. We say that a physical data structure (e.g., an index) I is relevant with respect to a query load Q if I can be used to evaluate a query q ∈ Q. We note that I must match at least one predicate in q, but it does not need to appear in the corresponding optimal plan. Given a state and a set of relevant access structures I, we use QueryCost(q, I) (where the database is understood) to denote the estimated cost of the physical plan chosen by the optimizer, assuming that the access structures in I are materialized. Hence, QueryCost(q, I) depends on the search space and cost model of the system’s optimizer. Scope of our work. We demonstrate our algorithms in the context of a relational database and singlecolumn indices. As shown in a recent study [10], this is an interesting problem in practice as a set of carefully chosen single-column indices can offer comparable benefits to more complex configurations, e.g., comprising materialized views and multi-column indices. Moreover, our work illustrates the significant technical challenges posed by the approach and the non-trivial solutions it requires, even in this simplified setting. We will mention later an instantiation of the techniques to a completely different context, namely, the selection of keyword and tag indices to optimize query processing over XML files [11]. The diversity of the two scenarios demonstrates the generality of the approach. One basic assumption of our work is that the physical access structures are non-overlapping. Two structures I1 and I2 are overlapping if one can substitute another in a physical plan, without altering the other operators. An example is an index on some attribute R.a and a materialized view that includes 4











































































































 





 (

















!

$







-

.



%









&

"



"

'

"









)







*







&

%



#







&

%



,







,

+

























 .

+

/



















Figure 1: Architecture of Colt. the same attribute, or an index on R.a and a composite index on (R.a, R.b). The extension of our techniques to more general access structures is an interesting direction for future work and we discuss it further in Section 8. It should be noted that this assumption does not limit the practical value of Colt for real-world applications, as a carefully chosen set of non-overlapping indices can still offer substantial performance improvements [10].

4

Overview of Colt

In this section, we present an overview of the proposed Colt framework for continuous on-line tuning. Colt divides the incoming workload in non-overlapping windows of w queries, called epochs, where w is a system parameter. We use Si to denote the sequence of queries in the most recent i epochs. During an epoch, Colt profiles each candidate index I on the corresponding queries in order to evaluate its potential benefit on the current query load. Thus, the measurements for I in recent epochs provide a picture of its potential performance as time progresses. At the end of an epoch, Colt initiates a reorganization phase that determines which indices should be materialized based on the performance statistics gathered while profiling. This continuous alternation between profiling and reorganization enables Colt to track the current workload and adapt the physical configuration accordingly. Colt forms the set C of candidate indices by mining the selection predicates of queries in the sequence Sh . Here, h is a global parameter that regulates the “depth” of the system’s memory and should be large enough in order to capture the dominant traits of the query workload. Colt continuously profiles the candidate indices and carefully selects a subset M ⊆ C, termed the materialized set, that is materialized and used for query evaluation. To avoid the high cost of extensive profiling for every candidate index, Colt employs a two level strategy. More precisely, each index in C is profiled with very easy to compute, yet crude performance statistics. These crude statistics are used to rank candidates and identify a small set H ⊆ C of hot indices, that is, indices that have not been materialized but look promising for the current workload. Colt subsequently profiles hot and materialized indices with accurate and therefore more expensive methods, and from that derives the new set of materialized indices. By design, Colt identifies first the most important indices to materialize, and then gradually adds other indices that offer less benefit in terms of query performance. Hence, it is essential to carefully select the composition of H so that the profiling resources are focused to those indices that are most likely to be worth materializing. This approach matches the process that would be followed by a human administrator, while allowing Colt to control the overhead of self-tuning by profiling few indices at a time. Clearly, a challenge is the constant evolution of the query work load. Figure 1 presents an architectural diagram of Colt. As shown, Colt works in parallel to the main

5

processing pipeline. The functionalities of its three main components can be summarized as follows: Extended Query Optimizer (EQO). The EQO extends a standard query optimizer by providing a whatif optimization interface. A what-if call simulates the optimization of the current query assuming that a particular index is materialized, thus enabling Colt to measure accurately the effect of different indices on query evaluation. Observe that the tight coupling with the query optimizer is an essential element of the approach. Profiler. This component is responsible for gathering performance statistics for candidate indices. These statistics are updated incrementally after the evaluation of the current query, in order to overlap index profiling with an eventual interval between two consecutive queries. (The system may optimize this further by collecting statistics in a batch fashion over several queries, or when the database is idle.) As explained previously, the level of detail for the collected statistics changes depending on the set (C, H, or M) of the candidate index. For C, the Profiler maintains very crude performance statistics; for H and M, the indices are profiled through the what-if interface of the EQO. Self Organizer (SO). This component implements the reorganization phase of Colt and is thus activated only at the end of each epoch. SO mines the performance statistics gathered by the Profiler and forecasts the expected benefit of each index in H ∪ M on the query workload. For indices in H, the cost of materialization is also taken into consideration as a negative benefit. The indices with the highest expected benefits are then selected and placed in the to-be-materialized set. We will discuss in Section 7 another module in charge of actually scheduling these materializations. The SO is also responsible for selecting the hot indices from C, in order to have them profiled accurately in the coming epochs. In the following sections, we detail the main components of Colt. As mentioned earlier, we present our algorithms in the context of tuning a relational database with single-column indices. Since our techniques have a more general application, our convention is to frame in boxes the parts of the algorithms that are specific to the given context.

5

Extended Query Optimizer

The Extended Query Optimizer forms part of the normal query processing pipeline, offering the services of a conventional query optimizer. At the same time, the EQO forms an integral component of the online profiling process, as it supports the what-if interface that enables Colt to measure the performance of candidate indices. Figure 2 shows the pseudo-code for the what-if interface. The EQO receives as input the current query q and a set P ⊆ (H ∪ M) of indices that need to be profiled against q. For each such index I ∈ P, the EQO computes a benefit metric QueryGain(q, I) that is defined as the reduction in the (estimated) execution cost of q if I is materialized (line 10). Clearly, the computation of QueryGain(q, I) involves the generation of two physical plans, one for the case that I is materialized and one where it is not. For all indices, at least one of these two plans is already computed as part of normal query processing: for a materialized index I, the optimal query plan generated by the standard query optimization naturally takes I into consideration; for hot indices, the EQO has already computed the plan for the case where I is not assumed to be materialized. The key issue, therefore, is computing the second optimal plan for each index. To compute the second plan efficiently, the EQO uses the semantics of QueryGain in order to reuse partial results from the generation of the first plan. More concretely, each what-if scenario uses a materialized set that only slightly differs from the set that was used in the optimization of the initial plan. Hence, some intermediate solutions that were generated for the initial plan may be re-used in the what-if plan. Consider, for instance, the query σa=10 (R) 1 S 1 T and a what-if scenario for a hot index 6

Procedure WhatIfOptimize(q,P) Input: The current query q; a set of indices P Output: The what-if gains QueryGain(q, I) for I ∈ P begin 1. plan ← the already computed physical plan of q 2. cost ← execution cost of plan 3. for I ∈ P do 4. if I ∈ M then 5. M0 ← M − {I} 6. else if I ∈ H then 7. M0 ← M ∪ {I} 8. plan I ← optimized physical plan using M0 9. cost I ← execution cost of plan I 10. QueryGain(q, I) ← |cost I − cost| 11. done 12. return {hI, QueryGain(q, I)i | I ∈ P} end

Figure 2: Algorithm for What-If Optimization.

I on R.a. To compute the optimal plan assuming that I is materialized, the optimizer can obviously re-use previously computed optimal plans for any sub-query that does not involve R. For instance, the join S 1 T does not need to be re-optimized. Only the sub-plans that involve R, such as σa=10 (R) 1 S, need to be optimized again. In our prototype implementation of Colt, we have modified the query optimizer to cache all sub-plans from the initial optimization of the current query, in order to speed up the computation of the subsequent what-if plans.

6

Profiler

In this section, we detail the design of the Profiler component. As mentioned earlier, the Profiler is closely coupled with the Extended Query Optimizer and its main function is to measure the performance of candidate indices. Intuitively, the performance of a particular index I may be captured as the reduction in estimated query execution time if I is materialized vs. if it is not. Consider a particular query q and some particular time t. Let QueryGain(q, I) = QueryCost(q, M ∪ {I}) − QueryCost(q, M − {I}) denote the savings in execution time of q when I is part of the materialized set M (with P t understood.) The benefit of I may be defined as the expected reduction in query execution time pq QueryGain(q, I), which depends on both the current query distribution (for pq ), and the optimizer’s cost model and data statistics (that are used to compute QueryGain). Since it is impractical to evaluate QueryGain for each I, each q, and each time instant t, we use a statistical approach that is based on averaging over an epoch. The benefit of an index I is thus defined as follows: P

Benefit(I) =

QueryGain(q, I)

q∈S1

w

.

(1)

where QueryGain(q, I) for an occurrence of q is computed at the time query q occurs. (So strictly speaking the time of query occurrences should appear in the equation.) Clearly, the exact computation of this metric would still require a prohibitive cost in terms of what-if calls to the query optimizer. To obtain “reasonable” estimates of Benefit(I) at moderate cost, Colt 7

employs a two-level strategy. At the first level, the Profiler computes a crude approximation Benefit C (I) of Benefit(I) that is used to select the most promising candidate indices and place them in the hot set H. At a second level, the Profiler uses what-if optimization calls to compute much better approximations Benefit H (I) and Benefit M (I) of Benefit(I) for hot and materialized indices respectively. These various approximations of benefits are computed with Equation 1, using appropriate approximations of QueryGain, namely QueryGain C , QueryGain H , and QueryGain M , respectively. We describe in Section 6.1 the approximations of QueryGain and detail in Section 6.2 the general profiling strategy.

6.1

Gain estimation

We consider in turn QueryGain C , QueryGain H and QueryGain M . 6.1.1

QueryGain C

The estimate of QueryGain C is closely tied to the context in which Colt is applied (in this case, a relational database with single-column indices). It is important to note that this estimate does not have to be precise since it is simply used to quickly focus resources on promising indices. The model for QueryGain C relies on standard cost formulas to obtain an optimistic approximation of the true query gain. More formally, let q be a query in the current workload and I ∈ C be a relevant candidate index. Let R denote the table on which I is defined and σ be the selection predicate in q that I may help evaluate. The Profiler approximates the gain of q as QueryGain C (q, I) = uq,I ·∆cost(R, σ, I ), where the binary indicator variable uq,I and the positive integer function ∆cost(R, σ, I ) are defined as follows. • The indicator variable uq,I essentially tracks the use of I in the optimized plan of q. For a materialized index I ∈ M, this information is readily derived from the execution plan of q; for a hot index I ∈ H, this information may be derived from the answer of a what-if call that indicates whether the optimizer would have used I if present; for the remaining indices, the Profiler sets uq,I = 1 assuming (optimistically) that it will be used in the optimized plan. • The function ∆cost(R, σ, I ) is a crude estimate of the gain in evaluating σ using I vs. using a sequential scan of R. This provides a crude estimate of the potential of using I for the particular query, based on the assumption that no other index is present. We use standard cost formulas [22] for this computation and set ∆cost(R, σ, I ) = 0 if using I is more expensive than scanning R. On purpose, the approximation Benefit C (I) typically overestimates the performance of a candidate index I, as (a) it ignores the existence of other candidate indices on the same table, and (b) it assumes that a non-materialized index would always be chosen by the optimizer. This presents the advantage that a beneficial index will always be detected and will eventually become hot. The price to pay for it, of course, is that a non-effective index I may also be designated as hot. Such a false alarm will be handled by the accurate profiling of I at the second level. Observe that the strategy tracks more closely the performance of materialized and hot indices, as uq,I is more precise and thus less optimistic. This enables Colt to notice early when such an index is no longer seen as beneficial for the current workload and to avoid re-designating it as hot. 6.1.2

QueryGain H

As mentioned earlier, the Profiler relies on what-if optimization calls to obtain detailed performance metrics for indices in H (and similarly for M). To control the overhead of these calls, the Profiler relies 8

on an intuitive assumption: for queries that are “similar”, one is likely to observe similar benefits for a given index I. Hence, profiling I against a sample of queries only may provide enough information for the complete query set. As an example of this observation, consider two queries q1 : σ30

Suggest Documents