autonomously managed (i.e. created and dropped) by the ..... cheap. The complexity of the above algorithm is obviously. O(n log n) for n index candidates due to ...
Autonomous Management of Soft Indexes Martin L¨uhring Kai-Uwe Sattler Department of Computer Science & Automation, TU Ilmenau, Germany
Karsten Schmidt Department of Computer Science, TU Kaiserslautern, Germany
Eike Schallehn Department of Computer Science, University of Magdeburg, Germany
Abstract In recent years the support for index tuning as part of physical database design has gained focus in research and product development, which resulted in index and design advisors. Nevertheless, these tools provide a one-off solution for a continuous task and are not deeply integrated with the DBMS functionality by only applying the query optimizer for index recommendation and profit estimation and decoupling the decision about and execution of index configuration changes from the core system functionality. In this paper we propose an approach that continuously collects statistics for recommended indexes and based on this, repetitively solves the Index Selection Problem (ISP). A key novelty is the on-the-fly index generation during query processing implemented by new query plan operators IndexBuildScan and SwitchPlan. Finally, we present the implementation and evaluation of the introduced concepts as part of the PostgreSQL system.
1. Introduction One of the key features and a major reason for the success of DBMS is removing the burden of low-level data management aspects from the user of the system. While system-internal solutions for storage layout, access optimization, etc. were developed early on, database tuning remained an extensive task for system administrators and consultants. The complexity of this issue results from the necessity to thoroughly understand the DBMS internals as well as the given database application to be tuned. One of the key aspects of database tuning is the selection of a physical database design, and most prominently selecting a set of useful indexes, which will be our focus in this paper. The current state-of-the-art to address physical database design are advisors, which are available for all major
DBMS. According tools support administrators by automatically analyzing input workloads and, based on this, recommending indexes, materialized views, and partitioning schemes. Though this eases the tuning process by automating the decision, the solution is still static and has to be carried out manually and to a large extent outside of the system. Furthermore, for most database systems the tuning process has to be done iteratively because of the dynamic changes of the overall system. These changes include • changing workloads (especially for exploratory usage like OLAP, or generated queries, e.g. mapped XML queries), • changing data (size and distributions), • schema changes (updates, additions, removal of tables), • infrastructure changes (hardware, operating system, etc.), and • interferences with other tuning measures (materialized views, partitions, memory or buffer management, etc.). For these reasons, we believe it is worthwhile to investigate truly dynamic, autonomous, and system-integrated solutions for index tuning. In this paper we present an approach for autonomous index tuning, which is fully included with the DBMS and runs independently of user or administrator interaction, though we explicitly emphasize the need to monitor such an approach and make information about the current state and decision process available. Furthermore, recent research by some DBMS vendors described later on indicates, that autonomous management of the physical design of a database is gaining acceptance as a future direction of development. The description follows the general auto-tuning model of Observation, Prediction, and Reaction, which here conforms to observing workloads and index recommendations,
iterative decisions about index configuration changes, and index creation which is integrated with query processing by exploiting table scans for index materialization. For this purpose, we introduce the concept of soft indexes which are autonomously managed (i.e. created and dropped) by the DBMS in contrast to hard indexes created explicitly by the DBA. On-the-fly index generation is the key novelty compared to our previous research. Finally, we describe the implementation and evaluation of the proposed concepts as an extension to the PostgreSQL system.
2. Foundations and Related Work Autonomous index tuning provides a dynamic solution for the well-studied Index Selection Problem (ISP) [9]. For a static solution, the ISP can be described in a simplified form as a special case [17] of the knapsack problem [12]. Given is a set of queries Q1 , . . . , Qm as well as a set of index candidates I1 , . . . , In with associated index management costs mcost(Ii ) and their sizes size(Ii ). The profit of an index Ii for a query Qk is the difference between the query execution costs with (cost(Qk , Ii )) and without (cost(Qk )) this index. profit(Qk , Ii ) = max{0, cost(Qk ) − cost(Qk , Ii )}
(1)
The ISP consists of finding an index configuration C ⊆ {I1 , . . . , In } of materialized indexes which optimizes the overall execution time by maximizing the profit m X
max{profit(Qi , Ij ) : Ij ∈ C} −
X
mcost(Ij )
(2)
Ij ∈C
i=1
while the size of the indexes in C is constrained by a space limit S. X size(Ij ) ≤ S (3) Ij ∈C
Though we apply this widely accepted formulation of the ISP in this paper, we have to point out that it only represents an approximation of the problem. Due to dependencies of indexes used together in queries, the accurate profit can only be defined in terms of index sets I ⊂ C. Accordingly, the ISP is not identical to the classic knapsack problem, because a value can only be defined on subsets of the input set of possible indexes, which adds complexity due to a combinatorial explosion of possible subsets. Previous research, e.g. [6], as well as the currently state of the art index advisor tools [1, 19] provide solutions for the static ISP based on greedy strategies and other approximative approaches. Furthermore, they consider several important aspects not discussed here in detail, like dependencies between indexes and dependencies with other tuning measures (materialized views, partitioning). More current
approaches try to improve on the complexity and accuracy for solving the static ISP, e.g. [3, 7]. The dynamic aspect of autonomous index tuning is here described according to the auto-tuning model consisting of a cycle of Observation, Prediction, and Reaction as introduced in [18], which is similar to the alternative MAPE cycle described in [13]. The phases are iterated continuously to monitor the current system behavior, from these observations derive a decision about future behavior and according optimal system properties, and if changes are required, apply these changes during the reaction phase. As we try to investigate in several directions, to position approaches we furthermore consider for each phase the object, term, and strategy, i.e. what, when, and how a selftuning system monitors the behavior, decides upon changes, and executes them. Currently considered alternatives for continuous index tuning are shown in Table 2. Accordingly, for the approach of autonomous index tuning presented here, we consider the following aspects. Observation: like the current static approaches, for the observation we monitor incoming queries. Nevertheless, we have to maintain and adjust statistics about recent usage dynamically. These statistics are based on index recommendations for single queries and the hypothetical benefit as the cumulative profit for such index candidates. Prediction: to derive an optimal index configuration, we maintain a list of all index candidates ordered by some criterion described later on. At frequently repeated decision points (after a number of queries) the top k indexes are chosen for materialization during the prediction phase under some further constraints. This equates to a greedy strategy. Reaction: contrary to our previous work [14, 15], index creation and deletion is integrated with query processing to minimize the performance impact on the query triggering the index configuration change. For the purpose of on-the-fly index generation, the IndexBuildScan and SwitchPlan operations are introduced. Recently, in [4] an alerter approach is presented, which is dynamic regarding the phases of observation and prediction by alerting the administrator if the current index configuration could be improved. Furthermore, in [2] it is suggested, that considering dynamics of system usage by treating the workload as a sequence during observation instead of a set of statements can significantly improve the result of index tuning tools. An approach similar to our previous research [14, 15] is presented in [16], which also applies epochs of observation as used in the approach presented here. Most recently, [5] describes a solution which is deeply integrated with the query optimizer, approximatively deals with index
Object Query workloads, optimizer recommendations, logical or physical accesses
Term Manually with frequent repetition, frequently repeated triggered by time or events, continuously
(2) Prediction
Changes to the index configuration, restructuring partial or access-balanced index structures
One-off decision with possible repetition, repeated decisions triggered by time or event (number of queries, environment changes), continuously (triggered by continuous monitoring)
(3) Reaction
Creation or deletion of indexes, restructuring partial or access-balanced index structures
Alerting for required reaction, autonomous and immediate reaction before or after triggering query, on-the-fly during query processing, delayed (user-defined, system downtime)
(1) Observation
Strategy Information completeness: full or sampled; Storage: full or condensed, sliding time frame window or aging strategy Integration with DBMS: none, optimizer, query processing; Algorithmic approach: ISP with Greedy, Dynamic Programming, Relaxation; Integration with other tuning issues: materialized views, partitioning, memory management, etc. Independent from or integrated with query processing
Table 1. Alternatives for Autonomous Index Tuning
interactions, but does not support on-the-fly index generation as presented here.
3. Concepts of Autonomous Index Tuning Based on the auto-tuning model, below we describe the basic concepts applied for our approach of autonomous index tuning regarding Observation, Prediction, and Reaction.
3.1. Observation: Workload Monitoring and Index Candidate Statistics To derive the local profit each processed query Q is optimized twice, once without considering any indexes, and then considering all index candidates. Index candidates are derived according to the approach described in [17] from WHERE, GROUP BY, ORDER BY, and SELECT clauses of the query. Then, possible combinations are created and an estimated size is assigned to each one. The local profit of an index set I used in the second optimizer run which is a subset of the created index candidates, can be computed based on the execution costs with and without the index set, cost(Q, I) and cost(Q), respectively. profit(Q, I) = cost(Q) − cost(Q, I)
(4)
As all indexes in I may contribute differently to the overall profit, in [15] we evaluated different approaches to assign this profit of the index set I to the single index I ∈ I. Based on these considerations, we chose the following rough estimation. profit(Q, I) =
profit(Q, I) · size(I) P Ij ∈I size(Ij )
(5)
Furthermore, the update costs related to an index can be considered as a negative addition to the profit considering the affected rows nrows of an operation QU , the height of the tree and a empirically derived factor F to correctly relate to the query execution costs. profit(QU , I) = −height(I) · nrows(QU ) · F
(6)
The statistical information for all recommended indexes gathered as described before are managed by the soft index manager, and the set of all soft indexes is denoted as D in the following. When storing statistics for all index candidates I ∈ D, we furthermore have to consider that index definitions can be overlapping, and that for instance an index on an attribute R(A) could as well use an index on R(A, B). Accordingly, the profit R(A) can be assigned to R(A, B), too. Index containment, denoted as (I1 v I2 ) is given, if the indexes
share the same sorting order and the index attributes of I1 are a prefix of I2 . In this case the profit would be equal and could be assigned to both indexes benefits. ∀Ii ∈ D : Ir v Ii ⇒ profit(Q, Ii ) = profit(Q, Ir )
(7)
On a workload level, the previously measured profits of indexes for single queries have to be gathered and condensed. To deal with this temporal aspect for the approach presented here we use the concept of an epoch, which simply marks an observation period of limited length. This represents a simple alternative to aging statistics we applied in our previous research [15], where each incoming query might trigger index configuration changes. For this approach we investigate epochs as an alternative, because they provide a lower overhead and in many scenarios result in a more stable index configuration. The length of an epoch can be defined in terms of time, number of queries, by a maximum overall benefit of one of the index candidates, or, as used in our implementation, by a maximum number of recommendations for an index. Furthermore, we can adjust the benefit to the most recent usage by considering timestamps ts1 , ts2 , . . . , tsk of recommendations within an epoch that ends at a time tsE , to decrease the weight of older recommendations. benefit(I) =
X profit(I, tsj ) tsE − tsj
(8)
j=1...k
For this purpose, recommendation profits and timestamps are stored in a fixed length field, where an epoch ends when this field is full.
3.2. Prediction: Dynamic Soft Index Selection The end of an epoch triggers the decision phase, which solves the ISP using the information gathered during the observation phase. For each index I that is the benefit(I), the estimated size(I), and the current state(I), where the latter denotes if the index is materialized or not. Furthermore, the space constraint for materializing indexes is considered as plimit. To solve the ISP, we apply a greedy strategy as outlined before. For this purpose, all index candidates (materialized and soft indexes) are sorted based on their relative profit. relative benefit(I) =
benefit(I) size(I)
(9)
Then a new index configuration C is computed as outlined in Algorithm 1. To avoid thrashing, i.e. quickly repeated alterations of inserting/deleting same indexes, a new index configuration C only replaces the current configuration C, if the new benefit exceeds the benefit of the old configuration
Algorithm 1 Greedy Algorithm for Index Selection 1: I[1 . . . n] := sort(D) by relative benefit; 2: C := ∅; 3: avail space := plimit; 4: overall benefit := 0; 5: for all k := 1 · · · n do 6: if avail space - size(I[k]) > 0 then C := C ∪ {I[k]} 7: 8: avail space := avail space - size(I[k]) 9: overall benefit := overall benefit + benefit(I[k]) 10: end if 11: end for 12: if overall benefit < threshold then C := C end if 13: return C
by some threshold, which usually is a constant factor > 1.0 times the benefit of the old configuration. The materialization of a new index configuration implies creating new indexes as deferred indexes and deleting those that were found less useful. Nevertheless, the entries remain in the index catalog as a virtual index and can further on only be used for collecting statistics gathered from the optimizers recommendations. Note, that we ignore the costs for index deletion (negative benefit) because we assume that is rather cheap. The complexity of the above algorithm is obviously O(n log n) for n index candidates due to the efforts necessary for sorting the index list. The overall effort is reasonable, as the search space is limited by the optimizers index selection during the second optimizer run when recommending indexes. Nevertheless, this approach does not grant an optimal solution, but previous research has shown that the result is sufficiently precise, especially considering that it is based on several estimated input variables and the result is used as a prediction of future index usage.
3.3. Reaction: On-the-fly Generation of Indexes When during the previous phase a necessary change to the index configuration was detected, indexes are created as deferred indexes, which are due for materialization later on. In general, the materialization can take place at any time, and using the downtime or times of low system load are viable alternatives. Nevertheless, the approach presented here investigates decreasing the impact of index creation by integrating it with query processing as proposed in [10]. Though this results in some overhead we discuss later on, it can provide an up-to-date index configuration, especially for systems with high requirements regarding availability and permanently high system load. Furthermore, the index created during a table scan operation of a query can be used within the same query if applicable. To achieve this on-the-
fly generation of indexes and their immediate usage we introduce the IndexBuildScan and the SwitchPlan operations.
cost = pages(r) + | {z } scan on r
IndexBuildScan. This operation (denoted by ξL (r)) extends a usual table scan to create one or more deferred indexes. An additional parameter is the list of deferred indexes L on the table, which are to be materialized during the scan as outlined in Algorithm 2. Algorithm 2 IndexBuildScan Operator ξL on a relation r with predicate P State: 1: scan complete := false Open: 2: for all I ∈ ILIST do 3: if deferred(I) then prepare index(I) end if 4: end for 5: open relation(r) Next: 6: tuple := seqscan next(r) 7: if tuple 6= ⊥ then 8: for all I ∈ L do 9: if deferred(I) then insert(I, tuple) end if 10: end for 11: if P (tuple) then return tuple else next() end if 12: else 13: scan complete := true 14: return tuple 15: end if Close: 16: close relation(r) 17: if scan complete then 18: for all I ∈ L do 19: if deferred(I) then finish(I) end if 20: end for 21: else 22: for all I ∈ L do 23: if deferred(I) then reset(I) end if 24: end for 25: end if An index build scan can replace a full (i.e. for instance no limit clause or exit condition) table scan σPSEQ (r) when all indexes of L are defined on the same base relation r. σPSEQ (r) ⇔ σP (ξL (r)) if ∀I ∈ L : I deferred index on r (10) The cost for an IndexBuildScan consist of the scan cost and for each index the cost for building the indexes, which can be considered equal to sorting the relation and writing the pages to disk. The number of index pages can be derived from the key and rowid size, page size and fill factor.
X
(|r| log |r| + idx pages(I) · costpage write ) | {z } | {z } I∈L sorting
(11)
write index pages
SwitchPlan. The purpose of the SwitchPlan operator (denoted by ) is to allow using newly created indexes in the same query that created the index. The SwitchPlan operator is a simplified variant of the ChoosePlan operator introduced in [11, 8]: During the first open-next-close phase the tuples are fetched from the left child operator, but during all following phases the input comes from the right child. In this way, a SwitchPlan operator can switch between an index-building scan and an index scan exploiting the newly created index. This is illustrated in Fig. 1. NLJoin
NLJoin n
n=|r| TableScan(r)
TableScan(s)
n=|r| TableScan(r)
SwitchPlan 1
IndexBuildScan(s)↦i
n-1 IndexScan(i,s)
Figure 1. SwitchPlan Example In this case, a nested loop join is rewritten to create an index during the first inner iteration. For every following iteration the join is processed like an index nested loop join. Obviously, this works only in situations where the SwitchPlan operator is the right child of a nested iteration-based operator, e.g. nested loop as in the above example, set operations or nested queries. Thus, the following rules for rewriting hold (σPIND (r) represents an index scan): σPSEQ (r) ⇔ (σφ (ξL (r)) σPIND (r)) (12) IND IND σP (r) ⇔ (σφ (ξL (r)) σP (r)) (13) ro nP s ⇔ (r o nP (σP (ξL (s)) σPIND (s))) (14) The costs for a (sub-)plan containing a SwitchPlan as the root node are: cost = costleft + (cardleft − 1) · costright
(15)
Here, costleft and costright denote the costs of the left and right sub-plan, resp. and cardleft is the estimated result cardinality of the left sub-plan. Discussion. By building an index during the processing of a query extends a read-only transaction to a write transaction. This could lead to two problems. First, in case of
a transaction abort the atomicity of index creation has to be guaranteed, i.e. the index should be available only if it was built completely. Second, concurrently running queries could try to build the same index. However, both problems can be solved by a simple state model: initially, a soft index is in the state “deferred”, starting an IndexBuildScan moves it to “under construction” and after a successful and complete creation it is transfered to state “ready”. Then, an index is only used for index scans in the state “ready” and it is created by an IndexBuildScan only if the state is still “deferred”. Though, the SwitchPlan operator is primarily intended for intra-query usage, exploiting newly created indexes in an inter-query manner is also possible: as soon as a soft index is set to state “ready”, it can be used by another running query even if the initial index-building query is not finished yet. Another issue is the cost-based plan selection. In many cases, the optimizer would reject a plan containing an IndexBuildScan, because it decides locally solely in favor of the current query but not of the overall workload. For this reason, execution and index build costs are treated separately. Then, an index-building plan is chosen if • soft index management is enabled (by a configuration parameter), • deferred indexes are available which can be created in the given query, • the plan has minimal execution costs,
Client
Parser
postgres Backend
Planner & Optimizer
Rewriter
Soft Index Manager
Executor
Index Advisor
Index Catalog
Index Pool
Plan Operators
Database
Figure 2. Integration of Soft Index Management in PostgreSQL
way. During plan construction, all index-based access paths are considered. At this point, also deferred indexes have to be taken into account. However, these access paths have to be combined with IndexBuildScan operators. Finally, if a nested iteration is performed, a SwitchPlan operator combined with an index scan is inserted. In addition, we have extended the database catalog by two tables for collecting query specific index recommendations produced by the index advisor as well as for collecting information about all soft indexes. Furthermore, we have added the following properties to the existing B+-tree indexes.
• the overall profit of all indexes created by this plan less the index creation costs exceeds a given limit.
• managed indexes are maintained automatically by the system,
In this way, on-the-fly index building is always a global, workload-driven decision.
• virtual indexes are hypothetical indexes used only during query planning, i.e. these indexes are registered in the system catalog only together with the corresponding statistics,
4. Implementation and Evaluation Our soft index management extension consists of three new modules: the index advisor for recommending the set of most beneficial indexes for a given query, the soft index manager which maintains the global view by monitoring the query workload, maintaining the set of all index candidates (materialized or not) and triggering creation and deletion of indexes as well as two additional plan operators, i.e. the IndexBuildScan and the SwitchPlan operator. Fig. 2 shows how these pieces are integrated into the query processing pipeline of the PostgreSQL backend. The gray components are the newly introduced modules. The soft index manager is placed between the rewriter and the planner in order to monitor the queries using the index advisor. We have also extended the planner in the following
• deferred indexes are created as empty indexes first and are populated later on. All the described components are completely implemented in PostgreSQL 7.4. Thus, the results of the experimental evaluation presented in the following are produced using this system running on SUSE Linux 10, Pentium 4 (3 GHz), 1 GB RAM. Here, we used the TPC-H database with scale factor 1. Due to several problems with the TPC-H queries in PostgreSQL (some queries are not optimizable using the default optimizer, other queries show non-reproducible results in cases where just additional multicolumn indexes are available) we have chosen an own workload of queries based on the TPC-H schema. The goal of this evaluation was not only to study the behavior and the influence of
100
SwitchPlan SeqScan IndexScan
80 Time [secs]
the components but also the practicability of the overall approach in order to answer the question: Is it possible to let the DBMS autonomously decide about maintaining useful indexes? With the first experiment we have investigated the overhead of the IndexBuildScan operator compared to an explicit CREATE INDEX as well as a full table scan (SeqScan) by measuring the runtime on different columns of TPC-H tables (e.g. LINEITEM denoted by prefix l ). Fig. 3 shows the times whereas the runtime of the full table scan is set to 100%. The results show that there is an overhead between 50. . . 150% compared to the full table scan. However, an index which was predicted as useful is created as a byproduct. Another observation from this experiment was that building multiple indexes in a single scan (multi) is – at least in PostgreSQL – more expensive that building one after another. Reasons are concurrent index creations as well as the dominance of building costs.
60
40
20
0 2
4
6 8 10 Subquery Iterations
12
14
Figure 4. SwitchPlan in Sub-Selects 50 45 40
SeqScan
35
CreateIndex IndexBuildScan
#Tuples
350
Time [%]
300
30 25
250
20
200
15
150
10
100
5
50
0
SwitchPlan Indexed SeqScan 0
0 l_orderkey
l_partkey
l_shipdate
p_partkey
ps_partkey
5
10
15
20 25 Time [secs.]
30
35
40
multi
Index
Figure 5. SwitchPlan in Nested Loops Figure 3. CREATE INDEX vs. IndexBuildScan The purpose of the second experiment was to investigate the impact of the SwitchPlan operator. Fig. 4 shows the runtime behavior of the iterations in sub-select queries with index-building scans. The plot shows that dynamic index building wins already after a small number of iterations compared to a full table scan (if the query condition takes benefit from the index). Another issue is the behavior for producing tuples. In the experiment of Fig. 5 a nested loop join on the tables PART and PARTSUPP was performed with an index-based nested loop join (“Indexed”; note that the time for creating the index on the join column partkey was added as an offset), using table scans on both tables (SeqScan), as well as using a SwitchPlan operator combining an IndexBuildScan and an index-based nested loop in the subsequent iterations. As the results show, the SwitchPlan strategy produces tuples earlier but has an later overhead due to the index building. From these experiments we can conclude that queries
with nested iterations profit from on-the-fly index building, especially if we take into account that indexes are created which are expected as beneficial for other queries of the workload. In addition, integrating index building into query processing allows an immediate usage of the indexes. The objective of the second group of experiment was to study the behavior of the soft index manager at the workload level. In this evaluation we used query mixes organized in 10 blocks each of 23 queries. Each mix represents a specific access pattern (certain tables and selections). A sequence of 6 mixes represents the overall workload (60 query blocks). The reason for this was to require an adaption of the index configuration between the blocks. The whole workload leads to 17 index candidates (8 of them are large indexes on the LINEITEM table). First, we compared the runtime for processing the whole workload with different parameters for the length of the epochs an the size of the index pool. As a reference, the upper bound (no indexes = NONE), and the lower bound (all
26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000
7277
ALL
15264
14950
7277
NONE
15667
15580
14728
Scan
19335
22766
Create
Time [secs.]
22766
NONE
49
33
17
Index creation 01
26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000
ALL
Time [secs.]
beneficial indexes recommended by the index advisor and created in advance = ALL) are given.
Epoch length
Figure 8. Workload processing: static vs. dynamic index creation
Figure 6. Workload processing: epoch length The epoch length was given as the number of queries (Fig. 6), for the pool size (Fig. 7) we have chosen values that all indexes can be materialized (Large = L), only indexes which are optimal for the current query mix (all large indexes, i.e. LINEITEM indexes and small indexes: Medium = M), as well as only the small indexes beneficial for the current mix (Small = S). Finally, we compared this to the runs where indexes were created explicitly (by create index) as well as by an IndexBuildScan (Fig. 8).
In the second step, we have investigated the adaptation behavior during workload processing. In Fig. 9 the results from processing the workload with epoch length 33, a small index pool and single-column indexes are shown and compared to the ALL and NONE cases. Note, that the times for index creation in the ALL run were not added (approx. 1010 secs in this experiment). First of all, the points of adaption are clearly visible by the peaks. This means, that some queries which are processed at the time of a necessary adaption of the index configuration have to take the burden of index building.
22766
No Indexes All Indexes Soft Indexes
700
19335
600 14829
Time [secs.]
13149
NONE
L
M
S
7277
ALL
Time [secs.]
800 26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000
500 400 300 200
Index pool size
100 0 0
Figure 7. Workload processing: index pool size
10
20
30 Query Block No.
40
50
60
Figure 9. Adaptation behavior From the results we can see that both the epoch length ad the index pool size have an impact. Thus, the critical issue is to find appropriate values. In case of the pool size this could be done by taking the maintenance costs into account – if index maintenance is too expensive it is better to drop some of the less beneficial indexes. For determining the epoch length, an approach similar to the alerter as proposed in [4] could be helpful. However, the results from Fig. 8 also show that on-the-fly indexing is not always favorable, but only if queries take benefit directly from the newly created indexes.
Concerning a detailed profiling of the overall soft index management we have found that choosing the set of beneficial indexes (i.e. solving the knapsack problem) is rather cheap (< 1%). Most of the effort is spend (obviously) for materializing the index configuration. However, we found that also maintaining virtual indexes (≈ 9%) as well as collecting the required information (≈ 28%) in the catalog needs some time. Hence, this requires further optimizations in the implementation.
5. Conclusion and Outlook We have presented an approach for integrated and autonomous index tuning realizing the idea of soft indexes. This approach is based on an Observation-PredictionReaction feedback loop. Beside the continuous workload monitoring and selection of beneficial index candidates, the novelty of our approach is the on-the-fly index creation and usage. Experiences from our prior work on QUIET and from the PostgreSQL implementation of the approach described here have shown the advantage of a tight integration with query planning and processing. There are some open issues which we plan to address in ongoing work. First, in fact we have introduced some additional tuning knobs, e.g. the length of an epoch, the size of the index pool, and several threshold parameters. Of course, the values of these parameters should be determined automatically, i.e. using appropriate heuristics. A second issue is that the index building costs cannot be eliminated completely. As our experiments have shown even the “piggy-back” index building approach results in a certain delay in query answering which is not acceptable if the application requires response time guarantees. One possible solution could be to break down the coarse-grained indexes by building only portions (ranges) of them. In this way, the costs of index creation could be distributed over several queries.
References [1] S. Agrawal, S. Chaudhuri, L. Koll´ar, A. P. Marathe, V. R. Narasayya, and M. Syamala. Database Tuning Advisor for Microsoft SQL Server 2005. In Proc. 30th VLDB Conference 2004, pages 1110–1121, 2004. [2] S. Agrawal, E. Chu, and V. R. Narasayya. Automatic Physical Design Tuning: Workload as a Sequence. In Proc. ACM SIGMOD Conference 2006, pages 683–694, 2006. [3] N. Bruno and S. Chaudhuri. Automatic Physical Database Tuning: a Relaxation-based Approach. In Proc. ACM SIGMOD Conference 2005, pages 227–238, 2005. [4] N. Bruno and S. Chaudhuri. To Tune or not to Tune? A Lightweight Physical Design Alerter. In Proc. 32nd VLDB Conference 2006, pages 499–510, 2006. [5] N. Bruno and S. Chaudhuri. An Online Approach to Physical Design Tuning. In Proc. Int. Conf. on Data Engineering (ICDE 2007), 2007. To appear. [6] A. Caprara, M. Fischetti, and D. Maio. Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design. IEEE Transactions on Knowledge and Data Engineering, 7(6):955–967, 1995. [7] S. Chaudhuri, M. Datar, and V. Narasayya. Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution. IEEE Transactions on Knowledge and Data Engineering, 16(11):1313–1323, 2004.
[8] R. Cole and G. Graefe. Optimization of Dynamic Query Evaluation Plans. In ACM SIGMOD Conference 1994, pages 150–160, 1994. [9] D. Comer. The Difficulty of Optimum Index Selection. ACM Transactions on Database Systems, 3(4):440–445, 1978. [10] G. Graefe. Dynamic Query Evaluation Plans: Some Course Corrections? Bulletin of the Technical Committee on Data Engineering, 23(2):3 – 6, June 2000. [11] G. Graefe and K. Ward. Dynamic Query Evaluation Plans. In Proc. ACM SIGMOD Conference 1989, pages 358–366, 1989. [12] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack Problems. Springer-Verlag, Berlin, Heidelberg, 2004. [13] J. O. Kephart and D. M. Chess. The Vision of Autonomic Computing. IEEE Computer, 36(1):41–50, 2003. [14] K. Sattler, I. Geist, and E. Schallehn. QUIET: Continuous Query-driven Index Tuning. In Proc. 29th VLDB Conference 2003, pages 1129–1132, 2003. [15] K. Sattler, E. Schallehn, and I. Geist. Autonomous Querydriven Index Tuning. In Proc. Int. Database Engineering and Applications Symposium (IDEAS 2004), Coimbra, Portugal, pages 439–448, July 2004. [16] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. COLT: Continuous On-line Tuning. In Proc. ACM SIGMOD Conference 2006, pages 793–795, 2006. [17] G. Valentin, M. Zuliani, D. Zilio, G. Lohman, and A. Skelley. DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes. In Proc. Int. Conference on Data Engineering (ICDE 2000), pages 101–110, 2000. [18] G. Weikum, C. Hasse, A. Moenkeberg, and P. Zabback. The COMFORT Automatic Tuning Project, Invited Project Review. Information Systems, 19(5):381–432, 1994. [19] D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. Storm, C. Garcia-Arellano, and S. Fadden. DB2 Design Advisor: Integrated Automatic Physical Database Design. In Proc. 30th VLDB Conference 2004, pages 1087–1097, 2004.