How to Summarize the Universe: Dynamic Maintenance of Quantiles Anna C. Gilbert
Yannis Kotidis
S. Muthukrishnan
Martin J. Strauss
AT&T Labs Research Florham Park, NJ 07032 USA fagilbert,kotidis,muthu,
[email protected]
Abstract Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data mining. We present a new algorithm for dynamically computing quantiles of a relation subject to insert as well as delete operations. The algorithm monitors the operations and maintains a simple, small-space representation (based on random subset sums or RSSs) of the underlying data distribution. Using these RSSs, we can quickly estimate, without having to access the data, all the quantiles, each guaranteed to be accurate to within user-speci ed precision. Previously-known one-pass quantile estimation algorithms that provide similar quality and performance guarantees can not handle deletions. Other algorithms that can handle delete operations cannot guarantee performance without rescanning the entire database. We present the algorithm, its theoretical performance analysis and extensive experimental results with synthetic and real datasets. Independent of the rates of insertions and delePermission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002
tions, our algorithm is remarkably precise at estimating quantiles in small space, as our experiments demonstrate.
1 Introduction
Most database management systems (DBMSs) maintain order statistics, i.e., quantiles, on the contents of their database relations. Medians (half-way points) and quartiles (quarter-way points) are elementary order statistics. In the general case, the -quantiles of an ordered sequence of N data items are the values with rank kN , for k=1; 2; : : : 1=. Quantiles nd multiple uses in databases. Simple statistics such as the mean and variance are both insuciently descriptive and highly sensitive to data anomalies in real world data distributions. Quantiles can summarize massive database relations more robustly. Many commercial DBMSs use equi-depth histograms [21, 23], which are in fact quantiles, during query optimization in order to estimate the size of intermediate results and pick competitive query execution plans. Quantiles can also be used for determining association rules for data mining applications [1, 3, 2]. Quantile distribution helps design well-suited user interfaces to visualize query result sizes. Also, quantiles provide a quick similarity check in coarsely comparing relations, which is useful in data cleaning [16]. Finally, they are used as splitters in parallel database systems that employ value range data partitioning [22] or for ne-tuning external sorting algorithms [9]. Computing quantiles on demand in many of the above applications is prohibitively expensive as it involves scanning large relations. Therefore, quantiles are precomputed within DBMSs. The central challenge then is to maintain them since database relations evolve via transactions. Updates, inserts and deletes change the data distribution of the values stored in relations. As a result, quantiles have to be updated to faithfully re ect the changes in the underlying data distribution. Commercial database systems often hide
this problem. Database administrators may periodically (say every night) force the system to recompute the quantiles accurately. This has two well-known problems. Between recomputations, there are no guarantees on the accuracy of the quantiles: signi cant updates to the data may result in quantiles being arbitrarily bad, resulting in unwise query plans during query optimization. Also, recomputing the quantiles by scanning the entire relation, even periodically, is still both computationally and I/O intensive. In applications such as described above, it often suces to provide reasonable approximations to the quantiles, and there is no need to obtain precise values. In fact, it suces to get quantiles to within a few percentage points of the actual values. We present a new algorithm for dynamically computing quantiles of a relation subject to both insert and delete operations.1 The algorithm monitors the operations and maintains a simple, small-space representation (based on random subset sums or RSSs) of the underlying data distribution. Using these RSSs, we can estimate, without having to access the data, all the quantiles on demand, each guaranteed a priori to be accurate to within user-speci ed precision. The algorithm is highly ecient, using space and time signi cantly sublinear in the size of the relation. Despite the commercial use of quantiles, their popularity in database literature and their obvious fundamental importance in DBMSs, no comparable solutions were known previously for maintaining approximate quantiles eciently with similar a priori guarantees. Previously known one-pass quantile estimation algorithms that provide similar a priori quality and performance guarantees can not handle delete operations; they are useful for refreshing statistics on an append-only relation but are unsuitable in presence of general transactions. Other algorithms that can handle modify or delete operations rely on a small \backing sample" or \distinct sample" of the database and cannot guarantee similar performance without rescanning the relation. We perform an extensive experimental study of maintaining quantiles in presence of general transactions. We use synthetic data sets and transactions to study the performance of our algorithm (as well as a prior algorithm we extended to our dynamic setting) with varying mixes of inserts and deletes. We also use a real, massive data set from an AT&T warehouse of active telecommunication transactions. Our experiments show that our algorithm has a small footprint in space, is fast, and it performs with remarkable accuracy in all our experiments, even in presence of rapid inserts and deletes that change underlying data distri1 Update operations of the form \change an attribute value of a speci ed record from its current value x to new value y" can be thought of as a delete followed by insert, for the purposes of our discussions here. Hence, we do not explicitly consider them hereafter.
bution substantially. In the rest of this section, we state our problem formally, discuss prior work and describe our results more, before presenting the speci cs. In Section 2, we describe the challenges in dynamically maintaining quantiles and present non-trivial adaptations of prior work. In Section 3 we present our algorithm in detail. In Section 4, we present experimental results. Finally, Section 5 has concluding remarks.
1.1 Problem De nition
We consider a relational database and focus on some numerical attribute. The domain of the attribute is U = f0; : : : ; jU j ? 1g, also called the Universe. In general, the domain may be a dierent discrete set or it may be real-valued and has to be appropriately discretized. Our results apply in either setting, but we omit those discussions. At any time, the database relation is a multiset of items drawn from the universe. We can alternately think of this as an array A[0 jU j ? 1] where A[i] represents the number of tuples in the relation with value i in that attribute. Transactions consist of inserts and deletes.2 Insert(i) adds a tuple of value i, i.e., A[i] A[i] + 1 and delete(i) removes an existing tuple with value i, i.e., A[i] A[i] ? 1. LetPAt be the array after t transactions and let Nt = i At [i]; we will drop the subscript t whenever it is unambiguous from context. Our goal is to estimate quantiles on demand. In other words, we need to nd the tuples with ranks kN , k = 1; : : : ; 1=. We will focus on computing approximate quantiles. That is, we need to nd a jk such that (k ? )N and
X i