Time-Stratified Sampling for Approximate Answers ... - Semantic Scholar

Implementing and Evaluating Warehouses and Summaries over a Cluster Pedro Furtado Dep. Eng. Informática Univ. Coimbra / CISUC Portugal [email protected]

Abstract Cluster computation power provides a promising way to improve response time in large data warehouses. On the other hand, the use of sampling summaries on the cluster for approximate answering of OLAP queries provides a very flexible system that can provide response time guarantees. In this paper we explore the cluster computation paradigm for data warehouses and summaries. The use of cluster computation in a network with N computers can speedup query processing about N times and further speedup can be obtained using samples instead of the full data. Sampling summaries have been proposed before in the context of OLAP queries to avoid query processing times that leave users and applications waiting too long when only exploration analysis is required over more or less aggregated data. But while a typical one-node sampling summary is either too small to answer more detailed queries or too slow to provide almost instant response time, summaries over a cluster are extremely fast and are sufficiently large to answer most aggregation query patterns. We explore the implementation and processing of the data warehouse and sampling summaries over a set of nodes for cooperative cluster computing and present experimental results on the subject.

1. Introduction Business environments running fast growing data warehouses and analysis on those systems very often face an unpleasant reality: their systems start up with more than sufficient efficiency but, as the data keeps growing at amazing rate, they find themselves with intolerable response times. Minor hardware upgrades and specific query tuning efforts often decrease the crisis level but, as the data set becomes very large, major hardware changes to fast parallel architectures and reinstallations are required. While these changes might solve some problems, they are very expensive and have impact over the whole system. It is possible to have a much cheaper, scalable and convenient strategy from the start, by using a set of standard computer nodes in a network doing the work very efficiently in a cluster computation environment. We explore the use of a cluster data warehouse. Such a strategy does not by itself guarantee very fast response times for ad-hoc query patterns revolving around enormous quantities of data. For such queries, sampling summaries [13, 9] have been proposed to process extremely fast exploration analysis or intensive computations. We show how sampling summaries are implemented in the cluster computation environment to yield extremely fast response times in a more flexible and scalable way. From our presentation, it will be clear that the full data warehouse placed in a “balanced” cluster system with N nodes can achieve speedups near to N. To attain very large speedups, a very large number of nodes would be required, which besides being expensive also increases

data exchange overheads. We show now how this speedup limitation can be overcome using summaries over the cluster. There are two crucial and interrelated variables in sampling summaries: the representation capacity determines the accuracy that the summary can give to estimations or, in the limit, if the summary is able to estimate the query or not due to lack of samples; the speedup relative to the base data returns a measure of the expected response time improvement. A summary is obtained by sampling the base data with a given sampling rate SP% (sampling percentage) (for simplicity, we are considering uniform sampling). The value 1/(SP%) can be used as a very rough estimation of the speedup that is typically achieved with sampling rate SP%. The typical aggregation query returns computations over groups (group-by attributes). Summaries must be able to estimate within each individual result group based on the set of samples that fall within that group. As an illustration, consider two groups in the same query, one with 5,000 elements and the other with 1,000 elements. Which summary size would be needed to estimate each of these groups? Figure 1 shows the samples and speedup expected for alternative summary sizes: Group with 5,000 elements Group with 1,000 elements SP Nº Samples Speedup SP Nº Samples Speedup 500 100 10% 10 10% 10 250 50 5% 20 5% 20 2% 100 50 2% 20 50 1% 50 100 100 1% 10 200 200 0.50% 25 0.50% 5 500 500 0.20% 10 0.20% 2 1000 1000 0.10% 5 0.10% Figure 1 – Group Estimation Examples

The bold values correspond to choices that would not be good ideas: for instance, for the 5,000 elements group, a summary with 0.5% of the base data would have 25 samples (not enough) for the estimation of that group. On the other hand, a summary with 5% of the base data would only be 20 times faster than the base data. If we consider the smaller 1,000 elements group, even a 2% summary will have too few samples. The problem exposed by the previous example is the accuracy/speed trade-off. Smaller summaries answer faster but cannot achieve the desired accuracy for more query patterns. For the same representation reason, different aggregation granularities require different minimum summary sizes. For instance, a year granularity might be answered with a 0.1% summary, while the quarter granularity requires a 0.4% summary. The cluster computing environment provides larger sampling rates without hurting performance. We have mentioned that a summary speedup is roughly (at least) 1/(SP%). The speedup of dividing the summary into N nodes is (roughly) N. Figure 2 shows the same 1,000 elements estimation problem presented in figure 1, now including new columns showing the speedup of N nodes cooperating for the summary and the minimum number of nodes needed for a given speedup using the summary or the full data warehouse.

SP 30% 20% 10% 5% 2% 1% 0.50% 0.20% 0.10%

Group with 1,000 elements Min N for Min N for 100 Nº Speedup Speedup 100 x x speedup on Samples (1 node) (N nodes) Speedup full DW 300 3.33 3.33N 30 100 200 5 5N 20 100 100 10 10N 10 100 50 20 20N 5 100 50 50N 2 100 20 100 100N 1 100 10 200 200N 1 100 5 500 500N 1 100 2 1000 1000N 1 100 Figure 2 – Group Estimation Example with N nodes

While the only way to hope for a given speedup (e.g. 100 x) with the full data warehouse is to use an equivalent number of nodes, summaries over the cluster provide a much more scalable and flexible way to return faster response times and can be coupled with the full data warehouse cluster. If the distribution and participation of individual nodes in query processing is totally optional, this cluster computing strategy is very efficient and tolerant to the availability of nodes. In the rest of this paper we describe the architecture and query processing strategy of this approach and analyze experimental results on the subject. The paper is organized as follows: Section 2 introduces the architecture of the cluster data warehouse and section 3 extends it to the summary warehouse. Section 4 discusses the query processing strategy and section 5 studies communication related overheads. Section 6 analyzes experimental results using the TPC-H decision support benchmark [14], section 7 shows related work and section 8 contains concluding remarks. 2. The Cluster Data Warehouse System A distributed data warehouse is an example of a grid DW system in which nodes contribute to query answering. The fact that such a distributed data warehouse is typically divided by functional areas means that query processing can be unbalanced for many queries. A different totally-balanced strategy is to divide the virtual centralized data warehouse facts into nodes randomly, so that the nodes are balanced and each query is expected to carry a similar workload into each node (for simplicity we are assuming homogeneous nodes in this discussion). Another important property of the random partitioned data is that it can estimate results correctly without all nodes, because the rows in each node are in fact random samples. Even a single node can estimate answers correctly. The basic strategy to build our cluster computing environment from the base data warehouse involves, for each star schema, dividing the fact rows into a set of nodes and duplicating the dimension tables in those nodes. Figure 3 shows the node construction strategy. The computation of queries is done by submitting a rewritten query into the nodes and collecting the results into a final answer.

2. Copy D rows

D

DW Fact

Node i: D

D

DWi Fact

1. Move rows to nodes

D

D

D

D

D

Figure 3 – Cluster DW Construction

A full data warehouse placed in a balanced cluster system with N nodes can achieve speedups near to N for many queries. We have argued before that this system requires a very large number of nodes to attain very large speedup. However, this speedup limitation can be overcome using summaries over the cluster. We show next the setup of a sampling summary over a cluster. 3. The Sampling Summary Over a Cluster Figure 4 shows the construction of an ordinary (non-cluster) sampling summary (SS) from the base data warehouse (DW). The star schema structure is first duplicated. Step 1 of the summary loading process samples the base data facts with a sampling rate SP% using a one pass algorithm, loading the corresponding facts of the sampling summary. Step 2 retrieves dimension rows that are referenced by the rows of the SS fact tables. 2. Retrieve D rows corresponding to MS samples

D

DW Fact

SS Fact

1. sample SP%

D

D

D

D

D

D

D

Figure 4 – Construction of the Summary Warehouse

A sampling summary constructed in this way achieves a large speedup. A rough approximation of such speedup is 1/(SP%). How is a pre-computed sampling summary constructed over a cluster environment? Assuming there are N nodes and an SP% summary is deemed sufficient for a desired accuracy target, step 1 samples the base fact using the sampling rate SP% but redirects the rows randomly into one of the N nodes. The expected speedup is roughly 1/(SP%/N)= N/(SP%) and the total sampling rate is SP%. Step 2 retrieves corresponding dimension values to each node individually. If SP%=100%, the whole data warehouse was divided into nodes and exact answers can be returned for every possible query pattern. If, instead of an accuracy target SP%, a response time target is expressed by a maximum sampling rate per node (SPnode%), the summary is obtained by sampling the base fact with Nx(SPnode%) (step 1) and randomly distributing into the nodes. The speedup is roughly 1/(SPnode%) and the total sampling rate is Nx(SPnode%). These accuracy target and response

time target expressions are of course equivalent (SP%=N x SPnode%), but the first emphasizes the fact that the cluster produces a much faster summary than a single node one - N/(SP%) versus 1/(SP%) and the second shows that the cluster produces a much more accurate summary than a single node one - N x SPnode% versus SPnode%. Periodic loading in the cluster environment is handled similarly as the loading process described above, with the incoming stream being sampled with SP%. To ensure large flexibility of the approach, it should be possible to add/remove nodes from the system using any desired sampling rate. 4. Managing the DW and Summaries Over a set of Nodes The system is made of a set of nodes which receive summaries that must be registered. The registry is a simple table maintaining a line per node with the address of the node together with the indication of which summaries are available in that node. Each node that enters the cluster system installs one service waiting for query submission (Query submit) and another service listening for incoming queries (Query Process), which processes the query and returns the partial answer to the submitting node. Figure 5 shows these functionalities. When a query is submitted in one node, the node becomes responsible for controlling the processing of that query, which involves submitting it, collecting the results, computing the answers to the query and presenting those answers to the user. The figure also shows that one node is responsible for query submission, collecting and presentation, while all available 1..n nodes are responsible for query processing. Query submit •

Evaluation/Redirect

•

Query rewrite

•

Query distribute to nodes

Query distribution

Query collect •

Get alive notices

•

Listen to partial incoming results from

1 • •

n

Query Process •

Listen to incoming query processing requests

•

Send alive notice to requester

•

Run against local data seti

•

Send results to submitting node

available nodes •

Merge results

Results presentation •

Compute answer/CIs from merged results

•

Publish

Figure 5 – The Query Submission and Processing Strategy

The next sections detail each of the functionalities shown in Figure 5. The formulas and SQL expressions shown include the sampling percentage value (SP%), which is composed of possibly varying individual node sampling percentages (SPi%) and can sum up to 100%, in which case we have a full data warehouse cluster system.

4.1. Query submission The system can be setup as a full node-partitioned data warehouse, a node-partitioned summary or both. We assume both in the following discussion. By default, a query is submitted into the full cluster data warehouse. However, the user may submit the query to the summary cluster warehouse for faster processing. When a query is submitted, the system must parse the query and evaluate if it can be answered by the summary warehouse or otherwise redirected into the original DW. The fact that summaries rely on estimation procedures means that the use of the summary is restricted to aggregation queries. This step is similar to the one used in other summary approaches (e.g. [10]). The next objective of query submission is to distribute the query into the nodes for independent processing. This requires a simple query rewriting step. The following discussion is valid either for the summary warehouse or the full warehouse, in which case the total sampling percentage is SP%=1. The system can be viewed as a single virtual data set (summary) made of the conjunction of all component subsets (summaries). The formulas used to estimate values are similar to those obtained for a random sampling summary if applied to the global virtual summary that is obtained by the sum of the subsets sampling percentages of the component summaries SP% = SP1% +…+ SPn%. However, the set of samples is not available in the merging node and the objective is to make each node compute a partial answer independently and the controlling node merge the independent partial answers. This is obtained by computing partial parameters in individual nodes and using the formulas (1) to (6) to arrive at the final estimation: COUNTestimated = COUNTsample _ set / SP% = ∑ all _ nodes COUNTsample _ set _ node i / SP%

(1)

SUM estimated = SUM sample _ set / SP% = ∑ all _ nodes SUM sample _ set _ node i / SP%

(2)

AVERAGE estimated = AVERAGE sample _ set = ∑ all _ nodes SUM sample _ set _ node i / ∑ all _ nodes COUNTsample _ set _ node i

(3)

( ∑ SUM _ OF _ SQUARES sample _ set _ nodei − ∑ SUM sample _ set _ nodei ) 2

STDDEVestimated = STDDEVsample _ set =

∑ COUNTsample _ set _ nodei

(4)

MAX estimated = MAX sample _ set = MAX ( MAX sample _ set _ nodei )

(5)

MIN estimated = MIN sample _ set = MIN ( MIN sample _ set _ nodei )

(6)

This means that the query rewriting step needs to replace each AVERAGE and STDDEV (or variance) expression in the SQL query by a SUM and a COUNT in the first case and by a SUM, a COUNT and a SUM_OF_SQUARES in the second case, while it can keep the other operators as are. For instance, the select clause “Select sum(a), count(a), average(a), stddev(a), max(a), min(a)” is rewritten into “Select sum(a), count(a), sum(a x a), max(a), min(a)”. For typical aggregation queries, those are the most relevant modifications, as the GROUP BY clause is left as-is and the HAVING clause (if present) can be eliminated from the rewritten query and applied later in the merging phase.

4.2. Query Processing The query processing service in each node receives incoming requests containing the rewritten query and runs it against the local summary. As soon as it receives the request it also sends an alive notice to the submitter node to indicate that the node is online. After the processing is done in the local node, the partial results are sent back to the submitter node. 4.3. Query Collection and Results Presentation The query collector receives partial results and merges them using the formulas (1) to (6). Simultaneously, it computes confidence intervals and publishes the results. In practice, this means running a modified version of the initial query against a temporary cached table that contains the partial results returned by the nodes. The next sequence shows exactly what each step does using an example: 1. Query submission: Select sum(a), count(a), average(a), max(a), min(a), stddev(a) From fact, dimensions Group by grouping_attributes;

2. Query rewriting: Select sum(a), count(a), sum(a x a), max(a), min(a) From fact, dimensions Group by grouping_attributes;

3. Results collecting: Create cached table PRqueryX(node, suma, counta, ssuma, maxa, mina) as ;

4. Results merging: Select sum(suma)/SP%, sum(counta) /SP%, sum(suma) / sum(counta), max(maxa), min(mina) (sum(ssuma)-sum(suma)2)/sum(counta), From fact, dimensions Group by grouping_attributes;

Confidence intervals (CI) are also computed for approximate answers during the results merging phase. Each of the parameters needed to compute CIs is available in the cached table of partial results, so that the CI for each item in the select clause is computed directly in the merge query. Confidence interval expressions can be applied using the well-known Central Limit Theorem(CLT), Chebychev or Hoeffdings bounds. CIs for the sum, count and average functions are shown below using CLT (additional CIs can be obtained for other operators). The number of samples is denoted as ns, the standard deviation as σ and zp is the normal distribution coefficient corresponding to a desired confidence level, CIavg ≈ zp σ/√ns,

CIcount ≈ zp √ns/SP%

CIsum ≈ CIavg x ns

Additional confidence intervals must be derived for the estimation of extremes and other nontrivial functions. This is part of our current work. The results are published as soon as possible. If the node is offline, the results are published immediately. Otherwise, as soon as the first partial result is available, the system can wait for additional results a fraction of the time it took for that first result (configurable parameter) and then merge and post the answer using the available partial results. If additional results arrive later, the system reiterates to show improved results.

5. Additional Communication Overheads The previous discussion has omitted the overheads incurred by node data exchange and result merging during query processing, because these are typically small. In this section we show why. The query must be sent to all nodes in the cluster and the results from each node must be collected into a merging node. The overhead of sending the query into all nodes can be disregarded because it is a small amount of data. Considering that a result set is composed of r rows with an average of B Megabytes and the network bandwidth of nb Mbps, the overhead of all N nodes sending the results to the merging node is not much larger than (N x r x B) / (nb/8 x 1MB) seconds. This result is obtained by considering that the receiving node gets a sequential non-stop stream of data from the sending nodes. For instance, if the result set has 20K rows (quite large) with an average size of 100B each and the network runs at 100 MBps, a 10 node cluster will not take much longer than 10 x 20Kx100B/ (100/8 x 1MB) ≈ 2 seconds to exchange query results. The temporary storing and merging overhead (merging results from all nodes) can be much larger if the number of nodes is very large and the number of rows in the result set is also large, as it must provide temporary storage and apply a merging query to a possibly large number of rows (N x r). This overhead depends largely on the query being posed. 6. Experimental Evaluation We conducted experimental evaluation work to assess the efficiency and accuracy of the cluster cooperative processing of queries described in this paper. Our experimental results are based on a TPC-H data set (50GB) on a set of standard nodes (Pentium IV 2GHz, 256MB RAM with 120 GB IDE hard disk with UDMA enabled), each running an Oracle 9i database engine. The relevant tables and indexes were analyzed for cost-based optimization before the experiments were ran. Our first experiment involved simulating different cluster configurations (1, 5, 10, 25 nodes) for queries Q1, Q3, Q4, Q5 and Q6 of TPC-H to measure the speedup. Figure 6 shows the results. The speedup of these queries was about N times for N nodes, as expected. We are engaged in more extensive testing against TPC-H, with preliminary results showing that for some queries the speedup is larger and for others it is lower than N, but this subject deserves additional experimentation, which is part of our current and future work.

Figure 6 – Response Time of the cluster DW for different Number of Nodes

The next experiments concern the use of the cluster summary warehouse. For these experiments we used query Qa, which is a slightly modified version of a typical TPC-H query in order to test varied granularities (degrees of detail). Qa computes aggregate quantities per brand, discount and a defined time period (e.g. year, quarter, month or week). SELECT p_brand, to_char(l_shipdate,'yyyy-mm') year_month, discount, avg(l_quantity), sum(l_quantity), count(*), max(l_quantity) FROM lineitem, part WHERE l_partkey=p_partkey GROUP BY to_char(l_shipdate,'yyyy-mm'), p_brand, discount ORDER BY 1,2

Our first experimental result compares the response time of the query ran against a 50GB data set with that of the same query ran over a set of nodes, each containing a fraction of the original data set. The experiments were held in a 5 nodes cluster and SPnode% set as 1%, 2.5%, 5% or 10% of the data set, which means that the queries are being submitted against summaries with respectively 5%, 12.5%, 25% and 50% of the data warehouse. Figure 6 shows the response times against the percentage of the data set per node.

Figure 7 – Response Time Results VS SPnode% for Qa over 50GB TPC-H Data Set

The query posed against the base data took 104 minutes. The cluster was able to improve the response time significantly, although the speedup for query Qa was not as much as the theoretical value of 1/(SPnode%). For the same query and with the experiment being ran on a 100Mbps network, the node data exchange overhead (send and collect data from all nodes) was only about 1 second and the result merging overhead was between 1 and 2 seconds for 1995 and 7700 groups in the result set respectively (the number of groups exchanged in query Qa depends on the aggregation granularity). Although these results were obtained for a 5 node cluster, we would expect only slightly worse response times with a few more nodes (e.g. 10 nodes), as the data exchange and results merging overheads would not increase significantly. In the previous experiment, the only way to have the full data warehouse in the cluster and still have a speedup similar to the one obtained with for instance SPnode% = 2.5% would be to have at least 40 nodes. This is why summaries are important in this context. The next experiments analyze the major variables involved in the accuracy of those summaries. The following experiment tests the accuracy resulting from the use of summaries over the data warehouse cluster, measured as the 95% confidence interval relative to the returned value (e.g. a confidence interval of 10% means that the probability that

the value is within 10% of the estimation is 95%). The same data sets of the previous experiments are used. Figure 8 compares accuracy results over a 12.5% summary (2.5% x 5 nodes) for brand sales averages over weekly, monthly, quarterly and yearly aggregations, to show that very aggregated query patterns are almost always answered very accurately.

Figure 8 – Accuracy Comparison – Aggregation Granularity

We can see from the figure that the yearly aggregation had maximum accuracy regarding all groups. As we detail the query further (quarter, then month and then week) the accuracy degrades and some less-well represented groups have a relative confidence interval that is larger. This means that the size of the summary (or summaries) must depend on the degree of detail that we expect aggregations to have. In the next experiment we consider a fairly detailed aggregation pattern - weekly – to test different summary sizes. Figure 9 shows the distribution of relative confidence intervals of the weekly average sales aggregation result groups that were returned (about 100K groups).

Figure 9 – Accuracy for week aggregation on Qa

First of all, these accuracy results validate the idea that it is important to have fairly large summaries to be able to answer accurately less aggregated analysis such as these weekly aggregations over brands (patterns such as daily brand sales or monthly product sales would be even more detailed). The larger the summary, the best accuracy guarantees for the estimations. If we conjugate these results with response time results, it also shows that the summary over a cluster is very important to guarantee that accurate summaries are also fast. Finally, Figure 10 emphasizes the fact that aggregation functions have different accuracy results. It compares average to count aggregation functions on query Qa applied over the 5% and 12.5% summaries, showing that a Count function will require larger summary sizes for the same degree of accuracy.

Figure 10 – Accuracy comparison AVG to Count over weekly aggregation

The discussion and response time experiments have shown that cluster computation provides large speedup to the data warehouse, especially if summaries are included. Accuracy results have shown that the accuracy of summaries is typically very good but depends on the aggregation pattern and function. 7. Related Work Distributed query processing and optimization over distributed data using strategies such as horizontal partitioning is discussed in [8], while strategies for distributed processing of OLAP queries are proposed in [3]. These works include relevant performance analysis on the subject but focus on either generic database distribution or on “functional areas” distributed data warehouse, as opposed to the randomly partitioned cluster data warehouse studied in this paper, whose main advantage is to achieve approximately equal workload distribution for every query. The most important new aspects discussed and evaluated in this paper are a peerarchitecture with the most relevant steps of the query processing strategy for cluster computing of the data and summary warehouse for flexible and scalable speedup. Response time requirements imposed by data analysis in large data warehouses has been a major driver for recent works in approximate query answering strategies using sampling, which include [13, 12, 9]. [13] proposed a framework for approximate answers of aggregation queries called online aggregation, in which the base data is scanned in random order at query time and the approximate answer is continuously updated as the scan proceeds. A graphical display depicts the answer and a confidence interval as the scan proceeds, so that the user may stop the process at any time. The Approximate Query Answering (AQUA) system [9, 10, 11] provides approximate answers using small, pre-computed synopsis of the underlying base data. Although online aggregation is able to refine the query estimation up to exact results, pre-computed synopsis are typically much faster. Sampling summaries are also more flexible and ad-hoc than materialized views, although they can coexist. Other model-based approximate summaries were also proposed (e.g. wavelets [15]), but while sampling summaries use exactly the same storage and processing structures of the RDBMS requiring only minor query rewriting, such models require non-trivial mechanisms that cannot take advantage of the query processing optimization. Most work on improving the accuracy of sampling summaries is based in query workloadrelated biasing strategies [2, 1, 4, 5, 6]. These strategies improve but do not solve the accuracy issue, do not cover patterns outside the predefined set or larger aggregation detail. They can

be applied in conjunction with the strategies described in this paper, as there is always the simultaneous quest for faster answers and more representation power in summaries. The accuracy limitations of summaries have been a main driver of our alternative thinking about the problem. Regardless of what bias strategies are used, what if the summary is randomly divided into nodes to allow a larger summary, with more representation power, to be as fast as the smaller one with less representation power? The related issue of determining if a summary is capable of answering a specific query or which summaries should be built is part of our current and future work. From an acceptable maximum confidence interval, it is possible to determine if a summary is appropriate to answer the query a-posteriori, by analyzing the CIs, or a-priori, using historical statistical selectivity data [7]. 8. Conclusions In this paper we have studied the implementation of warehouses and summaries over a cluster and evaluated the most relevant parameters. We have proposed a peer-architecture for the clustered data and summary warehouse for flexible and scalable speedup. We described the most relevant steps of the query processing strategy for cluster computing and discussed the expected speedup and overhead. We included experimental results testing response time and studying the major facts influencing the accuracy of the summary warehouse over a cluster. 9. References [1] S. Acharaya, P.B. Gibbons, and V. Poosala. “Congressional Samples for Approximate Answering of Group-By Queries”, ACM SIGMOD Int. Conference on Management of Data, pp.487-498, June 2000. [2] S. Acharaya, P.B. Gibbons, V. Poosala, and S. Ramaswamy. “Join synopses for approximate query answering”, ACM SIGMOD Int. Conference on Management of Data, pp.275-286, June 1999. [3] Michael Akinde, Michael Bohlen, Theodore Johnson, Laks V.S. Lakshmanan, and Divesh Srivastava, Efficient OLAP Query Processing in Distributed Data Warehouses. Int. Conf. on Extending Database Technology (EDBT'02), Czech Republic March 2002. [4] Surajit Chaudhuri, Gautam Das, Vivek Narasayya: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. SIGMOD Conference 2001. [5] Surajit Chaudhuri, Gautam Das, Mayur Datar, Rajeev Motwani, Vivek Narasayya: Overcoming Limitations of Sampling for Aggregation Queries. International Conference on Data Engeneering (ICDE) 2001. [6] Pedro Furtado, João Pedro Costa: “Time-Interval Sampling for Improved Estimations in Data Warehouses”. In Proceedings of the 2002 International Conference on Data Warehousing and Knowledge Discovery (DaWaK): 327-338.

[7] Pedro Furtado, João Pedro Costa: The BofS Solution to Limitations of Approximate Summaries. In Proceedings of International Conference on Database Systems for Advanced Applications (DASFAA) 2003. [8] D. Kossman: “The state of the art in distributed query processing.” ACM Computing Surveys, 32(4):422-469, 2000. [9] P. B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997. [10] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD Int. Conference on Management of Data, pp.331–342, June 1998. [11] P. B. Gibbons and Y. Matias. AQUA: System and Techniques for Approximate Query Answering. Bell Labs TR 1998. [12] P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th Int. Conference on Scientific and Statistical Database Management, August 1997. [13] J.M. Hellerstein, P.J. Haas, and H.J. Wang. “Online aggregation”, ACM SIGMOD Int. Conference on Management of Data, pp.171-182, May 1997. [14] TPC Benchmark H, Transaction Processing Council, June 1999. Available at http://www.tpc.org/ [15] J. S. Vitter and M. Wang. “Approximate computation of multidimensional aggregates of sparse data using wavelets”, ACM SIGMOD Int. Conference on Management of Data, pp.193-204, June 1999.