Adaptive Method for Range Top-k Queries in OLAP Data Cubes

24 downloads 3034 Views 172KB Size Report
a data cube with SALES as a measure attribute, and DATE OF SALE and. PRODUCT NAME as dimensions; such a data cube provides aggregated total sales figures ... Optimizations of top-k queries in other domains were studied in [12, 13].
Adaptive Method for Range Top-k Queries in OLAP Data Cubes Zheng Xuan Loh, Tok Wang Ling, Chuan Heng Ang, and Sin Yeung Lee School of Computing National University of Singapore {lohzheng, lingtw, angch, jlee}@comp.nus.edu.sg Abstract. In decision-support systems, the top k values are more informative than the max/min value. Unfortunately, the existing methods for range-max queries could not answer range top-k queries efficiently if applied directly. In this paper, we propose an efficient approach for range top-k processing, termed the Adaptive Pre-computed Partition Top method (APPT ). The APPT method pre-computes a set of maximum values for each partitioned sub-block. The number of stored maximum values can be adjusted dynamically during run-time to adopt to the distribution of the query and the data. We show with experiments that our dynamic adaptation method improves in query cost as compared to other alternative methods.

1

Introduction

With the rapid growth of enterprise data, the On-Line Analytical Processing (OLAP) [1] which uses a multi-dimensional view of aggregate data, has been increasingly recognized to provide quick access to strategic information for decision making. A popular data model for OLAP applications is the multi-dimensional database (MDDB), also known as data cube. A data cube [2, 3] is constructed from a subset of attributes in the database. Certain attributes are chosen as metrics of interest and are referred to as the measure attributes. The remaining attributes, are referred to as dimensions or the functional attributes. For instance, consider a database maintained by a supermarket. One may construct a data cube with SALES as a measure attribute, and DATE OF SALE and PRODUCT NAME as dimensions; such a data cube provides aggregated total sales figures for all combinations of product and date. Range Queries [4] apply a given aggregation operation over selected cells where the selection is specified by constraining a contiguous range of interest in the domains of some functional attributes. The most direct approach to answer range query is the na¨ıve method. However, the na¨ıve method is very expensive as all the involved cells in the data cube need to be scanned and compared before the final answer can be found. In order to speed up range queries processing in OLAP data cubes, comprehensive research has been done for aggregation functions sum [4, 5, 6, 7, 8] and max/min [4, 9, 10, 11]. Range-sum and rangemax/min queries are very important in decision support system. However, another aggregation function, top k, has not received the deserved attention due to R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 648–657, 2002. c Springer-Verlag Berlin Heidelberg 2002 

Adaptive Method for Range Top-k Queries in OLAP Data Cubes

649

its increased complexity. Aggregation function top k, which extends aggregation function max, is able to provide more precise information for decision-makers where the value of k is specified by the user. A Range Top-k Query finds the top k values in the given range. An example of range top-k query from the above supermarket data cube is to find the top 10 best selling products from 1st January 2002 to 7th January 2002. Queries of this form are commonly being asked by a supermarket manager to gain in-depth the overall performance of the sales before a precise decision can be taken. This query could not be answered directly by aggregation function max, but top k aggregation. In the literature, in order to handle range-max queries which is a special case of range top-k queries when k = 1, the Balanced Hierarchical Tree Structures is proposed in [4] for storing pre-computed maximum values. However, if the cardinalities of domains are quite different in size, the height of the tree is up to the largest domain. The concept of Maximal Cover is proposed in [9] to represent the data distribution information with respect to range-max processing. In other words, the performance of this approach is highly dependent on the data distribution. For Block-Based Relative Prefix Max in [11], the auxiliary data structures keeps only the maximum values of each region. Using these structure, the second maximum value could not be deduced. Thus, frequent access to original data cube may be necessary. In another study, the Hierarchical Compact Cube [10] uses a multi-level hierarchical structure which stores the maximum value of all the children sub-cubes with the locations. However, by pre-storing only the maximum value for each disjoint region is insufficient for aggregation function top k (as explain in Section 2). This results in frequent access to the children sub-cubes and also the original data cube. Optimizations of top-k queries in other domains were studied in [12, 13]. The probabilistic approach introduced in [12] optimizes the top-k queries by finding an optimal cutoff parameter to prune away the unwanted portions of the answer set in a relational database. On the other hand, [13] intends to find the top k tuples in a relational database which are nearest to a queried point by mapping a suitable range that encapsulates k best matches for the query. Thus, the concentration of both approaches is to find a suitable range in order to reduce the search space. In contrast, our focus on this paper is to explore the special property of aggregation top k that differs it from aggregation max and then find an efficient query method for range top-k queries. For handling range top-k queries in OLAP data cubes, [14] proposed a physical storage structure for a partitioned data cube is proposed where dense and sparse partitions are stored in different data structures. The indices to the dense and sparse structures are stored in an auxiliary data structure together with the maximum value. Based on the maximum values, the query algorithm finds the top k values in the respective partition. This paper concentrates on the physical storage of a data cube, which is different from our studies. In an interactive exploration of a data cube, it is imperative to have a system with low query cost. Therefore, our focus in this paper is to develop an algorithm to reduce the query cost of range top-k queries.

650

2

Z.X. Loh et al.

The APPT Method

Despite all the work on range-max queries, they cannot be applied directly to answer range top-k queries. In particular, most of the existing methods explore the idea that, given two disjoint regions A and B, if it is known that max(A) > max(B), region B can be ignored directly. However, for the case of top k where k > 1, even if it is known that max(A) > max(B), the effort of scanning the original data cube could not be waived for region A as the second maximum value may reside in region A. In this connection, pre-storing one value for each region may not be sufficient. The direct extension of these methods, therefore, is to store all the top k values for each region. This, however, poses the following problems: 1. the storage incurred by pre-storing all the top k values is very high 2. the value of k is unknown prior to queries and it may vary for queries Nevertheless, the problem introduced by pre-storing one and k values can be minimized by keeping more than one value from each region (not necessary k values). For instance, for answering a top-3 query, if it is known that a1 , a2 , b1 and b2 are the top 2 values for region A and B, respectively, and the descending order of these values are a1 , b1 , a2 and b2 , we may take the top 3 values that are known, a1 , b1 and a2 , as the result without scanning region A and B. Thus, pruning is possible and the chance of scanning data cube can be reduced. Making use of this essential idea, a data structure that stores at most r (r > 1) pre-computed maximum values for each partitioned region in a data cube is proposed. This structure is termed the Location Pre-computed Cube (LPC) as the location of each pre-computed maximum value is stored as well. The location is needed to judge whether the maximum value is in the query range. In addition, the smallest pre-computed maximum value in each entry of LPC indicates the upper bound of the unprocessed values in the region. This also helps in pruning a search when the smallest pre-computed maximum value is smaller than all the k maximum values found. The range top-k problem is also different from range-max problem in that the performance is influenced by the distribution of the maximum values over the data cube. For example, if the large values are mostly concentrated on a few region, then by pre-computing only r values for each region is insufficient unless r ≥ k. A simple solution will be to increase the value of r to a large value. However, this forfeits storage efficiency as the increment of r applies to all the regions in the data cube. In contrast, this is not a restriction for rangemax queries as k = 1. Thus, for handling non-uniformly distributed maximum values efficiently and cost effectively, the number of pre-stored values should be adjustable dynamically to adapt to the distribution of the query and data. To serve this purpose, a dynamic structure termed as the Overflow Array (OA) is introduced in complement to the LPC to adaptively store additional maximum values for regions during run-time. Since OA is obtained from queries, it is more accurate than any analytical model that tries to predict the query pattern of the users.

Adaptive Method for Range Top-k Queries in OLAP Data Cubes

2.1

651

The Data Structures

Definition 2.1 A data cube DC of d dimensions, is a d-dimensional array. For each dimension, the index can be ranged from 0 to ni − 1 inclusively where ni is the size of the ith dimension. Each entry in the data cube is called a cell and is denoted as DC[x1 , . . . , xd ] where 0 ≤ xi < ni .

Fig. 1. A 2-dimensional data cube

Example 2.1 Figure 1 shows a 2-dimensional data cube. The size of the first dimension (n1 ) is 7 and second dimension (n2 ) is of size 9. Definition 2.2 With respect to a data cube DC of d dimensions, a range query can be specified as {[l1 , . . . , ld ], [h1 , . . . , hd ]} such that for each dimension i, 0 ≤ li ≤ hi < ni where ni is the size of the ith dimension of DC. A range top-k query finds the top-k values in the query range. Example 2.2 The shaded area in Figure 1 represents range query {[2, 3], [5, 7]}. Definition 2.3 A data cube of d dimensions, the size of each dimension is ni (1 ≤ i ≤ d), and d partition factors b1 , . . . , bd can be partitioned into d i=1 ( ni /bi ) disjoint sub-regions known as sub-blocks. Example 2.3 Using partition factor 4 for n1 and partition factor 3 for n2 , the data cube in Figure 1 is partitioned into 7/4 × 9/3 = 6 sub-blocks. Definition 2.4 Given a data cube DC of d dimensions and d partition factors b1 , . . . , bd , with the number of pre-computed values for each sub-block, r, a location pre-computed cube, LPC of DC, is a cube such that 1. it has the same dimension d, 2. if the size of the ith dimension in DC is ni , i.e., ranges from 0 to ni − 1, then the dimension i in LPC will range from 0 to ni /bi − 1, and 3. each entry in LPC, LPC[x1 , . . . , xd ], corresponds to a partitioned sub-block in DC and stores two items: (a) the top r maximum values with the corresponding locations

652

Z.X. Loh et al.

(b) an index denoted as LPC[x1 , . . . , xd ].Overf low that links an entry of LPC, LPC[x1 , . . . , xd ], to an record in the OA. If there is no record in the OA for LPC[x1 , . . . , xd ], it is set to NULL. Definition 2.5 An overflow array, OA, is a set of linked-list. Each element in the OA is termed as an Overflow Record which has the following structure: 1. f number of maximum values and the locations of these maximum values 2. an index specifying the next OA record if more values are needed, else NULL. Initially, the OA contains no record and all the LPC[x1 , . . . , xd ].Overf low are NULL. When processing queries, if access to the original data cube is needed for a sub-block, additional maximum values are added into the OA for this sub-block. In other words, an overflow record keeps the maximum values which are not pre-stored in the LPC or other overflow records of the sub-block. The construction of the OA is discussed in Section 2.2.

Fig. 2. Auxiliary Data Structures of APPT method

Example 2.4 Figure 2 presents the LPC and the OA of the data cube DC shown in Figure 1. The number of values kept for each sub-block (r) and overflow record (f ), are set to 3. Note that LPC[1, 2] holds the top 3 maximum values and the corresponding locations among the cells in DC[i, j] where 4 ≤ i ≤ 7 and 6 ≤ j ≤ 8. The value of LPC[1, 2].Overf low is 1, shows that OA[1] is the first overflow record for LPC[1, 2]. The last field of OA[1] indicates that the next overflow record for LPC[1, 2] is OA[2]. For simplicity and without loss of generality, in this paper, we assume the size of each dimension of a data cube is n and partition factor b for every dimension. In the subsequent sections, we can safely take r = f as both factors represent the block size of each entry in the LPC and OA, respectively. 2.2

Queries and OA Construction

Due to the possibility of pruning the search, it is highly beneficial to find a correct order for the evaluation of sub-blocks. As mentioned earlier, the values

Adaptive Method for Range Top-k Queries in OLAP Data Cubes

653

in the LPC represent the maximum values of every sub-block. Obviously, the LPC needs to be processed before the rest. On the other hand, the smallest pre-computed value indicates the upper bound of the unprocessed values in the sub-block; thus, higher priority of being processed is given to the sub-block with a higher upper bound value. Hence, our query algorithm first searches in all the entries of the LPC covered by the query range, and secondly, based on the smallest pre-computed values of all the entries in the LPC, the OA and the DC are searched accordingly if necessary. We now present an example to illustrate the idea mentioned. Using the data cube and range query {[2 : 3], [5 : 7]} in Figure 1 and the LPC and OA in Figure 2, a range top-4 query, i.e. k = 4, is performed.

Fig. 3. Processing of LPC

Firstly, all the maximum values in the LPC which are in the query range are inserted into sorted list SLans together with their locations. SLans stores the candidate answer to the range top-k query and the number of nodes is limited to k as only k answers are needed. As can be seen from Figure 3, SLans contains only three candidate answers, 95 from LPC[0, 1], 93 from LPC[1, 1] and 97 from LPC[1, 2] whilst other pre-computed values in the LPC are out of the range. For all the entries in the LPC covered by the query range, the smallest precomputed values are inserted into another sorted list termed SLseq together with their indices in the LPC and also the indices linked to the OA. SLseq is used to sort the priority of being processed in OA and DC based on the smallest pre-computed values. The constructed SLseq is as shown in Figure 3. The SLseq is processed by dequeuing the first node and comparing the maximum value in the dequeued node with the candidate answers found. If the maximum value in the dequeued node is larger than any of the candidate answers found or the number of the candidate answers found is less than k, the OA is processed if the linked to the OA is not NULL, otherwise, the sub-block is scanned. As can be seen from Figure 4(a), the maximum value of the dequeued node is 96 from LPC[1, 2]. Since the candidate answers found so far is less than k, the overflow record linked by LPC[1, 2], that is OA[1], is processed and 89 from OA[1] is inserted into SLans as shown in Figure 4(b). Since the number of candidate answers found is 4 and the smallest pre-computed value in OA[1] is not larger than the smallest candidate answer found, therefore, 89 from OA[1] is not inserted into SLseq . As shown in Figure 4(b), the maximum value in the dequeued node is 95, which is larger than the smallest value in SLans , i.e., 89. However, there is no overflow record for this entry and thus, the sub-block which corresponding to LPC[0, 1] is scanned. The values which are not pre-stored in LPC but are larger

654

Z.X. Loh et al.

Fig. 4. Processing of OA and DC

than 89 are 94 and 93. However, only 93 is inserted into SLans as 94 is out of range. In addition, 94 and 93 are inserted to OA[3] and LPC[0, 1].Overf low is set to 3. As shown in Figure 4(c), the maximum value of the dequeued node is 93, which is equal to the smallest candidate answer found, thus, the algorithm stops and SLans = {97, 95, 93, 93} is returned as the answer. Making use of the information given by the smallest pre-computed value in the LPC and/or OA, the search to the OA and/or DC for sub-blocks corresponding to LPC[1, 1] and LPC[1, 2] are waived. This helps in reducing the search to the original data cube and thus, improving the response time of range top-k queries. 2.3

OA Maintenance

The OA is a global but fixed-size structure. If the OA is not full, any sub-block that requires more values during queries can request for a new allocation. In contrast, if the OA is full, the allocation must bear a higher “need” than the need of some existing blocks in order to replace it. To serve this purpose, we may have a “directory” structure which for each partitioned block the METAinformation of the access statistics of each block is stored. This gives a rough idea of how many data from each block will contribute to answer a query in average, M ean U sage = α × P rev M ean U sage + (1 − α) × Cur Data where Cur Data is the number of values required besides LPC for answering the current query, P re M ean U sage is the average number of values used from the previous query besides LPC and α is the “forgetting factor” for calculating the average. To find the most suitable overflow records for replacement, sub-block with the highest difference between the number of overflow records stored and the average number of overflow records used (M ean U sage) is chosen. In preferable manner, all the additional values obtained from queries should be kept in the OA. However, since a fixed size is allocated for the OA, in certain cases, not all the desired values can be kept. Hence, a parameter β is introduced to represent the percentage of the desired values kept where 0 ≤ β ≤ 1. For example, when β = 0, the OA is empty and when β = 1, all the additional values needed for queries are stored in the OA.

Adaptive Method for Range Top-k Queries in OLAP Data Cubes

655

Fig. 5. Query Cost Improvement with the Existence of OA

3

Experimental Results and Discussion

In order to have data cubes with non-uniformly distributed maximum values, Zipf distribution is used for determining the locations of cells holding the larger values. All the range queries are generated randomly and the query cost in terms of average cell accesses is measured. Figure 5(a) depicts three separate experiments are executed on a set of range queries using i) LPC only with r = 5; ii) r = 10; and iii) both LPC and OA are used with size of the overflow record and the number of pre-computed values in the LPC equal to 5 (r = 5, f = 5). In general, the query cost when both LPC and OA are used is lower than the query cost of LPC only with r = 5, especially when the number of sub-blocks covered by the query (Q) is moderately small. Without OA, a straightforward solution for improving the query cost for LPC only is to increase the r, e.g., r = 10. However, this solution only reduces the average cell accesses when Q is small as can be seen from Figure 5. Therefore, a better solution is to maintain an OA which gives a more stable query cost for different Q’s. As discussed in Section 2.3, due to storage limitation, not all the additional values needed for queries may be kept in the OA in the real-world cases. Therefore, the effect of β (the percentage of the additional values kept) is studied as shown in Figure 5(b). When β = 0.5, i.e., when 50% of the additional value required for queries can be kept in the OA, the performance is slightly degraded for moderately small Q as compared to β = 1. When Q is very small, the performance of β = 0.5 is better than β = 1. This is because less number of overflow records is required when Q is small and the cost difference is the cost of accessing to the OA. As a result, the requirement on OA maintenance can be relaxed (i.e., not storing all the desired values in the OA) without sacrificing the performance significantly. The same set of range queries are performed using the na¨ıve method, general max method, the APPT method using LPC only and the APPT method using both LPC and OA. Below are some observations made from Figure 6:

656

Z.X. Loh et al.

Fig. 6. Comparison with Alternative Methods

1. na¨ıve method: the query cost increases linearly with the size of the query range as all the cells covered by the query range need to be accessed. 2. general max method: pre-compute only the maximum value for each subblock. The query cost improvement over the na¨ıve method increases with the increment of the query size as the pre-computed maximum value is able to waive some search to the data cube. 3. APPT method (LPC only): with one additional pre-computed value than the max method, i.e., r = 2, the query cost is much lower than the general max method. 4. APPT method (LPC + OA): with OA, the query cost of the APPT method is further enhanced when the size of an overflow record and the number of pre-computed values for each sub-block equal to 2. For instance, for query size that covers 20% of the data cube, the query cost of the APPT method (LPC + OA) is only about 10% and 0.3% of the query cost of the na¨ıve method for 2-dimensional and 3-dimensional data cubes, respectively. This shows that a higher query cost improvement can be gained when a higher dimensional data cube is used.

4

Conclusion

In decision-making environment, a number of top values are usually needed in order to make a precise decision. However, current approaches on range-max queries could not answer range top-k queries efficiently and cost effectively if applied directly. In this paper, we have presented an approach, the Adaptive Pre-computed Partition Top Method (APPT ), for range top-k queries processing. The main idea of the APPT method is to pre-store a number of top values for each sub-block in the Location Pre-computed Cube (LPC). Based on the distribution of the maximum values, additional values required during queries are kept in a dynamic structure termed as the Overflow Array (OA). In order to fully utilize the limited space of OA, a technique is presented for OA maintenance. Through experiment, it is shown that the performance of the APPT

Adaptive Method for Range Top-k Queries in OLAP Data Cubes

657

method is further enhanced with the existence of OA. Although the additional values needed for queries might not be fully stored in the OA due to storage constraints, the existence of OA still provides a high degree of performance improvement. Furthermore, the improvement in query cost gained from the APPT method compared to other alternative methods increases for higher dimensional data cube, e.g., for query size that covers 20% of the data cube, the APPT method requires about 10% and 0.3% of the query cost of the na¨ıve method for 2-dimensional and 3-dimensional data cubes, respectively. This is very important as the data cube for OLAP applications are multi-dimensional.

References 1. The OLAP Council. MD-API the OLAP Application Program Interface Version 5.0 Speci.cation, 1996. 2. A. Agrawal, A. Gupta, S. Sarawagi. Modeling Multidimensional Databases. In Proc. 13th Int’l Conf. on Data Engineering, 1997. 3. J. Gray, A. Bosworth, A. Layman and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tabs and sub-totals. In Proc. 12th Int’l Conf. on Data Engineering, pp. 152-159, 1996. 4. C. Ho, R. Agrawal, N. Me.ddo and R. Srikant. Range queries in OLAP data cubes. In Proc. ACM SIGMOD Conf. on Management of Data, 1997. 5. C. Y. Chan, Y. E. Ioannidis. Hierarchical Cubes for Range-Sum Queries. In Proc. 25th Int’l Conf. on Very Large Databases, pp. 675-686, 1999. 6. S. Ge.ner, D. Agrawal, A.E. Abbadi, T. Smith. Relative Pre.x Sum: An E.cient Approach for Querying Dynamic OLAP Data Cubes. In Proc. 15th Int’l Conf. on Data Engineering, pp. 328-335, 1999. 7. W. Liang, H. Wang, M. E. Orlowska. Range queries in dynamic OLAP data cube. Data and Knowledge Engineering 34, 2000. 8. H. G. Li, T. W. Ling, S. Y. Lee, Z. X. Loh. Range-sum queries in Dynamic OLAP data cube. In Proc. 3rd Int’l Symposium on Cooperative Database Systems for Advanced Applications, pp. 74-81, 2001. 9. D. W. Kim, E.J. Lee, M. H, Kim, Y, J. Lee. An e.cient processing of rangeMIN/MAX queries over data cube. Information Sciences, pp. 223-237, 1998. 10. S. Y. Lee, T. W. Ling, H. G. Li. Hierarchical compact cube for range-max queries. In Proc. 26th Int’l Conf. on Very Large Databases, 2000. 11. H. G. Li, T. W. Ling, S. Y. Lee. Range-Max/Min queries in OLAP data cube. In Proc. 11th Int’l Conf. on Database and Expert Systems Applications, pp. 467-475, 2000. 12. D. Donjerkovic, R. Ramakrishnan. Probabilistic optimization of top N queries. In Proc. 25th Int’l Conf. on Very Large Databases, 1999. 13. S. Chaudhuri, L. Gravano. Evaluating Top-k Selection Queries. In Proc. 25th Int’l Conf. on Very Large Databases, 1999. 14. Z. W. Luo, T. W. Ling, C. H. Ang, S. Y. Lee, H. G. Li. Range Top/Bottom k Queries in OLAP Sparse Data Cubes. In Proc. 12th Int’l Conf. on Database and Expert Systems Applications, pp. 678-687, 2001.