HOBI: Hierarchically Organized Bitmap Index for ...

42 downloads 6293 Views 725KB Size Report
from the biggest East-European Internet auction platform Allegro.pl. The experiments ... Every value from the domain of an indexed attribute A has associated its own bitmap. .... optimized by means of a join index on attribute Products.name.
HOBI: Hierarchically Organized Bitmap Index for Indexing Dimensional Data

Jan Chmiel1 , Tadeusz Morzy2 , Robert Wrembel2 QXL Poland (Allegro.pl) Pozna« University of Technology, Institute of Computing Science Pozna«, Poland 1 [email protected], 2 {tmorzy,rwrembel}@cs.put.poznan.pl 1

2

In this paper we propose a hierarchically organized bitmap index (HOBI) for optimizing star queries that lter data and compute aggregates along a dimension hierarchy. HOBI is created on a dimension hierarchy. The index is composed of hierarchically organized bitmap indexes, one bitmap index for one dimension level. It supports range predicates on dimensional values as well as roll-up operations along a dimension hierarchy. HOBI was implemented on top on Oracle10g and evaluated experimentally. Its performance was compared to a native Oracle bitmap join index. Experiments were run on a real dataset, coming from the biggest East-European Internet auction platform . The experiments show that HOBI oers better star query performance than the native Oracle bitmap join index. Abstract.

Allegro.pl

1

Introduction

A data warehouse architecture has been developed in order to integrate data originally stored in multiple distributed, heterogeneous, and autonomous data sources (DSs) deployed across a company. A core component of this architecture is a huge database, called a data warehouse (DW), that stores current and historical integrated data from DSs. The content of a DW is analyzed by the so-called On-Line Analytical Processing (OLAP) queries (applications), for the purpose of discovering trends, patterns of behavior, anomalies, and dependencies between data. In order to support such kinds of analyses, a DW uses a dimensional model. In this model [6], an elementary information being the subject of analysis is called a fact. It contains numerical features, called measures that quantify the fact and allow to compare dierent facts. Values of measures depend on a context set up by dimensions that often have a hierarchical structure composed of levels. Following a dimension hierarchy from bottom to top, one may obtain more aggregated data in an upper level, based on data computed in a lower level. The dimensional model is often implemented in relational OLAP servers (ROLAP) 0

This work was supported from the Polish Ministry of Science and Higher Education grant No. N N516 365834

[3], where fact data are stored in a fact table, and dimension data are stored in dimension level tables. In the paper we will focus on a ROLAP implementation of a DW. OLAP queries access large volumes of data by means of the so-called star queries. Such queries frequently join fact tables with multiple dimension level tables, lter data by means of query predicates, and compute aggregates along multiple dimension hierarchies. Reducing execution time of star queries is crucial to a DW performance. Dierent techniques have been developed for reducing execution time of such queries. The techniques include among others: materialized views and query rewriting, partitioning of tables and indexes, parallel processing, and multiple index structures discussed below. 1.1

Related Work

The index structures applied to optimizing access to large volumes of data can be categorized as: (1) multi-dimensional indexes, (2) B-tree like indexes, (3) join indexes, (4) bitmap indexes, and (5) multi-level indexes. For indexing data in multiple dimensions some multi-dimensional indexes have been proposed in the research literature, like for example the family of R-trees [5, 15], Quad-tree [4], and K-d-b-tree [13]. Due to their ineciency in indexing many dimensions and large sizes they have not been widely applied in data warehouses. The most common indexes applied in traditional relational databases are indexes from the B-tree family [7]. They are ecient in indexing data of high cardinalities (i.e., wide domains) and they well support queries of high selectivities (i.e., when few records fulll query criteria). For searching values of low cardinalities (i.e., narrow dimensions) B-tree indexes are inecient. OLAP queries are often expressed on attributes of low cardinalities and, as a consequence, B-tree indexes may even deteriorate performance of such queries. Based on a B-tree index, the so-called join index was developed [21]. This index stores in its leaves a precomputed join of a fact and a dimension level table. It is applicable to the optimization of star queries that access both fact and dimension level tables. As mentioned before, OLAP queries are often of low selectivities and B-tree indexes may not be applicable to optimize them. For this reason, for indexing data of low cardinalities, for ecient ltering large data volumes, and for supporting OLAP queries of low cardinalities the so-called bitmap indexes have been developed [11, 12]. A bitmap index is composed of bitmaps, each of which is the vector of bits. Every value from the domain of an indexed attribute A has associated its own bitmap. The number of bits in each bitmap is equal to the number of rows in table T that stores A. A bitmap created for value v of indexed attribute A describes these rows in T whose value of A is v. In this bitmap, bit number n is set to '1' if the value of A of the n-th row equals v. Queries whose predicates involve attributes indexed by bitmap indexes can be answered fast by performing bitwise AND, or OR, or NOT operations on

bitmaps. The size of a bitmap index increases when the cardinality of an indexed attribute increases. As a consequence, bitmap indexes dened on attributes of high cardinalities become very large or too large to be eciently processed in main memory [23]. In order to improve the eciency of accessing data with the support of bitmap indexes dened on attributes of high cardinalities, either dierent kinds of bitmap encodings have been proposed, e.g. [2, 8, 14, 20, 22] or compression techniques have been developed, e.g. [1, 18, 19]. In [9, 10, 16, 17] indexes of multi-level structures have been proposed. The socalled multi-resolution bitmap index was presented in [16, 17] for the purpose of indexing scientic data. The index is composed of multiple levels. Lower levels, are implemented as standard bitmap indexes oering exact data look-ups. Upper levels, are implemented as binned bitmaps, oering data look-ups with false positives. An upper level index (the binned one) is used for retrieving a dataset that totally fullls query search criteria. A lower level index is used for fetching data from boundary ranges in the case when only some data from bins fulll query criteria. In [9, 10], the so-called hierarchical bitmap index was proposed for set-valued attributes for the purpose of optimizing subset, superset, and similarity queries. The index dened on a given attribute consists of the set of index keys, where every key represents the single set of values. Every index key comprises signature S . The length of the signature, i.e. the number of bits, equals the size of the domain of the indexed attribute. S is divided into n-bit chunks (called index key leaves) and the set of inner nodes. Index key leaves and inner nodes are organized into a tree structure. Every element from the indexed set is represented once in the signature by assigning value '1' to an appropriate bit in an appropriate index key leaf. The next level of the index key stores information only about these index key leaves that contain '1' on at least one position. A single bit in an inner node represents a single index key leaf. If the bit is set to '1' then the corresponding index key leaf contains at least one bit set to '1'. The i-th index key leaf is represented by j-th position in the k-th inner node, where k = di/le and j = i − (di/le − 1) ∗ l. Every upper level of the inner nodes represents the lower level in an analogous way. This procedure repeats recursively up to the root of the tree. From the index structures discussed above none was proposed for indexing hierarchical dimensional data in a data warehouse. Moreover, none of the above index structures reects the hierarchy of dimensions. Such a feature may be useful for computing aggregates in an upper level of a dimension based on data computed for a lower level. 1.2

Paper Contribution and Organization

In this paper we propose a hierarchically organized bitmap index (HOBI) for indexing data in a dimension hierarchy. HOBI is composed of hierarchically organized bitmap indexes. One bitmap index is created for one dimension level. Thus, an upper level bitmap index can be perceived as the aggregation of lover level bitmap indexes. HOBI supports: (1) query range predicates expressed on

attributes from dimension level tables at any level of a dimension hierarchy and (2) roll-up operations along a dimension hierarchy. HOBI was implemented as a software layer located on top of Oracle10g DBMS. The performance of HOBI has been evaluated experimentally and compared to the performance of a native Oracle bitmap join index that is built in the DBMS. Experiments were run on a real dataset, coming from the biggest East-European Internet auction platform Allegro.pl. As our experimental results demonstrate, HOBI oers better star query performance than the native Oracle bitmap join index, regardless the data volume fetched by test queries. This paper is organized as follows. Section 2 presents examples that motivate the need for an index along a dimension hierarchy. Section 3 presents the hierarchically organized bitmap index that we developed. Section 4 discussed the experimental results evaluating the performance of HOBI. Finally, Section 5 summarizes and concludes the paper.

2

Motivating Examples

2.1

Example Data Warehouse

Let us consider as an example a simplied DW schema, cf. Figure 1, built for the purpose of analyzing Internet auction data. The schema includes the Auctions fact table storing data about nished Internet auctions. The schema allows to analyze the number of nished auctions, and aggregate purchase costs with respect to time, location of a customer, and product sold. To this end, Auctions is connected to three dimensions via foreign keys, namely Product, Location, and Time, all of them composed of hierarchically connected levels 1 . For example, dimension Location is composed of three levels, namely Cities, Regions, and Countries, such that Cities belong to Regions that, in turn, belong to Countries. In the example schema multiple analytical range, set, roll-up, and drill-down queries are executed. Some of them are discussed below. 2.2

Range Query

Let us consider the below query counting the number of auctions that started between April 13th and July 31st, 2007. Typically, the query can be optimized rst, by fetching from Days rows that fulll the query criteria, and second, by joining the fetched rows with the content of Auctions. The query could be optimized by means of a join index on attribute Days.d_date. For queries of low selectivities (wide range of dates) a query optimizer will not prot from this index. Moreover, for queries specifying ranges of months or years, the index will not prevent from joining upper level tables, resulting in query performance deterioration. 1

For simplicity reasons table attributes are not shown in Figure 1.

Fig. 1.

An example data warehouse schema

SELECT COUNT(1) FROM auctions, days WHERE auctions.start_day = days.d_date AND days.d_date >= 13042007 AND days.d_date < 31072007; 2.3

Set Query

Let us consider the below query counting the number of auctions concerning all products from category 'Handheld', namely 'Mio DigiWalker', 'HP iPaq', 'Palm Treo', 'Asus P320', and 'Toshiba Portege', as well as only one product from category 'Mini Notebooks', namely 'Asus Eee PC'. Typically, the query can be optimized by means of a join index on attribute Products.name. Notice that all products from category 'Handheld' have been selected by the query. Therefore, if an additional data structure existed at level Categories that could point to appropriate auctions, then the query could be executed more eciently. SELECT COUNT(1) FROM auctions, products WHERE auctions.product = products.id AND products.name IN ('Mio DigiWalker','HP iPaq', 'Palm Treo', 'Asus P320', 'Toshiba Portege','Asus Eee PC'); 2.4

Roll-up/Drill-down Query

Let us consider the below query computing the sum of sales prices of products, per cities where customers are located.

SELECT cities.name, SUM(auctions.price) FROM auctions, cities WHERE auctions.city = cities.id GROUP BY cities.name;

By rolling up this result along the hierarchy of dimension Location, one can compute sum of product sales prices per regions (cf. the below query) and, further, per countries. In order to perform these roll-ups, a query optimizer has to join level tables Cities and Regions as well as fact table Auctions. Having an additional data structure at level Regions, pointing to appropriate regional auctions, the query could be better optimized. SELECT regions.name, SUM(auctions.price) FROM auctions, cities, regions WHERE auctions.city = cities.id AND cities.region = region.id group by regions.name;

3

Hierarchically Organized Bitmap Index

In order to optimize queries joining fact and dimension level tables along a dimension hierarchy, we propose the so-called hierarchically organized bitmap index (HOBI). HOBI is composed of bitmap indexes created for every level of a dimension hierarchy. The bitmap indexes are also organized in a hierarchy that reects the dimension hierarchy, such that a bitmap index at an upper level aggregates bitmap indexes from a lower level. In order to present the concept of HOBI let us assume that dimension D is composed of level tables LT1 , . . . , LTn . We assume that every level table includes a key attribute. Level tables form a hierarchy in D. The hierarchy is denoted as LT1 → LT2 → . . . → LTn , where symbol → represents a relationship between a lower level table and its direct upper level table, cf. Figure 2. The relationship also exists between level data (table rows), such that multiple rows from a lower level LTi are related to one row from a direct upper level LTi+1 . Rows from level LTi are represented by the values of the key attribute of LTi and they are denoted as keyjLTi , where j = {1, 2, . . .}. Let LTi → LTi+1 . Let LTi store rows denoted as keyjLTi (j = {1, 2, . . . , m, m+ LT LT 1, . . . , q}) and let LTi+1 store rows denoted as keyo i+1 and keyp i+1 . Let furLTi+1 LTi+1 LTi LTi ther keyj → keyo and keyk → keyp , where j = {1, 2, . . . , m} and k = {m + 1, . . . , q}. In the hierarchy of D, every key attribute of LTi (i = {1, 2, . . .}) has dened its bitmap index that is denoted as BIi . BIi is composed of the set of bitmaps {bi1 , bi2 , . . . , bim , bim+1 , . . . , biq }, where q represents the cardinality of the indexed key attribute of LTi . In a direct upper level LTi+1 , bitmap bi+1 for key value o LT i i i i+1 keyo i+1 is computed as follows: bi+1 = b OR b OR . . . OR b . o m Bitmap bp 1 2

Fig. 2.

The concept of HOBI

i+1 for key value keypLTi+1 is computed as follows: bi+1 = bi+1 p m+1 OR bm+2 OR . . . OR bi+1 q . This procedure repeats recursively to the root of the hierarchy. In order to illustrate the concept of HOBI let us consider fact table Auctions storing 10 auction sales rows, cf. Figure 3. Since Auctions is connected to dimension Product composed of two levels, namely Products and Categories, HOBI consists of two levels. At the level of Products there are 8 bitmaps, each of which describes auction sales of one product. At the upper level Categories there are 2 bitmaps, one bitmap for one category of sold products. The 'Handheld' bitmap describes auction sales of all products from this category, i.e., 'Mio DigiWalker', 'Toshiba Portege', 'Asus P320', 'HP iPaq', and 'Palm Treo'. The 'Handheld' bitmap is computed by OR-ing the bitmaps from level Products. Similarly, the 'Mini notebook' bitmap describes auction sales of products from this category and it is constructed by OR-ing bitmaps for 'MSI Wind', 'Asus Eee PC', and 'Macbook Air'. In order to illustrate the usage of HOBI for range queries (cf. Section 2.3) let us consider a query computing the sum of sales of products 'Mio DigiWalker', 'Toshiba Portege', 'Asus P320', 'HP iPaq', and 'Palm Treo', as well as 'Macbook Air'. Since all products from category 'Handheld' are selected by the query, the 'Handheld' bitmap index, dened for level Category, is used for retrieving all auction sales concerning products from this category, rather than using individual bitmaps for products. For retrieving auction sales of 'Macbook Air' the bitmap index on level Products is be used. In order to illustrate the usage of HOBI for roll-up queries (cf. Section 2.4) let us consider a query computing the sum of sales concerning products in category 'Mini notebook'. In order to answer this query, a query optimizer uses bitmap 'Mini notebook' (dened for level Categories ) for nding appropriate auction sales rows. In this case, fact table Auctions need not be joined with its Product dimension level tables.

Fig. 3.

An example HOBI

4

Experimental Evaluation

The performance of HOBI was evaluated experimentally and compared to the performance of a native Oracle bitmap join index that is built in the DBMS. In the experiments we focused on the performance of range queries (cf. Section 2.2) and roll-up queries (cf. Section 2.4). The test queries were executed in a simplied auctions DW, shown in Figure 1. The experiments were run on a real dataset acquired from the Allegro Group, which is the leader of on-line auctions market in Eastern Europe. The test dataset contained 100,000,000 of fact rows describing nished auctions from the period of time between April 2007 and March 2008. Data were stored in an Oracle 10g DBMS. HOBI was implemented as an application on top of the DBMS. The experiments were run on a PC (Intel Dual Core 2GHz, 2GB RAM) under Ubuntu Linux 8.04). In order to eliminate the inuence of caching, database buer cache and shared pool were cleared before executing every query. The same query was run 10 times in order to check the repeatability of obtained results. In the experiments we were interested in measuring times of nding rows fullling query selection criteria, and not for fetching data rows from the database, since for both compared indexes identical number of rows is fetched. For this purpose, in the test queries we used count as an aggregate function. 4.1

Range Scan

This experiment aimed at measuring the eciency of HOBI for a range scan query. The query used for the experiment computed the number of auctions created by sellers in a given time period dened on the Days level, as shown below. For this experiment, the bitmap join index was created on attribute Days.d_date and HOBI was created for the Time dimension. Time period was parameterized and equaled to 1 month, 3, 6, and 9 months. SELECT COUNT(1) FROM auctions, days WHERE auctions.start_day = days.d_date AND days.d_date >= date-A AND days.d_date < date-B;

The obtained average query execution times are shown in Figure 4. As we can observe from the chart, HOBI oers better performance than the bitmap join index for every tested time range. For the data visualized in Figure 4 we computed ratio tBJI /tHOBI , where tBJI and tHOBI are average query response times using the bitmap join index and HOBI, respectively. The ratio ranges from 1.16 (for the 3-months query) to 1.48 (for the 1-month and 9-months queries). 4.2

Roll-up/Drill-down

This experiment aimed at measuring the eciency of HOBI for roll-up operations. The query used for the experiment computed the number of auctions

Fig. 4.

Dierent time ranges selected by a test query

created by sellers in a given time period dened on the upper level Months. The bitmap join index was created on attribute Days.d_date and HOBI was created for the Time dimension, as in the previous experiment. Time period was parameterized and encompassed from 10% to 100% of days stored in the database. SELECT COUNT(1) FROM auctions, days, months WHERE auctions.start_day = days.d_date and days.month = months.id AND months.month >= date-A AND months.month < date-B;

The obtained average query execution times are shown in Figure 5. As we can observe from the chart, HOBI oers better performance than the bitmap join index regardless the selectivity of the test query. For the data visualized in Figure 5 we computed ratio tBJI /tHOBI , similarly as in Section 4.1. The ratio ranges from 1.22 (for the selectivity 10%-20%) to 1.50 (for the selectivity of 75%-85%). 4.3

Observation

In both of the above experiments HOBI oered better performance than the native Oracle bitmap join index. HOBI proted from the fact that days are grouped to months. For those time periods where queried days formed full months, a bitmap index on Months was used instead of an index on Days. This way fewer bitmaps were read. Note that HOBI was implemented as an application on top of the DBMS resulting in some additional time overheads. One may expect yet

Fig. 5.

Variable selectivity of a test query

better performance of HOBI while building it in the DBMS so that it is eciently managed by a system and eciently used by a query optimizer.

5

Conclusions

In this paper we proposed the HOBI index for indexing hierarchical dimensions. The index is composed of hierarchically organized bitmap indexes, where one bitmap index is created for one dimension level. HOBI supports range predicates expressed on dierent levels of a dimension hierarchy as well as roll-up operations along a dimension hierarchy. Experimental evaluation of HOBI on a real auction dataset shows its better performance as compared to a native Oracle bitmap join index. In future we will extend the experiments to include also synthetic data from the TPC-H benchmark. Current implementation of HOBI does not apply any bitmap compression technique. In future we will extend the index in order to store and process compressed bitmaps.

References 1. G. Antoshenkov and M. Ziauddin. Query processing and optimization in Oracle RDB. , 5(4):229237, 1996. 2. C. Chan and Y. Ioannidis. Bitmap index design and evaluation. In , pages 355366, 1998. 3. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. , 26(1):6574, 1997.

VLDB Journal SIGMOD Int. Conference on Management of Data SIGMOD Record

Proc. of ACM

4. R. A. Finkel and J. L. Bentley. Quad trees: A data structure for retrieval on composite keys. , 4:19, 1974. 5. A. Guttman. R-trees: a dynamic index structure for spatial searching. In , pages 4757. ACM, 1984. 6. M. Jarke, M. Lenzerini, Y. Vassiliou, and P. Vassiliadis. . Springer-Verlag, 2003. 7. T. Johnson and D. Sasha. The performance of current b-tree algorithms. , 18(1):51101, 1993. 8. N. Koudas. Space ecient bitmap indexing. In , pages 194201, 2000. 9. M. Morzy. . PhD thesis, Pozna« University of Technology, Institute of Computing Science, 2004. 10. M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos. Hierarchical bitmap index: An ecient and scalable indexing technique for set-valued attributes. In , pages 236252. LNCS 2798, 2003. 11. P. O'Neil. Model 204 architecture and performance. In , pages 4059. LNCS 359, 1987. 12. P. O'Neil and D. Quass. Improved query performance with variant indexes. In , pages 3849, 1997. 13. J. T. Robinson. The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In , pages 1018. ACM, 1981. 14. D. Rotem, K. Stockinger, and K. Wu. Optimizing candidate check costs for bitmap indices. In , pages 648655, 2005. 15. T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The R+-Tree: A dynamic index for multi-dimensional objects. In , pages 507518, 1987. 16. R. R. Sinha, , S. Mitra, and M. Winslett. Bitmap indexes for large scientic data sets: A case study. In . IEEE, 2006. 17. R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes for scientic data. , 32(3):138, 2007. 18. M. Stabno and R. Wrembel. RLH: Bitmap compression technique based on run-length and Human encoding. , 2009. 19. K. Stockinger and K. Wu. Bitmap indices for data warehouses. In R. Wrembel and C. Koncilia, editors, , pages 157178. Idea Group Inc., 2007. ISBN 1-59904-364-5. 20. K. Stockinger, K. Wu, and A. Shoshani. Evaluation strategies for bitmap indices with binning. In , pages 120129. LNCS 3180, 2004. 21. P. Valduriez. Join indices. , 12(2):218246, 1987. 22. K. Wu and P. Yu. Range-based bitmap indexing for high cardinality attributes with skew. In , pages 6167, 1998. 23. M. Wu and A. Buchmann. Encoded bitmap indexing for data warehouses. In , pages 220230, 1998.

Acta Informatica

SIGMOD Fundamentals of Data Warehouses ACM Transactions on Database Systems (TODS) Proc. of ACM Conference on Information and Knowledge Management (CIKM) Advanced database structure for ecient association rule mining ADBIS Int. Workshop on High Performance Transactions Systems Proc. of ACM SIGMOD Int. Conference on Management of Data

(CIKM)

SIGMOD Proc. of ACM Conference on Information and Knowledge Management

VLDB Parallel and Distributed Processing Symposium ACM Transactions on Database Systems (TODS) Information Systems, in press, doi:10.1016/j.is.2008.11.002 Data Warehouses and OLAP: Concepts, Architectures and Solutions Proc. of Int. Conference on Database and Expert Systems Applications (DEXA) ACM Transactions on Database Systems (TODS) Int. Computer Software and Applications Conference (COMPSAC) Proc. of Int. Conference on Data Engineering (ICDE)

Suggest Documents