2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
ACO based Approach and Integrating Information Retrieval Technologies in Selecting Bitmap Join Indexes Habiba Drias, Ibtissem Frihi Department of Computer Science USTHB, LRI Algiers, Algeria
[email protected] colony optimization (ACO) when addressing a broad spectrum of industrial applications [6], [7], [9], [10] and knowing that a few heuristic search techniques have been used to investigate the selection of bitmap join indexes problem, we have tackled this problem with this well recognized evolutionary approach.
Abstract Unlike existing studies dealing with the selection of Bitmap Join Indexes for star join queries optimization, this paper presents three original features. The first one consists in addressing the problem with ant based approach that is more robust than the simple heuristic algorithms, which are usually used in the related works. The second interesting novelty resides in the metric used to prune the search space. The fitness function designed in the ant approach is brought from information retrieval technologies and is more refined than the frequency measure usually used. Finally, the third efficient aspect is in the data structure used to manage dynamically the storage in order to select the best promising indexes. Keywords; Relational data warehouse, Bitmap Join Indexes, star join queries, Information Retrieval Technologies, ACO approach.
In this paper, an ACO algorithm, namely ACS-BJIS has been designed for exploring the problem of selecting bitmap join indexes. The algorithm was tested on a data warehouse benchmark and the comparison of its performances with those of the literature is reported.
2. Bitmap Join Index Selection In most cases, data warehouse schemes are simple. The star architecture for instance includes one or several facts tables that have a central position and several dimensions tables that are connected to the fact tables. The complexity of the data warehouse is due to three main factors. The first one resides in the tremendous volume of fact data while the second one is expressed by the huge number of the dimensions tables that contains up to thousands and multi-million rows in some cases. The third crucial parameter is the complexity of queries and their prohibitive number. Indeed to scan a part of this large scale data, queries necessitate an extremely long processing time. Bitmap join indexes are mostly designed for the purpose of reducing in a spectacular manner this time.
1. Introduction and Motivation Data warehousing has already shown its great capabilities in maximizing user access and analysis due essentially to the tremendous amount of organizational data that it centralizes. In this study, we are interested in optimizing user queries for a data warehousing environment knowing the importance of this task in data mining systems. The star data warehouse architecture is considered as it is the most common used scheme for efficiency reasons. Besides we focus on the selection of bitmap join indexes because of their capabilities of speeding up the processing of large scale queries.
The drawback of handling bitmap join indexes is their exponential number and selecting interesting indexes aiming at optimizing the queries processing time arises serious difficulties. The BJIS problem is thus an optimization problem and can be described by the important following components:
The problem of selecting bitmap join indexes (BJIS) is NP-hard [4] because of the huge exponential number of indexes that may be built for a data warehouse. Several heuristic search methods are developed for this problem [1], [3] and [4]. Nowadays more powerful artificial intelligence tools represent a large part of the methods used to cope with NP-hard problems. Motivated by the success and the power of the ant
978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.180
1) The input including: - A set of dimension tables T = {T1, T2, …, Tp} and a fact table F - A workload of queries Q = {q1, q2, …, qq}. We assume we have m indexable attributes in maximum in the queries 448
space pruning and index selection are done separately within two phases and this way of processing is not in accordance with the heuristic search methodology where these actions are handled simultaneously and dynamically all along the search process.
- The storage capacity S that limits the target BJIs. 2) The question formulated as follows: Determine the BJIs that speed up the queries processing time. The bitmap join indexes can be created by joining dimension tables with the fact table. BJIs correspond to subsets of Boolean valued attributes conceived from the interpretation of logical expressions appearing in the where clauses. This Boolean representation enables to recognize extremely rapidly the corresponding rows that satisfy the attribute predicates. Furthermore, they are efficient for handling quickly the operations and, or, not and count of where clauses. TABLE I.
Tables Sales Customers Products Times
3. The Ant Colony Optimization Each artificial ant of the colony builds a solution by repeatedly applying a pseudo-random-proportional rule for choosing interesting elements that will constitute the solution. While constructing a solution, the ant also modifies the amount of pheromone on the chosen elements by applying the step by step updating rule. Once all ants have achieved their solutions, the amount of pheromone is modified again by applying the global offline updating rule.
EXAMPLE OF DATA WAREHOUSE ENVIRONMENT
Nature of table Fact Dimension Dimension Dimension
Number of instances 16 260 336 50 000 10 000 1 461
Ants are guided in building their solutions by both heuristic and pheromone information that define the transition rule. The pheromone updating rules are designed in such a way to assign more pheromone to elements which should be chosen by ants. The ACO methodology is outlined as follows:
Table I exhibits an example of a star data warehouse where sales is the fact table and customers, products, and time are dimension tables. An example of a work load query is:
Pheromone initialization Repeat For each ant of the current generation do - Build a solution by choosing elements according to the transition rule and by applying the step by step updating rule - Determine the best solution of this generation Apply the offline updating rule Until termination criterion is met.
select sales.cust_id, sum (amount_sold) from sales, customers, times where sales.cust_id = customers.cust_id and sales.time_id = times.time_id and times.fiscal_year = 3 group by sales.cust_id ; sales.cust_id is an extracted attribute from the where clause. A BJI is a set of such attributes excluding the following constructions of sets where we have: - only primary keys of dimension tables or foreign keys of fact table - more key attributes than non key attributes - only non key attributes. The recent related works for this subject have addressed the problem with approaches such as linear programming [4] and greedy algorithms [1], [3]. The linear programming method tackles the problem complexity in a judicious manner. However our approach has the advantages of addressing the problem with a meta-heuristic that has shown very interesting outcomes for many real-life applications and also for proposing original pruning functions brought from information retrieval technologies.
4. ACS-BJIS algorithm The input to the ant algorithm includes in addition to the different tables of the data warehouse and their sizes, all possible indexable attributes extracted from the where clauses of the workload of the queries. They are stored in a table with additional information concerning their role in the tables they index as primary key, foreign key or join attribute. During the construction of the index by the ants, the process will verify the BJI conditions defined in section 2 before accepting it as a solution. A solution for the problem is a set of bitmap join indexes with good characteristics. The main idea of the approach consists in building BJIs from all possible indexes by means of heuristics and probabilistic rules specified by the ant system behavior. The number of BJI retained at the end of the search process is limited by the storage capacity. The fitness function we propose is inspired from the information retrieval (IR) technology. In fact, in this
The drawback with the greedy techniques is that they lack scalability since the considered input constituted by indexes can be browsed completely. Besides search
449
context, documents are retrieved from data collections when they are similar to the user query, that is, when they share a part of the same keywords with the query. An analogy exists between both paradigms: to a term, a document and a collection of documents in IR correspond respectively an attribute, a set of BJI candidates and a set of attributes in BJIS problem. Note that in IR only one query is handled at a time whereas in data mining, a workload of queries must be considered at the same time. We propose therefore the following representations for queries and BJIs candidates: q = ( waq1, waq2, waq3, …, waqm) bji= ( wabji1, wabji2, wabji3, …, wabjim) waqi corresponds to the weight of attribute ai in q and is expressed by the product aif*iqf [2]. aif denotes the attribute frequency in q and iqf the inverted query frequency. Usually, the following formulas are used to compute respectively aif and iqf. aif= freq(ai, q) iqf = log(1/k) where freq(ai, q) is the number of occurrences of attribute ai in q and k represents the number of all BJI candidates. The component af indicates the importance of an attribute for a query, while iqf expresses the power of discrimination of this attribute. This way, an attribute having a high value of aif *iqf is at the same time important in the query and less frequent in the others. Since a query and a bitmap join index are modeled identically as a set of attributes, all the above definitions stand for both concepts. In other words, when replacing q by bji, we get for a BJI the definition that is stated for a query. The similarity f(bji,q) of a BJI bji and a query q can be computed as: f(bji, q)=∑i(waqi*wabjii) /(∑i (waqi)2*∑ i(wabjii)2 )1/2 (Cosine). During the search process, each attribute of a considered BJI is assigned an amount of pheromone that represents the importance of its previous contribution in constructing good solutions. The pheromone information is saved in a m*2 table called phero, phero[i,1] and phero[i,0] represent respectively the pheromone amounts when ai is present in the BJI and when it is not. Phero is initialized with a small value equal to 0.1 in order to simulate the fact that initially the real ants deposit a very small amount of pheromone on the ground when starting their space exploration. The algorithm ACS-BJIS can be outlined as follows:
for i=1 to MaxIter do begin for k=1 to NbAnts do begin generate a random initial solution s; build a solution s’; s’’= improve (s’); update the online pheromone for s’’; if BJI conditions(s’’) are verified then if f(s’’) > f (best) then best:= s’’; end; apply offline-update of pheromone; if f(best)> f(worst) then (* worst appears at the end of queue *) begin if(S-size(queue)< size(best)then begin remove worst from queue; size(queue)=size(queue)-size(worst); end; insert best in queue; size(queue)=size(queue)+size(best); end; end; end; In order to capture the first best determined BJIs, a sorted dynamic data structure managed as a FIFO queue (First In First Out) is used to keep the most interesting BJIs built during the ant process. The best BJI found is put at the head of the queue whereas the worst one appears at the end of it and in between we keep the BJIs that have intermediate solution quality. The size of the queue depends on the storage limit S. Therefore, when the queue is full, the current BJI to insert is compared to the worst BJI located at the end of the structure. If it has better quality, the worst BJI will be removed if the available space in S is not sufficient to insert the current BJI. Then the latter will be inserted at the appropriate position according to the increasing order of BJI solution quality. The procedure that builds a solution is described as follows: procedure build( var s: bji) begin for i=1 to MaxChanges do begin generate a random number r [0,1]; if (r r then flip(ai); put ai in the taboo list; end ; end; end; 1 if a i has argmax phero[i, j ]D heur[i, j ] E P(a i ) ® (1) ¯0 else
^
P(a i )
phero[i, j ]D heur[i, j ] E ¦ phero[l, k ]D heur[l, k ] E
6. Conclusions In this paper, an original ACO algorithm namely ACS-BJIS has been designed for the selection of bitmap join indexes to take up the scalability challenge. Through the performed experiments, we have observed that ACS-BJIS is suited to large scale datasets and has achieved a performance level exceeding those of the previous works.
`
(2)
k
The probability of the attribute ai is computed using pheromone and heuristic values controlled respectively by the parameters D and E . They are set respectively to 0.8 and 0.2 to give more importance to the pheromone production. The heuristic is computed by rule (3).
¦ f (bji )
heur[i, j ]
bjiBJIs ( ai, j )
(3)
total f (bji )
BJIs(ai,j) is the set of BJIs when ai has the Boolean form j(with or without a negative connector). The strategies of updating pheromone simulate the evaporation of the pheromone followed by a production of pheromone mechanism. The evaporation phenomenon gives rise to rule (4) appearing below. The parameter U (0