Bees Swarm Optimization Based Approach for Web Information ...

3 downloads 16315 Views 258KB Size Report
in the web, the classic process of search knows lacks in efficiency. Innovative .... interested in building such corpuses. 3. ... among them, the richest and the easiest of access. In addition ... solution presenting good features that we call Sref and.
2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Bees Swarm Optimization based Approach for Web Information Retrieval Habiba Drias, Hadia Mosteghanemi Department of Computer Science USTHB, LRIA, Algiers, Algeria [email protected] In this study, artificial intelligence approaches and more precisely bee swarm optimization (BSO) algorithms are designed for this purpose. We show through this work that evolutionary approaches may help to palliate the complexity issue. The original BSO meta-heuristic was introduced for the first time in [4] and applied successfully for the satisfiability problem. The same principles and framework are adapted for the problem that attracts our interest in the present study. The idea behind addressing web information retrieval with a BSO-based approach is the pruning of the prohibitive search space in order to browse only interesting documents and therefore get results in a reasonable amount of time. This meta-heuristic belongs to the vast and well recognized domain of swarm intelligence. Many works have been undertaken in this area and applied to many public and industrial sectors. The methodology used the most concerns the particle swarm optimization known as PSO. The present article develops a BSO approach, which is different from PSO and is inspired from the collective behavior of bees. BSO is the fruit of an aggregation of individual behaviours dictated by very simple rules. It presents an auto-organized working model, based on a decentralized logic, founded on the cooperation of units having only local information. Real bees communicate between them by means of a dance. In fact, a bee performs an active dance in order to draw the attention of its congeners, when exploring a region it finds a wealthy food source. The discovered area will be exploited by the bees at maximum. Then they will repeat this way of feeding indefinitely until satisfying their needs. Motivated by the success and the power of this metaheuristic and knowing that a few heuristic search techniques have been studied to investigate information retrieval problem, we have designed a BSO algorithm, namely BSO-IR for exploring this useful domain. Three kinds of collections have been tested; CACM with 3204 documents, RCV1 with 804 414 documents and larger collections generated by our own process. Comparison with the classical IR method is performed.

Abstract This paper deals with large scale information retrieval aiming at contributing to web searching. The collections of documents considered are huge and not obvious to tackle with classical approaches. The greater the number of documents belonging to the collection, the more powerful approach required. A Bees Swarm Optimization algorithm called BSO-IR is designed to explore the prohibitive number of documents to find the information needed by the user. Extensive experiments were performed on CACM and RCV1 collections and more large corpuses in order to show the benefit gained from using such approach instead of the classic one. Performances in terms of solutions quality and runtime are compared between BSO and exact algorithms. Numerical results exhibit the superiority of BSO-IR on previous works in terms of scalability while yielding comparable quality. Keywords; web information retrieval; very large collections of documents, scalability; evolutionary algorithms; swarm intelligence; BSO; classic approach

1. Introduction With the exponentially growing amount of information in the web, the classic process of search knows lacks in efficiency. Innovative tools to address information retrieval (IR) become necessary to cope with the complexity induced by this tremendous volume of information. Many different directions of research are contributing in handling the complexity of the problem. Distributed information retrieval and Personalizing Information Source Selection are examples of these research axes. The recent works are considering the user and sources profiles in order to restrict the search only to the sources that have the same profile as the user [6,7]. In this manner, a lot of information is pruned and therefore, the respond time of such systems becomes rapid.

978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.179

6

2. Information retrieval problem

The weight of a term in a document is computed using the expression tf*idf where tf is the term frequency in the document and idf is the inverted frequency computed usually as follows: idf = log(m/df) where m represents the number of documents and df is the number of documents that contain the term. The component tf indicates the importance of the term for the document, while idf expresses the power of discrimination of this term. In this way, a term having a high value of tf *idf is at the same time important in the document and less frequent in the others. The weight for a query is computed with the same manner. The similarity of a document d and a query q is then computed using the Cosine following formulas: f(d, q) = ∑ i (ai * bi) / ( ∑ i (ai)2 * ∑ i (bi)2 )1/2 (Cosine) ai and bi are the weight of term ti respectively in the document and in the query.

An information retrieval system handles and manages a collection of documents structured in an internal representation using an indexing process. It consists in finding a set of documents including information expressed in a query specifying user needs. The process involves a matching mechanism between the query and the documents of the collection. Therefore four important components are central in such process: - The document which can be a text, a web page, an image or a video. A document is usually represented by a set of terms or keywords extracted from its source. - The query which represents a need expressed by a user and specified in a formalism adopted by the system. - The similarity function that measures the similarity between a document and a query. - Two system evaluations are widely used: the precision which is the fraction of retrieved documents that are relevant and the recall which is the fraction of relevant documents that are retrieved. In an IR system, an important step is the indexing process in which an internal organization of the documents, the terms and the queries is determined in order to access in an efficient way these components. Besides the indexing process, the documents and the queries must be described according to a model. Many models for IR like the Boolean model, the vector space model and the probabilistic model exist in the literature. The most widely used which is also appropriate for meta-heuristics is the vector space model. In this model, documents as well as queries are represented as vectors of weights. Each weight in the vector denotes the importance of the corresponding term in the document or in the query. The vector space is built during the indexing process and contains all the terms that the system encounters. Consider the following vector space: (t1, t2, t3, …, tn) where ti is a term or keyword for i=1 to n. For each term, we consider a structure that contains all the documents that include the term. The weight of the term in the document is associated with the document in the list. The whole collection of documents is represented by a vector containing all the documents. The structure is indexed by the number of the document. C = (d1, d2, d3, …, dm) Each element of C points towards a list containing all the terms of the documents with their respective weight. The list is sorted according to the number of the term. The query is modelled exactly as a document.

2.1. Complexity Issues The classic approach described in [11] yields an exact algorithm. The principle is to search for each term of the query in the inverted file for all documents containing the term and compute the similarity between the document and the query. Then the consulted documents are sorted according to their similarity with the query. This process has a complexity in O(n*m). When n and m are reasonable, this technique is very efficient. However for an environment like the web where the number of documents and the number of keywords are prohibitive, the complexity is exponential because the parameters n and m express exponential magnitudes. This is the reason why it is important to find another alternative for addressing information retrieval in such context. Meta-heuristics enables to get a polynomial response time but in the detriment of the solution quality. To enhance the latter every component of the search method must be very well thought and implemented and all the difficulty is in this task.

2.2. Discussion For information retrieval the landscape of the search space is not homogeneous because similar documents may exist in one region that is hard to access because it may be surrounded by barren regions. In other words the promising region can be isolated like an oasis. This situation does not represent all instances but can happen especially when the number of terms is excessively greater than the number of documents. Since in the real world, the terms we use are well defined and stored in a dictionary while at the same

7

time documents can be created in an infinite quantity, the number of documents is determinant in the search space shape. Therefore two kinds of situations appear. When the number of documents is small, the landscape of the search space cannot be managed easily by heuristic search techniques however the exact approach is more suited. On the contrary, when the number of documents is huge, the search space is more compact and then heuristic search can perform a great job. Heuristic search methods can even help distributed information retrieval systems where users and sources profiles are considered to direct the search to speed up their response time. This observation leaded us to construct corpuses from CACM which is a small collection and in which we have added a considerable number of documents in order to test our approach.

therefore the one that indicates the place of the richest source of food [3]. The meta-heuristic “Bees Swarm optimization” is inspired by the above collective bee behaviour description. It handles artificial bees to imitate the real bee feeding and working style in solving problems. First, a bee named BeeInit settles down to find a solution presenting good features that we call Sref and from which the other solutions of the search space are determined via a certain strategy. The set of these solutions is called SearchArea. Then, every bee will consider a solution from SearchArea as its starting point in the search. After accomplishing its search, every bee communicates through a structure named Dance to its fellows the best solution visited. One of the solutions of this list will become the new solution of reference for the next iteration of the process. In order to avoid cycles, the solution of reference is stored every time in a taboo list. The choice of the reference solution is made, first according to the quality criterion. However, if after a period of time, the swarm notes that the solution does not progress in term of quality, it integrates a second criterion of diversity that will allow it to escape from the region where it is possibly incarcerated. The BSO algorithm is therefore outlined as follows:

2.3. Related works A few works have attempted to address information retrieval with meta-heuristics and with genetic algorithms especially [1,10,13]. The only work that deals with large scale information retrieval is [8] where the authors proposed a hybrid search technique based on genetic algorithms and ant colony optimization. The reason for that may reside in the lack of availability of large scale corpus. Some very recent works are interested in building such corpuses.

begin let Sref be the solution found by BeeInit; while (MaxIter not reached) do insert Sref in a taboo list; determine SearchArea from Sref; assign a solution of SearchArea to each bee; for each Bee k do search starting with the assigned solution; store the result in Dance; endfor; compute the new solution of reference Sref; endwhile; end;

3. BSO meta-heuristic In 1946 Karl Von Fris while decoding the language of bees, has observed that it is through the dance, that a bee communicates with its fellows upon its return to the hive, the distance, the direction and the wealth of the food source. Bees of a same colony visit more than a dozen of potential exploitation areas. But the colony concentrates its efforts of harvest on a small number among them, the richest and the easiest of access. In addition, numerous observations make appear that a colony can displace its exploitation of a source quickly to another one. In their experience made in 1991, Seely, Camazine and Sneyd have shown that when a colony of bees has the choice between the exploitation of two sources of food where the concentration in sugar is very unequal and situated in a diametrically opposite manner to the hive, one to the north and the other to the south, the colony goes up towards the richest to concentrate its effort of harvest. In that phenomenon, the swarm follows the bee which does the most vigorous dance,

4. BSO-IR Algorithm In this section, we present the bee swarm optimization algorithm called BSO-IR designed for information retrieval. The adaptation of the meta-heuristic to IR requires the design of the following components: the artificial world where the bees live, the fitness function that evaluates solutions, the initial solution Sref, the strategies to determine the set of solutions SearchArea from Sref, the search procedure performed by each artificial bee, the quality and the diversity strategies and the choice

8

number implies that Sref is probably close to the local optimum of the new region of exploitation. Therefore the probability of an improvement is very weak. On the other hand, if this value is too important, the swarm will move away from the region containing Sref with the risk to lose good solutions. To proceed to the changes, we propose two strategies assuring that the gotten solutions are as distinct as possible. If the number of generated solutions proves to be insufficient, a random technique will be used to complete the rest. The first strategy: The solution s is generated while flipping the term ti from Sref as follows:

rules of the reference solution Sref allowing the iteration of the process. Let first start with the description of the problem modelling.

4.1. Solutions coding On the basis of the CACM collection study where there are more than 3000 documents and more than 6000 terms in total, we propose the data structures described in section 2. The whole collection is a table indexed by the document numbers and containing pointers to lists of the terms of the documents. Beside each term, we put its weight in the document. The list is sorted according to the term numbers. This dynamic structure allows the insertion and the remove of terms from it. This operation enables to implement the move from one state to another one in the search space for the BSO algorithm. Similarly, the terms are stored in a table of pointers to lists including the documents where the terms appear. The list is also sorted according to the document numbers allowing the insertion and the remove of documents from it for the same reason. The sorting of the lists help to search for an element, the search stops when the element is found or when the current search pointer finds an element greater than the element the process is searching for. The search space for the algorithm is restricted to the vector of terms since the search as in the classic approaches is directed by the terms. This way, the approach gains in efficacy. However when the collection of documents is huge, the inverted file achievement is time consumer and cannot be computed. In this situation, the set of documents is considered.

begin h = 0; while size of SearchArea not reached and h< Flip do s = Sref; p = 0; repeat if the term Flip*p+h exists in s then remove it from s else insert it in s; p = p+1; until Flip*p+h ≥ k; SearchArea = SearchArea ‰ {s}; (* set of all solutions s *) h = h+1; endwhile end The second strategy: Here we consider Sref as being a set of contiguous packets of terms. The solution s is generated while changing the terms ti of the packet s from Sref, that is: begin while size of SearchArea not reached and h< Flip do s = Sref; p = 0; repeat if the term (k/Flip*h)+p exists in s then remove it from s else insert it in s; p = p+1; until p >= k/Flip SearchArea = SearchArea ‰ {s}; h := h+1; endwhile; end;

4.2. The initial solution Initially, BeeInit builds the solution of reference Sref via a heuristic. The heuristic we have used is the information included in the query because the document the process will search for must contains the maximum of identical terms with the query.

4.3. The determination of SearchArea The region of exploitation SearchArea includes k solutions (k being the number of bees constituting the swarm). Each of these solutions is calculated from Sref while introducing terms and removing others at positions separated from each other by a gap equal to a multiple of the constant Flip. The choice of the parameter Flip determines the number of terms to change from Sref. Indeed, a too small value of this

Let n=20 be the number of terms and Flip = 5. If the terms are subscripted from 1 to 20, then the first strategy consists in flipping the terms: (1,6,11,16), (2,7,12,17), (3,8,13,18) and (4,9,14,19), (5,10,15,20), while in the second strategy, the following terms are

9

inverted: (1,2,3,4), (5,6,7,8), (9,10,11,12), (13,14,15,16) and (17,18,19,20) . Flipping or inverting a term from the solution s consists in removing it from s if it exists in it or adding it to s if it is not in it.

endif endif end Remarks. (Sref is better in quality) is equivalent to (f(Sref) = Max f(s)) where s belongs to Dance and not to the Taboo list. (Sref is better in diversity) is equivalent to (diversity (Sref) = Max diversity (s)) Where s belongs to Dance. If two solutions s1 and s2 are equal in quality, that is, if they have the same value of the objective function then the one that has the largest degree of diversity will be considered. In the same way, if two solutions s1 and s2 present the same degree of diversity, the one that improves the fitness function will be chosen. It can happen, although very rarely, that all solutions of Dance exist in the Taboo list. To palliate to this problem, the solution of reference will be generated at random. maxchances is an empirical parameter and designates the maximum number of chances accorded to create a search region SearchArea.

4.4. The solution quality It is measured by the fitness function f expressing the similarity between the solution and the query. f is computed as described in section 2.

4.5. The degree of diversity The degree of diversity offered by a solution s is measured by the minimum of the distances between s and the elements in the Taboo list called Taboo. It is expressed as: diversity(s) = min {distance(s, x), x  Taboo}, where distance(s, x) is the distance between s and x, which is defined as follows: distance(s,x) = |s∩x| that is, the number of shared terms between s and x.

5. Experimental Results 4.6. The bee search process In order to test the performance of the designed algorithm, we have performed a series of extensive experiments. The first one consists in setting the empirical parameters that yield high solutions quality such as the bee colony size, the maximum number of iterations and the maximum number of changes in the improvement procedure. A second step is undertaken in order to test the performance of the algorithm. The latter were implemented in C# on a personal computer. The tested collections are the known CACM, RCV1 and large scale collections generated from CACM by the following process. CACM has 3204 documents of 6468 terms whereas RCV1 possesses 804 414 documents of 47 236 terms.

The bee search process is an iterative process. The number of iterations is called MaxIter and is an empirical parameter. The process consists of two phases: -A simple local search. -A improvement technique that consists in flipping the maximum of terms of the solution found in the first phase so that the fitness function of the new solution is superior or equal to it.

4.7. The choice of the reference solution Sref Let Sref (t) be the chosen reference solution at the t iteration, then the choice of Sref (t+1) depends on the quantity Δf computed as: Δf =f(Sbest) – f (Sref(t)); Sbest being the best solution at the iteration t+1; The algorithm is as follows:

5.1. A large scale benchmarking process In order to test our algorithm on a larger scale collection, we have built a benchmark from the real CACM collection. The idea is to increase the size of the collection by including in it all possible concatenations of documents that are very similar between them. This way the size of the collection will augment in an exponential manner. The queries are the same as those of the original collection and the judgments are extended to the created documents that include the initial judged documents. The approach used includes two phases. The first one consists in classifying the documents in clusters having a very

begin if Δf > 0 then Sref = the best solution in quality; If NbChances < MaxChances then NbChances = MaxChances; endif else NbChances = NbChances –1; If NbChances > 0 then Sref = the best solution in quality else Sref = the best solution in diversity; NbChances = MaxChances;

10

close similarity. In the second step, new documents are generated by concatenating documents of each cluster in all possible ways that is by 2, by 3 and so on until concatenating all the documents altogether. The operation of documents concatenation is performed by merely merging their respective indexes.

5.1.2. Corpus generation The second phase deals with the construction of the benchmark. The idea is to create documents from similar documents belonging to the same class. The algorithm is as follows: Input: p clusters of r documents each documents Output: a collection of begin For each cluster p do For each document d of the cluster do Merge d with the other documents of class p 2by 2 then 3by 3 and so on ; Suppress d from the cluster p; endfor endfor end;

5.1.1. Clustering of documents The algorithm of clustering documents of a collection on the basis of their similarity is a BSO based algorithm. It is designed with the same framework as the one described previously for information retrieval. The main difference is threefold; first the classes are initially created by a diversification generator of documents, second BSO is called for each class to fill in the class with documents and third each time an artificial bee finds a good solution it will insert it in a dynamic list representing the cluster. This list managed as FIFO (First In First Out) is sorted according to the similarity of the documents and its size is restricted to r, which is the number of documents per cluster, that is, its size. With this constraint, only documents that have a similarity greater than the similarity of the element located at the end of the queue is inserted in the cluster. The algorithm is outlined as follows:

The CACM collection called CACM1 in Table I, was transformed in four collections called CACM2, CACM3, CACM4 and CACM5 as shown in Table I. TABLE I.

collection CACM1 CACM2 CACM3 CACM4 CACM5

Input: a collection C of m documents Output: p clusters of r documents each begin C’ = C; Initialize p queues to empty; determine p diversified documents from Sref; (* same process as SearchArea determination *) assign a document to each group; for each group p do create searchArea of size equal to k; assign a solution to each bee; for each Bee k do search starting with the assigned solution; let s be the result of the search; store s in Dance list; endfor; if f(s, p)>f(queue.end, p) then if queue full then remove queue.end; endif; insert s in queue; endif; endfor; end;

GENERATED COLLECTIONS

#clusters 3204 640 320 213 160

Cluster size 1 5 10 15 20

#doc 3204 19 844 327 364 6 979 380 167 772 004

Figure 1 shows how the size of the generated corpuses grows in an exponential way in terms of the cluster size. The aim of producing large scale collections is then achieved. In the following we describe the experimental study we have undertaken in order to demonstrate the efficiency of BSO on such tremendous corpuses.

Figure 1. The exponential growth of the size of the generated collections

11

The advantage of the evolutionary approach is in the respond time where it shows its superiority on the exact algorithm. The time factor is very important since information retrieval is performed on line and thus requires fast reactivity from the system.

5.2. Setting the parameters Preliminary tests have been carried out in order to fix the key parameters of BSO-IR. Figure 2 shows an example of the results of the tests done on CACM1 for setting the number of iterations of the algorithm. Identical tests have been performed for all the other parameters. Table II summaries the parameters values obtained after these extensive experiments for CACM1 and CACM5.

Figure 3. Comparison of BSO-IR and the exact algorithm performances for CACM1

Figure 2. Setting the maximum number of iterations for CACM1

TABLE II.

parameter SearchArea MaxChances Flip MaxIter

EMPIRICAL PARAMETERS VALUES

CACM1 30 5 40 30

CACM5 50 8 70 55

5.3. Comparison of BSO-IR and classical IR algorithm An exact algorithm based on a classical IR approach [11] was developed for comparison purposes. It computes the solution by proceeding with an exhaustive search. The following series of experiments were performed in order to analyze the behavior of the algorithm BSO-IR relatively to the exact algorithm. We vary the collection of documents and we compare the performance of the algorithm in terms of precision/recall and runtime. Figure 3 and Figure4 show the comparison between BSO-IR and the exact algorithm respectively for the collections CACM1 and CACM5. We observe a slight superiority of the classic approach in the first case that is for the small collection and the same performance for both algorithms for the huge collection.

Figure 4. Comparison of BSO-IR and the exact algorithm performances for CACM5

Table III exhibits numerical values for similarity and running time of Exact and BSO algorithms on RCV1 collection and four queries. Note that both algorithms have almost the same performance while BSO is faster than the exact algorithm. This observation is shown by another experiment to be less significant for CACM, because of its smaller size. Figure 5 and Figure 6 show the runtime for both algorithms respectively when processing CACM1 and

12

for the small collection. On the other hand for large scale corpuses both algorithms have comparable efficacies. However in terms of running time, BSO-IR has yielded a performance level exceeding the one of the exact algorithm. And as a conclusion we can state that BSO-IR is more suited for large scale information retrieval than classical IR method. As a future work, we plan to hybridize metaheuristics with the distributed information retrieval approaches to better address web information retrieval. Another intent is to design and develop a metaoptimization generator that generates automatically the empirical parameters for any meta-heuristic in general and for BSO-IR in particular in order to improve its performance. Actually the manual tuning of parameters is a hard and fastidious task on one hand and may not reach optimality on the other hand in spite of the extensive and numerous experiments we can perform.

CACM5. Note that the time increases in an exponential time for the classic algorithm whereas it augments almost linearly for BSO-IR. TABLE III.

exact

BSO

COMPARISON BETWEEN EXACT AND BSO RUNTIME FOR RCV1 COLLECTION

query 1 2 3 4 1 2 3 4

similarity 0.31 0.62 0.79 0.56 0.31 0.67 0.75 0.56

Time (s) 206 216 215 207 6 81 99 87

References [1]

[2]

[3]

[4]

[5]

[6]

Figure 5. Comparison of runtimes for BSO-IR and the exact algorithm

6. Conclusions

[7] [8]

In this paper, a bee swarm optimization algorithm named BSO-IR has been designed for information retrieval. The aim of this study is the adaptation of heuristic search techniques to large scale IR and their comparison with classical approaches. Experimental tests have been conducted on the well known CACM and RCV1 collections and also on very large benchmarks generated from CACM for test purposes. The approach designed to construct the large collection is original and enables to increase the scale level of any collection. Through the undertaken experiments we have observed that concerning the solution quality, the exact algorithm achieved slightly better results than BSO-IR

[9] [10]

[11]

[12] [13]

13

A. A. R. Ahmed, A. A. L Bahgat, A. A. Abdel Mgeid and A.S. Osman, “Using Genetic Algorithm to Improve Information Retrieval Systems,” Word Academy of Science, Engineering and Technology WASET’06, vol. 17, pp. 6-12, 2006. R. Baeza-Yates, B. Ribiero-Neto, “Modern Information Retrieval,” Addison Wesley Longman Publishing Co. Inc., 1999. E. Bonabeau, M. Dorigo, G. Theraulaz, “Swarm Intellignce from natural to artificial systems”, Oxford University Press, 1999. H. Drias, S. Sadeg, S. Yahi: Cooperative Bees Swarm for Solving the Maximum Weighted Satisfiability Problem. IWANN 2005: 318-325 C. Hsinchun, “Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning and Genetic Algorith,” Journal of the American Society for Information Science. 46, 3, pp. 194-216, 1995. S. Kechid, H. Drias: Personalizing the Source Selection and the Result Merging Process. International Journal on Artificial Intelligence Tools 18(2): 331-354 (2009) S. Kechid, H. Drias: Mutli-agent System for Personalizing Information Source Selection. Web Intelligence 2009: 588-595 P. Kromer, V. Snasel, J. Platos, A. Abraham, ‘Implicit User Modelling Using Hybrid Meta-Heuristics’, IEEE HIS, pp. 4247, 2008 C. D. Manning, P. Raghavan, H. Schutze, “Introduction to Information Retrieval,” Cambridge University Press, 2008. P. Pathak, M. Gordon and W. Fan, “Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation,” 33rd IEEE HICSS, (2000) Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management, 24, 5, pp. 513-523 1988. C. J. Van Rijsbergen, “Information Retrieval. Information,” Retrieval Group, University of Glasgow, 1979. D. Vrajitoru, “Crossover Improvement for the Genetic Algorithm in Information Retrieval,” Inf Process Manage, 34, 4, pp. 405-415, 1998.

Suggest Documents