Rank Aggregation Using Multi Objective Genetic ...

3 downloads 489 Views 729KB Size Report
user enters a query, the search engine returns a set of results containing thousands ... Genetic algorithm approach is one of the optimization techniques that play ...
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

Rank Aggregation Using Multi Objective Genetic Algorithm Manjeet Kaur∗ , Parneet Kaur† and Manpreet Singh‡ ∗

CSE Department, ACE and AR, Devsthali, Ambala, Haryana, India [email protected] Professor, CSE Department,ACE and AR, Devsthali, Ambala, Haryana, India,[email protected] Assistant Professor,IT and CSE Department, GNDEC,Ludhiana, Punjab,India, [email protected]

† Assistant ‡

Abstract—Rank Aggregation is needed to combine many different rank orderings on the same set of alternatives, or candidates, in order to obtain a better ordering. The aim of this field is to somehow merge a number of ranked lists in order to build a single superior ranked list. Various methods exist for dealing with the problem of Rank Aggregation problem. In this paper, Rank Aggregation is implemented using Genetic Approach. Multiple objectives have been achieved using genetic approach. So, this approach is called Multi-Objective Genetic Algorithm. The results of the genetic Approach are compared with that of Stuart method and Mean Method. From the experiments, It is concluded that Performance of GA lies between the Sturat and Mean method. In most cases, Stuart method gives better results than GA and GA gives better results than Mean method.

task of ranking a list of several alternatives based on one or more criteria is encountered. In such situations, one of the goals of Rank Aggregation is to identify the best alternative. Rank aggregation is needed to provide the users with the best search results for their queries by combining the ranked lists of different search engines because the web users totally depends on the top list of results provided by the search engines. A number of meta search engines are used now a day which are based on the idea of Rank Aggregation Technique. A Meta search engine fetches the top results related to a query from the databases of different search engines and hence provides the users with the result combined from different search engines.

Keywords: Rank Aggregation, GA, Stuart, Mean

The problem of Rank Aggregation is concerned with finding a consensus ranking list that represents and reflects the combined rankings of many different search engines and will be as close as possible to all the individual ranked lists simultaneously [1]. Finding such a consensus list of rankings is an NP-Hard problem because it cannot be said that the obtained aggregated list is the best list. The aggregated list will be the optimized list at that time. For solving the NP-Hard problems, optimization techniques are used. So, Optimization techniques can also be used for solving the Rank aggregation problem [1, 3]. Genetic algorithm approach is one of the optimization techniques that play an important role for solving NP-Hard problems. In this paper, Genetic Algorithm approach is used for implementing Rank Aggregation problem. This approach is routinely used to generate useful solutions to optimization and search problems.

I.

INTRODUCTION

Today Internet is increasingly becoming an important part of our daily life. Now the web is used not only to find information on particular topics but to carry out different tasks. As the users are getting more sophisticated, their queries are becoming more challenging for the web search engines. When user enters a query, the search engine returns a set of results containing thousands of links to different websites, but it is impossible for the user to go over all of them and thus, the user opens only few of them to find the answers to his query. The rest are just useless for the user. So, the search engines must provide the best results at the top so that the user can get the answers to his query within those top results. Every search engine uses some criteria to rank the pages according to their importance with respect to the users query. It is well known that the Search engines have a very low overlapping in terms of data coverage. So, the users will get a different ranked list from different search engines for the same query. Now, there arises the need for rank aggregation to find the best ranked list. A. Rank Aggregation Rank Aggregation is a method that is used to combine many different rank orderings on the same set of candidates, or alternatives, in order to get a better rank ordering basically used in the field of voting [11,13]. The aim of Rank Aggregation is to somehow merge a number of ranked lists in order to build a single superior ranked list. The concept of Rank Aggregation is mainly required in the situations where the 978-1-4673-6808-7/15/$31.00 ©2015 IEEE

836

B. Rank Aggregation as an Optimization Problem

II.

RELATED STUDY

Lot of research has been going on in this field in order to provide user with a genuine, filtered and relevant list as per the query. Some of the major contributions done in this field are given below. The paper [3] proposed the work of combining ranking results from different sources. They developed a set of techniques for rank aggregation and compare their performance to that of other well-known methods. Their aim was to design a rank aggregation technique that can result in the reduction of spam, a very serious problem in web searches. They studied the rank aggregation problem in the context of the web. In paper [7] described a polynomial time algorithm to compute the footrule optimal aggregation for full lists. They applied the parallel genetic algorithm for the rank aggregation. Two approaches can be used for this. In the first approach, each

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

processor can operate independently on an isolated subset of population and periodically share the ”fittest” chromosomes through migration. In the second approach, each step of GA, i.e. selection, mutation and crossover, is performed parallel amongst processors. Out of the above two approaches, the first approach is used in this paper. The [18] paper proposed a Genetic PageRank Algorithm (GPRA) which is based on the PageRank algorithm. PageRank algorithm is used by the well known search engine Google. Google classifies the web pages according to the pertinence scores given by PageRank, which are computed from the graph structure of the Web. Besides preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm in order to solve web search. Experimental results of this paper show that GPRA performance is superior to genetic algorithm and PageRank algorithm. The author in [20] presented a novel robust rank aggregation (RRA) method. This method detects genes that are ranked consistently better than expected under null hypothesis of uncorrelated inputs and assigns a significance score for each gene. The probabilistic model used in this method makes the algorithm parameter free and robust to outliers, noise and errors. They compared their method with Average Rank Method and Stuart Method. RRA Method has its unique properties which makes it better over these two methods . III.

TABLE I.

S EARCH RESULTS FOR Q UERY 1

TABLE II.

S EARCH RESULTS FOR Q UERY 2

PROPOSED METHODOLOGY

Genetic Algorithm is a multi-objective method that can be used to aggregate a number of ranked lists in an unbiased way[8]. The two objectives of genetic algorithm are being considered during the work. The first is to minimize the total distance of the input rankings from the reference ranking. The second objective is to minimize the standard deviation among those distances in order to avoid bias toward a particular input ranking. In this work, an implementation of Rank Aggregation for a set of queries using multi-objective Genetic Algorithm Approach has been done. The results of GA Approach are compared with the results of Stuart and Mean method. Computational time of these three methods has also been compared in this work. IV.

IMPLEMENTATION

Genetic Algorithm technique has been implemented for the problem of rank aggregation using Matlab. A. Inputs For the work, 5 different queries (Architecture, Information Retrieval, Chocolate, Cheese, and Data Mining) have been considered. These Queries are chosen randomly. Further these queries are fed to five different search engines which are: www.google.com, www.yahoo.com, www.bing.com, www.info.com, and www.aol.com. Experimentation on different sizes of result sets has been done. Results for queries Architecture and Information Retrieval have been shown in Table 1 2 resp. below. Similarly for other three queries also top most 4 -5 URLs has been obtained. Query1: Architecture 837

B. Implementation steps for GA The steps followed by GA during the implementation of Rank Aggregation are described below: Initialization: Population can be initialized either randomly or specified by the user. In this work, Population has not been initialized randomly instead specified by the user. Population size is set from the ordered lists of size k (no. of elements in each list) which forms the initial population of possible solutions to this optimization problem. The population size is important and, obviously, the larger the population size, the better chance of it containing, at some point, the optimal solution. Aggregated list and scores obtained from three methods for Query2 (Information Retrieval). Encoding is the process of representing individuals of population in an appropriate form that is suitable for the process of GA. The encoding scheme

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

that is used in this work is Value Encoding. In Value Encoding, every chromosome is a string of values connected with the problem. Here, URLs are taken as chromosomes.

TABLE III.

A GGREGATED LIST AND SCORES OBTAINED FROM THREE METHODS FOR Q UERY 1 (A RCHITECTURE )

Selection: Fitness function is used to compute the fitness value for each member of the population. For this work, fitness function maps the rank values of individuals to a scale of 0 to 5. Then best, average and worst fitness is calculated. Proportionate based selection scheme is used which selects the offspring chromosomes for cross-over based on their fitness value. Cross-over: The selected members are then crossedover with the cross-over probability. If a randomly generated number is greater than the crossover probability, then the parents are copied to the offspring solutions (i.e. no cross-over is performed), otherwise single point crossover is done. Mutation: Crossing-over will allow only for the mixing of ordered lists but a rather drastic event is required to bring radically new solutions to the population pool. These are introduced by mutations which happen with the mutation probability. If a randomly generated number is greater than the mutation probability, then the parents are copied to the offspring solutions (i.e. no mutation is performed), otherwise Swapping method of mutation is used to mutate the off-springs. Convergence: The algorithm is stopped if the ”optimal” list remains optimal for consecutive generations. To ensure that the algorithm stops running eventually, the maximum number of generations is set in advance which will terminate the execution regardless of the first condition being true. If neither the maximum number of iterations has been reached nor the ”optimal” list stayed untouched during few last generations, continue to step Selection. V.

RESULTS AND DISCUSSIONS

In this work, Genetic Algorithm Approach is implemented for aggregating a number of ranked lists retrieved from five different web search engines. For each element in the list, the algorithm looks at how the element is positioned in the ranked lists and compares this to the baseline case where all the preference lists are randomly shuffled. As a result, a P-value is assigned for all items, showing how much better it is positioned in the ranked lists than expected. This P-value is used both for re-ranking the elements and deciding their significance. When the inputs are given to GA, it provides significance scores for final rankings that show how much higher an element is placed in the input lists than expected. Lower the significance score, better the rank of the element in the list. The results of Genetic Approach are compared with that of two other methods: 1) Stuart Method and 2) Mean Method. Stuart Method: Stuart et al. [5] were the first to utilize order statistics in rank aggregation. The computational scheme for their method was further optimized by Aerts. This algorithm compares the actual rankings with the expected behavior of uncorrelated rankings, re-ranks the items and assigns significance scores. While being robust to noise, this method requires simulations to define significance thresholds. Mean Method: Mean is an in-built function of MatLab. This method works by considering the positions of elements 838

in the input lists. Based on their position, it calculates the average value/position for each element. General equation for calculating mean is given in eqn. (1). M (x1 , x2 , . . . , xn ) = 1/n

n X

xi

(1)

i=0

Where M is the vector containing the mean value of variables x1 , x2 , . . . xn . n is the number of variables. N = nanmean(Y, DIM) returns the sample mean of a series object Y along the dimension DIM of Y, treating NaNs as missing values. N is a row vector containing the mean value of the non-NaN elements in each series. Mean method is almost similar to Borda Count Method. Borda method calculates the summation of ranks assigned to the elements of ranked list where as Mean method calculates the average of ranks by dividing the summation to the number of elements in the ranked list. A. Results Query1: Architecture When the results obtained from the Query1 (Architecture) are given as input to GA, Stuart and Mean methods, then the score values and aggregated lists obtained from these three methods are shown in Table 3. Explanation: Fig. 1(a) shows the graphical representation of the significance scores assigned by the three methods (GA, Stuart, and Mean) to the elements (URLs) of the ranked list provided in Query1 (Architecture). GA vs. Stuart: The graph in Fig. 1(a) shows that GA gives minimum scores than Stuart for two URLs (3rd and 4th) out of four. For 1st and 2nd URLs, Stuart gives better results. Overall results i.e. Aggregated List obtained from these two methods are same. GA vs. Mean: The graph in Fig. 3(a) shows that GA gives better results that Mean for three URLs (2nd, 3rd and 4th) out of four. For 1st URL, results of GA and Mean are same. Overall results i.e. Aggregated List obtained from these two methods are also same. Query2: Information Retrieval

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

Fig. 1. (a) Comparison Graph of three methods for Query1. (b) Graph for Best, Average and Worst Fitness vs. Generations. (c) Graph for best individuals vs. Generations. TABLE IV.

A GGREGATED LIST AND SCORES OBTAINED FROM THREE METHODS FOR Q UERY 2 (I NFORMATION R ETRIEVAL ).

Fig. 2. (a) Comparison graph of three methods for Query2. (b) Graph for Best, Average and Worst Fitness vs. Generations. (c) Graph for best individuals vs Generations. TABLE V.

C OMPUTATION T IME TAKEN BY S TUART AND GA.

When the results obtained from the Query2 (Information Retrieval) are given as input to GA, Stuart and Mean methods, then the score values and aggregated lists obtained from these three methods are shown in Table 4. Explanation: Fig. 2(a) shows the graphical representation of the significance scores assigned by the three methods (GA, Stuart, and Mean) to the elements (URLs) of the ranked list provided in Query2 (Information Retrieval). GA vs. Stuart: The graph in Fig. 2(a) shows that GA gives minimum scores than Stuart for two URLs (4th and 5th) out of five. For 1st, 2nd and 3rd URLs, Stuart gives better results. Overall results i.e. Aggregated List obtained from these two methods are same. But in this, Stuart performs better than GA. GA vs. Mean: The graph in Fig. 2(a) shows that GA gives better results than Mean for three URLs (3rd, 4th and 5th) out of five. For 1st and 2nd URLs, results of GA and Mean 839

are same. Overall results i.e. Aggregated List obtained from these two methods are also same. But GA performs better that Mean in this case. Similarly, for other three queries, Results are obtained and their tables showing the score values and aggregated lists are given appendix. B. Computation time The computational time for GA, Stuart mean methods is calculated. The comparison graph of GA, Stuart and Mean methods for their computation time is show in Fig. 3. The above graph

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

Fig. 3.

Comparison graph of Computation Time for Stuart, Mean and GA.

in Figure 3 shows that time taken by GA method is much greater than that of Stuart and Mean method. This is the main disadvantage of Genetic approach that it takes large time for execution because of large number of iterations involved in Genetic Algorithm. VI.

CONCLUSION AND FUTURE SCOPE

In this work, GA is used for implementing Rank Aggregation problem and its results are compared with the results of Stuart and Mean Method. From the results, it is observed that final Aggregated List given by all three methods is similar. But difference lies in the significance scores assigned to the URLs by three methods. When GA is compared with Stuart method, GA gives better significance scores than Stuart method only for less than 50% of elements in the list. Overall it can be said that Stuart methods performance is better than Genetic Algorithm Approach when compared for Rank Aggregation Approach. On the other hand, when GA is compared with Mean method, GA gives 70% better results as compare to Mean method. When Rank Aggregation problem is implemented using GA Approach, a filtered, optimized, and rectified aggregated list is obtained which can be used in many applications such as designing of meta search engines, spam reduction, etc. In future, various other soft computing techniques such as Fuzzy technology, Neural Networks, etc. can be used to solve the problems of this area. Researchers can also use core methods of Rank Aggregation such as Markov Chain method, simulated game data Method, etc to work in this field. ACKNOWLEDGEMENTS: All the authors are highly obliged for their support and contributions to complete this research work. Also, the authors would like to thank Ambala College of Engineering and Applied Research for providing the facility of matlab software for simulation work. R EFERENCES [1]

Amy N. Langville and Carl D. Meyer, Rank Aggregation-Part 1 in Whos 1? The Science of Rating and Ranking, 2012, pp. 187-212.

840

[2] Merijn Van Erp and Lambert Schomaker, Variants of Borda count method for combining ranked classifier hypothesis, L.R.B. Schomaker and L.G. Vuurpijl (Eds.), Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition, 2000, pp. 443-452. [3] Dwork C., Kumar R., Naor- M., and Sivakumar D., Rank Aggregation Methods for the Web, Proceedings of the l0th World Wide Web Conference, Hong Kong, 2001, pp. 613-622. [4] Javed A. Aslam and Mark Montague, Models for metasearch, Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, USA, ACM/Springer, September 2001, pp. 276-284. [5] Stuart J.M., Segal, E., Koller, D., and Kim, S. K, A gene-coexpression network for global discovery of conserved genetic modules, Science, Volume 302, 2003, pp. 249-255. [6] Franz Rothlauf and David E. Goldberg,Redundant representation in evolutionary computation, vol. 11, No. 4, 2003, pp. 381-416. [7] M. M. Sufyan Beg, Parallel rank aggregation for the World Wide Web Proceedings of Intelligent Sensing and Information Processing, IEEE International Conference, 2004, pp. 385 - 390. [8] Maheswara Prasad Kasinadhuni, Michael L. Gargano, Joseph DeCicco, and William Edelson,Self-Adaptation in genetic algorithms using multiple genomic redundant representations, Congressus Numerantium, 167, 2004, pp. 183-192. [9] Judit Bar-Ilan, Comparing rankings of search results on the web, Information Processing Management 41 (6), 2005, pp. 1511-1519. [10] Michael L. Gargano, Maheswara Prasad Kasinadhuni, Rank aggregation for meta-search engines using Self-Adaptation in genetic algorithms using multiple genomic redundant representations, Congressus Numerantium, No. 176, 2006, pp. 25-31. [11] Yao Yu, Chen Xinmeng and Zhu Shanfeng, Rank Aggregation Algorithms Based on Voting Model for Metasearch, Proceedings of IEEE International Conference, 2006, pp. 1-4. [12] Richard Harvey and Michael L. Gargano, Minimal Edge-Ordering Spanning Trees using a Self-Adaptating Genetic Algorithm with Multiple Genomic Representations, Congressus Numerantium, No. 239, Volume 180, 2007, pp. 21-31. [13] Yu-Ting Liu, Tie-Yan Liu, Tao Qin, Zhi-Ming Ma, and Hang Li, Supervised Rank Aggregation, Search Quality and Precision, 2007, pp. 481-489. [14] Nir Ailon, Aggregation of Partial Rankings, p-Ratings and Top-m Lists, Algorithmica, Springer-Verlag, Volume 57, 2008, pp. 284-300. [15] Mohamed A. Soliman, Ihab F. Ilyas, Ranking with Uncertain Scores, Proceedings of IEEE International Conference on Data Engineering, 2009, pp. 317-328. [16] Felipe Bravo-Marquez, Gaston LHuillier, Sebastian A. Rios, Juan D. elasquez, and Luis A. Guerrero, DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval, Proceedings of 14th international conference on Knowledge-based and intelligent information and engineering systems, Springer-Verlag, 2010, pp. 93-102. [17] Arijit De, Elizabeth Diaz, Vijay V. Raghavan, Search Engine Result Aggregation Using Analytical Hierarchy Process, Proceedings of Web Intelligence and Intelligent Agent Technology (WI-IAT), IEEE/WIC/ACM International Conference, 2010, pp. 300-303. [18] Lili Yana, Zhanji Guia, Wencai Dub, Qingju Guo, An Improved PageRank Method based on Genetic Algorithm for Web Search, Proceedings of Advanced in Control Engineering and Information Science, Elsevier, Volume 15, 2011, pp. 2983-2987. [19] Vasyl Pihur, Somnath Datta, Susmita Datta, RankAggreg, an R package for weighted rank aggregation, BMC Bioinformatics, 2012, pp. 1-20. [20] Raivo Kolde, Sven Laur, Priit Adler, Jaak Vilo, Robust rank aggregation for gene list integration and meta-analysis, Bioinformatics, Volume 28, Issue 4, 2012, pp. 573-580. [21] David Houcque, Introducion to MatLab for Enginieering Students, version 1.2, 2005.

Suggest Documents