Metasearching Using Modified Rough Set Based Rank Aggregation Iram Naim1 and Rashid Ali2 1
Department of Computer Engg & Information Tech, M.J.P.Rohilkhand University, Bareilly, U.P., India,
[email protected] 2 Department of Computer Engineering, A.M.U., Aligarh, U.P., India,
[email protected]
Abstract— A metasearch engine is a search engine, which uses the results of several search engines to produce a collated list of search results. The actual success of a meta-search engine directly depends on the aggregation technique underlying it. User satisfaction is obviously the most important factor in measuring the quality of search results. Therefore, we propose a metasearch system that models user feedback based metasearching. Our system uses the modified rough set based rank aggregation for metasearching. In the modified rough set based rank aggregation technique, we incorporate the confidence of the rules in predicting a class for a given set of data. We associate a score variable to the predicted class of the record, where the value of the variable is equal to the confidence measure of the rule. For each query in the training set, we mine the ranking rules using rough set theory and select the best rules-set by performing cross-validation test. Once the system is trained, we may use the best rule set to get the overall ranking for the results returned from different search systems in response to other queries. Previously modified rough set based rank aggregation technique has been already used for the performance evaluation of Web Search systems but here, we apply this method for meta searching. We also perform an evaluation of our proposed system by three independent evaluators. We compare our method with metasearching method based on rough set based rank aggregation and find that our method performs better than the metasearching method based on rough set based rank aggregation. Keywords—rank aggregation, supervised learning,
metasearching, rough set.
I. INTRODUCTION Metasearching is the process of retrieving and combining information from multiple sources that index effectively common data sets, the Web. In the context of the World Wide Web, Rank aggregation is a frequently used method in metasearch applications. For Web information retrieval, data in the form of individual rankers is abundant, for example Google, Yahoo, MSN, . . . , which are generally based upon ranking algorithms that incorporate information retrieval methods, link based algorithms and other algorithms used to compute the relevance of web pages to a given query. Unfortunately, query results of different rankers differ from each other due to the differences in ranking criteria and the specific algorithms and databases employed by specific rankers. So it is a good idea to develop a metasearch system that satisfies a user with his information need. For this, the user feedback should be taken in. The user feedback may be explicit or implicit in nature. In
this paper, we present a method for user feedback based metasearching that learns ranking rules using rough set theory to estimate an aggregated ranking for the rankings obtained from the participating search systems. Our system learns the rules on the basis of user’s ranking, which is available for a given set of rankings in the training set. This paper is organized as follows. In section 2, we briefly look at the background and related work. In section 3, we discuss the proposed method for metasearching that models the user feedback based metasearching using the modified rough set based rank aggregation. We show our results in section 4. Finally, we conclude in section 5. II. BACKGROUND AND RELATED WORK Lets us begin with some useful definitions and notations used in rank aggregation. Then, we discuss the previous work done in the area of rank aggregation. A. Useful Definitions [1]. Full List. Given, a set of entities S, let V be a subset of S and assume that there is a total order among the entities in V. τ is called a ranking list with respect to S, if τ is a list of the entities in V maintaining the same total order relation, i.e., τ = [d1 , d2 , ……… dm] , if d1 > ⋯ >dm , di Є V, i = 1, ⋯ , m, where > denotes the relation and m denotes the size of V. If V equals S, τ is called a full list [1],[2],[3],[4],[5],[6],[7]. [2]. Partial List. If V not equals to S, τ is called a partial list. A special case of partial list is a top-t list, for which the first tth entities are ordered in the list [1],[2],[3],[4],[5],[6],[7]. [3]. Rank Aggregation. Given a set of n candidates say C={1,2,3,….. n) and a set of m voters say V={1,2,3,…, m) A ranked list li on C for each voter i, where, li(j) < li(k) indicates that voter i prefers the candidate j to k. Rank aggregation is the process of combining the ranked lists l1, l2, l3,…, lm into a single list of candidates say l that represents the collective choice of the voters. The function used to get l from the ranked lists l1, l2, l3,…, lm (i.e. f(l1, l2, l3,…, lm) = l ) is known as the rank aggregation function.[2]. [4]. Spearman Correlation Coefficient. Let the full lists [u1, u2,…,un] and [v1, v2,…,vn] be the two rankings for some query Q Spearman rank order correlation coefficient (rs) [7] between these two rankings is defined by equation 1 as follows:
possible rule can also be drawn from equivalence class present in upper approximation. (1) [5]. Modified Spearman Correlation Coefficient. Without loss of generality, assume that the full list is given as [1, 2,…, n]. Let the partial list be given as [v1, v2,…,vm]. The Modified Spearman rank order correlation coefficient (rs [ )׳7] between these two rankings is defined by equation 2 as follows:
(2) B. Related Work From earlier studies rank aggregation techniques can be classified into the following categories: Unsupervised Rank aggregation and supervised rank aggregation. [6] Unsupervised rank aggregation. Unsupervised rank aggregation refers to the function that takes the values of input ranking lists perform the best aggregation and produce output ranking list. There is no limitation of target output value i.e. the output ranking list is not produced on the basis of previous examples or training data. Borda count method for rank aggregation [1],[2],[6], Markov Chain based rank aggregation [1] Median based Rank Aggregation [8], Genetic Algorithm based rank aggregation[4][5], Fuzzy Logic based rank aggregation[3] and Classification Algorithm based rank aggregation[9] are some of the known unsupervised rank aggregation techniques. [7] Supervised rank aggregation. Supervised rank aggregation refers to the function that takes the values of input ranking lists perform the best aggregation and produce output ranking list. There is limitation of target output value (user feedback in case of metasearching) i.e. the output ranking list should perform a mapping between target output and itself. Rough Set based rank aggregation[2], Modified Rough Set based rank aggregation[7] and Supervised MC2 method for rank aggregation[10]are some of the known supervised rank aggregation techniques. C. Rough Set based rank aggregation Rough Set based rank aggregation [2] is a user feedback based technique for rank aggregation, which learns ranking rules using rough set theory. For learning the ranking rules, user feedback has been obtained to the search results returned by a search engine in response to a set of queries and mines the ranking rules using rough set theory. In rough set approach, it is assumed that a pair of precise concepts –called the lower and the upper approximation of the vague concept, replaces any vague concept. Therefore, for each rough set, two crisp sets, called the lower and the upper approximation of the rough set, are associated. The lower approximation consists of all objects, which surely belong to the set, and the upper approximation contains all objects, which possibly belong to the set. The difference between the upper and the lower approximation constitute the boundary region of the rough set. For each equivalence class present in lower approximation class, a certain rule can be drawn. A
III. METASEARCHING USING MODIFIED ROUGH SET BASED RANK AGGREGATION In this section, discuss the details of metasearching using modified rough set based rank aggregation. The limitation of rough set based aggregation is that a record does not necessarily belong to a particular class according to a particular rule. So, we try to make an improved version of rough set based rank aggregation by incorporating confidence measure of a rule. The whole process of metasearching using Modified rough set based rank aggregation can be divided into two phases: namely (i) Learning Phase/Training Phase and (ii) Running Phase/Testing Phase A. Learning Stage/Training Stage In Learning Stage/Training Stage of proposed metasearch system the user gives his query to the search engine and obtains the search results arranged by the search engine ranking. Now we collect top few results (a fixed no) from every participating search engine and these search results from the participating search engines are combined and presented before the user. Let us assume the cardinality of the union, U of all the lists from the different search engines is |n|. The user feedback on the set U is obtained implicitly [2] [7][11]. As we have with us search results from the participating search engines and user feedback ranking on union of ranking lists (U). Now we build a table say ranked information table by using the ranked lists from the participating search engines and the user feedback based ranking RUF. Now, if the number of participating searches engines is m, we have a total number of m+1 ranking. In the ranked table, we have m+1 columns corresponding to these m+1 ranking. i.e. m is for participating search engine and 1 for user feedback based ranking. Now, we assign a real valued score to each document by placing a value –k in the cell (i, j), if a document i ЄU is present at kth position in the jth ranking. If the document i ЄU is not present in the jth ranking at all, which is possible in the case of external meta-searching, then, we place a value –(n+1) in the cell (i, j). After the formation of rank table we build binary information table. In the binary information table, an equivalence relation EA for a subset of attributes A At can be defined. The attribute corresponding to the overall ranking partitions all pairs of objects into two disjoint classes. In our case, RUF, the ranking of documents obtained by the user implicitly is the overall ranking of the documents. The lower and upper approximation of each class can simultaneously be obtained based on attributes in A. Reducts and core of attributes in A can also be found to eliminate the redundant ones. Then, for each equivalence class present in lower approximation class, a certain rule can be drawn. A possible rule can also be drawn from equivalence class present in upper approximation. These possible rules are useful in case of larger data sets where inconsistencies may reduce lower approximation and hence the finding of strong rules. Rosetta
[12] a rough set toolkit for analyzing data, may be used to get a minimal set of ranking rules from the binary information table. We repeat the whole process for a good number of queries in the training set. Then, we select the best set of ranking rules with confidence measures by performing cross validation. B. Running Stage / Testing Stage Using calculated ranking rules from the learning phase uses running Stage/Test Stage of proposed rank aggregation system to categorise new set of examples. In the running phase, for any new example we take m rankings to construct a ranked information table and convert the ranked information table into binary information table, by following the same procedure as in the learning phase. The selected set of ranking rules along with their confidence values will be used for the aggregation in the running phase. The ranking rules obtained from the learning phase with their corresponding confidence values are then used to predict the column corresponding to the user feedback. On the basis of this then, a score is computed for each document as in the rank aggregation algorithm. The selected set of ranking rules is used to estimate the discrete class 0 or 1 for the objects, on the basis of which a score is computed for each candidate. The score is then used to obtain the overall ranking of the candidates. A score variable to the predicted class of the record was associated where the value of the variable is equal to the confidence measure of the rule. For validation of mined ranking rules, the predicted user feedback based ranking and the actual user feedback based ranking is compared. IV. EXPERIMENTS AND RESULTS For training of our meta-search system, we submit a query to the participating search engines and get the results. we first issue a same query to all seven public search engines. These seven search engine are AltaVista, Ask, Excite, Google, HotBot, Lycos and Yahoo. From this we have seven different search results from all the seven search engines. Now we collect top few results (say 10) from every participating search engine and these search results from the participating search engines are combined and presented before the user for user feed. A. Training Set For experimentation, we have taken the 15 different queries in Training Set as discussed in [2]. These queries are measuring search quality, mining access patterns from web logs, pattern discovery from web transactions, distributed associations rule mining, document categorization query generation, term vector database, client -directory-server-model, Similarity measure for resource discovery, Hyper-textual web search, IP routing in satellite networks, focused web crawling, concept based relevance feedback for information retrieval, parallel sorting neural network, spearman rank order correlation coefficient, web search query benchmark. We performed all the steps discussed in section 3 for each query in training set and performed 5 fold cross validation. After performing cross validation and obtained ranking rules
to categorize any new example. The obtained ranking rules are shown in table 1. Table .1: List of ranking rules from learning phase Ranking Rules
Confidence Measures
SE2(0) AND SE3(0) AND SE4(0) AND SE6(0) => SE8(0) OR SE8(1) SE2(1) AND SE3(1) AND SE4(0) AND SE6(1) => SE8(1) OR SE8(0) SE2(1) AND SE3(1) AND SE4(0) AND SE6(0) => SE8(1) SE2(1) AND SE3(0) AND SE4(0) AND SE6(1) => SE8(1) OR SE8(0) SE2(1) AND SE3(0) AND SE4(0) AND SE6(0) => SE8(1) OR SE8(0) SE2(0) AND SE3(0) AND SE4(0) AND SE6(1) => SE8(0) OR SE8(1) SE2(1) AND SE3(1) AND SE4(1) AND SE6(1) => SE8(0) OR SE8(1) SE2(0) AND SE3(1) AND SE4(1) AND SE6(0) => SE8(0) OR SE8(1) SE2(0) AND SE3(1) AND SE4(1) AND SE6(1) => SE8(0) 1 SE2(1) AND SE3(1) AND SE4(1) AND SE6(0) => SE8(1) SE2(0) AND SE3(0) AND SE4(1) AND SE6(0) => SE8(1) OR SE8(0) SE2(0) AND SE3(1) AND SE4(0) AND SE6(0) => SE8(0)
0.728869, 0.271131 0.852632, 0.147368 1.0, 0.0 0.017621, 0.982379 0.010989, 0.989011 0.663158, 0.336842 0.27027, 0.72973 0.147186, 0.852814 0.0, 1.0 1.0, 0.0 0.043243, 0.956757 0.0, 1.0
B. Test Set As a set of test examples we take 10 new queries. These queries are Data Visualization, Learning Algorithm, Corelation Coefficient, Search Engine, Optimization Tool, Lower Bound, Rank Aggregation, Meta Search, Computer Graphics, Neural Network. We input each of the 10 queries one by one into 7 search engine and collect top 10 result. On the basis on these results we build Rank Information Table (RIT) and Binary Rank table (BIT). C. Evaluation with user feedback For the evaluation of the performance of the proposed work we take the union of each query from Test Set and remove redundant documents. We also take user feedback. We provide this union list to three judges and obtained three different user feedback lists according to the choices of judges. These ranking lists are called user1, user2 and user3. We also implement Rough set based rank aggregation for the comparison purpose. Now we calculate aggregated list from our proposed method of the rank aggregation to calculate the result for Test Set. We calculate modified spearman correlation coefficient between aggregated lists from modified rough set based rank aggregation and independent User Ranking. Here, we are presenting the results for one of the query from Test Set. The top-10 search resultS for the query “Learning Algorithm” by Modified Rough set based rank aggregation method is shown in table 2.
Table 2: URLs for the query “Learning Algorithm” 1
http://en.wikipedia.org/wiki/Machine_learning
2
http://en.wikipedia.org/wiki/Machine_learning
3
http:// ask.com/wiki/Supervised learning
4
http:// people.revoledu.com/kardi/tutorial/Learning/index.html
5
http:// www.huomah.com/Search-Engines/Algorithm-Matters/SEO-Hig
6
http://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html
7
http://dli.iiit.ac.in/ijcai/IJCAI-2003/PDF/085.pdf
8
http://www.sensenetworks.com/mve_algorithm.php
9
http://www.cis.hut.fi/ahonkela/dippa/node36.html
10
http://www.cse.iitb.ac.in/saketh/phdthesis.pdf
REFERENCES [1]
The combined Modified spearman correlation for the user rankings (i.e. user1, user2 and user3 rankings) for Rough Set and Modified Rough Set based rank aggregation is shown in table 3. Table 3: Combined Modified spearman correlation coefficient Test Query
Rough Set
Modified Rough Set
1 2 3
0.790956 0.729493 0.817949
0.850609 0.737663 0.826819
4 5 6 7 8
0.792994 0.862597 0.766667 0.834892 0.808690
0.776923 0.903810 0.798977 0.853853 0.786250
9 10
0.781301 0.698788
0.760225 0.731602
AVG 0.7884327 0.8026731 Results of comparison of the implemented methods i.e. Modified Rough Set based rank aggregation and rough set based rank aggregation is pictorially shown in fig 1. Combined Modified spearman correlation Coefficient Modified Rough Set
0.80267 0.78843
Rough Set
0.72
0.74
0.76
0.78
analyzing the performance of modified rough set based aggregation in the field of metasearching with the help of three independent evaluators. Our system merges the results of different search engines using the ranking rules that are based on the user’s feedback. We compared our method with rough set based rank aggregation method and observed that our method performs better than the rough set based rank aggregation method.
0.8
0.82
Modified spearman correlation coefficient
Fig 1: performance of different metasearching techniques V. CONCLUSION In this paper, we presented a system for implicit user feedback metasearching that uses modified rough set based rank aggregation. The main contribution of our paper is in
Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. “Rank Aggregation Methods for the Web”, In Proceedings of the10th International World Wide Web Conference, pages 613-622, 2001. [2] Ali R. and Beg M. M. S. “User Feedback based Meta-searching using Rough Set Theory”, International journal of Fuzzy Systems and Rough Systems (IJFSRS), 2008. [3] Ahmad N. and Beg M. M. S. “Fuzzy Logic Based Rank Aggregation Methods for the World Wide Web”, In Proceedings of the International Conference on Artificial Intelligence in Engineering and Technology, Malaysia, pages 363-368, 2002. [4] Ahmad N. and Beg M. M. S. “Soft Computing Techniques for Rank Aggregation on the World Wide Web”, World Wide Web: Internet and Web Information Systems, 2003. [5] Beg, M. M. S. “Parallel Rank Aggregation for the World Wide Web”, World Wide Web. Kluwer Academic Publishers, vol 6, issue 1, pages 522. March 2004. [6] Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. “Rank Aggregation revisited”, In Proceedings of the10th International World Wide Web Conference, pages 554-559, 2003. [7] Ali R. and Beg M. M. S. “Modified Rough Set Based Aggregation for Effective Evaluation of Web Search Systems”, The 28th North American Fuzzy Information Processing Society Annual Conference (NAFIPS2009) Cincinnati, Ohio, USA, 2009. [8] Fagin, R., Kumar, R., and Sivakumar, D. “Efficient Similarity Search and Classification via Rank Aggregation”, In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, pages 301-312, 2003. [9] Malik, S. A., Ismail, M. and Marshall, B. “A classification algorithm for finding the optimal rank aggregation”, In Proceedings of the International Workshop on the Challenges in Web Information Retrieval and Integration, pages 10–19, 2006. [10] Yu-Ting Liu, Tie-Yan Liu, Tao Qin, Zhi-Ming Ma, and Hang Li. “Supervised Rank Aggregation”, International World Wide Web Conference, 2007. [11] Beg, M. M. S. “On Measurement and Enhancement of Web Search Quality”, Ph.D. thesis submitted to the Department of Electrical Engineering, I. I. T. Delhi, 2002. [12] Rosetta, a rough set toolkit for analyzing data, http://www.idi.ntnu.no/aleks/rosetta/ [12]. Nuray R. and Can, F.“Automatic ranking of information retrieval systems using data fusion,” Information Processing and Management, vol. 42, issue 3, pp. 595-614, 2006. [13] Renda M. E. and Straccia, U. “Web metasearch- Rank vs. score based rank aggregation methods,” in Proceedings of the 18th Annual Symposium on Applied Computing, pp. 841–846, 2003. [14] Hull, D. A., Pedersen, J. O., and Schütze, H. “Method Combination for Document Filtering”, In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pages 279-287, 1996. [15] Jarvelin, K. and Kekalainen, J. “IR Evaluation Methods for Retrieving Highly Relevant Documents”, In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pages 41-48, 2000. [16] Fagin R., Kumar R., Mahdian M., Sivakumar D., and Vee E. “Comparing and aggregating rankings with ties”, In Proceedings of ACM PODS, pages 47–58, 2004.