Optimizing Web Search using Spreading Activation on the Clickthrough Data Gui-Rong Xue1, Shen Huang1, Yong Yu1, Hua-Jun Zeng2, Zheng Chen2, Wei-Ying Ma2 1
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954 Huashan Ave., 200030 Shanghai, P.R.China
[email protected], {shuang, yyu}@cs.sjtu.edu.cn 2 Microsoft Research Asia, 5F, Sigma Center 49 Zhichun Road, Beijing 100080, P.R.China {hjzeng, zhengc, wyma}@microsoft.com
Abstract. In this paper, we propose a mining algorithm to utilize the user clickthrough data to improve search performance. The algorithm first explores the relationship between queries and Web pages and mine out co-visiting relationship as the virtual link among the Web pages, and then Spreading Activation mechanism is used to perform the query-dependent search. Our approach could overcome the challenges discussed above and the experimental results on a large set of MSN click-through log data show a significant improvement on search performance over the DirectHit algorithm as well as the baseline search engine.
1 Introduction Approach based on keywords in existing Web search engines often works well when users’ queries are clear and specific. However, the performance of Web search engines is often deteriorated by that search queries are often short [1] and ambiguous, and Web pages contain a lot diverse and noisy information [3][6]. This problem can be partially solved by using external evidence to enrich the content of existing Web pages – the so-called surrogate document approach. One of such examples is to use anchor texts as additional description of target Web pages. This is because anchor texts represent the view of a Web page by other Web editors rather than its own author. Another solution is to introduce additional description by using click-through data, which has not been extensively studied. User click-through data can be extracted from the logs accumulated by Web search engines. These logs typically contain user-submitted search queries and the URL of Web pages clicked by users in the corresponding search results. Many valuable applications have been proposed along this direction, such as term suggestion [2][12], query expansion [3], and query clustering [4][8]. Derived from the co-citation and co-coupling methods [7][10] to find the similar papers, we propose to use the analogous method co-visiting, which is used to exploit the relationship between the Web pages and the queries in the clickthrough data, to
find the association relationship among the Web pages. If the two Web pages are clicked by many same queries, they are similar. We take such co-visiting relationship as the virtual link between the Web pages. Additionally, there is a weight associated with the link represent the degree of the similarity between two Web pages. Finally, the Spreading activation approach is proposed, which impose the co-visiting relationship among the Web pages, to re-rank the search result.
2 Spreading Activation on the Clickthrough Data
2.1 Problem Description We define click-through data as a set Session, each of which is defined as a pair of a query and a Web page the user clicked on. We assume that Web pages d is relevant to the query q in each session for most users usually are likely to click on a relevant result.
Fig.1. Interrelations between queries and Web pages By merging same queries and Web pages in the above sessions, click-through data could be modeled as a weighted directed bipartite graph G=(V, E), where nodes in V represent Web pages and queries and edges E represent the click-throughs from a query to a clicked Web page. We can divide V into two subsets Q={q1, q2, …, qm} and D={d1, d2, …, dn} where Q represents the queries and D represents the Web pages. Then, the problem is to efficiently find the relationship between the nodes in D by mining the bipartite graph G. Here we propose a co-visiting mining algorithm to solve the problem. 2.2 Co-Visiting Mining (CVM) It is easy to demonstrate that DirectHit method could achieve good performance if the query click-through data is complete, i.e. each query is associated with all the related documents. But unfortunately, in the real world, each query will randomly be associated with only a few individual documents instead of whole list. This data incom-
pleteness problem makes the performance of the DirectHit algorithm drop significantly. Deriving from the co-citation in the scientific literature [7][10][15], we develop an analogous approach to find similar Web pages. As shown in Fig.1, if the two Web pages are clicked by mostly the same queries, it is possible that two Web pages are similar. We define Web pages with such relationship as co-visiting Web pages. Next we describe how to measure the similarity of two co-visiting Web pages using the click-through information. Precisely, the number of visit times of a Web page di, denoted as visiting(di), refers to the number of the sessions containing Web page d. The number of co-visiting times of a two Web pages pair (di, dj) visited by the same query, denoted visiting(di, dj). Then, the similarity S between two Web pages di and dj based on the co-visiting relationship can be computed as: S (d i , d j ) =
visiting(d i , d j )
(1)
visiting(d i ) + visiting(d j ) − visiting(d i , d j )
The measure is scaled to [0, 1]. If the similarity between two Web pages based on co-visiting is greater than a minimum threshold σ, the two Web pages are treated as similar. Later experiments will show that the precision of queries associated with a given page is highest when σ is equal to 0.3. 2.3 Spreading Activation on Web Search The technique of spreading activation is based on a model of facilitated retrieval [5] from human memory [1][9] and has at least once been implemented for the analysis of hypertext networks structure by [13]. The model assumes that the coding format of human memory is an associative network in which the most similar memory items have strongest connections [16]. Retrieval by spreading activation is initiated by activating a set of cue nodes which associatively express the meaning of the nodes be retrieved. Activation energy spread out from the cue nodes to all other related nodes modulated by the network connection weights. Derived from the definition of spreading activation approach, we propose to use this method to re-rank the result of Web search by utilize the co-visiting information among the Web pages. First, the user submits the query Q to the search engine and the system returns the result set D that matching the query terms. The degree of match between a Web page di in D and Q is computed by the retrieval system (In this paper, we take the BM2500 as relevance measurement between the query and Web pages). We denote the similarity between the di and Q as sim(di, Q). Then, we use the spreading activation approach to propagate the similarity between the di and Q to the co-visiting Web pages of di through a certain number of cycles using a propagation factor. To simplify the problem, we use a simplified ver-
sion with only one cycle. In that case, the final retrieval status value of a Web page di that co-visiting with m Web pages is computed according to the following equation: m
sim(di, Q)= sim(di, Q)+ λ
∑
sim(d j , Q)
(2)
j =1
Finally, the search result is re-ranked according to the final similarity values between the Web pages and query.
3 Experiments
3.1 Data Set Our experiments are conducted on a real click-through data which is extracted from the log of the MSN search engine [11] in August, 2003. It contains about 1.2 million query requests recorded over three hours. Before doing experiment, all queries are converted into lower-case, stemmed by the Porter algorithm; stop words are removed in process. The query sessions sharing a same query are merged into a large query session, with the frequencies being summed up. After preprocessing, the log contains 13,894,155 sessions, 507,041 pages and 862,464 queries. We use a crawler to download the content of all Web pages contained in this log. After downloading the pages, Okapi system [14] is used to index the full text using BM25 formula. 3.2 Evaluation Criteria The Precision in IR is applied to measure the performance of our proposed algorithm. Given a query Q, let R be the set of the relevant pages to the query and |R| be the size of the set; let A be the set of top 20 results returned by our system. Precision is defined as: Precision =
| R ∩ A| | A|
(3)
In order to evaluate our method effectively, we also propose a new evaluation metric Authority. Given a query, we ask the ten volunteers to identify top 10 authoritative pages according to their own judgments. The set of 10 authoritative Web-pages is denoted by M and the set of top 10 results returned by search engines is denoted by N. Authority =
|M ∩N | |M |
(4)
3.3 Performance We fixed several parameters for the rest experiments. i.e. minimum similar threshold as 0.3 and the weight of the original similarity as 0.4, which are determined by extensive experiments. First, the volunteers were asked to evaluate the Precision and Authority of search results for 20 queries. Fig.2 shows the comparison of our approach with content based search (Content) and DirectHit (DH). Precision
Content
DH
CVM
0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0
20%
40%
60%
80% 100% Data Size
Fig.2.The precision on different data
Authority
Content
DH
CVM
0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0
20%
40%
60%
80% 100% Data Size
Fig.3. The authority on different data sizes From Fig.2 and Fig.3, we found that the performance of the full text search technique is poor, demonstrating the gap between the document space and the query space. When click-through data is introduced, the search performance is improved. The more click-through data is introduced, the higher is the performance of search. Co-visiting method has a highest performance in all the algorithms. Co-visiting method outperforms the DirectHit method by fully mining the implicit link relationship among the Web pages.
4. Conclusions In this paper, we propose a novel mining algorithm to utilize click-through data. The algorithm could fully explore the interrelations between heterogeneous data objects, and effectively find the virtual link between Web pages, thus deal with the above
issues. Experiment results on a large set of MSN click-through data show a significant improvement of search performance.
5 Reference [1] A.M.Collins and E.F.Loftus. A Spreading Activation theory of Semantic Processing. Psychological Review, 82:407-428,1975. [2] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. Relevant term suggestion in interactive Web search based on contextual information in query session logs. JASIST 54(7): 638-649,2003. [3] Cui H., Wen J.R., Nie J.Y., and Ma W.Y., Query Expansion by Mining User Logs, IEEE Transaction on Knowledge and Data Engineering, Vol. 15, No. 4, July/August 2003. [4] D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 407-415, 2000. [5] D.E. Meyer and R.W. Schvaneveldt. Facilitation in Recognition Pair of Words: Evidence of a dependence between Retrieval Operations. Jounal of Experimental Psychology, 90:227-234,1971. [6] Funas, G.W., Landauer,T.K., Gomez,L.M. and Dumais, S.T. 1987. The vocabulary problem in human-system communication. Communications of the ACM 20,11, Pages 946-971, Nov.1987. [7] H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265– 269, 1973. [8] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. In Proceedings of the Tenth International World Wide Web Conference, Hong Kong, May 2001. [9] John R. Anderson. A spreading activation theory of memory. Journal pf Verbal Learning and Verbal Behaviours, 22:261-295,1983. [10] M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10–25, 1963. [11] MSN Search Engine, http://www.msn.com. [12] Nicolas J. Belkin, Helping people find what they don't know, Communications of the ACM, v.43 n.8, p.58-61, Aug. 2000. [13] Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow’s ear: Extracting usable structure from the Web. In Proc. of CHI’96 (ACM), Human Factors in Computing Systems, Vancouver, Canada, Apirl 1996, ACM. [14] Robertson, S.E. et al. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference(TREC-3), 109-126, 1995. [15] R. R. Larson. Bibliometrics of the World-Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting of the American Society for Information Science, Baltimore, Maryland, October 1996. [16] W.Klimesch. The Structure of Long Term Memory: A connectivity Model of Semantic Processing. Lawrence Erlbaum and Associates, Hillsdale,1994.