PERSONALIZED SEARCH BASED ON LEARNING USER CLICK HISTORY Cheqian Chen, Kequan Lin, Heshan Li, Shoubin Dong* School of Computer Science and Engineering South China University of Technology Guangzhou, Guangdong, China
[email protected],
[email protected],
[email protected],
[email protected]
So far, personalization techniques have been developed in diversified ways. In a nutshell, the techniques can be classified into three categories: content based personalization, link-based personalization, and function-based personalization. The approach of this paper falls into the third category.
Abstract—Nowadays, Web Search Engines have become an indispensable tool for people to find internet resources. However, current Web Search Engines still have many drawbacks. They serve all people in the same way, regardless of the individual needs of each user, which obviously cannot satisfy most of the users. Personalized Search is proposed to solve this problem and to improve the retrieve quality. This paper deeply investigates the approach for personalized search, and has proposed a practical and effective method.
Content based personalization mainly uses a set of to represent the user’s interest, which are extracted by mining user’s clickthrough data. Then uses these interests to improve the search results or query expansions. The most representations are: [1] Match user query to a category which would be a user's query context to provide users with personalized web results; [2] Gather query words or page summary and then put these data to the matched ODP category in order to develop the user's personal model.
Keywords-Personalization; clickthrough data; search engine; user preferences
I.
INTRODUCTION
Nowadays, most current search engines return the same results to all users who ask the same query. This is inadequate when the users have different search goals, tasks and interests. The personalization of search engine is what users look for, and is the important way to improve the service quality.
Link-based personalization is mainly by modifying the importance of web page by adding the user personalized information. The most representations are: [3] Based on the topic of query word to calculate the sensitive topic-based PageRank; [4] Through the sharing of local vector method to reduce the computation amount of personalized PageRank.
In this article, we tackle the adaptation problem of search engines by considering two main research issues. The first one is preference mining, which discovers user’s preferences of search results from clickthrough data. The second one is ranking function optimization, which optimizes the ranking (retrieval) function of a search engine according to the user’s preferences. Clickthrough data is the search result clicked by user and the ranking function in a search engine is employed to present the search results in some proper order according to the user’s preferences.
Function-based personalization is mainly by designing new sorting algorithms or transforming the existing algorithms to achieve the personalized services. The most representations are: [5] Using SVM support vector machine algorithm to sort the results of web pages; [6] Design an algorithm called CubeSVD to calculate the relationship among the three-dimensional data , then obtains the weight of each page and re-sorts them.
This paper proposes a new method for the personalized search, using clickthrough data as the personal data. Firstly, uses the semantic statistical of word frequency method to extract the query expansion terms and recommended to the user. Secondly, improves the Naive Bayesian classifier and combines SVM to make users’ personalized learning models, then provides personalized re-sort results by user models. After experimental evaluation, it shows that this method has a significant effect, not only provides a meaningful query expansion terms, but also significantly improves the ranking of results. II.
III.
ARCHITECTURE
A. The Structure of the System The system architecture is shown in Fig. 1, which consists of presentation layer, data layer and analysis layer. The presentation layer is to interact with the user. The data layer is to record user data and the analysis layer is the core of the system to achieve all the algorithms. B. Presentation Layer Presentation layer is formed as a web page, through which users could login and interact with the system. Users can import their favorite bookmarks which will be recorded in the
RELATED WORK
*Corresponding Author : Shoubin Dong
Proc. 9th IEEE Int. Conf. on Cognitive Informatics (ICCI’10) F. Sun, Y. Wang, J. Lu, B. Zhang, W. Kinsner & L.A. Zadeh (Eds.) 978-1-4244-8040-1/10/$26.00 ©2010 IEEE
the text contents of this page, also uses Lucene to index them. The system maintains an index file for each user.
data layer. Users can also enter a query word to search, then the system will returns the top S (here taken as 50) of the result pages and records the page which user clicked. The search results and the clicked results will serve as a training set.
After users enter a query term, we would use this query word to search the index file and then find the topN (this system takes 10) page documents associated with the current query. Each page document would have a Lucene document relevant value as docScore. For each retrieved document, we have Chinese word segmentation for its text content. After that, using the following formula to calculate the document weights for each word in the corresponding page:
C. Data Layer The user bookmark data, clickthrough data and the results search engine returns are stored in the data layer. Analysis layer get training data from the data layer, indexing and training user model, and then store the index file and the user profile back to the data layer. D. Analysis Layer Analysis layer is the core of the system, all algorithms are achieved here. In this system, three algorithms are implemented: query expansion recommendation algorithm, Naive Bayes classification algorithm and SVM algorithm. Presentation layer
User Query
ͳ
ͳ
݊ ݏ݀ݎܹݎെݏ
ʹ
ʹ
݊ݏ݀ݎܹݎ
ܶ݁ ݁ݎܿܵ݉ݎൌ ቂ ൈ
Where ݊ ݏ݀ݎܹݎmeans the total numbers of the document, ݏmeans the first location of the word, ܶ ܨmeans the frequency of the word showed in the document. Then use the next formula to combine all the words in all the documents:
Action Tracking UI
Pages Reranking
Bayesian Classification Module
SVM learning modules
Load Bookmark
Pre-processing module
Query Expansion Module
ܶ݁݅ ݁ݎܿܵ݉ݎǡ݆ ݆ ݁ݎܿܵܿ݀כ ݊ݏ݀ݎܹݎ
Ǥ
Pages Table
User Profile Table
Finally, sort the final score of each word in descending order, and take the top M (Taken as 5) words recommended to the user.
Pages Index Table
Figure 1. Architecture of Personalized Search Engine
IV.
݅݁ݎܿܵ݉ݎ݈݁ܶܽ݊݅ܨൌ σܰ ݆ ൌͳ
Where ܶ݁݅݁ݎܿܵ݉ݎǡ݆ means the word weights in document j, ݀ ݆݁ݎܿܵܿmeans the document relevant value between the word and the document j.
Data layer
Analysis layer
ቃ ൈ ݈݃ ሺͳ ܶܨሻǤ
ALGORITHM
A. Query Expansion Algorithm 1) Why query expansion Because the ambiguity of natural language itself and most of the query words are short, the search engines can’t understand the real intention of the user, which leads to the redundant results which not meet the need of user. Add some terms to the query word can make the intention of the user's query more obvious, and make the search results more accurate. This is why we need "query expansion." Query expansion has become a hot field of information retrieval research branch. Nowadays it has a lot of research results [7] [8] [9] [10]. The query expansion in our system has some difference from the traditional ones. Traditional query expansions are implicit, which extend the query terms and submit to search engine completely transparent to users. This system is extended to a number of words shown to users, allowing the user to select one explicitly. However, the realization of them remains the same. 2) Algorithm Our system records the title and summary of the clicked page as the text contents of this page, and then uses Lucene to index them. If the user imports their own bookmarks, the system would get the page title, keywords and Description as
B. Naive Bayes Algorithm The system gains the user interest through the query word and the clickthrough data, however the clickthrough data is relatively sparse overall the returned pages, so we need to extract the user-related pages from the all returned pages. Bayesian classifier is a very good data mining algorithms. In this system, we have improved the Naive Bayes according to the actual situation, using the Naive Bayes-based Spy. Next, Naive Bayes and Naive Bayes-based Spy will be introduced. 1) Naive Bayes Bayesian classification algorithm is an important data mining classification technique. In theory, Bayesian classification algorithm has the smallest error rate among all the classification algorithms, which in practice has a broad application prospects. Naive Bayesian classifier based on the priori probability and conditional probability of Bayesian formula, which combine a priori probability of events with the posterior probability, using known information to determine the posterior probability of a new sample. Bayesian classification algorithm's goal is to calculate the maximum posteriori probability of sample data in different categories, and summed up into the class with the maximum posteriori probability. The application of this system is to classify untagged pages into some category, positive or negative. We calculate the probability of the web pages as follows [9]: ܲݎ൫ ݆ݓห൯ ൌ
ߣσܰ ݅ൌͳ ܰ ݉ݑ൫ ݆ ݓǡ݈ ݆ ൯ߜሺȁ݈ ݆ ሻ
Ǥ
ܰ ߣ ܯσܯ ݇ൌͳ σ݅ൌͳ ܰ ݉ݑ൫ ݆ ݓǡ݈ ݆ ൯ߜሺȁ݈ ݆ ሻ
Whereߜሺȁ݈݅ ሻ indicates the class label of link ݈݅ . Its value is 1 if ݈݅ is positive and 0 otherwise. ܰ݉ݑሺ ݆ݓǡ ݈݆ ሻ is a function counting the number of times ݆ݓappears in ݈݅ . ߣ is a smoothing factor; we set ߣ ൌ ͳ to make Naive Bayes more robust. When predicting unlabeled links, Naive Bayes calculates the posterior probability of a link ݈, using the Bayes rule: ܲݎሺȁ݈ሻ ൌ
ܲ ݎሺ݈ȁሻܲ ݎሺሻ ܲ ݎሺ݈ሻ
Ǥ
Where ܲݎሺ݈ȁሻ ൌ ςܯ ݆ ൌͳ ܲݎሺ ݆ݓȁሻ is the product of the likelihoods of the keywords in link ݈. Then link ݈ is predicted to belong to class “+” (positive), if ܲݎሺȁ݈ሻ is larger than ܲݎሺെȁ݈ሻ then “г” (negative) otherwise.
ܲܰ, if and only if it appears at least a certain number (ܶ) ݒ in ܲܰ݅ . ܶ ݒis called the voting threshold. The voting procedure selects ܲܰ ݏbased on the opinions of all spies so it can minimize the bias of the spy selection.
2) Naive Bayes-based Spy Naive Bayesian classification requires an initial positive set of related pages and an initial negative set of related pages. However, in this system, the initial collection of pages is only the pages user clicked (i.e. positively related pages) without negatively pages, we need to divide the unlabeled page into a positive correlation set and a negative correlation set according to the clicked pages. In view of this situation, we carried out an improved Naive Bayes, which is Naive Bayes-based Spy.
Notice that, in our spying technique, the identified ܲܰ can be influenced by the selection of spies. As for clickthrough data, there are typically very few positive examples (recall that they are clicked links). We can make full use of all the potential spies to reduce the influence with a voting procedure to strengthen the spying technique further. C. The Ranking SVM Algorithm To achieve a personalized sorting algorithm which not only could handle high-density data but also could not be overfitting the test data is the core of the system. Support Vector Machine (SVM) based on VC dimension and structural risk minimization principle is a good choice. According to the limited sample of information in the complexity of the model (e.g. training samples of specific learning accuracy) and learning ability (e.g. error-free samples to identify any capacity) to find the best compromise in order to obtain the best generalization ability. Support Vector Machine approach has several major advantages:
Naive Bayes-based Spy puts every positive page into the unlabeled collection, each positive page as a Spy to divide the more positive collection from the unlabeled collection. Meanwhile, using voting mechanisms, for each Spy, each unlabeled page would be divided into a certain collection (positive or negative), if the positive votes are more than a certain value, it would be divided into the positive collection, otherwise the negative collection. The idea of a spying and voting procedure is depicted in Fig. 2 and is explained as follows. Query
1. It is specific to the situation of limited samples. The goal is to get the optimal solution under the existing information, not just the optimal value when the number of samples tends to infinity.
Pages returned
Pages Clicked ˄Initial Positive Data˅
Pages Unclicked (Initial Unlabelled Data)
Positive Data
Unlabelled Data
2. Algorithm will eventually be transformed into a quadratic optimization problem. In theory, it would get the global optimal point, which solves the problem of local extremism the neural network method can not avoid.
Each Positive Url Spy
Positive
3. Algorithm converts practical problems from non-linear transformation to high-dimensional feature space. In highdimensional space, a linear discriminate function is constructed to achieve the original non-linear discriminate function in original space. The special nature can guarantee better generalization ability, while it cleverly solves the dimension problem; the complexity of the algorithm has nothing to do with the dimension of sample.
Naive Bayes Classifier
Negati ve
DŽDŽDŽDŽDŽ
Negative1
Negative2
Negative3
First of all, the algorithm runs the spying technique ݊ times, where ݊ ൌ ȁܲȁ is the number of positive examples. Each time, a positive exampleܲ݅ in ܲ is selected to act as a spy and put into ܷ to train the Naive Bayes classifier. The probability, ܲݎሺȁܲ݅ ሻ, assigned to the spy ܲ݅ , can be used as the threshold ܶ ݏto select a candidate predicted negative set (ܲܰ݅ ). That is, any unlabeled exampleܷ݆ with a smaller probability of being a positive example than the spy (ܲݎሺȁܷ݆ ሻ ൏ ܶ ) ݏis selected into ܲܰ݅ . As a result, ݊ candidate predicted negative sets, ܲܰͳ ܲܰʹ ǥ ܲܰ݊ , are identified. Finally, a voting procedure is used to combine all ܲܰ݅ into the final ܲܰ . An unlabeled example is included in the final
DŽDŽDŽ DŽ
NegativeN
We have designed a SVM method based on RBF kernel, whose goal is to study the results of Naive Bayesian classifier and obtain a sorting function݂ሺݍԦǡ ݀ሻ.݂ሺݍԦǡ ݀ሻ means the inner product between the weight vector and query terms-document mapping feature vector , that is:
Voting
Negative
Figure 2. The procedure of Naive Bayes-based Spy
݂ሺݍǡ ݈ሻ ൌ ݓ ሬሬԦ ȉ ߶ሺݍǡ ݈ሻǤ
1) PageRank: PR value 2) urlLength: length of url 3) pageSize: size of the page document 4) isDomain: Whether is the root web(0,1) last modified date ͳͳെݔ
ሻܴܽ݊݇ ܧൌ ቊ
ͳͲ
Ͳ
document d ranks ݔ, ݔ ͳͲ
ͳ ሻܶ ܶܧൌ ቄ Ͳ ranked Top T, T= {1, 3, 5, 10} ͺሻܵ݅݉ܶሺݍǡݐሻ
whether document d is
1) The Assessment for query expansion Our system collects a total of 110 extension words which 9 users evaluate, of which a score of positive effect (1 point) in a total of 51, accounting for 46.36%; a score of no effect (0 point) in a total of 38, accounting for of 34.55%; a score of reaction (1 point) in a total of 21, accounting for 19.09%. Thus, almost an extension of the term in general is of great significance for users, 80.9% of the expansion of the term will not cause search engine returns the results of variation. Therefore, this system provides the query expansion terms for the user with a positive meaning, enabling the user's search intentions more clearly, the search engine returns results more accurate.
݈ ܰ݃ǥ ݂݅ݐ א ݅ݓǡ ݍ א ݅ݓǢ െ݈ܰ݃ ǥ ݂݅ݐ א ݅ݓǡ ݍ ב ݅ݓǢ ൌ൞ ͳ ሺͳെܲെሻܲ ݈݃ ǥ ݁ݏ݅ݓݎ݄݁ݐǢ ʹ
ሺͳെܲሻܲെ
ȁܽתݍȁ
ͻሻܵ݅݉ܥሺݍǡܽሻ ൌ Let ȁܽ ת ݍȁ be the number of words in ȁݍȁ a query q that also appear in the abstract a. ͳͲሻܵ݅݉ܩሺݍǡܽሻ ൌ σ
EXPERIMENTS
A. Experiment Setup Because the information content of test data we need is very large, and associated with the particular user, almost no readily available data sets can be used, so we designed a program of manual measurement. We found nine volunteers involved in testing, each tester chose "computer" or "sports" as a search topic which contains 20 query words. Each tester need to register in our system and uses the 20 query words to search. For each query term, they have to click 3 or more web pages they think relevant. By recording the search results and user's clicked history, we get 692 hits Web page history and the 9145 search data as a training set. In order to assess the effectiveness of the system, we will give another 5 ambiguous expression as the test queries for users to search, and thus test the effect of the system.
߶ሺݍǡ ݀ሻ indicates the correlation between document ݀ in the search results and query ݍ. Its characteristics (total 12) are defined as follows:
5) Date:
V.
݊ ݍȉܿ ݐ݊ݑሺܽ ݊݅ݍሻ
ݐݑܿ ݍא݅ ݓሺܽ ݊݅ ݅ ݓሻ
11) containsHome: whether URL contains “home”, (2valued attribute: (0,1))
2) The assessment for re-sort effect When the training data set collection is finished, we call up the naive Bayesian classifier, and SVM support vector machine learning module to learn the user’s model. For each test query word, we score the return pages respectively, marking them as relevant, less relevant and irrelevant. In order to avoid the impact of sorting on the mark, we will sort the initial results of random.
12) containsTilde: whether URL contains “~”, (2-valued attribute: (0,1)) ݓ ሬሬԦ is given the weight of each characteristic value. When the user weight vector ݓ ሬሬԦ is calculated, Search engine returns the query results using formulas ݂ሺݍǡ ݈ሻ ൌ ݓ ሬሬԦ ȉ ߶ሺݍǡ ݈ሻ to resort. The principle of resort shown in Fig. 3:
For evaluation of effectiveness of the system, we use the DCG (Discounted Cumulative gain) evaluation method [11]. This method can not only take into account the related website's ranking position on the impact of user query results, but also be able to distinguish between the different related grades. We value the page based on user tags, relevant, less relevant and irrelevant assigned to 2, 1, 0, respectively. DCG values using the formula: ܩሾͳሿǡ݂݅݅ ൌ ͳ ܩܥܦሾ݅ሿ ൌ ൝
Figure 3. SVM re-sort theory
For example, if the user's weight is vector ܹͳ , sorts the result into (1,2,3,4); if ܹʹ , sorts the result into (2,3,1,4).
ܩܥܦሾ݅ െ ͳሿ
ܩሾ݅ሿ ݈݅ ܾ ݃
ǡ ݁ݏ݅ݓݎ݄݁ݐ
Where b mean the number of pages which is relevant, and if b< 2 then set b = 2. We can get the DCG values for the initial results and after the re-sort according to the user model. After the compare of this two DCG values, we would obtain the effect of the percentage of re-sorting algorithm.
In the system's implementation, we are using SVM-Light software [10] to calculate the user's profile, namely, learning the outcomes of Naive Bayesian classifier to get vectors, and to achieve SVM re-sorting algorithms.
B. Experiment Results The test results on the sports shown in Fig. 4˖
4
from Baidu of some query words have been a very good result. From the figure above, the DCG value of "Jordan" query word from Baidu is 0.762, the DCG value of "Rabbit" query word from Baidu is 0.783.It shows the results of the top 30 pages are highly satisfied by users, but our system would enhance the pages with similar interest because of some noise, and this would cause the DCG to decrease;(2) Query words are poorly designed. For example, for "Rabbit" query term, the computerrelated results returned are only "Magic Rabbit software", the range was very limited and inadequate for a learning system. A special query term may cause a bad effect; (3) Training is insufficient. As mentioned above, each of our users only used 20 query words to the training system, which is also not enough to get rid of the noise. But now we have been pleased with this effect. We believe that after a period of time to train and adjust the parameters, the effect of our system will be more satisfactory.
Figure 4. The test results on the sports
Fig. 4 shows the effect of the re-sort result of the sports. Abscissa is after comparing the value of DCG for the re-sort search results with the value of DCG of Baidu's search results, the rising percentage of sorting effects; Vertical coordinates the user's query words, from the bottom-up every 5 data is a user's five query words, a total of 5 users. From the figure, there are four query words (Italy, Li Na, Li Ning, Florence) for all users, and sports-related page rank have generally been upgraded, maximum to 33.4%. As for the "Jordan" query words, there are four worse effects.
VI.
SUMMARY AND OUTLOOK
Through exploration and practice, we have finally completed the search engine system, which could not only implicitly collect users’ personal information and provide query expansion, but also provide the personal re-sort results. We designed and implemented query expansion based on statistical semantic recommendation algorithm, and Naive Bayesian classification algorithm, as well as support vector machine SVM re-sorting algorithm.
The test results on the computer shown in Fig. 5˖
The test results show that our search engine could upgrade the quality of results for a user in a short time. However, this system is not perfect enough. For a small part of the search terms may cause negative effects. Our next step will be to improve the system. Firstly, improve support vector machine weight characteristics, Secondly, meta search engine will be included, Thirdly, adjust the document feature selection to make it express the users’ interest better. Finally, improve the algorithm to increase the adaptive parameter adjustment, so that the system could adjust the classification and sorting algorithms automatically according to the characteristics of each user. Figure 5. The test results on the computer
ACKNOWLEDGMENT Fig. 5 shows the effect of the re-sort result of the computer, a total of 4 users. We can see that for "rabbit" query words, all users have a more obvious effect fall for the re-sort pages, the other query words individual effects of a slight decrease, the resort effect of relevant pages have been a very considerable increase.
This research was supported by China Next Generation Internet Project “IPv6 oriented Large Scale Distributed Search Engine” (CNGI 2008-122) and National Undergraduate Innovation Programs.
Based on the above test results, we can see our system the most significant advantage is to quickly provide good re-sort results to the new user. Other personalized sorting algorithms generally require collecting large amounts of data for a long time to gain good effect. And if using our system, users only need search for 20 terms, and could enhance the re-ranking effect of the relevant pages.
[1]
However, from the test results, there are more obvious negative effects for two query terms after the results of resorting. It shows that there is some deficiencies in our system. After analysis, there are three main reasons: (1) Ranking pages
[6]
REFERENCES
[2] [3] [4] [5]
[7]
F.Liu, C.Yu, W.Meng. Personalized Web Search by Mapping User Queries to Categories. CIKM, 2002. M.Sperretta, S.Gauch. Personalizing Search Based on User Search Histories. Conference, 2004. Taher H. Haveliwala. Topic-Sensitive PageRank. WWW 2002. Glen Jeh, Jennifer Widom. Scaling Personalized Web Search. 2003. Thorsten Joachims. Optimizing Search Engines using Clickthrough Data. SIGKDD, 2002. Jian-Tao Sun, Hua-Jun Zeng, Huan Liu. CubeSVD: A Novel Approach to Personalized Web Search. WWW 2005. P.Chirita, Claudiu S.Firan, W.Nejdl. Personalized Query Expansion for the Web. SIGIR, 2007.
[8]
[9]
[10] T. Joachims. Making large-scale SVM learning practical. In B. Scholkoph et al., editor, Advances in Kernel Methods – Support Vector Learning. MIT Press, 1999. http://svmlight.joachims.org/ [11] K. Jarvelin and J. Keklinen. Ir evaluation methods for retrieving highly relevant documents. In SIGIR, 2000.
L. Deng, X. Chai,Q. Tan,W. NgSpying and D. L. Lee. Spying Out Real User Preferences for Metasearch Engine Personalization. In Proc. of WebKDD, 2004. Q Tan, X Chai, W Ng, and DL Lee. Applying Co-training to Clickthrough Data for Search Engine Adaptation. In Lecture notes in computer science, 2004.
C