Exploiting Concept Network-based User Profile for ...

3 downloads 1354 Views 277KB Size Report
The proposed method is a sort of a re-ranking personalized search method .... Actually, the domain name-based approach has been found to outperform the ...
Exploiting Concept Network-based User Profile for Personalized Web Search: A Re-ranking Approach Jun-ho Roh

and

Han-joon Kim

School of Electrical and Computer Engineering, University of Seoul, 90 Jeonnong-dong, Dongdaemun-gu, Seoul 130-743, Korea E-mail: {loece, khj}@uos.ac.kr

Abstract This paper proposes a novel way of personalized web search through re-ranking search results with user profiles of concept-network structure. Basically, personalized search systems need to be based upon user profiles that contain users' search patterns, with which they expand initial queries or re-rank the search results. The proposed method is a sort of a re-ranking personalized search method integrated with query expansion facility. The method identifies some documents which occur commonly among a set of different search results retrieved from expanded queries, and re-ranks the search results in terms of the degree of co-occurring. We show that the proposed method outperforms the conventional ones by performing the empirical web search with a number of actual users who have diverse information needs and query intents.

Key Words: Web search, Personalization, Ranking, Concept network, User profile, Query expansion

1. Introduction Web search is to search for preferred information (i.e., web pages) on the World Wide Web. With the rapid growth of the information available on the Web, it becomes difficult for web search engines to satisfy users’ information need with only a sequence of query words. In the future, web search techniques continue to evolve in terms of functionality and performance, in which a promising technique is about personalized search that is to present the custom-tailored search results to each user [1]. Recently, the personalization technique is increasingly being recognized as highly important for mobile web search [2-6]. Recently, some personalized search engines such as Rollyo (http://www.rollyo.com) and Ness (http://likeness.com) have been introduced, and however more effective and practical ways are required. In general, personalized search engines are based upon user profiles that are a special type of data structure containing users’ implicit search patterns or explicit given details; their performance is dependent upon the quality of their user profiles. In this paper, we propose a personalized search technique that is to re-rank the search

results based upon concept-network based user profiles. In our previous work, we have showed that user profiling technique based upon concept network structure can achieve precise personalized search by expanding an initial query [7]. The key idea of the proposed method is to identify some top-ranked documents which occur commonly among a set of different search results generated from expanded queries, and then to re-rank the search results by the degree of co-occurring. The criteria for estimating the degree of co-occurring includes domain names as well as URLs (Uniform Resource Locators), and under these two criteria, we introduce a simple function that evaluates the degree of co-occurring among a set of search results. The rest of this paper is organized as follows. In Section 2, we review some related work. In Section 3, we describe our proposed method. In Section 4, experimental results of personalized queries are presented. In the last section, we summarize our work and introduce future work.

2. Background Study 2.1 Personalized search Since personalized techniques have been known to be highly important in information retrieval, many studies has been carried out. In [6], a number of personalized search techniques using user profiles have been well described, and they are broadly classified into two approaches. The first one is to present the search results according to a user profile, and its typical way is to re-rank the delivered documents based upon the similarity of those documents and a user’s preference [2, 4] (see Fig. 1). The second one is to enhance a submitted query through query expansion (or recommendation) with exploiting a user profile [3] (see Fig. 2). Our proposed method corresponds to a combination of these two approaches.

Fig. 1. Personalized search by re-ranking

Fig. 2. Personalized search by query expansion

2.2 Concept network-based user profile In our work, the user profile is represented as so-called ‘concept network’, which is a form of network (or graph) structure containing the vertices of concepts and the edges of relevance between vertices. Here, the concept is called ‘user interest concept’ (simply ‘concept’) in our work and it is approximately defined as a formal concept according to the formal concept analysis (FCA) theory [81]. In FCA theory, a concept is a unit of thoughts and is defined as a pair of ‘extent’ and the ‘intent’; the extent covers all objects belonging to the concept and the intent consists of all features (or attributes) valid for all the objects. For the pair of extent and intent to be a formal ‘concept’, the set of intent features should be common to a set of extent objects, and the set of extent objects should possess all the features in the intent set. Our method assumes that whenever a user submits his/her query, a user interest concept (i.e., a unit of concept network), is generated, which is combined into his/her current concept network. A user concept is defined as a pair of ‘extent’ and ‘intent’ where the ‘extent’ covers a set of documents visited among the search results and the ‘intent’ covers a set of keyword features extracted from the ‘extent’ documents. In order to generate an unambiguous concept, it is important to extract high quality keywords, Fig. 3 shows an example of concept network-based user profile, in which each concept is represented as a set of intent features. To extract better keywords, we have used our previous keyword extraction method [7], which is based on the conventional TF-IDF weighting scheme. As each of user queries continues to be processed, its corresponding user concept is generated, and his/her user profile of concept network will evolve incrementally.

Fig. 3. An example of concept network-based user profile

3. Analyzing Search Results for Re-ranking Re-ranking is to sort the web pages delivered as a search results again with considering a user’s preference. For this, we intend to identify some documents which occur commonly among a set of different search results retrieved by the expanded queries, and then to evaluate the degree of co-occurring for re-ranking the search results.

Fig. 4. Representation of documents sets retrieved by expanded queries

As seen in Fig. 4, when a set of documents Dq is retrieved by the query q, different sets of documents Dq1, Dq2, and Dq3 are retrieved by its expanded query q1, q2, and q3; that is, Dqi is a set of documents retrieved by the expanded query “q & qi”. For example, suppose a user submits a query ‘ruby’ to search for some information about ruby programming language. Assuming that the user’s profile is given as seen in Fig. 3, we can get a set of keywords {‘program’, ‘language’, ‘tutorial’, ‘linux’, ‘java’, ‘test’} associated with the query keyword ‘ruby’. The initial query can be expanded with appropriate number of keywords selected among those associated keywords. If the expanded queries are “ruby & program”, “ruby & language”, and “ruby & tutorial”, we obtain three sets of more personalized documents by these expanded queries. Here, we focus upon the fact that some particular documents occurring commonly among a set of different search results can be closer to user query intent. This is because the documents commonly occurring among the search results can contain more of keywords in the user profile. In general, the search results are delivered to users in the order of the degree of relevance between users’ query and documents. In our work, we focus upon reordering top-N documents for personalized search. Then for re-ranking, we estimate the degree of cooccurring among different search result sets retrieved by expanded queries. The estimated value is called ‘re-rank score’ in this paper. The re-rank score function for each document

contained in the result set retrieved by an expanded query is defined as follows. S (d q(ir ) ) 

1 log( r  1)

where d qri denotes the document with the rth rank in the document set

(1)

Dqi retrieved

by the

expanded query qi. In our work, the value of r ranges from 1 to 20. This score function should consider that as the documents with lower rankings occur commonly among different search result sets, they must be adjusted to higher ranks. Thus it is desirable to make smaller the difference between re-rank score values of adjacent ranks; this is why the score function uses logarithm in the denominator of Equation 1. For example, suppose that a particular document ranks 1st, 3rd, and 6th for three expanded queries, and then the re-rank score values are given 1/log(1+1)=3.32, 1/log(3+1)=1.66, and 1/log(6+1)=1.18, respectively, within each of three search results. Consequently, since the actual re-rank score function of a particular document dq is to return the value of summing up re-rank scores in the result set for each of expanded queries, it can be calculated as follows. k

S (d q )   S (d q(ir ) ) i 1

(2)

where k denotes the number of expanded queries for a query. That is, the final re-rank score (r ) of a document dq is a sum of k score values of S (d q ) when the document occurs in each of i

search results Dqi generated by an expanded query qi. Until now, we assume that URL (Uniform Resource Locator) is a criterion for estimating the degree of co-occurring. However, in many cases, even if some web pages have different URLs with each other, they may have similar contents over the same domain. This implies that we can use the domain name as another criterion of estimating the degree of co-occurring. Actually, the domain name-based approach has been found to outperform the URL-based one in our empirical results.

4. Empirical Results 4.1 Experimental setup As test documents for performance analysis, we have used more than 100,000 documents from the foxnews site at www.foxnews.com. Here, we have prepared diverse query words by each of news subjects such as entertainment, health, scitech, travel, leisure, world, and sports.

As a performance metric, we have adopted nDCG (Normalized Discounted Cumulative Gain) [9] which is a recent measure of effectiveness of a web search algorithm. Basically, this measure is to compare the current query results with perfect ranking results for different types of queries, which is defined for the query q as follows. K

nDCGq  M q  (2rel (i )  1) log(1  i) i 1

(3)

where rel(i) is the graded relevance of the query result at the position i, and it is evaluated with four discrete values, i.e., 0, 1, 2, 3. As the relatively upper part of query results are evaluated as higher values for rel(i), the nDCG values come to be higher. And, Mq is a normalized constant. Consequently, the nDCG value ranges from 0 and 1, and it has the value 1 when we obtain the best result. 4.2 Performance analysis As a baseline method for performance comparison, we have used a simple query expansion method, which is very similar to conventional query expansion methods; specifically, the baseline method expands a given initial query with all of its related keywords of user profile. As mentioned before, the proposed re-ranking method has two approaches: URL-based and domain name-based ones. Fig. 5 shows the nDCG values for diverse queries, in which we observe that the proposed method gives better performance by 3~5% on average. Note that only 2% of performance enhancement is highly difficult to achieve due to the characteristic of the nDCG equation. The empirical results show that our personalization technique is simple yet highly promising. In particular, the query words with higher performance are found to be the keywords that have multiple meanings in the scitech and sports domains. Fig. 6 gives an average value of nDCG values for query words with relatively more (or fewer) meanings. In this figure, nDCG total average means the total average of nDCG values for all of the given queries, and nDCG partial average means the average of nDCG values for only the queries with relatively more meanings. In terms of nDCG partial average, the proposed method outperforms the baseline method by 7~9 %. And we have found that the proposed method can give more than double personalization effect for the query words with multiple meanings. In addition, domain name-based re-ranking outperforms URL-based reranking by 1.1% on average and 2.2% at the maximum. This empirical result is very encouraging since the similar or related web pages with the same domain name can be delivered with being clustered together to users.

nDCG 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Baseline

Proposed-URL

Proposed-DomainName

Fig. 5. Performance comparison for queries with various topics 1 0.99 0.98 0.97 0.96 0.95

nDCG

0.94 0.93

total average nDCG

0.92

partial average

0.91 0.9 0.89 0.88 Baseline

Proposed

Proposed

-URL

-DomainName

Fig. 6. Comparison of average nDCG by two types of query groups

5. Conclusions This paper describes a novel way of re-ranking for personalized search by utilizing concept network-based user profile. The proposed method is a combination of re-ranking and query expansion approaches for personalized search. Query expansion is performed based on concept network user profile and then re-ranking is performed by calculating the degree of co-occurring among a set of different search results by the expanded queries. We show that the proposed method outperforms the conventional ones through empirical web search with a number of actual users who have diverse information needs and query intents. As future work, we plan to extend the proposed method with considering collective preference as well as personal preference.

6. Acknowledgments This work was supported by Mid-career Researcher Program through the National Research Foundation of Korea (NRF) grant funded by the Korea government (MISP) (grant number: NRF-2013R1A2A2A01017030), and was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (grant number: NRF-2010-0025212).

References [1] A. Ramos, and S. Cota, Search Engine Marketing, McGraw-Hill, 2008 [2] P-A. Chirita, C. S. Firan, and W. Nejdl, Personalized Query Expansion for the Web, Proceedings of Annual ACM Conference on Research and Development in Information Retrieval, 2007, pp.7-14. [3] S. Gauch, J. Chaffee, and A. Pretschner, Ontology-based personalized search and browsing, Web Intelligence and Agent Systems, Vol. 1, No. 3-4, 2003, pp.219-234. [4] J. Hu, and P.K. Chan, Personalized Web Search by Using Learned User Profiles in Reranking, Proceedings of the Internal Workshop on Web Mining and Web Usage Analysis, 2008, pp.78-83. [5] F. Qiu, and J. Cho, Automatic identification of user interest for personalized search, Proceedings of International World Wide Web Conference, 2006, pp.727-736. [6] X. Shen, B. Tan, and C. Zhai, Implicit user modeling for personalized search, Proceedings of the International Conference on Information and Knowledge Management, 2005, pp. 825831. [7] H. Kim, H. Yune, J. Lee, and B. Lee, Concept Network-based Personalized Web Search, Information: International Interdisciplinary Journal, Vol. 15, No. 8, pp. 3531-3542. [8] U. Priss, Formal Concept Analysis in Information Science, Annual Review of Information Science and Technology, Vol. 40, No. 1, 2006, pp.521-543. [9] H. Valizadegan, R. Jin, R. Zhang, and J. Mao, Learning to Rank by Optimizing NDCG Measure, Proceeding of Neural Information Processing Systems, 2010, pp.41-48.

* Corresponding author: Han-joon Kim, Ph.D. School of Electrical and Computer Engineering, University of Seoul 90 Jeonnong-dong, Dongdaemun-gu, Seoul 130-743, Korea E-mail: [email protected]

Suggest Documents