Social Annotation in Query Expansion: a Machine Learning Approach Yuan Lin, Hongfei Lin, Song Jin, Zheng Ye School of Computer Science and Technology, Dalian University of Technology No. 2 LingGong Road GanJingZi District DaLian, China
[email protected],
[email protected],
[email protected],
[email protected] ABSTRACT
Keywords
Automatic query expansion technologies have been proven to be effective in many information retrieval tasks. Most existing approaches are based on the assumption that the most informative terms in top-retrieved documents can be viewed as context of the query and thus can be used for query expansion. One problem with these approaches is that some of the expansion terms extracted from feedback documents are irrelevant to the query, and thus may hurt the retrieval performance. In social annotations, users provide different keywords describing the respective Web pages from various aspects. These features may be used to boost IR performance. However, to date, the potential of social annotation for this task has been largely unexplored. In this paper, we explore the possibility and potential of social annotation as a new resource for extracting useful expansion terms. In particular, we propose a term ranking approach based on social annotation resource. The proposed approach consists of two phases: (1) in the first phase, we propose a term-dependency method to choose the most likely expansion terms; (2) in the second phase, we develop a machine learning method for term ranking, which is learnt from the statistics of the candidate expansion terms, using ListNet. Experimental results on three TREC test collections show that the retrieval performance can be improved when the term ranking method is used. In addition, we also demonstrate that terms selected by the term-dependency method from social annotation resources are beneficial to improve the retrieval performance.
Query Expansion, Social Annotation, Learning to Rank
1.
INTRODUCTION
Queries submitted to search engines always contain very few keywords or phrases, which are generally insufficient to fully describe a user’s information need. To solve this problem, query expansion (QE) has been widely used [3, 27, 8, 18]. Among all the methods, pseudo-relevance feedback (PRF) via query expansion has proven to be effective in many information retrieval (IR) tasks [25, 14]. The basic assumption of PRF is that the top-ranked documents in the initial retrieval result contain many useful terms that can help describe the information need better. However, this assumption is often invalid [4] which can result in a negative impact on the retrieval performance. With a rapid development of social communities, a large amount of manually generated annotations, the so-called tags, have emerged which can be a potential resource for query expansion. In recent years, with the rise of Web 2.0 technologies, social annotations have become a popular way to allow users to contribute descriptive metadata for Web information, such as Web pages and photos. A series of studies have been done on exploring the social annotations for folksonomy [16], recommendation [22], semantic Web [26], Web search [11, 2, 28] etc. Positive impact has been found in these studies. However, it is not clear whether social annotations can be of help as a resource for query expansion. As described in [26], social tags, which link to each other through Web resource, often share similar topics. Thus, a set of tags attached with the same resource (e.g. Web page, photo, production) could help better understand it. In this paper, we propose a novel query expansion method based on social annotations which are used as the resource of expansion terms. Given a query, a large number of candidate expansion terms (words or phrases) will be chosen to convey users’ information needs. We propose a termdependency method to select the candidate expansion terms based on two term-dependence assumptions of query terms: 1) full independence, 2) sequential dependence. From our initial results of this method, some of the candidate expansion terms are indeed unrelated to the original query. To solve this problem, we develop a novel learning to rank method to rank the candidate terms according to their potential impact on retrieval effectiveness. Once the ranking list is obtained, the top ranked terms will be selected to expand the original query. The contributions of this paper are as follows: 1) we con-
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Search and Retrieval—search process, query formulation
General Terms Algorithms, Performance, Design, Experimentation
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’11, July 24–28, 2011, Beijing, China. Copyright 2011 ACM 978-1-4503-0757-4/11/07 ...$10.00.
405
duct extensive experiments to evaluate the potential of social annotations as a resource of expansion terms for query expansion, 2) in expansion terms selection process, we investigate two term-dependence assumptions for selecting useful expansion terms from social annotations, 3) according to the impact of expansion terms selected based on these two assumptions, we develop a learning to rank approach for the novel expansion terms ranking: the list of expansion terms pertaining to a query is seemed as an instance, and the statistical properties of terms in social annotations are used as the features for learning a ranking model. The remainder of our paper is organized as follows. Section 2 reviews some related work. Section 3 explores the potential of social annotations as a new resource of expansion terms. Section 4 proposes the expansion term ranking method based on the social annotation sample. In Section 5, we report the experimental results and list some discussions about our work. Finally, we conclude the paper and discuss future work in Section 6.
Web search. Bao et al. [2] proposed to measure the similarity and popularity of Web pages from Web users’ perspective by calculating SocialSimRank and SocialPageRank. Xu et al. [28] proposed a personalized search framework to utilize folksonomy for personalized search. Carman et al. [6] explored how useful tag data might be for improving search results, but they focused mainly on data analysis rather than retrieval experiments. Different from the above work, we investigate the capability of social annotations in improving the retrieval performance as a promising resource for query expansion.
3.
SOCIAL ANNOTATION COLLECTION
In this section, we briefly introduce the advantage of social annotations for our study, and then investigate the potential of the social annotations for filtering irrelevant terms for query expansion.
3.1
Exploiting Social Annotation
In social annotation services like Delicious [26], they allow users to annotate and categorize Web resources with the keywords (called tags). These tags are freely chosen by the users without a pre-defined taxonomy or ontology. A single resource could be annotated with several tags from many disparate users. For example, the tags such as ”conference”, ”research”, ”acm”, ”2011”, and ”sigir” are used by many users to annotate Web page about SIGIR 2011 homepage. As we can see, tags can be the good keywords for describing the respective Web page from various aspects. Moreover, different tags describing the same Web resource are semantically related to some extent. An annotation typically consists of at least three parts: the URL of the resource (e.g. a Web page), one or more tags, and the user who created the annotation. Thus we abstract it as a triple as follows:
2. RELATED WORK Automatic query expansion technique has been widely used in IR. Among all the approaches, pseudo-relevance feedback (PRF) has been shown more effective by reformulating the original query using expansion terms from pseudorelevant documents. Traditional PRF has been implemented in different retrieval models: vector space model [21], probabilistic model [20], relevance model [12], mixture model [32] and so on. Meanwhile, large amounts of research has been conducted to improve traditional PRF by using passages instead of documents [31], by using a local context analysis method [27], by using a query-regularized estimation method [25], by using latent concepts [18], and by using a clustered-based re-sampling method for generating pseudorelevant documents [14]. These methods follow the basic assumption that the top-ranked documents from an initial search contain many useful terms that can help discriminate relevant documents from irrelevant ones. Despite the large number of studies, a crucial question is the expansion terms determined in traditional ways from the pseudo-relevant documents are not all useful [4]. Some studies focus on using an external resource for query expansion. They found one of the query expansion failure reasons is the lack of relevant documents in the local collection. Therefore, the performance of query expansion can be improved by using a large external collection. Several external collection enrichment approaches have been proposed, such as search engine query logs [9], some thesauruses (e.g. Wordnet) [7], Wikipedia [29] etc. Our work follows this strategy of a query expansion approach using an external collection as a resource of query expansion terms. Recently, many researches have focused on the social annotations in large part motivated by their increasing availability across many Web-based applications. P. Mika [19] proposed a tripartite model of actors, concepts and instances for semantic emergence. Heymann et. al [10] found that the social annotation data had a good coverage of interesting pages on the Web. X. Wu et al. [26] explored machine understandable semantics from social annotations in a statistical way and applied the derived emergent semantics to discover and search shared Web bookmarks. Hotho et al. [11] proposed Adapted PageRank and FolkRank to find communities within the folksonomy but have not applied them to
⟨useri , tagj , urlk ⟩
(1)
which means that user i has annotated URL k with tag j. In this paper, we focus on what resource annotated with what tags and do not care much about who annotated the resource.
3.2
Social Annotation Data Set
A social annotation service (Delicious) has the potential to give us a great deal of data about Web pages. We collected a sample of Del.icio.us1 data by crawling its website during March 2009. The data sample consists of 7,063,028 tags on 2,137,776 different URLs with 280,672 different tags. In our study, we mainly utilize Web pages, tags, and relationship between them. Thus, the social annotation sample is organized as one article per annotation by filtering out the user information. Each article will effectively summarize the most important information of each annotation. Based on our analysis of the social annotation structure, we divide each social annotation article into four fields as shown in Table 1. Before the experiments, we perform two data preprocessing processes: 1) some tags only reflect personal requirement, such as ”toread ”, ”@read ”, etc. We remove some of them manually. 2) Some users may concatenate several correlative terms to form one tag, e.g. ”java/programming”, 1
406
http://delicious.com
Table 1: Fields of Social Annotation Article Field Description URL Unique identifier for Web page Title Summary of the Web page Frequency Annotation frequency of the Web page Tags Tags of the Web page annotated
Table 2: Proportions of each group of expansion terms selected from feedback documents Collection Good Terms Neutral Terms Bad Terms AP 17.55% 64.01% 18.44% WSJ 15.83% 66.93% 17.24% Robust 2004 16.57% 66.05% 17.38%
”Iraq war ”, ”news-business-finance”. We split this kind of tags with the help of delimiters.
Table 3: Proportions of each group of expansion terms selected from social annotation sample Collection Good Terms Neutral Terms Bad Terms AP 19.05% 59.98% 20.97% WSJ 18.33% 59.21% 22.46% Robust 2004 18.47% 60.05% 21.48%
3.3 Evaluation of Social Annotation Collection After analyzing a large amount of social annotations, we find that tags are usually semantically related to each other if they are used to annotate the same or related resources for many times. Intuitively, the social annotation collection can be seen as a manually edited thesaurus which provides a wealth of information of relevant terms. In order to evaluate the potential usefulness of social annotation sample as the expansion term resource, we will consider all the terms extracted from the social annotation sample using the term co-occurrence method. The term co-occurrence is usually used to measure how often terms appear together in the text window. In our experiment, the text window is defined as one annotation field (e.g. tags, title). For a query term qj and a candidate term ti in the social annotation sample S, the co-occurrence value is defined as follows: ∑ cooc(ti , qj ) =
f ∈S
0.005. It means good (or bad) expansion term which can improve (or hurt) the effectiveness should produce a performance change such that |chg(e)| > 0.005. Now, we will examine whether the candidate expansion terms from social annotation sample are good terms. Our verification is made on three TREC collections: AP, WSJ and Robust 2004. We respectively consider 150 queries for each collection and 100 expansion terms for each query with largest probabilities from pseudo-feedback documents and with largest co-occurrence values from social annotation sample. Table 2 and 3 respectively show the proportion of good, bad and neutral expansion terms from feedback documents and social annotation collection for all the queries. Comparing Table 2 and 3, we can see that the proportion of good terms extracted from social annotation sample is higher than it extracted from pseudo-feedback documents. It means that social annotations have the potential to produce more good terms as the resource for query expansion. From Table 2 and 3, all the proportions of bad terms on all three collections are higher than that of the good ones’. This means while the proportion of good terms increases, social annotation sample indeed introduces more bad terms into the expansion process. The research of Cao et al. [4] shows that the retrieval effectiveness can be much improved if more good expansion terms are added to the original queries. The challenge now is to develop an effective method to correctly select the good terms in the expansion process.
log(tf (ti |f ) + 1.0) × log(tf (qj |f ) + 1.0) log(N )
(2) where N is the sum of articles in social annotation sample S, tf (.|f ) is frequency of term appears in field f . Base on this equation, terms with highest cooc(ti , qj ) is chosen for the original query term qj as the candidate expansion terms. The final expansion terms can be selected as follows: coof (ti , Q) = single
∑
idf (qj )idf (ti )log(cooc(ti , qj ) + 1.0) (3)
qj ∈Q
where idf is computed as log(N/df ), N is the sum of articles in social annotation sample, and df is the number of articles which contain term t. Inspired by the work of Cao et al. [4], we will test each of these terms to see its impact on the retrieval effectiveness. In order to make the test simpler, we make the following simplifications: 1) Each expansion term is assumed to act on the query independently from other expansion terms; 2) Each expansion term is added into the query with equal weight λ(it is set at 0.01 or -0.01). Based on these simplifications, we measure the performance change due to the expansion term e by the ratio: chg(e) =
M AP (Q ∪ e) − M AP (Q) M AP (Q)
4.
EXPANSION TERM RANKING
The challenge to select good terms for query expansion consists of two parts: (1) how to select the candidate expansion terms from the whole social annotation sample, (2) how to distinguish the different importance of these terms. For the first part, we propose a term-dependency method to select the candidate expansion terms. When the candidate expansion terms selected, we use a novel learning to rank method to rank the candidate terms. Once the ranking list is obtained, the top ranked terms in the list will be selected to expand the original query.
(4)
4.1
where M AP (Q) and M AP (Q ∪ e) are respectively the MAP of the original query and expanded query (expanded with e). In our previous work [30], we conducted some preliminary experiments. In our experiment, we set the threshold to
Term-Dependency Method for Candidate Term Selection
It is well known that the dependencies between terms exist in most queries. For example, in Topic 63 (”machine translation”), occurrences of certain pairs of terms are correlated.
407
the development dataset, we label each term with binary relevance judgments (relevant or irrelevant) according to Eq. (7). For each term in the dataset (training, development or test), we represent it with the features described in the next section.
The fact that either one occurs provides strong evidence that the other is also likely to occur. Eq. (3) assumes that the selection of each query term is determined independently, lacking consideration of latent term relations. Most work on modeling term dependencies in the past has analyzed three different underlying dependence assumptions: full independence, sequential dependence [23], and full dependence [17]. The full independence variant which assumes query terms are independent underlies many retrieval models. Eq. (3) is based on this assumption. The second variant we considered is the sequential dependence. In this variant we assume the dependence between neighboring query terms. Based on this assumption, we generalize Eq. (3) as follows: coof (ti , Q) = bigram
∑n−1 j=1
idf (qj )idf (qj+1 )idf (ti )
4.3
(5)
log(cooc(ti , qj ) + 1.0)log(cooc(ti , qj+1 ) + 1.0) where n is the sum of query terms in original query Q. The full dependence variant assumes all the query terms are in some way dependent on each other. Under this assumption, each combination of query terms should be considered. In the experiments, the results using this variant are not satisfactory. One of the possible reasons is that some irrelevant term group may be introduced into the term selection procedure. Therefore, we only consider the first two assumptions and apply the following function with linear interpolation, weighted by a parameter λ. coof (ti , Q) = (1 − λ) coof (ti , Q) + λ coof (ti , Q) single
(6)
This section describes our current feature set for the expansion term. We utilize the statistical features of single term and also term pair. Each expansion term is represented by a feature vector. Useful statistical features include those already used in traditional methods such as term frequency (TF), document frequency (DF). In order to capture the relationship between the candidate expansion terms and the original query terms, we consider the co-occurrences of them. The social annotation article described in Section 3.2 consists of four fields (url, title, frequency, tags), and we did not use the information in the url field in the experiment. Obviously, the importance of a term appearing in the title may be different than its appearance in the tags. We calculate the statistical features separately in title, tags, and the whole article. Each expansion term is represented by a feature vector: [f1 (e), f2 (e), · · · , fN (e)]T . We will only describe the features in title field (ftitle ). The others can be defined similarly. • Term frequency(TF) The first features are the classic statistics of term frequency. f1 (e) = tf (e|ftitle )
bigram
Using the term-dependency method, we select the candidate terms with the highest coof scores. For expansion term selection, the term-dependency method will be shown effective in Section 6.
f4 (e) =
tf (e|ftitle ) max tf (e′ |ftitle ) ′
(8)
(9)
e ∈ftitle
4.2 Term Quality Evaluation
f7 (e) = ∑
A key idea of our term ranking approach is that one can generalize the knowledge of expansion terms from the past candidate ones to predict effective expansion terms for the novel queries. In order to do this, it requires a method for estimating a candidate term for each training query given examples of its relevant or irrelevant terms. Intuitively, for query expansion technology, the relevant expansion terms are beneficial to improve the retrieval performance. Inspired by the work of Cao et al. [4], the chg(e) of expansion term e described in Section 3.3 could reflect its potential impact on retrieval effectiveness. Suppose that query qi has k expansion terms, the relevance label of expansion term ej (1 ≤ j ≤ k) is defined as follows: { 0, chg(e) < 0; label(ej ) = 1, chg(e) ≥ 0;
Features for Term Ranking
′
e
tf (e|ftitle ) ′ ∈ftitle tf (e |ftitle )
(10)
• Document frequency (DF) In the experiment, a social annotation article described in Section 3.2 is defined as a document, and df of the expansion term e is the number of documents containing e in the specific field. The definitions of df features are: f10 (e) = df (e|ftitle )
f13 (e) =
df (e|ftitle ) max df (e′ |ftitle ) ′
(11)
(12)
e ∈ftitle
(7) f16 (e) = ∑
where label(ej ) = 1 reflects term ej is relevant to query qi , and label(ej ) = 0 reflects term ej is irrelevant. In our experiment, we use three TREC collections (see Table 4 in Section 5.1), with 150 queries for each collection. We divide these queries into three groups of 50 queries. In the training dataset, each query has a term list which ranks the candidate expansion terms according to chg(e). To generate
′
df (e|ftitle ) df (e′ |ftitle )
(13)
e ∈ftitle
• Co-occurrence with a single query term The co-occurrence features are used to estimate the relationship between high frequency terms more reliably. Therefore, we define the co-occurrence feature as follows:
408
∑n f19 (e) = log
i=1
C(qi , e|ftitle ) n
yi,j =
(14)
• Co-occurrence with query term pairs A stronger co-occurrence relation for an expansion term is with two query terms together. [1] has shown that this type of co-occurrence relation is much better because it can take into account some query contexts.
f22 (e) = log
(qi ,qj )∈Ω
C(qi , qj , e|ftitle ) |Ω|
(15)
where Ω is the set of possible query term pairs. • Term Popularity
m ∑
The frequency field denotes the annotation frequency of Web page, while it also reflects the popularity of Web page. A popular Web page will introduce many more popular tags which may be the good descriptor. Notice that we only develop the distinctive property for terms in tags field. f25 (e) =
∑
f requency(e|ftags )
(17)
where φ(.) is a performance measure function, in our experiment, we use MAP as the measure function. The assumption is that the higher performance improvement is observed for the combination of ti,j and qi the stronger relevance exists between them. In training, the training set can be denoted as Γ = {(xi , yj )}m i=1 , where xi = {xi,1 , xi,2 , · · · , xi,ni } is the list of features and yi = {yi,1 , yi,2 , · · · , yi,ni } is the corresponding list of scores. The feature vector xi,j = ψ(qi , ti,j ) is created from each query-term pair (qi , ti,j ), i = 1, 2, · · · , m and j = 1, 2, · · · , ni . For each feature vector xi,j , the ranking function f outputs a score f (xi,j ). Furthermore, we will obtain a list of scores zi = (f (xi,1 ), · · · , f (xi,ni ))) for the list of feature vectors xi . The objective of learning is formalized as minimization of the total losses with respect to the training data.
where C(qi , e|ftitle ) is the frequency of co-occurrences of query term qi and the expansion term e in title field.
∑
φ(qi ∪ ti,j ) − φ(qi ) φ(qi )
L(yi , zi )
(18)
i=1
where L is a loss function. ′ In the term ranking, when a new query q and its asso′ ′ ciated terms t are given, we construct feature vectors x from them and use the trained ranking function to assign ′ scores to the terms t . Finally we select the top k terms to expand the original query. We use a learning method for optimizing the loss function based on top gamma probability, with Neural Network as model and Gradient Descent as optimization algorithm [5]. Based on the Neural Network model ω, we denote the ranking function as fω . For a feature vector xi,j , fω (xi,j ) assigns a score to it. When we use Cross Entropy as metric, the loss function becomes:
(16)
ftags ∈S
where f requency(e|ftags ) denotes frequency of the Web page contains the term e in tags field. Our feature space is constructed of statistical features and the unique feature from social annotation sample. Given examples of target term weights paired with corresponding features, the following work is to predict the relevant score of expansion term with given the features. We accomplish this via the ListNet of learning to rank approaches.
L(yj , zi (fω )) = −
∑
Pyi (g)log(Pzi (fω ) )(g)
(19)
∀g∈ξγ
4.4 ListNet ListNet [5] is a feature-based learning to rank approach, which minimizes a listwise ranking loss function based on the probability distribution on permutations. It utilizes result lists as instances in the learning procedure. Based on the neural network model[13], the gradient descent approach[15] is used to minimize the K-L divergence loss. ListNet outputs a linear ranking model that is used to predict the relevance score of a new object. In this paper we use this approach to learning a ranking model for the term ranking.
Ps (ξγ (j1 , j2 , · · · , jγ )) =
γ ∏
exp(sjz ) ∑ni l=z exp(sjl )
(20)
exp(fω (xi,jz ) ∑ni l=1 exp(fω (xi,jl )
(21)
z=1
Pzi (fω ) (ξγ (j1 , j2 , · · · , jγ )) =
γ ∏ z=1
In our experiments, we implemented ListNet with γ = 1.With some derivation, when γ = 1 we have: ∑ i ∂fω (xi,j ) ,zi (fω )) ∆ω = ∂L(yi∂ω =− n j=1 Pyi (xi,j ) ∂ω ∑ ∂fω (xi,j ) ni 1 + ∑ni exp(f exp(f (x )) ω i,j j ∂ω (x ))
4.5 Term Ranking Model via ListNet We now give a general description on learning to rank a set of expansion terms according to their effectiveness. Assume that, a set of queries Q = {q1 , q2 , · · · , qm } is given. Each query qi is associated with a set of possible usable terms Ti = {ti,1 , ti,2 , · · · , ti,ni }, where ti,j denotes the j-th term is relevant to the i-th query and ni denotes the sizes of Ti . Each list of terms Ti is associated with a list of relevance judgment yi = {yi,1 , yi,2 , · · · , yi,ni }, where yi,j denotes the extent to which the candidate term yi,j is relevant to the original query qi . The measure to score the degree is defined as follows:
j=1
ω
(22)
i,j
For simplicity, we use a linear Neural Network model in our experiments.
5. 5.1
EXPERIMENTS Experimental Settings
We used three standard TREC collections in our experiments: AP88-90(Associated Press); WSJ87-90(Wall St. Journal); and Robust2004 (the dataset of TREC Robust Track
409
Table 4: Statistics of Evaluation Datasets Train Dev. Test Collection #Docs Topics Topics Topics 101-150 151-200 AP 242,198 51-100 101-150 151-200 WSJ 173,252 51-100 351-400 401-450 Robust 2004 528,155 301-350
Table 6: Performance comparisons of baseline models for all the test topics on AP, WSJ, Roust2004 collections Method AP WSJ Robust 2004 QL 0.2111 0.3364 0.2462 RM 0.2771 0.3906 0.2705 RM+Oracle 0.3201 0.4196 0.3153
Table 5: Performance comparisons of baseline models for all the topics on AP, WSJ, Roust2004 collections Method AP WSJ Robust 2004 QL 0.2201 0.2270 0.2214 RM 0.2763 0.3272 0.2423
are added. This shows the usefulness of correctly selecting the expansion terms with the high potential of improving the retrieval effectiveness. The MAP of the RM+Oracle method represents the upper bound retrieval effectiveness we can expect to obtain using expansion terms extracted from pseudo-relevance documents. These three models serve as the baseline for the following experiments.
started in 2003). Table 4 shows the details of these collections. For each dataset, we split the topics into three parts: the training data for the rank learner, the development data to estimate the parameters, and the test data. Retrieval effectiveness is measured in terms of Mean Average Precision (MAP) for top 1000 documents. When an original query Q is given, a set of M candidate expansion terms will be selected from social annotation sample by term-dependency method described in Section 4.1. According to their potential impact on retrieval effectiveness, the ranking model will re-rank the M terms to form a new term ranking list. Once the ranking list is obtained, the top k terms will be selected to form an expansion query Qexp . In the experiments, the Indri 2.6 search engine [24] is used as our basic retrieval system.
6.
We now turn our attention to our proposed method utilizing the social annotation collection (SA) as the resource of query expansion. In order to extract the effective expansion terms from the social annotation collection, we propose an unsupervised method and a supervised learning model. For the unsupervised method, we investigate the method based on two different term dependence assumptions: full independence and sequential dependence. In Section 6.1 and 6.2, we evaluate the performance of the unsupervised method, and also demonstrate the social annotation collection could be a good resource for extracting useful expansion terms. For the supervised learning model, we examine the quality of the term ranking model, and then investigate the performance of the term ranking model for query expansion task.
5.2 Baseline Models In the experiments we select three baseline models; one is the query-likelihood language model (QL), the second is Lavrenko’s relevance model (RM) [12] implemented in Indri, the third is the expanded query model by relevance model using the oracle expansion terms selected from documents (RM+Oracle). In relevance model RM, it retrieves a set of N documents and forms an expanded query Qexp by adding the top k most likely terms. Note that in RM+Oracle model, we select top k expansion terms with high chg(e) according to Eq. (4) for each query. The expanded query is formed with the following structure: #weight( λf b Qori
(1.0 − λf b ) Qexp )
EVALUATION OF QUERY EXPANSION METHODS BASED ON SOCIAL ANNOTATION COLLECTION
6.1
Performance of the Method based on Full Independence
We first evaluate the performance of the method based on full independence (SA+FI). Note that the SA+FI method selects the top k terms with highest co-occurrence according to Eq. (3). From Table 7, we can see that the SA+FI method enhance the retrieval performance over the QL model on all the collections. This indicates that the expansion terms social annotations provided is closely related to the query. Compared with the RM model, the SA+FI method does not work as well as it. Although social annotations may produce more good terms by the SA+FI method, more bad terms is also introduced into the query. That is the most likely reason for the poor performance of the SA+FI method. The SA+FI method assumes the query terms are independent to each other. Under this assumption, some irrelevant terms may be selected by the SA+FI method. For example, on AP, Topic 159 (”electric car development”) scores a mean average precision (MAP) of 0.3100 in the QL model, and in the SA+FI method the MAP of Topic 159 is increased to 0.3657. Using this method, the top ranked expansion terms of Topic 159 are ”car”, ”electric”, ”development”, ”program”, ”energy”, ”auto”, ”tool”, ”design”, ”green” and ”ev”. Although the MAP is increased after adding the expansion terms, some irrelevant terms (e.g. ”program”, ”tool”, and
(23)
For the RM and RM+Oracle methods, we fixed the parameters as follows: N = 10, k = 50 and λf b = 0.5, since we achieve relative good performance under this setting in general. Table 5 and 6 respectively show the performance (MAP) for all the topics and all the test topics on three TREC collections. Comparing Table 5 and 6, the improvements over the QL model for all the topics and the test topics are on the same levels. In order to facilitate comparison with term ranking model, we only use the test topics to do the retrieval experiments. As can be seen from Table 6, the relevance model can significantly improve the retrieval performance over the query likelihood language model. Compared with the results of the RM and RM+Oracle methods, the retrieval effectiveness can be much improved if the oracle expansion terms
410
the query Q. In the SA+TDW method, we will obtain the top k terms for query expansion, whose weights are normalized before being added into original queries. In addition, similar to the RM+Oracle method, we select top k terms with high chg(e) extracted from the social annotation collection to add into the original query, and the results of this method (SA+Oracle) represent the upper bound retrieval effectiveness we can expect to obtain using expansion terms extracted from the social annotation collection. As we can see from Table 7, both the SA+TD and SA+TDW methods achieve better performance than the SA method in the expansion term selection. Term-dependency methods can effectively select relevance terms for the original query. Compared with the two term-dependency methods, the SA+TDW method performs much better. For this method, top k terms with high co-occurrence score are selected to expand the query. This shows that the co-occurrence score could be used to reflect the importance of the expansion term. Using the oracle expansion terms extracted from social annotation collection, the performance can also be much improved than other methods. Comparing Table 6 and 7, we can see the SA+FI and SA+TD methods do not work as well as the RM model does. On AP and WSJ, the SA+TDW method gets lower performance than the RM model does, but on Robust2004, the performance is improved over the RM model. In a well-tuned relevance model (RM), the expansion terms are weighted according to the distribution they are sampled from collection. That is one of the reasons the SA+TD method gets weak performance. Compared with the results of the SA+Oracle and RM+Oracle methods, the SA+Oracle method achieve better performance. It indicates that the social annotation collection is a good resource for query expansion. Based on these results, the refinement of the expansion term selection process for queries could be expected to further improve retrieval performance. Our goal now is to develop an effective method to select more relevance expansion terms.
Table 7: Performance comparisons of SA+FI, SA+TD, SA+TDW and SA+Oracle for all the test topics on AP, WSJ, Robust2004 collections. Method AP WSJ Robust 2004 SA+FI 0.2380 0.3564 0.2515 SA+TD 0.2562 0.3767 0.2685 SA+TDW 0.2707 0.3885 0.2776 SA+Oracle 0.3537 0.4470 0.3324
0.38 0.36 0.34
MAP
0.32 AP WSJ Robust2004
0.3 0.28 0.26 0.24 0.22
0
0.2
0.4 0.6 Weight for Term Dependency
0.8
1
Figure 1: Performance on different weights for SA+TD on AP, WSJ and Robust2004 collections
”design”) may also hurt the retrieval performance. To examine the possible impact with the irrelevant terms, we remove the irrelevant terms from the set of expansion terms manually, and add the same number of terms to Topic 159. After this processing, the MAP is increased to 0.4101. We can see that the retrieval performance can be much improved if fewer irrelevant terms are added. Our problem now is to develop an effective method to extract good terms from the social annotation collection.
6.3
Term Ranking Experiments
Let us now examine the quality of our term ranking model. In ranking performance evaluation, there are many IR evaluation measures: such as Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP). In our experiments, we take more attention to the relevance of the top k terms. Therefore, we adopt P @n as the main measures for the performance evaluation of ranking model. The formula of P @n is defined as:
6.2 Performance of the Method based on Sequential Dependence In Section 4.1, we take into account the dependencies between query terms, and propose a term-dependency method (SA+TD) to select more relevant terms for original queries. The parameter λ in Eq. (6) is used to measure the impact of term-dependency in the term selection phase. When λ = 0 , the term selecting method becomes the SA+FI method. Figure 1 shows the results of assigning different weight on three TREC collections. As can be seen in Figure 1, the performance is improved when weight λ is at the higher level (λ ≤ 0.6). On the other hand, when λ > 0.5, the MAP of the SA+TD method is lower than the SA+FI method’s. It means the dependencies between query terms can be useful in the term selection process, but it will hurt the performance if much query term dependency information is used. Finally, we set the parameter λ = 0.2. In the SA+FI and SA+TD methods, the expansion terms are added into the original query with the same weight. It means all the expansion terms have the same importance with each other. In fact, the co-occurrence of expansion term calculated by Eq. (6) may reflect the importance to
terms| (24) P @n = |relevant n i.e., the portion of relevant terms in the top-n result list. P @n is sensitive to the amount of relevance terms returned, and thus is helpful to measure the performance gain on precision of the ranking model. In our experiment, we use three TREC collections. For each collection, we use 50 queries to training the ranking learner. In training set, each query has a term list which ranks 150 expansion terms according to chg(e) which reflects the term’s impact on retrieval effectiveness. To evaluate the performance of ranking learner, according to the methods described in Section 4.2, we label the expansion term e in the development dataset with two levels of relevance judgments: relevance and irrelevance. In test dataset, we rank
411
Table 8: Term ranking accuracies in terms of P @n on AP, WSJ and Robust2004 collections Setting P@10 P@20 P@30 TermList-TD 0.3796 0.3724 0.3653 AP TermList-Learning 0.4041 0.3888 0.3952 TermList-TD 0.3880 0.3560 0.3500 WSJ TermList-Learning 0.3920 0.3600 0.3647 TermList-TD 0.3612 0.3306 0.3211 Robust2004 TermList-Learning 0.4061 0.3571 0.3524
Table 9: Performance comparisons of SAR and SARW for all the test topics on AP, WSJ, Roust2004 collections. Method AP WSJ Robust 2004 SA+Learning 0.2935 0.4103 0.3033 SA+LearningW 0.3325 0.4145 0.3163 0.45
0.4
6.4 Performance of Term Ranking Model for Query Expansion As described in Section 4.5, given a test query, a possible expansion term list is obtained using term-dependency method, and then the ranking learner will re-rank the term lists according to their impact on retrieval effectiveness. The term ranking process performs a further selection of expansion terms from those proposed by term-dependency method. After re-ranking, we use the top ranked terms in the new reranked list as the final expansion terms. For the term re-ranked lists, we mentioned two possible ways to use the term ranking results. The first method (SA+Learning) is to add the top ranked 50 terms with the same weight to original queries. In our term ranking experiments, we are interested to know not only is a term is good, but also the extent to which it is good. The ranking learner will assign a score to all the expansion terms. This score is useful to enhance the weight of an expansion term in the final query model. Therefore, the second method (SA+LearningW) is to add the top 50 terms with different weights, note that their weights are normalized before being interpolated with the original query model. Table 9 shows the results obtained with both methods. From this table, we can see that both of the two methods
412
AP WSJ Robust2004
0.35
MAP
the expansion terms order by the co-occurrence score calculated by Eq. (6). This term list is named TermList-TD. The TermList-TD list could be re-ranked to form a new term list (TermList-Learning) by ranking learner. Finally, we represent each expansion term with the features described in section 4.3. Table 8 shows the performance of our ranking learner on different collections. As mentioned before, an expansion term list (TermList-TD) could be obtained according to the co-occurrence score using term-dependency method. Actually, the SA+TD method selected the top k terms from this term list to expand the original query. Our ranking model learns to re-rank the term list produced by term-dependency method to form a new term list (TermList-Learning). From Table 8, for the effective expansion term selection, the ranking learner outperforms the term-dependency method. On AP and Robust2004 datasets, the ranking learner works much better than it works on WSJ dataset. Table 3 shows that the proportion of good terms on WSJ is lower than that on the other two datasets. It may cause that the total of relevance terms in the term list produced by the termdependency method is lower. This is one of the reasons that the performance on WSJ is not much improved like the other two datasets. In the next section, we will investigate the usefulness of the selected expansion terms from the new re-ranked lists (TermList-Learning) for query expansion.
0.3
0.25
0.2 0.1
0.2
0.3
0.4 0.5 0.6 Weight for Original Query
0.7
0.8
0.9
Figure 2: Performance on different weights for SA+TD on AP, WSJ and Robust2004 collections, k = 50. improve the effectiveness. In comparison, although on WSJ and Robust2004 the improvements with the SA+LearningW method are smaller, they are steady on all the three collections. Comparing Table 7 and 9, the improvements with the SA+Learning and SA+LearningW methods are statistically significant. Our explanation is that, since the term ranking learner performs better (shown in Table 8), some top ranked expansion terms could improve the performance significantly. On the other hand, it means that the relevant expansion terms may be re-ranked in the higher position of the list and the learners have the ability of recommending the most likely terms to the original queries. Comparing Table 6 and 9, the SA+Learning and SA+LearningW methods outperform the RM model, it shows that the expansion terms selected from re-ranked term lists is more relevance than them extracted from pseudo-relevance documents. Although the the SA+LearningW method does not work well as the RM+Oracle method on AP and WSJ datasets, there is a smaller gap between them. However, on Robust2004 dataset, the SA+LearningW method performs better than the RM+Oracle method. Our explanation is that, for the RM and RM+Oracle methods, the improvement over the QL model on the Robust2004 dataset is not as significant as them on AP and WSJ datasets. It means the pseudo-relevance documents produced more noisy expansion terms for the RM model. It could reduce this influence by using social annotation collection as the resource of expansion terms. However, it is also a difficult task to explore an effective method to mining relevance terms from this new resource. This experiments show that the term ranking model has the ability of selecting more relevant terms for the original queries.
for our future investigations. For the term selection method, because of its weak performance in our experiment the full dependence variant of query terms is not used. In fact, the query context factors could bring significant improvements in retrieval effectiveness. This means that there is still much room to improve the retrieval performance. For the term ranking model, we plan to explore more distinctive features in social annotation sample for the learning process. We leave these limitations as our future work.
0.38 0.36 AP WSJ Robust2004
0.34
MAP
0.32 0.3 0.28
8.
0.26 0.24 10
20
30
40 50 60 70 Number of Expansion Terms
80
90
ACKNOWLEDGMENTS
This work is supported by grant from the Natural Science Foundation of China (No.60673039 and 60973068) , the National High Tech Research and Development Plan of China (No.2006AA01Z151), National Social Science Foundation of China (No.08BTQ025), the Project Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry and The Research Fund for the Doctoral Program of Higher Education (No.20090041110002).
100
Figure 3: Sensitivity of SA+TD to parameter (k) on AP, WSJ and Robust2004 collections,λf b = 0.5.
6.5 Parameter Selection Both the RM and the SA methods have parameters k and λf b . We tested these methods with 10 different values of k: the number of expansion terms: 10, 20,..., 100. For λf b (weight for original query), we tested with 9 different values: 0.1, 0.2,..., 0.9. Figure 2 shows the performance on different weights for original query on AP, WSJ and Robust2004 collections. When λf b is smaller than 0.6, the performance remains steady. However, when λf b continues to increase, the performance drops sharply. We conjecture that the main reason is the impact on retrieval effectiveness of expansion terms reduces with λf b increasing. From another perspective, it also proves the usefulness of expansion terms. Figure 3 shows the sensitivity of SA+TD to k. For the parameter k (the number of expansion terms), when k is around 50, we can achieve the best performance on all collections. When we use more expansion terms for PRF (k > 50), the performance changed is not marked. For efficiency, we suggest to choose top 50 terms for query expansion in our proposed method. The results for other methods are similar, and it shows that setting k = 50 and λf b = 0.5 works best among all the values tested.
9.
REFERENCES
[1] J. Bai, J. Y. Nie, H. Bouchard, and G. Cao. Using query contexts in information retrieval. In SIGIR ’07: Proceedings of the 30st annual international ACM SIGIR conference on Research and development in information retrieval, pages 15–22. ACM, 2007. [2] S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su. Optimizing web search using social annotations. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 501–510. ACM, 2007. [3] C. Buckley, G. Salton, J. Allen, and A. Singhal. Automatic query expansion using smart: Trec 3. In the 3th Text REtrieval Conference (TREC 1994), 1994. [4] G. Cao, J. Y. Nie, and S. Robertson. Selecting good expansion terms for pseudo-relevance feedback. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 243–250. ACM, 2008. [5] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank: From pairwise approach to listwise approach. In ICML ’07: Proceedings of the 24th International Conference on Machine Learning, pages 129–136, 2007. [6] M. Carman, M. Baillie, R. Gwadera, and F. Crestani. A statistical comparison of tag and query logs. In SIGIR ’09: Proceedings of the 32nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 123–130. ACM, 2009. [7] K. Collins-Thompson and J. Callan. Query expansion using random walk models. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 704–711. ACM, 2005. [8] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. A framework for selective query expansion. In CIKM ’04: Proceeding of the 13th ACM conference on Information and knowledge management, pages 236–237. ACM, 2004.
7. CONCLUSION In this paper, we have explored the potential of social annotation as a new resource for query expansion. For TREC topics, we measure the importance of expansion terms according to their impact on the retrieval performance, which selected from social annotation sample. Our experiments show that the expansion terms extracted from social annotation are significantly better than those from the feedback documents, using a simple statistical method. In the expansion term selection phase, in addition to the full independent assumption, the sequential dependence of query terms has been taken into account, which captures more relevant terms that are beneficial to retrieval performance. Another contribution of this paper is that we develop a machine learning approach for query expansion, which learns to rank a set of expansion terms according to their impact on the retrieval performance. We also show that our ranking approach works satisfactorily on the different TREC collections. This study suggests several interesting research avenues
413
[22] Y. Song, Z. Zhuang, H. Li, Q. Zhao, J. Li, W. C. Lee, and C. L. Giles. Real-time automatic tag recommendation. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 515–522. ACM, 2008. [23] M. Srikanth and R. Srihari. Biterm language models for document retrieval. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 425–426. ACM, 2002. [24] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based search engine for complex queries. In International conference on Intelligence Analysis, 2004. [25] T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo-relevance feedback. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 162–169. ACM, 2006. [26] X. Wu, L. Zhang, and Y. Yu. Relevance based language models. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 417–426. ACM, 2006. [27] J. Xu and W. Croft. Query expansion using local and global document analysis. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4–11. ACM, 1996. [28] S. Xu, S. Bao, B. Fei, Z. Su, and Y. Yu. Exploring folksonomy for personalized search. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 155–162. ACM, 2008. [29] Y. Xu, J. F. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on wikipedia. In SIGIR ’09: Proceedings of the 32nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 59–66. ACM, 2009. [30] Z. Ye, J. Huang, S. Jin, and H. F. Lin. Exploring social annotation tags to enhance information retrieval performance. In AMT ’10: Proceedings of the 6th International Conferences on Active Media Technology, pages 255–266, 2010. [31] D. Yeung, C. Clarke, G. Cormack, T. Lynam, and E. Terra. Model-based feedback in the language modeling approach to information retrieval. In the 12th Text REtrieval Conference (TREC 2003), pages 810–819, 2003. [32] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceeding of the 10th ACM conference on Information and knowledge management, pages 403–410. ACM, 2001.
[9] H. Cui, J. R. Wen, J. Y. Nie, and W. Y. Ma. Query expansion by mining user logs. IEEE Transactions on knowledge and data engineering, 15(4):829–839, 2003. [10] P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarks improve web search? In WSDM ’08: Proceedings of the First ACM International Conference on Web Search and Data Mining, pages 195–206. ACM, 2008. [11] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. In ESWC ’06: Proceedings of the 3rd European Semantic Web Conference, pages 411–426, 2006. [12] V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 120–127. ACM, 2001. [13] Y. LeCun, L. Bottou, G. B. Orr, and K. R. M¨ uller. Efficient backprop. Neural Networks: Tricks of the Trade, pages 9–50, 1998. [14] K. S. Lee, W. B. Croft, and J. Allan. A cluster-based resampling method for pseudo-relevance feedback. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 235–242. ACM, 2008. [15] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In NIPS ’00: Proceedings of the 14th Neural Information Processing Systems, pages 512–518, 2000. [16] A. Mathes. Folksonomies - cooperative classification and communication through shared metadata. 2004. [17] D. Metzler and W. Croft. A markov random field model for term dependencies. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 472–479. ACM, 2005. [18] D. Metzler and W. B. Croft. Latent concept expansion using markov random fields. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 311–318. ACM, 2007. [19] P. Mika. Ontologies are us: A unified model of social networks and semantics. In ISWC ’05: Proceedings of the 4th International Semantic Web Conference, pages 522–536, 2005. [20] S. E. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In the 4th Text REtrieval Conference (TREC 1996), 1996. [21] J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323, 1971.
414