User Biased Document Language Modelling ∗
L. Azzopardi
M. Girolami
C. J. van Rijsbergen
School of Computing University of Paisley
Dept. of Computing Science University of Glasgow
Dept. of Computing Science University of Glasgow
[email protected]
[email protected]
[email protected]
ABSTRACT Capitalizing on the intuitive underlying assumptions of Language Modelling for Ad-Hoc Retrieval we present a novel approach that is capable of injecting the user’s context of the document collection into the retrieval process. The preliminary findings from the evaluation undertaken suggest that improved IR performance is possible under certain circumstances. This motivates further investigation to determine the extent and significance of this improved performance.
Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing—Language Models; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval —Retrieval Models
General Terms Experimentation
Keywords Information Retrieval, Context, User Biased, Language Model
1.
INTRODUCTION
A key challenge in Information Retrieval (IR) is the principled utilization of contextual evidence within the ad-hoc retrieval process. We investigate the Language Modelling (LM) framework to accomplish this as it provides an intuitive means to encode contextual evidence in a manner that is thoroughly consistent with its underlying assumptions. In Language Modelling for IR, ad-hoc retrieval is viewed as the problem of predicting the likelihood of a user’s query given the estimated document language model [2, 4]. This can Q be expressed mathematically as p(q|θd ) = t∈q p(t|θd )n(t,d) where the p(t|θd ) is the probability of a term given the document model θd and n(t, d) is the number of times the term t occurs in document d. The approach relies on the two main assumptions[2] : (1) when a user formulates a query, they will choose terms that
are good discriminators, and (2) the terms they choose actually are good discriminators. That is, the terms are good discriminators because these are the terms that are most prevalent in relevant documents. Implicitly this entails that a user has an understanding of the distribution of a term within documents in the collection. It is therefore paramount that the problem of estimating document models is taken seriously. Typically estimation of the document model relies on ’backing off’ to the collection probabilities [2, 4]. We posit that a user biased topical structure that is imposed on the collection can provide richer a priori knowledge with which to estimate the document models. This is consistent with the assumptions of the model and should result in superior IR performance, because the document models reflect the user’s understanding of the collection and the distribution of terms within documents. Note that when we speak of user bias, we refer to the shared understanding of the document collection among a group of users. This could be manifested as a predefined ontology or created by a process of interaction. Our proposal for creating user biased (UB) document models for ad-hoc retrieval within the LM framework is as follows: The documents in the collection are grouped according to the user group’s context. These groupings are used to create a multinomial term distribution associated with each group. The association of a document and a group, allows the creation of a UB term prior for each document that can be used in the construction of the document model. In this preliminary evaluation, we provide a novel implementation for generating UB document models and determine whether improved IR performance is achievable.
2.
EMPIRICAL EVALUATION
The Wall Street Journal (WSJ) and Associated Press(AP) collections were indexed by taking approximately 40000 documents from each collection, removing standard stop words, applying stemming and removing infrequent terms. We simulated the imposed user biased structure on the collection as follows: The filtering tracks from TREC 1 and 2 (Topics 1100) were compressed into nine domains according to the tag ’ Domain:’. Examples of such domains include; US Economics, Military, International Relations, Law and Enforcement, Politics, etc; The relevance judgements from each of the topics associated with the domain were pooled and used to classify whether the document was in that domain or not using a Binary Independence Model Classifier[3]. Any documents that were not assigned to a domain were placed in a miscellaneous domain. As the number of relevant docu-
∗This author is partially supported by Memex Technology Ltd. (www.memex.com). Copyright is held by the author/owner. SIGIR’04 July 25–29, 2004, Sheffield, South Yorkshire, UK ACM 1-58113-881-4/04/0007.
542
5.5
2
k=2 k=3 k=5
5 4.5
−2 % Change
% Change
4 3.5 3
2.5
−4 −6 −8
2
−10
1.5
−12
1 0
for short queries but has resulted in poorer IR performance for long queries over the standard models. This was regardless of the smoothing method (JM or BS) and collection (WSJ or AP), see Table 1. The prior pc (t), which provides a weighting scheme akin to inverse document frequency [4], seems to be more suited to the long queries as they have many common terms. The UB prior seems to introduce too much bias with respect to these terms, and as such an over or under representation of these common terms in a domain will dramatically influence the ranking of the document. This concern was noted in [2], however this may be remedied if a two stage language model [5] is used. To examine this further see Figure 1 as a representative example of our findings. It shows that as more smoothing is employed the relative performance of the UB Model for long queries decreases dramatically, confirming the intuition. For short queries, the relative performance is positive but decreases as more smoothing is employed. To determine whether there was any significant benefit from the UB model we performed a query-wise comparison using a Wilcoxon Rank Sum test (α = 0.05). This indicated that the difference between the UB models and the standard models was not significantly different, regardless of the type of smoothing (JM or BS), the type of query (SHORT or LONG), the number of user defined topics, and the across the collections considered. This is quite a startlingly result. Further analysis showed that the difference in Average Precision between the standard and UB model ranged from +40 to -40 percent on a query by query basis, suggesting that for particular queries the UB model outperforms the standard and vice versa. Further work is required in order to determine whether there are certain properties about a query that enables the appropriate prior to be selected in order to maximize the IR performance. This study has provided a novel implementation of encoding user knowledge within a Language Modelling framework. However, results are far from conclusive and further investigation is required to determine whether consistently improved IR performance is achievable.
k=2 k=3 k=5
0
1000
2000
3000
4000
5000
−14 0
β
1000
2000
β
3000
4000
5000
Figure 1: Percentage change of mean Average Precision over the standard prior across the hyper parameters for Bayes Smoothing on the AP collection. Method JM
BS
k 2 3 6 2 3 6
WSJ-S 25.43 26.20 26.25 26.19 29.21 29.83 29.81 29.26
WSJ-L 39.03 38.33 38.44 36.08 38.95 37.95 38.31 35.06
AP-S 21.19 22.26 22.28 22.31 28.22 29.00 29.20 29.11
AP-L 32.34 31.34 30.94 30.79 39.47 37.97 37.26 36.89
Table 1: mean Average Precision
ments for each domain was limited we performed a number of classifications where we restricted the number of domains from k = 10 to k = 2 removing those with the least number of relevant documents on each step. To create a multinomial term distribution for each domain k; Firstly, we estimated the probability of a domain given a document p(k|d) by using the p(d|k) from the BIM Classifier and applying Bayes theorem. This enabled the estimation P of the probap(t|d)p(k|d)p(d) bility of a term given a domain p(t|k) = d p(k) Pn(t,d) , p(d) = 1 , |D| is the number of |D| d n(t,d) 1 p(k) = |K| and |K| is the number of domains.
where p(t|d) =
4.
documents, Therefore a document is characterized as a distribution over domains p(k|d) and each domain is characterized as a distribution over terms, p(t|k). This is effectively an Aspect model[1], where thePUser Biased document prior is equivalent to pub (t|d) = k p(k|d)p(t|k). This estimate encodes the user’s understanding of the terms that they would expect to see in documents that discuss the various domains of which that the document is comprised. Jelinek Mercer Smoothing (JM) and Bayes Smoothing (BS) were employed as standard using the probability of a term given the collection pc (t) as in [4]. To encode our contextual evidence within the LM framework we replaced the standard prior pc (t) for the user biased prior pub (t|d). For each smoothing method, the standard versus the user biased prior was compared across a set of hyper parameters, a number of domains, and types of queries (LONG and SHORT).
[1] T. Hofmann. Probabilistc latent semantic indexing. In 22nd ACM SIGIR Conference, pages 50–57, Berkley, CA, 1999. [2] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In 21st ACM SIGIR Conference, pages 275–281, Melbourne, Australia, 1998. [3] K. Sparck-Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Managament, 36:779–808, 2000. [4] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In 24th ACM SIGIR Conference, pages 49–56, New Orleans, LO, 2001. [5] C. Zhai and J. Lafferty. Two-stage language models for infromation retrieval. In 25th ACM SIGIR Conference, pages 49–56, Tampere, Finland, 2002.
3.
DISCUSSION AND CONCLUSION Using the UB prior consistently improved IR performance
543
REFERENCES