Comparison of Information Retrieval Models for ...

1 downloads 0 Views 105KB Size Report
University Sts. Cyril and Methodius. Faculty of Computer Science and ...... [21] Robertson, S.Е. & Walker, S., Okapi at TREC-8. Proceeding of the Eight Text ...
Comparison of Information Retrieval Models for Question Answering Jasmina Armenska

Katerina Zdravkova

University Sts. Cyril and Methodius Faculty of Computer Science and Engineering Rudjer Boshkovikj 16, 1000 Skopje

University Sts. Cyril and Methodius Faculty of Computer Science and Engineering Rudjer Boshkovikj 16, 1000 Skopje

[email protected]

[email protected]

ABSTRACT Question Answering Systems (QAS) are an immerging research topic triggered and at the same time stimulated by immense amount of texts which are available in digital form. As quantity of natural language information increases, the necessity of new methods to precisely retrieve the exact information from massive textual databases becomes inevitable. Although QAS have already been well explored, there are still many aspects to be solved, particularly those which are language specific. The main goal of the research presented in this paper was to compare three proven information retrieval (IR) methods in order to accurately determine the relevant document which proposed the correct answer to questions. It was accomplished using a real-life corpus of lectures and related questions existing in our e-testing system. In order to compare the results, a small system capable of learning the correct answer was produced. It has revealed that modified vector space model was the most suitable for our collection. The results we obtained were promising and they encouraged us to either adapt existing models to our goal, or even to invent new more appropriate IR models.

1. Introduction Impressive data storage capabilities demand new tools and techniques to access required information. Information retrieval develops methods for automatic searching of documents out of enormous quantity of unstructured data which are relevant for user requirements. In 1945, Bush [5] presented the idea of automatic access to huge amounts of stored data and initiated the idea of information retrieval. One of the most influential methods so far was offered by Luhn [14] in 19577. He proposed extraction of words for indexing and their matching as search criterion. The effectiveness of natural-language retrieval models developed in the 1970s and 1980s was first confirmed at Text REtrieval Conferences (TREC) [24]. Evaluation with TREC collections modified old and proposed new IR techniques. First IR algorithms presented at these conferences were used to initially search the World Wide Web, which itself increased the awareness that search engines should be optimized for all the languages. In contrast, user demands became more and more specific and usually expressed as questions, rather than as strings of key words. This fact implied the need of new techniques enabling easier and more focused access to information, bringing up the first question answering systems [15] capable of “understanding” the question and offering the answer within a list of corresponding sources. By the end of 1990s QA problems became an integrated topic within TREC and supported the development of QAS [3]. In 2003, Monz claimed that the development of document retrieval

and question answering was a result of the incredibly fast expansion of the World Wide Web together with users’ demands to access information quickly and straightforwardly [18]. Nowadays, QAS are highly supported by natural language processing techniques. Undoubtedly, the most fascinating success in the area is the famous IBM Watson [11], a computer system developed during DeepQA project [12]. Even though the system uses millions pages of structured and unstructured content from different sources, it processes them instantaneously. There are several textual question answering systems that cover wide range of different techniques and architectures. According to Monz [18], most of them have common features, such as: question analysis; retrieval of relevant documents; analysis of selected documents; and finally, selection of the answer the system finds the most appropriate. Each of these components has an important influence for final accomplishment of IR task. For example, too many retrieved documents increase the effort of further document analysis, while the selection of few documents leads to inability to find the correct answer, although the correct answer exists in the collection of other documents. Therefore, it is crucial to identify the exact documents likely enough to contain the answer to given question. The goal of the research presented in this paper was to judge the usefulness of three well-known models for information retrieval in the process of selecting the document that comprises the correct answer. In the next section, we will present these models. Third section is dedicated to description of test collection used to evaluate the models. The results of discovering the relevant document will be presented in the fourth section. At the end, we present further research directed towards model implementation in the process of determining the correct answer, as well as the final goal which is to propose new metrics corresponding to our collection of lectures and real-life pool of questions and answers.

2. Information Retrieval Models Information retrieval models (IR Models) have a long history, and many approaches and methodologies have been developed. Implemented systems strived to retrieve the most relevant document for arbitrary query the user defines in order to express the need for certain information. Most IR Systems use techniques to index terms, which can be words, phrases, or other units that identify the document contents [16]. In order to estimate the similarity of a document with the query, i.e. to determine how much the document content associates the query, it is essential to calculate statistical information concerning term distribution within a document, and within whole document collection.

IR systems usually assign numerical values to each document for each purpose and accordingly they range documents. Documents with higher values are considered more relevant for the corresponding query.

factor normalization [22]. Normalization is necessary to decrease the advantage of longer document to smaller ones, originating from multiple occurrences of some terms and from greater amount of different terms in longer documents.

There are many models proposed for retrieval process. The most important are: vector space model [9], probabilistic model [13], and language model [25]. In order to consistently present them in the following subsections, we define these variables for a given collection of documents and the queries related to them:

Experimental results obtained using TREC benchmark documents and queries have shown that the implementation of cosine document factor normalization has a tendency to retrieve shorter documents with higher probability independently of their real likelihood or relevance [23]. In order to stimulate retrieval of longer document without affecting the retrieval of shorter documents, Singhal, Buckley and Mitra suggest pivoted document length normalization.

q - a query t - a query term d - a document N - number of documents in a collection tft,d - the frequency of the term t in document d dft - number of documents in the collection containing the term t tft,q - the frequency of the term t in query q wd,t - weight of term t in document d wq,t - weight of term t in query q M - maximum dimension of the vector space containing unique terms extracted from a document collection

2.1

2.1.1

Vector Space Model



Each document in the vector space model is identified by one or several index terms derived from the documents in the collection [22]. They are represented as vectors in the vector space of dimension M, where M corresponds to maximum number of unique index terms in the document. Term extraction is model independent and each index term retains one dimension of vector space. Accordingly, every document is represented with one vector comprising its index terms, disregarding their order. Such vector space is usually high-dimensional. The importance of each term is represented by its two weights. First weight wd,t represents the importance of the term t in the document d, and the document itself is represented with the vector →

Vd = ( wd ,1 , wd , 2 , ... , wd,M )

(1)

Supposing that the query itself can be considered a short document, the importance of the term t in the query q is the weight wq,t. while the query is represented with the vector →

Vq = (wq ,1 , wq , 2 , ... , wq,M )

(2)

The weight of a term in a document vector is usually determined by its two components, the term frequency tf and the invert document frequency idf [9]. In such case, the weight wd,t is calculated as: wd ,t = tf t , d ⋅ idf t = tf t ,d ⋅ log

N df t

(3)

Similarity of the query q with the document d is model independent too. The recommended similarity measure is cosine measure [22]: →

sim( d , q) =



Vd ⋅ V q →



| Vd | ⋅ | Vq |

Pivoted Document Length Normalization

The main idea behind pivoted normalization is that the probability of document retrieval is inversely related to the normalization factor. If the curves of both probabilities are plotted against the document length, they intersect in a point called pivot. In order to minimize the tendency of favouring smaller documents, retrieval probability curve is rotated towards relevance probability curve. The easiest implementation is to use a normalization factor which is linear to Euclidian length of document vector [23], i.e.

(4)

It the equation (4), the denominator is a product of Euclidean length of both vectors, and it represents their cosine document

|V | u = (1 − S ) + S ⋅ d Vavg

(5)

In the equation (5), S∈[0,1] is the slope, while Vavg is the average length of all the documents in the collection. Initial tests with TREC documents and queries have shown that the implementation of pivoted cosine normalization reduce the deviation between probability of retrieval and probability of relevance. Moreover, Singhal et al. [23] identified an optimal slope value which can be implemented for different collections. However, recent research [7] show that the parameter S should be carefully calibrated according to the document collection.

2.2

Probabilistic Model

Probabilistic indexing and information retrieval were introduced in the 1960 [17], and they have been exploited in details. The research resulted in many new retrieval methods based on probabilistic theory. Particular attention was paid to methods responsible for weighting of the terms. Probabilistic theory, which is the basis of this approach, is interested to find the answer to “Basic Question” [13]: “What is the probability that this document is relevant to this query?” In probabilistic theory, for each query, documents are divided into relevant and non-relevant, and two models are estimated for them. Afterwards, documents are ranked according to posterior probability of relevance. This ranking is directly connected with the Basic Question. Although probabilistic models for information retrieval offered many interesting ideas, they have never reached the same success as other IR models. The main reason is the need to determine reasonable approximation of probabilities, which itself imposes additional assumptions. Put into practice, these assumptions disabled better performance. Situation has been significantly changed with the introduction of BM25 weighted model in the 1990s [20]. Probabilistically based, the model demonstrated superior performance and practical value. Weight determination has been accepted as an etalon for weight assignment of terms by other researchers.

BM25 weighted model is known as Okapi weighting model according to Okapi system [21] which was implemented at City University of London for ranking documents connected with given user need. Robertson and his associated claim that: ”The Okapi system has been used in a series of experiments on the TREC collections, investigating probabilistic models, relevance feedback and query expansion, and interaction issues.”. The model is sensitive to term frequency and document length, without introducing too many parameters [13]. The most employed form is the Retrieval Status Value - RSV, used for document ranking. It is calculated as: ( k 3 + 1) ⋅ tf t ,q (k1 + 1) ⋅ tf t ,d N ⋅ RSVd = ∑ log ( ) ⋅ − + ⋅ + df k (( 1 b ) b ( L / L )) tf k 3 + tf t ,q t∈q t 1 d avg t ,d

(6)

where RSV is ranking value during document retrieval, Ld is the length of document d, and Lavg is the average length of the documents in the collection. Parameters k1 and k3 are both training parameters which determine the scaling of term frequency in the document and in the collection respectively, while b (0 ≤ b ≤ 1) determines the scaling with document length.

2.3 Language Model Recently, new approach in information retrieval based on language model appeared to be very successful [4, 10, and 19]. The basic idea of the approach is to build a probabilistic language model Md for each document d in the collection of documents and to rank the documents according to probability of generating the query q, i.e. according to posterior probability P(qMd) [8]. Posterior probabilities P(qMd) are estimated using multinomial unigram language model which excludes any dependency between terms (words) thus deriving the probability of sequence terms supposing their independence [16]. In this model, the probability of producing the query q in the language model Md for document d using maximum likelihood estimation is: n

n

tf ti ,d

i =1

i =1

Ld

Pˆ ( q | M d ) = ∏ Pˆ (t i | M d ) =∏

(7)

where q=t1t2…tn, tft,d is the frequency of the term ti in the document d, while Ld is the number of terms in the document d. The crucial problem when using language model is the estimation of terms which appear very sparsely in documents. Namely, some words can never appear in certain document, but they are possible words in the query. In such case, the probability of a term t missing from document d is zero, and documents will give a query non-zero probability if all the terms in the query q appear in the document. Second obvious problem with the model is its poor estimation of those words which exist sparsely in documents, particularly those occurring only once. They are usually overestimated, “since their one occurrence was partly by chance” [16]. In order to resolve these problems, smoothing models are the optimum solution.

2.3.1

Smoothing Techniques

Usage of relative frequency of terms in a query for estimating the probability of their production is a model that uses no smoothing. Therefore, relative distribution underestimates unseen words. Main task of proposed approach to solve the problem is to smooth probability distribution by assigning some probability mass to non-zero probability of unseen words and to generally improve the estimation of word probability.

There are several techniques for smoothing probability distribution proposed by Chen and Goodman [6]. They usually decrease the probability of seen words, and increase the probability of unseen words. A popular and at the same time easy for implementation is the method for Bayesian smoothing. It is a language model built from the whole collection as a prior distribution according to equation (8): tf + µ ⋅ Pˆ (t | M c ) Pˆµ (t | d ) = t ,d Ld + µ

(8)

where µ is the smoothing parameter. The equation (8) enables estimation of the probability of a word appearing in the document by combining its discounted maximum likelihood (MLE) and a fraction of the estimate of its prevalence in the whole collection. For words which are not present in the document, MLE is excluded.

3. Test Collection As previously mentioned in this paper, question answering systems are evaluated with TREC test collections. They give a valuable feedback of known and potential new models for information retrieval, and enable exchange of useful information and experience not only in information retrieval, but also for question answering systems. The evaluation and comparison of all the models presented in 2.1, 2.2 and 2.3 was done over a test collection extracted from our lectures. The basic corpora were four different and mutually related lectures presented in textual mode. They cover: the history of computing and computer hardware; basic concepts of information technology; hardware essentials and software essentials. Each text has in average barely 7000 words, including references which were extracted during information retrieval. Such small corpora were a limiting factor, but they gave us a valuable feedback not only of the usefulness of the models, but also of the learning system which was built to estimate them. In parallel, we had a pool of 164 multiple-choice questions used in our e-testing system. For each question, four answers are proposed, only one of them correct. It is important to stress that all answers are extracted from the lectures, making information retrieval process very ambiguous. However, each question is extracted from one and only one of four proposed documents. In our learning system questions from test collection were divided into two sets: training set and testing set. Division was random, extracting 84 questions to train the system, i.e. to determine parameter values of all models described in the paper. Remaining 80 questions were used for cross-validation of calculated parameters. In absence of real testing set, in the paper, we will consider cross-validation as testing.

3.1 Training and Testing Set Most of the questions in our question pool start with nine question words: who, what, when, why, what kind, how big, how many, and where. Eight questions are open questions in which student should provide a numerical value. Table 1. presents the distribution of all questions according to question words in both, the training and the testing set. It can be noticed that the questions in our test collection predominantly started with question words “what” (47.54%) and “who” (24.39%), or 42.86% for the training and 52.50% for the testing set, and 28.57% for the training and 20.00% for the testing set, respectively to question word.

Table 1. Distribution of questions in training and testing set Question

Training

Testing

Who

24

16

40

What

36

42

78

When

3

3

6

Why

2

0

2

What kind

4

2

6

How big

1

1

2

How

4

6

10

How many

3

3

6

When

4

2

6

Numerical Total

3

5

8

84

80

164

Random selection of training set has been repeated several times in order to smooth the inconsistent distribution of questions in both sets, without success. In most cases, questions starting with “who” appeared more frequently in the training set (in presented case, even 50% as much), a misbalance which was neutralized with more frequent appearance of questions starting with “what”. Table 2. Distribution according to question category Category

Training

Testing 20

Collection

Factoid

32

Descriptive

45

46

91

Enumeration

7

14

21

84

80

164

Total

Figure 1. Desk-top application for analyzing documents

Collection

52

Second distribution is shown on Table 2. It deals with the basic question category. The most prevalent in our test collection were descriptive question which represented 55.49% of the entire collection. They were evenly distributed over training and testing sets. However, factoid questions were again inconsistent. They appeared in the training set 60% more frequently than in the testing set, making both sets of unequal importance. Noticed misbalances could be solved with multiple repetition of the experiment with alternative samples or with alternative division algorithm, but initial experiments have shown insignificant difference and didn’t affect the success of the models at all. Therefore, in this paper we present the basic experiment.

4. The Experiment This section addresses the system for information retrieval built to compare the models. The results of the experiment for three models, which include adjustment of training parameters over our test collection are afterwards presented in details.

4.1 Information Retrieval Software The application built to analyze the documents is capable improving the retrieval of information from the query questions. It is a small desk-top object-oriented application. interface was developed in Visual Studio 2008, supported built-in programming language C# (Figure 1).

of of Its by

The application itself consists of three windows. First one (Select Docs) presents Word documents selected for analysis. Second (Select Questions) presents selected textual files with questions. Finally, after pressing the button Analyze!, the results of the analysis appear on the right-hand side window, which is in fact a monitoring window. After initiated selection process, pre-processing starts with filtering out initially determined stop words from selected documents. For our system, stop words are all question words (Table 1), relative pronouns, and prepositions. In order to avoid exhaustive morphological analysis, lemmatization of unique words was simulated by integrating all the words with exact beginning and extracting them as query strings, assuming that the deficiencies due to wrong derivation would equally appear in training and testing sets. Prior to pseudo-lemmatization, or to be more precise, pseudo-stemming, frequencies of all the terms, documents containing the terms were calculated. In further process, terms were replaced with extracted pseudo-terms. The system temporarily stores calculated parameters which are fixed during the analysis, including constant coefficients and weights. Temporary values avoid recalculating of the same parameters, thus significantly shorten the retrieval process.

4.2 Experimental Results In this section, we present the results of implementing three basic IR models presented in section 2 of this paper in order to determine the exact document in the test collection which was the source for the question. Since valuable information can be obtained not only from the question, but also from offered answers, it was decided to construct queries consisting of the terms existing jointly in the questions and in the answers. This made the ranking of the queries time consuming, but the results were far more accurate. After the ranking, ranks were sorted in descending order, considering the first document as a document which is its most relevant source. In each of three basic models, we initially present the optimum results obtained during the training process over queries in the training set of test collection, after carefully adjusting all the variable parameters for each model. Afterwards, we present the experimental results with these values over queries in the testing set. Surprisingly, second results are better, confirming uniformity of training and testing sets after random division of the whole set of questions in our question pool.

4.2.1 Vector Space Model with Cosine and Pivoted Document Length Normalization

4.2.3 Language model

Implementation of vector space model with cosine normalization managed to correctly classify 87% of all queries into appropriate documents. In order to improve the classification, pivoted document length normalization was tested with twenty values of parameter S, varying from 0 to 1 with, and constant step 0.05 (Figure 2). It increased the correct classification to 92% obtained for S = 0.25 and S = 0.30. Figure 2. Percentage of correctly classified documents in the training set according to variable values of parameter S 93

In our small system, we implemented multinomial unigram model with Bayesian smoothing. Experiments were done by altering the value of the smoothing parameter. We tested thirteen different values for parameter µ ranging from 1 to 5000. It was concluded that higher values of the parameter couldn’t improve the correctness obtained for much smaller values (Figure 4.). Additional experiments within the range between 1 and 50 have shown that the best value for our system was µ = 10, generating correct classification of 88.1%. The same value was then used for the training set, reaching satisfactory value of 92.5%, but still inferior to classification done with pivoted document length normalization. Figure 4. Percentage of correctly classified documents in the training set according to variable values of parameter µ

92 91 90

89

89

88 87

88

86 85

Learned parameters from training set were temporarily stored, and the same information retrieval process was done with the questions from testing set. The classification with basic model was slightly better (87.5% compared to 87% in the training phase). Pivoted document length normalization was immediately done with optimum value 0.30 for S parameter, and even 95% of questions from testing set were correctly classified. Better classification results in the testing phase than in the training phase are usually very exceptional. However, in our approach, testing was done after the learning was completely settled, enabling testing in a stationary regime.

4.2.2 Okapi BM25 probability model We decided to immediately switch to Okapi BM25 model, and to test our systems with empirically the best parameter values over TREC test collections [20, 21]. Thus, we conducted the experiment with recommended values: k1 = 1.20, b = 0.75, and k3 = 109. The results were inferior, but still satisfactory (Figure 3). Similarly to vector space model, correctness of classification over training set of 85.7% was inferior to correctness of classification over testing set, which reached high 90%. Figure 3. Percentage of correctly classified documents during training and testing using Okapi BM25 probability model 92 90 88 86 84 82 80 78 76 Training

Testing

84 83 82 81 80

5000

1

4500

0,9

4000

0,8

3500

0,7

3000

0,6

2500

0,5

2000

0,4

1500

0,3

1000

0,2

500

0,1

100

0

10

86

1

87

5. Discussion of results from three models Although general experience presented in practice [2] show that language model is superior to other models, our corpus was either too small to be influenced by its advantages, or joined queries comprising of questions and answers were too demanding, defeating the worth of term frequency tf and invert document frequency idf in classical vector space model, and BM25 weights in probabilistic model. There are several reasons for superiority of pivoted document length normalization in our system. Namely, this approach enables better understanding of the design and its empirical approximations, and at the same time, it clearly explains different heuristic settings defined in the past. However, smoothing methods have already been exploited and shown extraordinary results in other disciplines, and we expect that their examination might be beneficial to combine information retrieval with vector space model. Although the results of all three models were rather satisfactory, it was crucial that for all of them, the results obtained during testing were superior, a fact which is unusual, and thus very surprising. As noticed and presented in section 3.1, random selection of questions for the training set was misbalanced in respect of three question category. In order to evaluate its influence, we decided to examine those questions which were misclassified in all the models in more details (Table 3). In all three models, enumeration was almost ideal (Table 2.). Descriptive questions were worst classified (12.59% in training and 9.42% in testing set). However, factoid questions which are predominant in the training set seem to be the greatest problem, suggesting us to pay much more attention to division of both sets in future experiments.

Table 3. Distribution of percentage of questions with incorrect document retrieval in the training and testing set Question

Vector space

Okapi 25

Language model

Train 9.38

Test 5.00

Train 12.50

Test 5.00

Train 12.50

Test 5.00

Descriptive

8.89

6.52

15.56

13.04

13.33

8.70

Enumeration

0.00

0.00

14.29

7.14

0.00

7.14

Factoid

6. Conclusion and Future Work The creation of question answering systems is a very demanding, and exhaustive process, facing many obstacles [2]. Most of them are directly connected with language specific elements. The retrieval of relevant documents intended for further analysis and processing is the first important step, which significantly influences good organization and efficiency, facilitating quick and accurate answers. That was the main reason to examine in details three proven models for information retrieval, and investigate their potential to correctly retrieve the document containing the answer to questions from out test collection. Our research has shown that vector space model based on document length normalization gave superior results, where document length considerably contributed to final classification. Furthermore, we discovered that question category influenced the final retrieval to a great extent too. Expectedly, questions with enumeration gave almost impeccable results. Our future activities in the project will be directed toward achieving these goals: 1) Enhancement of the learning system, in order to achieve retrieval accuracy higher than 97%. 2) Combination of several retrieval strategies within documents (for example, for paragraph retrieval) in order to enable exact selection between several offered answers. 3) Definition of new metrics over sequence of words, as well as the distance between word forms.

7. References [1] Allan, J. et al. Challenges in information retrieval and language modeling. SIGIR Forum, 37, 2003. [2] Aunimo L., Methods for Answer Extraction in Textual Question Answering, University of Helsinki, Finland, 2007, http://www.doria.fi/bitstream/handle/10024/5780/methodsf.p df?sequence=2 [3] Belkin, N.J. and Vickery, A., Interaction in information systems. The British Library, 1985. [4] Berger, A. and Lafferty, J., Information retrieval as statistical translation. In Proceedings of the 1999 ACMSIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 222–229, 1999. [5] Bush, V. As We May Think. Atlantic Monthly, 176:101–108, July 1945. [6] Chen, S. F. and Goodman, J., An empirical study of smoothing techniques for language modeling. Tech. Rep. TR10-98, Harvard University, 1998.

[7] Chowdhury A., McCabe M. Catherine, Grossman D. and Frieder O. Document normalization revisited. Proceedings of the ACM-SIGIR, pp 381–382, Tampere, Finland, 2002. [8] Croft, W.B. & Lafferty, J., Language Modeling for Information Retrieval. Kluwer Academic Publishers, pp.27581, 2003. [9] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), pp. 613–620, 1975. [10] Hiemstra, D., A linguistically motivated probabilistic model of information retrieval. In Proc. ECDL, Volume 1513 of LNCS, pp. 569–584, 1998. [11] IBM Watson, http://www03.ibm.com/innovation/us/watson/what-is-watson/index.html [12] LINGUIST 180, From Languages to Information, http://www.stanford.edu/class/cs124/AIMagzineDeepQA.pdf [13] Jones K. S., Walker S., and Robertson S. E.. A probabilistic model of information retrieval: Development and comparative experiments. Parts 1 and 2. Information Processing and Management, 36(6), pp. 779–840, 2000. [14] Luhn H. P., A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1957. [15] Magnini, B., Negri, M., Prevete, R., Tanev, H.. Mining the Web to validate answers to natural language questions. In Proceedings of Data Mining 2002, Bologna, Italy, 2002. [16] Manning C., Raghavan P., Schutze H., Introduction to Information Retrieval. Cambridge University Press, 2008. [17] Maron, M.E. and Kuhns, J.L., On Relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7, pp. 216-244, 1960. [18] Monz, C., From Document Retrieval to Question Answering. University of Amsterdam, 2003, http://www.eecs.qmul.ac.uk/ ~christof/html/publications/thesis.pdf. [19] Ponte, Jay M., and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proc. SIGIR, pp. 275–281. ACM Press. [20] Robertson, S.E., Walker, S. & Hancock-Beaulieu, M., Experimentation as a way of life: Okapi at TREC. Information Processing and Management, (36(1)), p.95– 108, 2000. [21] Robertson, S.Е. & Walker, S., Okapi at TREC-8. Proceeding of the Eight Text REtrieval Conference, pp.151-62, 1999. [22] Salton, G., Wong A., and Yang C. S.. A vector space model for information retrieval. Communications of the ACM, 18(11):613–620, November 1975. [23] Singhal A., Buckley C., and Mitra M.. Pivoted document length normalization. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, pp. 21–29, Zurich, Switzerland, 1996. [24] Text REtrieval Conference, http://trec.nist.gov/ [25] Zhai C. and Lafferty J.. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2): pp 179–214, 2004.

Suggest Documents