Query Expansion in Information Retrieval for Urdu

2018 Fourth International Conference on Information Retrieval and Knowledge Management

Query Expansion in Information Retrieval for Urdu Language Haider Banka Department of CSE Indian Institute of Technology (ISM) Dhanbad, India [email protected]

Imran Rasheed Department of CSE Indian Institute of Technology (ISM) Dhanbad, India [email protected]

and Asian languages are publicly available [3, 4]. On the other hand, the available standard data in case of Urdu is scarce and limited to only domain specific. For example, Becker-Riaz corpus contains about 7000 short news articles gathered from BBC news [5], EMILLE corpus includes 1,640,000 words of Urdu text and 512,000 words of spoken Urdu [6], IJCNLP2008 NE tagged corpus has 40,000 words for named entity recognition [7] and CLE datasets have 100,000 words used for POS tagging and named entity recognition [8] are the only Urdu corpora freely available in public. All these datasets are supposedly smaller in size and are not suitable enough to accomplish various Urdu information retrieval tasks such as text summarization, text clustering, Ad-hoc information retrieval, question-answering, query expansion, categorization etc. However, the authors of [9], recently developed a NE tagged dataset used only for NER analysis. So, the present work is undertaken to have a constructive and a prominent Urdu database which can readily be used later for different linguistic study. In [10], the authors build an efficient QESBIRM platform that combines QE and proximity SBIRM approaches to significantly enhance information retrieval efficiency. Here SBIRM method will be implemented using either DWT, KLD or P-WNET methods. In [11] the authors gives an overview of the information retrieval models based on query expansion. The description also includes some explanation of the practical undertaken and their methods of implementation. According to reference [12] investigated the combined approach of the association and distribution based term selection methods to advance the overall retrieval effectiveness. In another place the authors [13] reported positive results in text retrieval using WordNet-based query expansion. More recently, the authors [11] used three different types of spectral analysis based on semantic segmentation are carried out namely: sentence-based segmentation, paragraphbased segmentation and fixed length segmentation to improve ranking score as well as improve the run-time efficiency to resolve the query, and maintain a reasonable space for the index. A Urdu text collection consist of 85,304 newswire documents for general interest based on Text Retrieval Conference (TREC) standard, also includes 52 topics from different

Abstract—The information retrieval system need to be upgraded constantly to meet the challenges posed by the advanced user queries as the search system becoming more sophisticated with time. These problems have been addressed extensively in recent times in several research communities to achieve quick and relevant outcome. One such approach is to augment the query where the automatic query expansion increases the precision in information retrieval even if it can cut down the results for some queries. Here, the above approach was tested with the present Urdu data collection obtained via different expansion models such as KL, Bo1 and Bo2. The current collection is quite large in size compared to other existing Urdu datasets. It comprises of 85,304 documents in a TREC schemes and 52 topics with their relevance assessment. In this paper we emphasize to enhance the retrieval model using the query expansion which is never done before on Urdu text. However, we show that a deep analysis of initial and expanded queries brings fascinating insights that could avail future research in the domain. Keywords—Urdu Information Retrieval, Query Expansion; Relevance Feedback, Local Analysis, Urdu Corpus, Relevance Judgment.

I.

INTRODUCTION

In the last fifty years, a massive progress is being reported in Information Retrieval (IR) area but most of it was carried out in English [1]. Beside this, a number of IR communities (like TREC, CLEF, NTCIR and FIRE) have taken noticeable initiatives in several East-Asian, European and South-Asian languages. However, there are still number of languages where nothing progressive has been reported when compared with aforementioned languages and Urdu is being one among them and its poor response is mainly attributed to the lack of available linguistic resources. It is remarkable to note that in the last decade, an ample work is reported in different South Asian languages mostly comprising the area of machine learning and natural language processing but Urdu is still quite distant away from all such initiatives [2]. Additionally, for any linguistic study, the benchmark dataset of that given language is a basic necessity for performing the advance research experiments. Few large datasets of news genre is for general interest have been constructed for English and other European

978-1-5386-3812-5/18/$31.00 ©2018 XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XXIEEE IEEE 171


categories and their relevance judgment. Here, we examine the efficiency of different query expansion (QE) models like KL, Bo1 and Bo2 on our Urdu text collection. The paper is organized as follows: Section II describes some challenges in Urdu language. Section III outlines some related work and the state-of-the-art approaches to query expansion whereas Section IV, describes the brief summary of the collection statistics and Section V reviews some query expansion model. Section VI and Section VII evaluate some experiments to investigate the retrieval effectiveness using query expansion. The retrieval results are described in Section VIII. Section IX concludes this paper and suggests some avenues for future work. II.

are generally found separated by a space (tokens), Urdu on the other hand, faces tokenization and word space problems [20]. Moreover, the use of Space in Urdu language is quite bugging when viewed from information retrieval perspective. Here, the space is not used after each single word, instead, it is only used after the word is joined with an appropriate suffixes. Therefore, enquiries made can often result in unwanted outcome which left a huge space for researchers to overhaul the present tokenization issues in Urdu language. D. Diction Problem Diction related issues correspond to using word from a set of words having same meanings. The Urdu language is full of such diction problems due to its computationally complex nature. For example, words in Urdu that are different in spelling but having same meanings are: Ground (տϧ΅΍ή̳ ˬտϧϭ΍ή̳).

THE URDU LANGUAGE

Urdu being a National Language of Pakistan is spoken well over by 300 million people worldwide [7, 14] including a large chunk of speakers from Indian sub-continent. It is well adopted by Bollywood, the famous Indian cinema which introduce it to large non Urdu speaking populations including Middle East, Africa and several European and Latin American regions. Urdu is very much alike with Hindi language but distinct in script writing. Both languages share common vocabulary and grammar of daily speech [15, 16]. Urdu morphology, orthography, word tokenization, word space and poorresources are number of challenges that makes it more challenging among the languages that follows Arabic script. Therefore, it offers a unique yet significant attraction to the linguistic research world.

E. Loan Words The Urdu language comprises many loan words from several foreign languages including English, Arabic, Persian and Turkish. Urdu is relatively a new language in contrast to a historically rich languages such as Arabic, Persian and Turkish and it is coined mainly to bridge a gap between speakers of afore-mentioned languages and native North Indian languages mainly Hindi. Over a course of time, Urdu saw an embracement of several words from other languages such as English which is routinely used not only in spoken Urdu but also in written Urdu. Some examples in Urdu language borrowed from English are given as follows: Promotion (ϦηϮϣϭή̡), Police (βϴϟϮ̡), and Company (̶Ϩ̢Ϥ̯).

A. Orthography The Urdu script is typically written from right to left like Arabic script in Nastaliq style [4]. Most of the characters acquires a different shapes depending on their position in the ligature e.g. a letter may appear differently depending on its position in a word such as occurring in the beginning, end, center or appearing totally isolated [17].

III.

Queries given by users to the information retrieval system are usually short, ambiguous and imperfect [21] that mostly returns an inaccurate response to the queries addressed. Many essential terms can be absent from the given query that might lead search engine or information retrieval system to respond poorly or ineffectively resulting in documents having less relevance to the queries made. This problem is first addressed by Rocchio [22], who later proposed a Relevance Feedback system that automatically expand the original queries by adding some extra terms such as synonyms, plurals, modifiers etc. as per user feedback or query reformulation [23, 24]. Query expansion is widely practiced methodology to significantly enhance the user’s experience in preferred information response [25-28]. Query Expansion is basically a process of adding significant and contextually related words to reformulate the seed query to enhance retrieval performance in information retrieval operations. Several approaches have been proposed by researchers for query expansion operations but among all the approaches addressed, Pseudo-Relevance Feedback (PRF), Automatic Query Expansion and Blind Relevance Feedback are observed to be the most effective and useful in data retrieval [27]. In this approach, an original query is initially fired using typical information retrieval system [2931]. Then, the related terms are extracted from the seed query. For example, the top K ranked documents that are returned in the first attempt of data retrieval may contain some important

B. Morphology Morphology is the study of word organization [18]. Urdu is morphologically rich in nature that means multiple words can be extracted from a single source [19]. The following Table I shows the few variants that exists from the single word in Urdu. TABLE I. DIFFERENT VARIANTS OF WORD PLAY Play

ϞϴϬ̯ ϞϴϬ̯ ϮΗ ؏ϮϠϴϬ̯ ϢΗ ΎϨϠϴϬ̯ ϢΗ

You Play

RELATED WORK

ؐϴϠϴϬ̯ ̟΁ ‫ك‬ϴϠϴϬ̯ ̟΁ ؏ϮϠϴϬ̯ ̟΁ ‫ف‬ΘϠϴϬ̯ ̟΁

C. Tokenization and Word Segmentation Tokenization is a process of breaking a text at word (or root) level. Unlike in English text where two distinct words

172


such example is given in Fig. 3. In addition, a set of 52 topics with their relevance assessment have been used for the analysis of query expansion.

terms that can ease to separate relevant documents from the irrelevant ones. In general, the obtained expansion terms are either according to the most frequent terms in the feedback documents or according to the most specific terms found in the feedback documents or within the entire document collection. Fig. 1 describes a typical work flow of how PRF based information retrieval systems work. Several methods have been discovered by researchers for QE. In this work, three QE models namely Bo1, Bo2 and KL are adopted for the analysis. Furthermore, a comparative study on the retrieval effectiveness of state-of-the-art retrieval models on Urdu languages is discussed in this paper and the effectiveness of applying QE to improve the retrieval accuracy is explored. Terrier is used as information retrieval framework for all the experiments undertaken in this work [32].

10 ̶ϔόΘγ΍ ‫ف‬ϟΎՌϮϬ̳ ̶Ս΋ΎγϮγ ̲Ϩγ΅Ύ٫ εέΩ΁ ̶ϔόΘγ΍ ̶Ϡϋ΍ ήϳίϭ ΪόΑ ‫ف ̯ف‬ϟΎՌϮϬ̳ ̶Ս΋ΎγϮγ ̲Ϩγ΅Ύ٫ εέΩ΁ ‫ف‬ϧϮ٫ Ε΍ΰϳϭΎΘγΩ ‫؟‬ϘϠόΘϣ ‫ف‬γ ̶ϔόΘγ΍ Ύ̯ ϥ΍ϭΎ̩ ̭Ϯη΍ ̶Ϡϋ΍ ήϳίϭ ‫ ̯ف‬ήՍη΍έΎ٬ϣ ̶Ϡϋ΍ ήϳίϭ ΪόΑ ‫ف ̯ف‬ϟΎՌϮϬ̳ ̶Ս΋ΎγϮγ ̲Ϩγ΅Ύ٫ εέΩ΁ Ϯ̯ Ε΍ΰϳϭΎΘγΩ .‫ف‬Ό٫Ύ̩ ‫ف‬ϟΎՌϮϬ̳ .‫ف‬Ό٫Ύ̩ ΎϧϮ٫ ϞϤΘθϣ ή̡ ΕΎϣϮϠόϣ ϖϠόΘϣ ‫ف‬γ ‫ف‬ϧϮ٫ ΙϮϠϣ ؐϴϣ ̶ϔόΘγ΍ / ؏ϮϳέΎΘϓή̳ ή̴ϳΩ) ϦϴϣΎπϣ ‫؏ ̯ف‬ϭήΒΧ ؐϴϣ ‫ـ‬έΎΑ ‫΅؏ ̯ف‬ϮϠ٬̡ ή̴ϳΩ ‫̯ف‬ ؐϴ٫ ‫؟‬ϘϠόΘϣ ήϴϏ (ϩήϴϏϭ ϩήϴϏϭ ̶ϔόΘγ΍ Fig. 3. A Sample topic (Topic 2, “Aadarsh Housing Scams”).

V.

Fig. 1. Architecture of Pseudo Relevance Feedback based System.

IV.

QUERY EXPANSION MODELS

In the given analysis, DFR-based term weighing models namely, Bo1 [33], Bo2 [33] and KL are employed by using Terrier’s search engine. Terrier employs a Divergence from Randomness based QE mechanism which is a generalization of Rocchio’s method [26]. Initially, the DFR model measure the weightage of the terms from the top ranked documents. The most essential terms are then collected from returned results and added with the original query so as to generate an expanded query. The above mentioned weighting schemes models are shown in the sections V-A, V-B, V-C [34].

DOCUMENT COLLECTION

A. Documents, Topics, and Relevance Assessments To measure the efficiency of different query expansion models described in this paper are based on data from our own developed Urdu dataset. The collection consists of 85,304 documents based on TREC specifications. All documents in the collection are in UTF-8 encoding scheme. A sample document in standard TREC format is shown in Fig. 2. 26_July_2012_Sportz4 ‫ف‬Ό̳Ը΍ ή̡ ήΒϤϧ ‫ـ‬ήδϴΗ ‫؟‬ϠϣԸ΍ ϢηΎ٫ :̲Ϩ̰Ϩϳέ ՋδϴՌ ؐϴϣ ̲Ϩ̰Ϩϳέ ՋδϴՌ ̶γ ̶γ ̶΋Ը΍ ‫؟‬ϠϣԸ΍ ϢηΎ٫ έϭ΍ βϠϴ̯ ̮ϴΟ ‫ ̯ف‬ΎϘϳήϓ΍ ̶ΑϮϨΟ έΎ̩ ‫ف‬γ ؐϴϣ Ϭ̩ ؐϴϣ ̲Ϩ̰Ϩϳέ ̶ϤϟΎϋ ‫ؐل‬ϴ٫ ‫ف‬Ό̳Ը΍ ή̡ ήΒϤϧ ‫ـ‬ήδϴΗ έϭ΍ ‫ـ‬ήγϭΩ έΎϤ̯ ‫ ̯ف‬Ύ̰Ϩϟ ̵ήγ ‫فل‬٫ αΎ̡ ‫؏ ̯ف‬ϮϨϴϤδՍϴΑ ‫ ̯ف‬ΎϘϳήϓ΍ ̶ΑϮϨΟ ϦθϳίϮ̡ ̟ΎՌ ̵ή̪Ϩγ Ϟ̡ήՌ ‫؟‬ϠϣԸ΍ ϢηΎ٫ ˬ‫ـ‬ήγϭΩ βϠϴ̯ ̮ϴΟ ‫ؐل‬ϴ٫ ή̡ ̟ΎՌ έϮΘγΪΑ ΍έΎ̯Ύ̴Ϩγ ؐϳϮ̪ϧΎ̡ ίήΌϴϠϳϭ ̵վ ̶Α ̵΍ ˬ‫ف‬ϬΗϮ̩ ϝΎ̡ έΪϨ̩ ˬ‫ـ‬ήδϴΗ ΪόΑ ‫ف ̯ف‬ϧή̯ έϮ̰γ΍ ؏΍ϮΗΎγ ήΒϤϧ Ύ̯ ̭έϼ̯ Ϟ̰ϴ΋Ύϣ ̵ϮϠϳήՍγ΍Ը ‫ؐل‬ϴ٫ ή̡ ήΒϤϧ ‫ف‬ՍϬ̩ ϬΘϤγ΍ Ϣϳή̳ έϭ΍ ϖΤϟ΍ ΡΎΒμϣ έϭ΍ ؐϳϮ٫έΎϴ̳ ̶Ϡϋ ή٬υ΍ ˬؐϳϮϧ ϥΎΧ βϧϮϳ ‫ ̯ف‬ϥΎΘδ̯Ύ̡ ‫فل‬٫ ‫̴؟‬Ο ̶̯ ϦδΤϟ΍ ΐϴ̰η ̶θϳΩ ‫؟‬Ϡ̴ϨΑ ؐϴϣ ίέտϧϭ΍έ Թ ϝԸ΍ ϩϭ ‫ؐل‬ϴ٫ ή̡ ήΒϤϧ ؐϳϮ٫ήϴΗ ϦϴՍγ΍ Ϟϳվ ؐϴϣ ̲ϨϟϮΑ ‫ؐل‬ϴ٫ ‫ف‬Ό̳Ը΍ ή̡ ̟ΎՌ896 ‫ؐل‬ϴ٫ ή̡ ̟ΎՌ ϬΗΎγ ‫ ̯ف‬βՍϨ΋΍Ϯ̡ ϦϤΣήϟ΍ΪΒϋ ήϨ̢γ΍ ϡέԸ΍ Ջϔϴϟ ‫فل‬٫ ΍ήγϭΩ ήΒϤϧ Ύ̯ ϞϤΟ΍ Ϊϴόγ ήϨ̢γ΍ ϑ΍Ը ̶ϧΎΘδ̯Ύ̡ ‫ؐل‬ϴ٫ ή̡ ήΒϤϧ ؐϳϮγΩ

A. Kullback-Leibler (KL) Model Kullback-Liebler divergence computes the divergence between the probability distributions of terms in the whole collection and in the top ranked documents to obtain the first pass retrieval using the original user query [35]. For the term t, this divergence is given by ‫ݓ‬ሺ‫ݐ‬ሻ ൌ ܲ௡ ሺ‫ݐ‬ሻ ‫݃݋݈ כ‬ଶ ܲ௡ ሺ‫ݐ‬ሻ ൌ σ

௉೙ሺ௧ሻ ௉೘ ሺ௧ሻ

σ೏‫א‬೙ ௧௙ሺ௧ǡௗሻ ᇲ ೏‫א‬೙ σ೟ᇲ ‫א‬೏ ௧௙ሺ௧ ǡௗሻ

ܲ௠ ሺ‫ݐ‬ሻ ൌ σ

σ೏‫א‬೘ ௧௙ሺ௧ǡௗሻ ᇲ ೏‫א‬೘ σ೟ᇲ ‫א‬೏ ௧௙ሺ௧ ǡௗሻ

(1) (2) (3)

Where

Fig. 2. A Sample file in TREC format.

A topic resembles a standards common with other retrieval initiatives such as TREC. Any ordinary topic has three sections, i.e., title, description, and the narrative section. In the given process, a unique identification number is assigned to the topic which separates it out from the other similar topics. One

•

ܲ௡ (t) is the probability of the term t in the top ranked documents n.

•

ܲ௠ (t) is the probability of the term t in the whole collection.

B. Bose-Einstein1 (Bo1) Model This model is based on the Bose Einstein Statistic and the weight of the term t in the top ranked documents (rank ranging from 3 to 10) is given by [36].

173


‫ݓ‬ሺ‫ݐ‬ሻ ൌ σௗ‫א‬௡ ‫݂ݐ‬ሺ‫ݐ‬ǡ ݀ሻ ‫݃݋݈ כ‬ଶ ܲ௖ ൌ

ሺଵା௉೎ ሻ ௉೎

൅ ݈‫݃݋‬ଶ ሺͳ ൅ ܲ௖ ሻ

σ೏‫א‬೘ ௧௙ሺ௧ǡௗሻ ே

TABLE II.

(4)

Okapi BM25 Mean Average Precisiona

(5)

where equation (5) denotes the average term frequency of t in the collection (N is the number of documents in the collection). C. Bose-Einstein2 (Bo2) Model The scoring formula of Bo2 is given by ‫ݓ‬ሺ‫ݐ‬ሻ ൌ ‫݂ݐ‬௫ ή ݈‫݃݋‬ଶ

ሺଵା௉೎ ሻ ௉೑

൅ ݈‫݃݋‬ଶ ൫ͳ ൅ ܲ௙ ൯

‫݂ݐ‬௫ is the frequency of the term in the top-returned documents.

•

ܲ௖ is given by , where F is the term frequency of the ே query term in the collection and N is the number of documents in the collection.

•

ܲ௙ ሺ‫ݔ‬ሻ ൌ ೣ ೣ is the probability of the term t in the ௧௢௞௘௡೎ whole collection. where lx is the sum of the length of the exp_doc top ranked documents where exp_doc is a parameter of the query expansion methodology. F, is the term frequency of the query term in the whole collection.

௧௙ ή௟

݈௫ is the size in tokens.

•

‫݊݁݇݋ݐ‬௖ is the total number of tokens in the collection. VI.

0.2990

P@10

0.3308

P@20

0.2702

P@100

0.1360

EXPERIMENTS FOR QUERY EXPANSION WITH BO1, BO2 & KL MODELS

In these experiments, Rocchio's approach is adopted to perform the enhancement of the Okapi (BM25) retrieval method. Initially, 5, 10 or 15 top retrieved documents were chosen, from each set of retrieved documents 5, 10, 15, 30 or 50 terms were extracted. These terms were then added to original queries and examine whether the results are significantly different. Top 100 documents are retrieved initially using the retrieval model. Thereafter, the model is modified with Bo1, Bo2 and KL expansion models and is also analyzed with MAP methodology for further improvement in the result. Table III shows the results obtained after query expansion.

ி

•

0.3162

R-precision

VII.

(6)

•

MAP, R-PREC, P@K OF BM25

VIII.

RESULTS AND DISCUSSIONS

In this work, the baseline text retrieval is compared with the expansion variants as discussed in section V. Table III presented the results obtained on Urdu dataset which shows that the highest MAP is found for BM25 (b=1.0) enhanced by Bo1 model having an overall (+25.55%) improvement compared to baseline text retrieval method when the number of documents and selected terms were 10 and 15 respectively. Likewise, the highest MAP for BM25 (b=1.0) enhanced with KL expansion model shows a (+25.68%) growth over baseline when the number of documents is 10 and the number of selected terms are 5. Moreover, in all the cases addressed, the Bo1 and KL models performed almost similarly whereas the results obtained with Bo2 is not appreciable because lesser number of relevant documents are retrieved due to the term mismatch issues between the original query terms and the candidate expansion terms.

EXPERIMENT ARCHITECTURE

In this paper, the performance of Urdu collection based on different query expansion methods is measured. At first, the search with Okapi (BM25) model [22] was carried out as a weighting scheme using only the title field of the original queries. This method is chosen as benchmark to enhance the search results by expanding each topic with similar concepts as shown in Table II. The expanded terms assist the process of matching the relevant documents to the associated query. Moreover, it ease to reduce the discrepancy between the documents and the queries [37]. Several experiments were performed, using a set of 52 queries with Okapi (BM25) model based on weighting scheme. Here, Rocchio beta = 0.4, and retrieval parameter b = 0.4 gives the best results for query expansion. The evaluation is performed with Terrier information retrieval framework that was found to be quite effective in indexing, retrieval and evaluation of English and other Non-English documents. In order to evaluate the results of the retrieval process, a program inside the TREC conference, trec_eval, is used. Trec_Eval is quite significant in evaluation of different measures such as the total number of documents (Retrieved, Relevant and Rel_ret (relevant, and retrieved)) over all queries or MAP, R-prec, and Interpolated Recall-Precision Averages. Thus in evaluation of the retrieval performance, the MAP (mean average precision) measure is chosen in this experiment, where its value is computed based on (maximum) 100 retrieved documents per query.

IX.

CONCLUSION AND FUTURE WORK

The focus of the present work is to observe the effect of query expansion on the given Urdu dataset which has not been addressed before on such a scale. The results show that KL model performed extremely well in comparison to other expansion models such as Bo1 and Bo2 on the present Urdu data collection. In addition to this, the Kullback-Leibler model was found to significantly enhance the MAP of the retrieved data by almost 22-24% in the above study. In future, external resource like WordNet approach to Bo2 model will be undertaken to enhance the efficiency in Urdu information retrieval.

174


[19] S. Iqbal, M. W. Anwar, U. I. Bajwa, and Z. Rehman, “Urdu spell checking: Reverse edit distance approach,” in Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing, 2013, pp. 58–65. [20] S. Stymne, “Spell checking techniques for replacement of unknown words and data cleaning for haitian creole sms translation,” in Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011, pp. 470–477. [21] A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic, “Searching the web: The public and their queries,” Journal of the Association for Information Science and Technology, vol. 52, no. 3, pp. 226–234, 2001. [22] S. E. Robertson, “The probability ranking principle in ir,” Journal of documentation, vol. 33, no. 4, pp. 294–304, 1977. [23] F. Diaz, “Pseudo-query reformulation,” in European Conference on Information Retrieval. Springer, 2016, pp. 521–532. [24] G. Salton and C. Buckley, “Improving retrieval performance by relevance feedback,” Readings in information retrieval, vol. 24, no. 5, pp. 355–363, 1997. [25] D. Metzler and W. B. Croft, “Latent concept expansion using markov random fields,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007, pp. 311–318. [26] J. J. Rocchio, “Relevance feedback in information retrieval,” The Smart retrieval system-experiments in automatic document processing, 1971. [27] J. Xu and W. B. Croft, “Quary expansion using local and global document analysis,” in ACM SIGIR Forum, vol. 51, no. 2. ACM, 2017, pp. 168–175. [28] C. Zhai and J. Lafferty, “Model-based feedback in the language modeling approach to information retrieval,” in Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001, pp. 403–410. [29] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, “Searching the web,” ACM Transactions on Internet Technology (TOIT), vol. 1, no. 1, pp. 2–43, 2001. [30] R. Baeza-Yates and B. Ribeiro-Neto, “Modern information retrieval addison-wesley longman,” Reading MA, 1999. [31] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999. [32] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma, “Terrier: A high performance and scalable information retrieval platform,” in Proceedings of the OSIR Workshop, 2006, pp. 18–25. [33] G. Amati, “Probability models for information retrieval based on divergence from randomness,” Ph.D. dissertation, University of Glasgow, 2003. [34] V. Plachouras, B. He, and I. Ounis, “University of glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier.” in TREC, 2004. [35] T. Cover and J. Thomas, “Elements of information theory wiley new york,” NY Google Scholar, 1991. [36] C. Macdonald, B. He, V. Plachouras, and I. Ounis, “University of glasgow at trec 2005: Experiments in terabyte and enterprise tracks with terrier.” in TREC, 2005. [37] M. Shokouhi and J. Zobel, “Robust result merging using sample-based score estimates,” ACM Transactions on Information Systems (TOIS), vol. 27, no. 3, p. 14, 2009.

ACKNOWLEDGMENT We sincerely thank to Mr. Zahoor Ahmad Shora chief editor `Daily Roshni' for his generous contribution in freely sharing raw data for the collection and also I give thanks to Mr. Hamaid Mehmood for his guidance and kind support.

REFERENCES [1] [2] [3]

[4] [5]

[6] [7] [8] [9]

[10]

[11]

[12] [13]

[14]

[15] [16] [17]

[18]

A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001. P. Majumdar, M. Mitra, S. K. Parui, and P. Bhattacharya, “Initiative for indian language ir evaluation,” 2007. E. Darrudi, M. R. Hejazi, and F. Oroumchian, “Assessment of a modern farsi corpus,” in Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID), 2004. A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,” Artificial Intelligence Review, pp. 1–33, 2016. D. Becker and K. Riaz, “A study in urdu corpus construction,” in Proceedings of the 3rd workshop on Asian language resources and international standardization-Volume 12. Association for Computational Linguistics, 2002, pp. 1–5. A. Hardie, “Developing a tagset for automated part-of-speech tagging in urdu.” in Corpus Linguistics 2003, 2003. S. Hussain, “Resources for urdu language processing.” in IJCNLP, 2008, pp. 99–100. S. Urooj, S. Hussain, F. Adeeba, F. Jabeen, and R. Parveen, “Cle urdu digest corpus,” LANGUAGE & TECHNOLOGY, vol. 47, 2012. W. Khana, A. Daudb, J. A. Nasira, and T. Amjada, “Named entity dataset for urdu named entity recognition task,” Organization, vol. 48, p. 282. S. Alnofaie, M. Dahab, and M. Kamal, “A novel information retrieval approach using query expansion and spectral-based,” information retrieval, vol. 7, no. 9, 2016. M. Y. Dahab, M. Kamel, and S. Alnofaie, “Further investigations for documents information retrieval based on dwt,” in International Conference on Advanced Intelligent Systems and Informatics. Springer, 2016, pp. 3–11. D Pal, M Mitra, and K Datta, “Query expansion using term distribution and term association,” arXiv preprint arXiv:1303.0667, 2013. D Pal, M Mitra, K Datta, “Improving query expansion using wordnet,” Journal of the Association for Information Science and Technology, vol. 65, no. 12, pp. 2469–2478, 2014. K Riaz, “Baseline for urdu ir evaluation,” in Proceedings of the 2nd ACM workshop on Improving non english web searching. ACM, 2008, pp. 97– 100. K Riaz, “Urdu is not hindi for information access,” in Workshop on Multilingual Information Access, SIGIR, 2009. K Riaz, “Comparison of hindi and urdu in computational context,” Int J Comput Linguist Nat Lang Process, vol. 1, no. 3, pp. 92–97, 2012. M. I. Razzak, “Online urdu character recognition in unconstrained environment,” Ph.D. dissertation, PhD thesis, International Islamic University, Islamabad, 2011. V. Gupta, N. Joshi, and I. Mathur, “Design & development of rule based inflectional and derivational urdu stemmer usal,” in Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), 2015 International Conference on. IEEE, 2015, pp. 7–12.

175


TABLE III.

MAP OF DIFFERENT EXPANSION MODELS BASED ON R OCCHIO PSEUDO-RELEVANCE FEEDBACK

Okapi BM25: A Probabilistic Retrieval model Mean Average Precision (MAP) Without PRF No. of documents

5 documents

10 documents

15 documents

No. of terms

BM25_Bo1

5 terms 10 terms 15 terms 30 terms 50 terms 5 terms 10 terms 15 terms 30 terms 50 terms 5 terms 10 terms 15 terms 30 terms 50 terms

0.3867 (+22.30%) 0.3866 (+22.26%) 0.3899 (+23.31% 0.3905 (+23.50%) 0.3893 (+23.12%) 0.3935 (+24.45%) 0.3941 (+24.64%) 0.3970 (+25.55%) 0.3931 (+24.32%) 0.3927 (+24.19%) 0.3958 (+25.17%) 0.3924 (+24.10%) 0.3966 (+25.43%) 0.3966 (+25.43%) 0.3914 (+23.78%)

176

0.3162 BM25_Bo2

0.2103 (-33.49%) 0.2090 (-33.90%) 0.2180 (-31.06%) 0.2453 (-22.42%) 0.2694 (-14.80%) 0.1772 (-43.96%) 0.1913 (-39.50%) 0.1939 (-38.68%) 0.2339 (-26.03%) 0.2642 (-16.45%) 0.1787 (-43.49%) 0.1826 (-42.25%) 0.1887 (-40.32%) 0.2263 (-28.43%) 0.2623 (-17.05%)

BM25_KL

0.3884 (+22.83%) 0.3899 (+23.31%) 0.3902 (+23.40%) 0.3936 (+24.48%) 0.3937 (+24.51%) 0.3974 (+25.68%) 0.3951 (+24.95%) 0.3965 (+25.40%) 0.3967 (+25.46%) 0.3965 (+25.40%) 0.3915 (+23.81%) 0.3903 (+23.43%) 0.3965 (+25.40%) 0.3966 (+25.43%) 0.3923 (+24.07%)