for the three languages Hindi (HI), Bengali (BN) and Marathi. (MR). For example the .... (e.g., we used a list of 471 words for the English language). In order to ...
UniNE at FIRE 2008: Hindi, Bengali, and Marathi IR Ljiljana Dolamic, Jacques Savoy Computer Science Department, University of Neuchatel, Rue Emile Argand 11, 2009 Neuchatel, Switzerland {Ljiljana.Dolamic, Jacques.Savoy}@unine.ch
ABSTRACT In participating in this first FIRE evaluation campaign, we design and evaluate stopword lists and light stemming strategies for the Hindi, Bengali and Marathi languages. As members of the Indo-European languages family, they tend to have similar syntax and morphology, and also related writing systems. Our second objective is to obtain a better picture of the relative merit of various search engines in exploring Hindi, Bengali and Marathi documents. To evaluate these solutions we use our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language model (LM) together with the classical tf ·idf vector-processing approach. Our various experiments with these three languages tend to demonstrate that the I(ne )C2 or the PB2 model derived from DFR paradigm tend to produce the best overall retrieval performances. Moreover, increasing the query size (e.g., from T to TD, or from TD or TDN) clearly improves the retrieval effectiveness measured by the MAP. The retrieval performance differences are usually statistically significant. Applying the Z-score data fusion operator after a blind-query expansion tends also to improve the MAP of the merged run over the best single IR system. For the Hindi language, our experiments demonstrate that our light stemming procedure produces better retrieval results than an indexing strategy ignoring this word normalization procedure. When comparing a word-based (with a light stemmer) and a 4-gram indexing strategy, our evaluations tend to show a relative better performance for the word-based scheme. For the Bengali language as for the Hindi, a word-based approach (with a light stemmer) tends to produce better retrieval effectiveness than a 4-gram indexing scheme. The performance difference with a nonstemming approach favors clearly our light stemmer. For the Bengali language we developed also a more aggressive stemmer producing a retrieval performance comparable to those achieved by a light stemmer. For the Marathi language presenting a more complex inflectional morphology, the performance difference between a word-based or a 4gram indexing strategy tends to favor the n-gram scheme. The suggested light stemmer approach produces however better results than an IR scheme without a stemming stage.
1.
INTRODUCTION
The main objective of the IR group at University of Neuchatel is to design, implement and evaluate IR strategies and models for various natural languages, including popular European [1] and Far-East languages (Chinese, Japanese, and
Korean) [2]. This objective also includes bilingual IR (queries written in one language, documents retrieved in another) or multilingual IR systems (targeted information items are written in different languages). In our participation in the first FIRE campaign (www.isical.ac.in/~fire/), our main motivation is to develop tools for monolingual IR for various Indian languages. The rest of this paper is organized as follows: Section 2 presents an overview of the corpora used in the FIRE-2008 ad hoc track. Section 3 outlines the main aspects of various IR models used with the FIRE test-collections together with the stopword lists and stemming strategies we developed for these languages. Section 4 presents the evaluation carried out on the various probabilistic IR models for the Hindi, Bengali and Marathi corpus. Finally, Section 5 describes our official runs and their evaluation.
2.
OVERVIEW OF THE CORPORA
The test collections were based on various news sources covering the period from September 2004 to September 2007, for the three languages Hindi (HI), Bengali (BN) and Marathi (MR). For example the sources used for Bengali were CRI & Anandabazar Patrika (newspaper edited by ABP Ltd). The encoding system used for both documents and queries is UTF-8. Listed in Table 1 are statistics on the three corpora, showing that the Hindi and Bengali collection are similar in size (in MB) while that of Marathi is smaller. As for document length, the Hindi corpus has a larger mean length (based on the mean number of distinct indexing terms or mean number of indexing terms per article, or about 356 terms/document). Based on this measure both the Bengali and Marathi corpora have similar mean document lengths (about 275 indexing terms/article). Based on the TREC model [3], each topic record consists of three logical sections, namely a brief title (denoted T), a one-sentence description (D), and a narrative part (N) indicating relevance assessment criterion. As shown in Appendix 17, available topics cover various subjects (e.g., Topic #028: ”Iran’s Nuclear Programme,” Topic #034: ”Jessica Lall Murder,”) covering cultural questions (Topic #041: ”Kolkata Book Fair 2007” or Topic #070: ”Remake in Bollywood”), scientific problems (Topic #045: ”Global Warming”) or sports (Topic #073: ”Zinedine Zidane’s headbutting incident at the World Cup”). Certain topics seem to be more national in coverage (Topic #041: ”New Labour Laws in France,” Topic #058: ”Thailand Coup”). The real topic subject matter is sometimes difficult to determine, at least
Table 1: FIRE test-collection statistics Hindi Bengali Marathi Size 718 MB 732 MB 487 MB # doc 95,215 123,047 99,357 # terms 127,658 249,215 511,550 Number of distinct indexing terms per doc. Mean 231.1 181.59 171.5 Std 199.48 99.04 102 Median 179 166 149 Maximum 2,284 1,481 2,166 Minimum 0 0 16 Number of indexing terms per document Mean 356.2 291.88 264.6 Std 400.43 180.62 188.96 Median 256 265 222 Maximum 6,998 2,928 5,077 Minimum 0 0 28 For the queries Number 45 50 49 Rel. items 3,436 1,893 1,095 Mean 76.36 37.26 22.35 Std 55.12 27.47 20.69 Median 67 29.5 18 Maximum 194 (T #60) 149 (T #32) 84 (T #45) Minimum 1 (T #59) 5 (T #71) 1 (T #47)
based on the title section (Topic #049: ”World wide natural calamities,” Topic #052: ”Budget 2006-2007”). We were surprised to see that topic descriptions contained many proper names (e.g., geographical with ”Singur,” ”China,” ”Kolkata”, personal names such as ”Bush,” ”Sania Mirza,” or products such as ”Prince,” and ”Bofors”) as well as acronyms (”ULFA,” ”CBI,” ”HIV,” ”LOC”). Table 4 (bottom part) also compares the number of relevant documents per request, with the mean always being greater than the median (e.g., for the Marathi collection, the average number of relevant documents per query is 22.35, with the corresponding median being 18). These findings indicate that each collection contains numerous queries, yet only a rather smaller number of relevant items are found. For each collection, 50 queries were created (numbered from #26 to #75) and, if necessary, manually translated into the other languages. Relevant documents could not however be found for each request and each language. For the Hindi language however, five topics (#40, #43, #47, #48, and #50) do not have any relevant items in the collection. For the Marathi collection, Topics #70 (”Remake in Bollywood”) did not have any relevant items. The largest number of relevant articles is 149 for Topics #32 (”Relations between Congress and its allies”) in the Bengali collection. On the other hand, Topics #47 (”Nobel Prize missing”), Topics #50 (”Kolkata Book Fair 2007”) and Topics #72 (”Stamp paper scam”) have the smallest number of relevant document (1 in this case and in the Marathi corpus).
3. 3.1
IR MODELS AND STEMMING STRATEGIES IR Models
In order to obtain higher MAP values, we considered adopting different weighting schemes for the terms included in documents or queries. This would allow us to account for term occurrence frequencies (denoted tfij for indexing term tj in document Di ), as well as their inverse document frequency (denoted idfj ). Moreover, we considered normalizing each indexing weight using the cosine to obtain the classical tf · idf formulation. In addition to this classical vector-space approach, we also considered probabilistic models such as Okapi (or BM25) [4], that also take document length into account. As a second probabilistic approach we implemented three variants of the DFR (Divergence from Randomness) family of models suggested by Amati & van Rijsbergen [5]. In this framework, the indexing weight wij attached to term tj in document Di combines two information measures as follows: wij
1 2 = Infij · Infij = − log2 P rob1ij (tf ) · (1 − P rob2ij )
(1)
As a first model, we implemented the PB2 scheme, as defined in the following equations: " −λ tf # e j · λj ij tcj 1 with λj = (2) P robij = tfij ! n tcj + 1 P rob2ij = 1 − (3) dfj · (tf nij + 1) c · mean dl with tf nij = tfij · log2 1 + li
(4)
where tcj indicates the number of occurrences of term tj in the collection, li the length (number of indexing terms) of document Di , meandl the average document length, n the number of documents in the corpus, and c a constant (see Appendix 16 for corresponding values). For the second GL2 model, the implementation of P rob1ij is shown in Equation 5, and P rob2ij in Equation 6, as follows: tf nij 1 λj P rob1ij = · (5) 1 + λj 1 + λj tf nij P rob2ij = (6) tf nij + 1 where λj and tf nij were defined previously. For the third PL2 model, the implementation was carried out using Equation 2 for P rob1ij and Equation 6 for P rob2ij . For the fourth I(ne )B2 model, the implementation was carried out using the following two equations: n+1 1 Infij = tf nij · log2 (7) ne + 0.5 " tcj # n−1 with ne = n · 1 − n P rob2ij
=
1−
tcj + 1 dfj · (tf nij + 1)
(8)
where n, tcj and tf nij were defined previously and dfj indicates the number of documents in which the term tj occurs. If we replace the log2() in Equation 4 by ln(), the natural logarithm, we obtain the I(ne )C2 model. Finally we also considered an approach based on a statistical language model (LM) [6], [7], known as a non-parametric
probabilistic model (the Okapi and DFR are viewed as parametric models). Probability estimates would thus not be based on any known distribution (e.g., as in Equation 2 or 5), but rather estimated directly based on the term occurrence frequencies in document Di or corpus C. Within this language model paradigm, various implementations and smoothing methods might be considered, although in this study we adopted a model proposed by Hiemstra [7], as described in Equation 9, combining an estimate based on document (P rob[tj |Di ]) and on corpus (P rob[tj |C]) corresponding to the Jelinek-Mercer smoothing approach. P rob[Di |Q] = P rob[Di ] · Y [λj · P rob[tj |Di ] + (1 − λj ) · P rob[tj |C]]
Table 2: MAP of various IR models with light stemmer (Hindi collection)
Light Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
(9)
tj∈Q
P rob[tj |Di ]
=
tfij /li
P rob[tj |C]
=
dfj /lc with lc =
X
Mean Average Precision Hindi Hindi Hindi T TD TDN 0.2601 0.3195 0.3653 0.2818 0.3436 0.3495 0.2457 0.2926 0.3278 0.2746 0.3271 0.3592 0.2616 0.3301 0.3668 0.2692 0.3357 0.3750 0.2369 0.3023 0.3432 0.1756 0.2060 0.2151 0.2614 0.3216 0.3553 base +23.0% +35.9%
dfk
k
where λj is a smoothing factor (constant for all indexing terms tj , and usually fixed at 0.35) and lc an estimate of the size of the corpus C.
3.2
Stopword Lists and Stemmers
In defining our indexing strategies, we used a stopword list to denote very frequently occurring forms having no important impact on sense matching between topic and document representatives (e.g., ”the,” ”in,” ”or,” ”has,” etc.). Following the guidelines described in [8], we established a general stopword list for the three Indian languages. Firstly, we sorted all word forms appearing in our corpora according to their occurrence frequency, and extracted the 200 most frequently occurring words. Secondly, we inspected these lists in order to remove all numbers (e.g., ”2004”, ”1”), plus all nouns and adjectives more or less directly related to the main subjects of the underlying collections. Thirdly, we included certain words that contained no information, even though they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as ”my”), postpositions (e.g., ”near”) and conjunctions (e.g., ”where”). In our experiments, our final stopword list contained 165 Hindi, and 114 Bengali terms (due to time constraint, we were no able to establish a stopword list for the Marathi language). We may mention that compared to other Indo-European languages, these lists are rather short (e.g., we used a list of 471 words for the English language). In order to develop a light stemmer for Hindi, Bengali and Marathi, we continued applying our main strategy of developing stemmer that removes only those inflectional suffixes attached to nouns and adjectives (see for example [9] for the English language). We ignored verbal suffixes because we considered matches based on verbal forms between the query and document expressions to be less important. Moreover, verbal suffixes tended to be more numerous than nouns and adjectives suffixes. Since our stemmer did not consider parts of speech or did not involve complex morphological analyses, we thought these numerous verbal suffixes might hurt the mean average precision when included in a fast suffix-stripping approach. Finally, light stemmers tended to result in better retrieval effectiveness than the more aggressive stemmers (e.g, the Porter’s stemmer [10] for the English language) that also removed derivational suffixes [11].
4.
OVERALL EVALUATION
To measure retrieval performance, we adopted MAP values computed on the basis of 1,000 retrieved items per request, as calculated with the TREC EVAL program [12]. Using this evaluation tool, some differences may occur in the values computed according to the official measure (the latter always takes 50 queries into account while in our presentation we did not account for queries having no relevant items, as for example for the Hindi or the Marathi collections having respectively 45 and 49 queries with at least one relevant document). In the following tables, best performances under given conditions (same indexing scheme and same collection) are listed in bold type.
4.1
Evaluation of the Various Probabilistic IR Models
Using the Hindi collection, Table 2 shows the MAP obtained by various probabilistic models with three different query formulations (T, TD and TDN) and a word-based with a light stemming approach while Table 3 depicts the same information using a character 4-gram [13] indexing scheme. In this scheme, each word is decomposed into sequences of n characters. For example, when analyzing the term ”system”, the following 4-grams are extracted ”syst”, ”yste”, and ”stem”. With this type of indexing approach, stopword lists and stemmers adapted for the corresponding language are not required, since during indexing the n-grams appearing in all documents (e.g., ”with”, ”have” or very frequent suffixes like ”-ment”) will be assigned null or at least insignificant weights. In the bottom part of the following tables, we can find under the label ”Average” the average MAP across the 7 (or 6 for the Bengali language) probabilistic IR models (thus ignoring the tf · idf approach). The last line show the percentage variation obtained when compared to the short (T) query formulation. For the Bengali language, Table 5 depicts the MAP obtained by various probabilistic models with three different query formulations (T, TD and TDN) and a word-based with a light stemming approach. Table 6 shows the same information using language-independent approach, namely a character 4-gram [13] indexing scheme. The MAP obtained when we ignore the stemming stage is given in Ta-
Table 5: MAP of various IR models with light stemmer (Bengali collection)
Table 3: MAP of various IR models with 4-gram indexing strategy (Hindi collection)
4-gram Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Hindi Hindi Hindi T TD TDN 0.2495 0.2918 0.3273 0.2531 0.3082 0.3193 0.2259 0.2710 0.3009 0.2537 0.2967 0.3369 0.2601 0.2967 0.3280 0.2629 0.3020 0.3390 0.2199 0.2735 0.3313 0.1750 0.2069 0.2249 0.2464 0.2914 0.3261 base +18.2% +32.3%
Table 4: MAP of various IR models without stemmer (Hindi collection)
No Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Hindi Hindi Hindi T TD TDN 0.2179 0.2792 0.3388 0.2424 0.3018 0.3395 0.1960 0.2445 0.3005 0.2220 0.2833 0.3350 0.2255 0.2848 0.3411 0.2311 0.2954 0.3518 0.1872 0.2581 0.3128 0.1548 0.1895 0.2066 0. 0. 0. base +% +%
Light Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Bengali Bengali Bengali T TD TDN 0.2958 0.3372 0.4124 0.3104 0.3487 0.4054 0.2733 0.3078 0.3883 0.2997 0.3480 0.4131 0.3069 0.3554 0.4130 0.2671 0.3032 0.3849 0.1850 0.2081 0.2422 0.2922 0.3334 0.4029 base +14.1% +37.9%
Table 6: MAP of various IR models using 4-gram indexing strategy (Bengali collection)
4-gram Model Okapi DFR-PB2 DFR-GL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Bengali Bengali Bengali T TD TDN 0.2519 0.2979 0.3787 0.2606 0.3076 0.3873 0.2341 0.2735 0.3616 0.2654 0.3174 0.4005 0.2694 0.3214 0.4074 0.2226 0.2649 0.3537 0.1724 0.2003 0.2486 0.2507 0.2971 0.3815 base +18.5% +52.2%
ble 7. These retrieval performance are in mean lower than the light stemming approach (given in Table 5). Finally, we have designed and implemented a more aggressive stemmer that removes some frequent derivational suffixes. The corresponding MAP are depicted in Table 8 presenting a retrieval performance similar to those obtained with our light stemmer (see Table 5). Finally, for the Marathi language, Table 9 depicts the retrieval performance achieved by various probabilistic models with three different query formulations (T, TD and TDN) and a word-based with a light stemming approach. Table 10 shows the same information using a character 4-gram [13] indexing scheme. This latter indexing scheme tends to produce, in mean, better retrieval effectiveness than the wordbased indexing strategy. This could be interpreted as an evidence that the light stemming approach does not work well for this more inflectional language. The MAP obtained when we ignore the stemming stage is given in Table 11. These retrieval performance are in mean lower than the light stemming approach (given in Table 9). An analysis showed that pseudo-relevance feedback (whether PRF or blind-query expansion) seemed to be a useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio’s approach [14]) with α = 0.75, β = 0.75, whereby the system was allowed to add m terms extracted from the k best ranked documents from the original query. From our previous experiments we learned that this type of
Table 7: MAP of various IR models without stemming (Bengali collection)
No Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Bengali Bengali Bengali T TD TDN 0.2595 0.3025 0.3732 0.2641 0.3056 0.3715 0.2453 0.2811 0.3638 0.2644 0.3095 0.3808 0.2680 0.3118 0.3786 0.2324 0.2659 0.3511 0.1810 0.2073 0.2402 0.2556 0.2961 0.3698 base +15.8% +44.7%
Table 8: MAP of various IR models with an aggressive stemmer (Bengali collection)
Aggressive Stem. Model Okapi DFR-PB2 DFR-GL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Table 10: MAP of various IR models using 4-gram indexing scheme (Marathi collection)
4 gram Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Marathi Marathi Marathi T TD TDN 0.2945 0.3554 0.4117 0.2810 0.3396 0.3339 0.2817 0.3378 0.3574 0.2933 0.3433 0.3835 0.3268 0.3864 0.4056 0.3253 0.3860 0.4263 0.2754 0.3401 0.4124 0.2459 0.2774 0.3555 0.2903 0.3555 0.3901 base +22.5% +34.4%
Mean Average Precision Bengali Bengali Bengali T TD TDN 0.2952 0.3322 0.4179 0.3025 0.3497 0.4113 0.2720 0.3056 0.3896 0.2950 0.3469 0.4161 0.3025 0.3526 0.4174 0.2626 0.3009 0.3851 0.1773 0.1988 0.2354 0.2883 0.3312 0.3896 base +14.9% +35.1% Table 11: MAP of various IR models without stemmer (Marathi collection)
Table 9: MAP of various IR models with light stemmer (Marathi collection)
Light Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Marathi Marathi Marathi T TD TDN 0.2598 0.3133 0.3961 0.2612 0.3086 0.3807 0.2527 0.3144 0.3751 0.2554 0.3104 0.3860 0.2655 0.3174 0.3925 0.2779 0.3161 0.3997 0.2436 0.3036 0.3851 0.1604 0.2363 0.3138 0.2602 0.3120 0.3879 base +19.9% +49.1%
No Stemming Model Okapi DFR-PB2 DFR-GL2 DFR-PL2 DFR-I(ne )B2 DFR-I(ne )C2 LM tf idf Average % over T
Mean Average Precision Marathi Marathi Marathi T TD TDN 0.2268 0.2514 0.3410 0.2256 0.2472 0.3378 0.2129 0.2517 0.3276 0.2204 0.2492 0.3353 0.2313 0.2542 0.3525 0.2402 0.2474 0.3484 0.2136 0.2607 0.3563 0.1414 0.1933 0.2612 0.2244 0.2517 0.3427 base +12.2% +52.7%
blind query expansion strategy does not always work well. More particularly, we believe that including terms occurring frequently in the corpus (because they also appear in the top-ranked documents) may introduce more noise, and thus be an ineffective means of discriminating between relevant and non-relevant items [15]. Consequently we also chose to apply our idf-based query expansion model [16]. To evaluate these propositions, we applied certain probabilistic models and enlarged the query by adding the 20 to 100 terms retrieved from the 3 to 10 best-ranked articles contained in the Hindi (see Table 13), Bengali (see Table 14) or Marathi (see Table 15) collection.
4.2
Table 12: Data fusion combination operators used in this study) Name Sum RSV Norm Max Norm RSV Z-score
Merging Strategy Pm i=1 αi · RSVk Pm RSVk i=1 αi · M axi Pm RSVk −M ini i=1 αi · M axi −M ini i h −M eani + δi αi · RSVkStd i
Data Fusion
It is usually assumed that combining different search models would improve retrieval effectiveness [17], for three main reasons. First there is a skimming process in which only the k top-ranked retrieved items from each ranked list are considered. In this case, we would combine the best answers obtained from various document representations (which would retrieve various pertinent items). Second we would count on the chorus effect, by which different retrieval schemes would retrieve the same item, and as such provide stronger evidence that the corresponding document was indeed relevant. Third, an opposite or dark horse effect might also play a role whereby a given retrieval model may provide unusually high (low) and accurate estimates regarding a document’s relevance. Thus, a combined system could possibly return more pertinent items by accounting for documents having a relatively high (low) score, or when a relatively short (long) result lists occurred. Such a data fusion approach however requires more storage space and processing time. Following a comparison of advantages and disadvantages, it is unclear whether such approaches might be of any real commercial interest. In this current study we combined three probabilistic models representing both the parametric (Okapi and DFR) and non-parametric (language model or LM) approaches. To produce this combination we evaluated various fusion operators (see Table 12 for a detailed list of their descriptions). The ”Sum RSV” operator for example indicates that the combined document score (or the final retrieval status value) is simply the sum of retrieval status values (RSVk ) for the corresponding document Dk computed by each single indexing scheme [18]. Table 12 thus illustrates how both the ”Norm Max” and ”Norm RSV” apply a normalization procedure when combining document scores. When combining the retrieval status value (RSVk ) for various indexing schemes and in order to favor certain more efficient retrieval schemes, we could multiply the document score by a constant αi (usually equal to 1), reflecting retrieval performance differences. In addition to use these data fusion operators, we also considered the round-robin approach, wherein each document was taken in turn from each individual list, and also any duplicates, retaining only the highest ranking occurrence. Finally we suggested merging the retrieved documents according to the Z-Score, computed for each result list. More details can be found in [1]. In Table 12, M ini (M axi ) lists the minimal (maximal) RSV value in the ith result list. Of course, we might also weight the relative contribution of each retrieval scheme by assigning a different αi value to each retrieval model (fixed at 1 in all our experiments).
Table 13: Description and mean average precision (MAP) for our official Hindi monolingual runs
1 TD 2 TD 3 TD 4 TDN
5.
Index 4-gram light light 4-gram light light light 4-gram light 4-gram
Model Okapi I(ne )B2 PB2 I(ne )B2 LM Okapi I(ne )B2 PB2 I(ne )B2 Okapi
PRF 5 / 100 5 / 70 5 / 20 5 / 50 3 / 20 5 / 30 5 / 50 5 / 70 5 / 50
MAP 0.2935 0.2971 0.2317 0.2668 0.2931 0.3357 0.2953 0.1556 0.3097 0.3029
MAP Z-score 0.3507 Z-score 0.3529 Z-score 0.3711 Z-score 0.3630
OFFICIAL RESULTS
Table 13 shows the exact specifications of our 4 official monolingual runs for the Hindi monolingual evaluation task, based mainly on the probabilistic models (Okapi, DFR and statistical language model (LM)). Table 14 lists the same information for Bengali, showing the 6 official submissions. Table 15 reports our 4 official runs for the Marathi language. For the Hindi and Marathi languages we submitted three runs with the TD query formulation and one with the TDN. For the Bengali corpus, we have submitted four runs with TD formulation, and two additional runs with the longest TDN query formulation. All runs were fully automated using our stopword lists and light stemming strategies. In all cases the same Z-score data fusion approach was applied.
6.
CONCLUSION
In this first FIRE campaign we evaluated various probabilistic IR models using three different Indian languages, namely Hindi, Bengali and Marathi. The topic descriptions available for these test-collections comprised news articles containing numerous proper names, a fact that may have had an impact on the retrieval effectiveness of the stemming strategy. Usually in such circumstances, the best stemming strategy for the English language is to ignore this term normalization. For the three Indian languages we could however mention that the suffix attached to noun and in some instances to the related adjectives may denote the grammatical case and thus a light stemming strategy would be useful for removing such case markers. In participating in this evaluation campaign, we have de-
Table 14: Description and mean average precision (MAP) for our official Bengali monolingual runs
1 TD 2 TD 3 TD 4 TD 5 TDN 6 TDN
Index 4-gram aggres. light 4-gram aggres. light 4-gram light 4-gram aggres. light aggres. 4-gram. light 4-gram aggres.
Model PB2 LM I(ne )C2 I(ne )C2 LM I(ne )C2 LM I(ne )C2 PB2 LM Okapi Okapi LM I(ne )B2 I(ne )C2 Okapi
PRF 5 / 30 5 / 50 5 / 20 3 / 50 5 / 20 5 / 40 5 / 20 5 / 30 5 / 20 5 / 20 3 / 90 5 / 70 3 / 50 5 / 40
MAP 0.3539 0.3695 0.4103 0.3214 0.3701 0.4103 0.3410 0.4103 0.3382 0.3547 0.3976 0.4179 0.4071 0.4347 0.4189 0.4029
MAP Z-score 0.4132 Z-score 0.4132 Z-score 0.4099 Z-score 0.4134 Z-score 0.4719 Z-score 0.4549
Table 15: Description and mean average precision (MAP) of our official Marathi monolingual runs
1 TD 2 TD 3 TD 4 TDN
Index light 4-gram light 4-gram light 4-gram 4-gram light light 4-gram light
Model I(ne )C2 GL2 Okapi Okapi LM PL2 I(ne )B2 Okapi GL2 Okapi I(ne )C2
PRF 5 / 20 3 / 20 10 / 50 5 / 20 3 / 50 5 / 50 10 / 50 5 / 70 3 / 50
MAP 0.3161 0.3481 0.3817 0.3554 0.3692 0.3584 0.4067 0.3301 0.3763 0.4230 0.4116
MAP Z-score 0.4128 Z-score 0.4225 Z-score 0.4238 Z-score 0.4575
signed and evaluated stopword lists and light stemming strategies for the Hindi, Bengali and Marathi languages. Based on these IR tools, the results of our various experiments demonstrate that the I(ne )B2 or PB2 models derived from the Divergence from Randomness (DFR) paradigm seem to provide the best overall retrieval performances (see Tables 2 and 3 (Hindi), Tables 5 and 6 (Bengali), and Tables 9 and 10 (Marathi)). The Okapi model used in our experiments usually results in retrieval performances inferior to those obtained with the best DFR approaches. For the classical tf · idf or the LM models however the performance differences were statistically significant and in favor of the DFR models. For these three languages, giving more search terms in the topic description will statistically improve the retrieval performance, either comparing T to TD or TD to TDN query formulations (and of course T to TDN). Additional analyses are needed to verify the retrieval effectiveness and hopefully improvement related to pseudorelevance feedback techniques. Applying the Z-score data fusion operator after a blind-query expansion tends to improve the MAP of the merged run over the best single IR system. From our past experience, we know that such IR technique may improve the MAP in some cases and hurt retrieval quality when used with other corpora and languages. Within the FIRE test-collections, the improvement obtained after the Z-score data fusion operator was clearly important. For each language and based on our experiments, we have reached the following conclusions. For the Hindi language, our experiments demonstrate that our light stemming procedure produces better retrieval results than an indexing strategy ignoring this word normalization procedure (mean relative difference from 13.5% (TD) to 16.8% (T), see Tables 2 and 4). When comparing a word-based (with a light stemmer) and a 4-gram indexing strategy, our evaluations tend to show a relative better performance for the wordbased scheme (relative improvement of 5.7% (T) to 9.4% (TD), see Tables 2 and 3). For the Bengali language as for the Hindi, a word-based approach (with a light stemmer) tends to produce better retrieval effectiveness than a 4-gram indexing scheme (mean relative improvement of 10.9% (TD) to 14.2% (T), see Tables 5 and 6). The performance difference with a nonstemming approach favors clearly our light stemmer and the difference is usually statistically significant (mean relative enhancement 11.2% (TD) to 12.5% (T), see Tables 5 and 7). For the Bengali language only we developed also a more aggressive stemmer producing a retrieval performance comparable to those achieved by a light stemmer(see Tables 5 and 8). For the Marathi language presenting a more complex inflectional morphology, the performance difference between a word-based and a 4-gram indexing strategy tends to favor the n-gram scheme (mean relative difference of 11.6% (T) to 14% (TD), see Tables 9 and 10). The suggested light stemming approach produces however better results than an IR scheme without a stemming stage (relative improvement of 13.7% (T) to 19.3% (TD), see Tables 9 and 11).
Acknowledgments. This research was supported in part by the Swiss National Science Foundation under Grant #200021-113273.
7.
REFERENCES
Table 16: Parameter settings for the various testcollections) [1] Savoy, J.: Combining Multiple Strategies for Effective Monolingual and Cross-Lingual Retrieval. IR Journal, 7, 121–148, (2004). [2] Savoy, J.: Comparative study of monolingual and multilingual search models for use with Asian languages. ACM - Transactions on Asian Languages Information Processing, 4, 163–189, (2005). [3] Voorhees, E.M., Harman, D.K. (Eds): TREC. Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (MA), 2005. [4] Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a Way of Life: Okapi at TREC. Information Processing & Management, 36, 95–108, (2002). [5] Amati, G., van Rijsbergen, C.J.: Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems, 20, 357–389, (2002). [6] Hiemstra, D.: Using Language Models for Information Retrieval. PhD Thesis (2000). [7] Hiemstra, D.: Term-specific Smoothing for the Language Modeling Approach to Information Retrieval. In Proceedings of the ACM-SIGIR, 35–41. The ACM Press, (2002). [8] Fox, C.: A Stop List for General Text. ACM-SIGIR Forum, 24, 19–35, (1990). [9] Harman, D.K.: How Effective is Suffixing? Journal of the American Society for Information Science, 42, 7–15, (1991). [10] Porter, M.F.: An algorithm for Suffix Stripping. Program, 14, 130–137, (1980). [11] Savoy, J.: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages. In Proceedings ACM-SAC, 1031–1035. The ACM Press, New York, (2006). [12] Buckley, C., Voorhees, E.M.: Retrieval System Evaluation. In E.M. Voorhees, D.K. Harman (Eds): TREC. Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (MA), 53–75, (2005). [13] McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. IR Journal, 7, 73–97, (2004). [14] Buckley, C., Singhal, A., Mitra, M., Salton, G.: New Retrieval Approaches Using SMART. In: Proceedings TREC-4, 25–48. Gaithersburg (1996). [15] Peat, H.J., Willett, P.: The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems. Journal of the American Society for Information Science, 42, 378–383 (1991). [16] Abdou, S., Savoy, J.: Searching in Medline: Stemming, Query Expansion, and Manual Indexing Evaluation. Information Processing & Management, 44, 781–789 (2008). [17] Vogt, C.C., Cottrell, G.W.: Fusion via a Linear Combination of Scores. IR Journal, 1, 151–173, (1999). [18] Fox, E.A., Shaw, J.A.: Combination of Multiple Searches. In Proceedings TREC-2, 243–2409. NIST Publication #500215 (1994).
Language Hindi Bengali Marathi
8.
b 0.55 0.55 0.75
APPENDIX
Okapi k1 1.2 1.2 1.2
advl 356 292 265
c 1.5 1.5 1.5
DFR mean dl 356 292 265
Table 17: collections) Number C026 C027 C028 C029 C030 C031 C032 C033 C034 C035 C036 C037 C038 C039 C040 C041 C042 C043 C044 C045 C046 C047 C048 C049 C050 C051 C052 C053 C054 C055 C056 C057 C058 C059 C060 C061 C062 C063 C064 C065 C066 C067 C068 C069 C070 C071 C072 C073 C074 C075
Query
titles
for
FIRE-2008
test-
Title Singur land dispute Relations between India and China Iran’s Nuclear Programme Post Tsunami relief measures Laloo Parasad Yadav as the Railway Minister Kashmir under terrorist attacks Relations between Congress and its allies President Bush visits India Jessica Lall Murder ULFA attacks in India Protest against Narmada Dam construction Changing political scenario in Nepal Uneasy truce between Greg Chappell and Sourav Ganguly Attacks on American soldiers in Iraq Corruption in Pakistan’s judicial system New Labour Laws in France North Korea’s Nuclear Policy - worldviews Imposing dress-code in Educational Institutions Terrorist attacks in Britain Global Warming ”Prince” rescued after 50 hours in black hole Nobel Prize missing Nithari murder case World wide natural calamities Kolkata Book Fair 2007 Corruption in the educational system Budget 2006-2007 India-U.S Nuclear Deal HIV and AIDS epidemic Sania Mirza’s tennis career The increase of mobile phone users Salman Khan’s killing of rare antelope Thailand Coup Protests by American citizens against Iraq War Terrorist activities of Al Qaida Harry Potter mania Corruption in the Central Government Netaji’s death unraveled Sabharwal Murder Case CBI in search of Dawood Ibrahim Khadim owner abduction case Revival of Bofors Scandal Amarnath Yatra Indian Railway Accidents Remake in Bollywood Haj Pilgrims from India Stamp paper scam Zinedine Zidane’s headbutting incident at the World Cup Indo-Pak LOC problem Britain’s new Prime Minister