The Effect of Named Entities on Effectiveness in Cross-Language ...

12 downloads 0 Views 63KB Size Report
run EIT01M3N in the CLEF 2001 campaign, we see that it has a fairly good average precision of 0.341. However, for topic 44, which had an average difficulty, ...
The Effect of Named Entities on Effectiveness in CrossLanguage Information Retrieval Evaluation Thomas Mandl

Christa Womser-Hacker

Information Science, University of Hildesheim Marienburger Platz 22 D-31141 Hildesheim, Germany ++ 49 5121 883 837

Information Science, University of Hildesheim Marienburger Platz 22 D-31141 Hildesheim, Germany ++ 49 5121 883 833

[email protected]

[email protected]

ABSTRACT The large number of experiments carried out within evaluation initiatives for information retrieval has led to an invaluable source for further research and meta-analysis. In this study, an analysis of the results of the Cross Language Evaluation Forum (CLEF) campaigns for the years 2000 to 2003 is presented. This study considers the performance of the systems for each individual topic. It is dedicated to the influence of named entities on retrieval performance. Named entities in topics lead to significant improvement of the retrieval quality in general and for most systems and tasks. The performance of systems varies for topics without, with one or two and with three or more named entities. This knowledge gained by data mining on the evaluation results can be exploited for the improvement of retrieval systems as well as for the design of topics for future CLEF campaigns.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval H.3.4 [Information Storage and Retrieval]: Systems and Software – performance evaluation.

General Terms Measurement

Keywords Cross Language Evaluation Forum (CLEF), evaluation issues, named entities, test collections, query features

1. INTRODUCTION Topics are considered as an essential component of experiments for information retrieval evaluation [12]. In most evaluation initiatives, the variation in the results between the topics is larger than the variation between the systems. The topic creation process for a multilingual test environment requires special care in order to avoid cultural or linguistic bias influencing the semantics of the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05, March 13-17, 2005, Santa Fe, New Mexico, USA. Copyright 2005 ACM 1-58113-964-0/05/0003…$5.00.

topic formulations. It must be assured that each topic provides equal conditions as a starting point for the systems which may choose any topic language. The question remains whether linguistic aspects randomly appearing within the topics have any influence on the retrieval performance. This is especially important because leaving out one topic from the CLEF campaign changes the ranking of the retrieval systems in many cases as we have observed. Most analysis of the data generated in CLEF considers the average performance of the systems. Our study looks at the retrieval quality of systems for individual topics. By doing that, we identify one feature of the topics which has potential for future system improvement. Named entities are a feature which can be easily identified within queries. We consider the systems at CLEF as black boxes and have so far not undertaken any effort to analyze how these systems treat named entities and why that treatment may result in the effects we have observed. This data is not provided by CLEF. The systems use very different approaches, tools and linguistic resources. Each system may treat the same named entity quite differently and successful retrieval may be due to a large number of factors like appropriate treatment as n-gram, proper translation by a translation service or due to an entry in a linguistic resource. An analysis of the treatment of the named entities would lead merely to case studies. As a consequence, we think a statistical analysis of the overall effect is an appropriate research approach. In this study we focused on the impact of named entities in topics and found a significant correlation with the average precision. The goal of this study is twofold: (a) to measure the effect of named entities on retrieval performance in CLEF (b) to optimize retrieval systems based on these results. This work is based on a previous, preliminary analysis [9]. The remainder of this paper is organized as follows. The next section provides a brief overview of the research on evaluation results and their validity. Section three describes the data for CLEF used in our study. In section four, the influence of named entities on the overall retrieval results are analyzed. Section five explores the relation between named entities and the performance of individual systems. In section six, we show how the performance variation of systems due to named entities could be exploited for system optimization. Finally, we discuss directions for future research.

2. ANALYSIS OF INFORMATION RETRIEVAL EVALUATION RESULTS

3. NAMED ENTITIES IN THE MULTILINGUAL TOPIC SET

The validity of large-scale information retrieval experiments has been the subject of a considerable amount of research. Zobel concluded that the TREC (Text REtrieval Conference) experiments are reliable as far as the ranking of the systems is concerned [15]. Buckley & Voorhees have analyzed the reliability of experiments for different sizes of the topic set [3]. They concluded that the typical size of the topic set of some 50 topics in TREC is sufficient for a satisfactory level of reliability.

The data for this study was extracted from the Cross Language Evaluation Forum (CLEF) [10, 11]. CLEF is a large evaluation initiative which is dedicated to cross-language retrieval for European languages. The setup is similar to the Text Retrieval Conference (TREC) [7, 14]. The main tasks for multilingual retrieval are the following:

An important aspect in evaluation studies is pooling. Not all submitted runs can be judged manually by jurors and relevant documents may remain undiscovered. Therefore, a pool of documents is built to which the systems are contributing differently. In order to measure the potential effect of pooling, a study was conducted which calculated the final rankings of the systems by leaving out one run at a time [2]. It shows that the effect is negligible and that the rankings remain stable. However, our analysis shows that leaving out one topic during the result calculation changes the system ranking in most cases. It has also been noted that the differences between topics are larger than the differences between systems. This effect has been observed in TREC [7] and also in CLEF [8]. For example, when looking at run EIT01M3N in the CLEF 2001 campaign, we see that it has a fairly good average precision of 0.341. However, for topic 44, which had an average difficulty, this run performs far below (0.07) the average for that topic (0.27). An intellectual analysis of the topics revealed that two of the most difficult topics contained no proper names and that both topics were about the domain sports (Topic 51 and 54). The assumption that named entities have an effect on retrieval effectiveness is backed by experimental results. The influence of named entities on the retrieval performance is considerable. In an experiment, the removal of named entities from the query formulations led to a large decrease of the quality whereas the use of named entities only as query led to a much smaller decrease [4]. In this experiment, the full query achieved a mean average precision of 0.3, named entities only still achieved a mean average precision of 0.15 and eliminating named entities caused a drop to 0.06 [4]. A topic analysis by the retrieval group at the University of Amsterdam revealed that for their participation, the five best scoring topics contained proper names whereas the five worst or most difficult topics contained no proper names but general terms with a high frequency in the collection [8]. However, Hollink et al. point out the effects of the multilingual setting. The positive effect of proper names can be lost through spelling variants and transliteration problems in some languages. For the systems tested [8], the overall effect of proper names in multilingual settings was somewhat smaller than the effect which can be observed in monolingual retrieval [e.g. 7]. Especially the transliteration of named entities in languages with different alphabets have led to intensive research [13]. The idea to perform deeper linguistic analysis beyond stemming on the query formulation has been pursued and can be used e.g. for ambiguity resolution [1].



The core and most important track is the multilingual task. The participants choose one topic language and need to retrieve documents in all main languages. The final result set needs to integrate documents from all languages ordered according to relevance regardless of their language.



The bilingual task requires the retrieval of documents different from the chosen topic language.

Figure 1 shows a typical CLEF topic: C088 Mad Cow in Europe Find documents that cite cases of Bovine Spongiform Encephalopathy (the mad cow disease) in Europe. Relevant documents will report statistics and/or figures on cases of animals infected with Bovine Spongiform Encephalopathy (BSE), commonly known as the mad cow disease, in Europe. Documents that only discuss the possible transmission of the disease to humans are not considered relevant. Figure 1. Example of a topic The topic language is the language which the system designers choose to construct their queries. The retrieval performance of the runs for the topics was extracted from the appendix of the annual CLEF proceedings [e.g. 10, 11]. From this source, the average precision of each run for each system can be extracted. An intellectual analysis of the results and the properties of the topics had identified named entities as a potential indicator for good retrieval performance. They are also often mentioned as an important factor in information retrieval [6]. Because of that, named entities in the CLEF topic set were analyzed in more detail. The analysis included all topics from the campaigns in the years 2001 through 2003. The number of named entities in the topics were assessed intellectually. We focused on the number of types of named entities in the topics. Table 1 shows the overall number of named entities found in the topic sets. The large number of named entities in the topic set already shows their importance. We focused on English and German as topic languages and looked at the bilingual and multilingual tasks. For the analysis in section five, we divided the topics into three classes: (a) without named entities (b) with one or two named entities (c) with three and more named entities. The distribution of topics over these three classes is also shown in table 2. It can be

seen, that the three classes are best balanced in CLEF 2002, whereas topics in the second class dominate in CLEF 2003.

correlation coefficient of 0.35. These relations are statistically significant at a level of 95%.

Table 1. Named entities (NE) in the CLEF topics

Table 3. Method a: Best run for each topic in relation to the number of named entities in the topic (topics 41 to 200)

2000 2001 2002 2003

Total number of NE 52 60 86 97

Average number of NE in topics 1.30 1.20 1.72 1.62

Standard deviation of NE in topics 1.14 1.06 1.54 1.18

Table 2 Overview of named entities in CLEF tasks CLEF year 2000 2001 2002 2003

Topics with no named entities 8 17 14 9

Topics with one or two named entities 26 26 21 40

Topics with more than three named entities 4 7 15 10

4. NAMED ENTITIES AND GENERAL RETRIEVAL PERFORMANCE Our first goal was to measure whether named entities had any influence on the overall quality of the retrieval results. In order to measure this effect we first calculated the correlation between the overall retrieval quality achieved for a topic and the number of named entities encountered in this topic. In the second section, this analysis is refined to single tasks and specific topic languages.

4.1 Correlation between average precision and number of proper names First, we show the overall performance in relation to the number of named entities in a topic. For each number of named entities, we determine the overall performance by two methods: (a) take the best run for each topic (b) take the average of all runs for a topic. For both methods, we obtain a set of values for n named entities. Within each set we can determine the maximum, the average and the minimum. For example, we determine for method (a) the following values: best run for n named entities, average of all best runs for n named entities and worst run among all best runs for n named entities. The last value gives the performance for the most difficult topic within the set of topics containing n named entities. The maximum of the best runs is in most cases 1.0 and is therefore omitted. Tables 3 and 4 and figure 2 show these values. The CLEF campaign contains relatively few topics with four or more named entities. It can be observed that topics with more named entities are generally solved better by the systems. This impression can be confirmed by a statistical analysis. The average performance correlates to the number of named entities with a value of 0.43 and the best performance with a value of 0.26. Disregarding the topics from the campaign in 2003 leads to a

Number of named entities Number of Topics Average of Best System per Topic Minimum of Best System per Topic Standard Deviation of Best System per Topic

0 1 2 3 4 5 42 43 40 20 9 4 0.62 0.67 0.76 0.83 0.79 0.73 0.09 0.12 0.04 0.28 0.48 0.40 0.24 0.24 0.24 0.18 0.19 0.29

Table 4. Method b: Average precision of runs in relation to the number of named entities in the topic (topic 41 to 200) Number of named entities Number of Topics Minimum of Average Performance per Topic Average of Average Performance per Topic Maximum of Average Performance per Topic Standard Deviation of Average Performance

0 42

1 43

2 40

3 20

4 9

5 4

0.02 0.04 0.01 0.10 0.17 0.20 0.20 0.25 0.36 0.40 0.31 0.40 0.54 0.61 0.78 0.76 0.58 0.60 0.14 0.15 0.18 0.17 0.14 0.19

0,9 0,8 0,7 Average Precision

Number of topics 40 50 50 60

CLEF year

0,6 0,5 0,4 0,3 0,2 0,1 0 0

1 2 3 4 5 Number of Named Entities Minimum

Average

6 Maximum

Figure 2. Method b: Relation between system performance and the number of named entities

4.2 Correlation for Individual Tasks and Topic Languages The correlation analysis was also carried out for the individual retrieval tasks or tracks. This can be done by (a) calculating the

average precision for each topic achieved within a task, by (b) taking the maximum performance for each topic (taking the maximum average precision that one run achieved for that topic) and by (c) calculating the correlation between named entities and average precision for each run individually and taking the average for all runs within a task. All three measures are presented in table 5. Except for one task (multilingual with topic language English in 2001), all correlations are positive. Thus, the effect which was before observed overall, occurs within most tasks and even within most single runs. Table 5. Correlation of system performance and number of named entities for different tasks based on three methods

RR – nr NE

Eng Ger Ger Eng Eng Ger Ger Eng Eng Ger Eng Eng

MXP – nr NE

Topic language

multi bi multi bi multi bi multi bi multi bi bi multi

Stat

Run type (bior multilingual)

2000 2001 2001 2001 2001 2002 2002 2002 2002 2003 2003 2003

AV Prec – nr NE

CLEF year

MXP – nr NE RR – nr NE

Method (a) Correlation of average precision per topic to number of named entities Level of statistical significance (t-distribution) for previous column Method (b) Correlation of maximum precision per topic to number of named entities Method (c) Average correlation between run results and number of named entities

Number of runs

AVP – nr NE Stat

21 9 5 3 17 4 4 51 32 24 8 74

0.26 0.44 0.19 0.20 -0.34 0.33 0.43 0.40 0.29 0.21 0.41 0.31

99% 95% 99% 95% 99% 95%

0.17 0.32 0.24 0.13 -0.36 0.25 0.41 0.36 0.37 0.10 0.47 0.27

0.20 0.26 0.18 0.16 -0.27 0.27 0.42 0.25 0.19 0.12 0.31 0.25

There is no difference in the average strength of the correlation for German (0.27) and English (0.28) as topic language. The average for each language in the last column shows a more significant difference. The correlation is stronger for German (0.19) than for English (0.15) as topic language. Furthermore, there is a considerable difference between the average correlation for the bi-lingual (0.35) and multi-lingual run types (0.22). This could be a hint, that the observed positive effect of named entities on retrieval quality is smaller for multi-lingual retrieval.

5. PERFORMANCE VARIATION OF SYSTEMS FOR NAMED ENTITIES In this chapter, we show that the systems tested at CLEF perform different for topics with different numbers of named entities. Although proper names make topics easier in general and for almost all runs, the performance of systems varies within the three

classes of topics based on the number of named entities. As already mentioned, we distinguished three classes of topics, (a) the first class with no proper names called none, (b) the second class with one and two named entities called few and (c) one class with three or more named entities called lots. This categorization is similar to the one proposed in other experiments where topics were grouped according to their difficulty [5]. However, our approach is suited for an implementation and allows the categorization before the experiments and the relevance assessment. It requires no intellectual intervention but solely a named entity recognition system.

5.1 Variation of System Performance As we can see in table 3, the three categories are well balanced for the CLEF campaign in 2002. For 2003, there are only few topics in the first and second category. Therefore, the average ranking is extremely similar to the ranking for the second class few. Figure 3 shows that the correlation between average precision and the number of named entities is quite different for all runs of one task. The runs in figure 3 are ordered according to the original ranking in the task. We observe a slightly decreasing sensitivity for named entities with higher system performance. However, the correlation is still substantial and sometimes still high for top runs. A look a the individual runs shows large differences between the three categories. As an example, we show the values for one task in figure 4. The curve for many named entities lies mostly above the average curve whereas the average precision for the class none with no named entities in most cases remains below the overall average. Sometimes even the best runs perform quite differently for the three categories. Other runs perform similarly for all three categories.

5.2 Correlation of System Rankings The performance variation within the classes leads to different system rankings for the classes. An evaluation campaign including, for example, only topics without named entities may lead to different rankings. To analyze this effect, we determined the rankings for all runs within each named entity class, none, few and lots. The resulting system rankings can be quite different for the three classes. The difference has been measured with the Pearson correlation coefficient for the system rankings. For most tracks, the original average system ranking is most similar to the ranking based only on the topics with one or two named entities. For the first and second category, the rankings are more dissimilar. The ranking for the top ten systems in the classes usually differs more from the original ranking. This is due to the small performance differences between top runs. These findings are not statistically significant in all cases because each category contains only few topics. As stated by Buckley and Voorhees, some 50 topics are necessary to create a statistically reliable ranking among systems in an evaluation study [3]. However, this result hints that special treatment of named entities in information retrieval systems may improve the quality of the results.

6. OPTIMIZATION BY FUSION BASED ON NAMED ENTITIES The patterns of the systems are strikingly different for the three classes. As a consequence, there seems to be potential for the combination or fusion of systems. We propose the following simple fusion rule. For each topic, the number of named entities is determined. Subsequently, this topic is channeled to the system with the best performance for this named entity class.

Because named entity recognition systems are available, this method could easily be implemented within an information retrieval system.

0,6

Average Korrel proper names

0,5 0,4 0,3 0,2 0,1 0,0 -0,1

A oc A bi e0 E N 2e N n2 L t nl ER U Am f i nb sC 02 U i2 J En AB G eN IIT G r ap a m lb ie tlr nit en U 2es JA B ap I S lb P ie nd tlr e en 2 t n fr oe n U JA 1 BI ap FR lb IR ie oc ST nnl e0 en 2e 2i t ex n2e 3 ei sL te O ex ng ei org te ng bi p B iw ex KB gt es I pe EG ng 2 o BK rgb BI i E BK G1 B bk I E F y2 1 b bk ie n y 2 es bi en fr2

Correlation strength / average precision

The best system is a combination of at three or less runs. Each category of topics is answered by the optimal system for that

number of named entities. By simply choosing the best performing system for each topic, we can also determine a practical upper level for the performance of the retrieval systems. This upper level can give a hint on how much of the potential for improvement is exploited by an approach. Table 6 shows the optimal performance and the improvement by the fusion based on the optimal selection of a system for each category of topics.

Runs

Figure 3. Correlation between named entities and performance of runs in CLEF 2002 (task bilingual, topic language English)

0,5 0,4 0,3 0,2

Average avg few

0,1

avg none avg lots

ic N 02 iiM N t01 iiD ic 0 cc 1 n IM gm SB IM G Z SB 3 G Z0 IR cc ST m de ix 2i t_ 1 IR c c ST w r cm de d uG 2it_ BK 2Ic 2 R o ex UB mb ed LG cm eqd R2 uG ittc ex 2Ic pro ed om eq bf itd b b bk ipro U b Am c m id sC uG eit1 0 3 2I G pa eI r t4 a G ex bk iSb ed bi eq de cm yd it2 u itb ex G2 ipr ed Ip o eq ara yd fb itb ap isy lb s id ap eit lb a id ei tb

0

N iiD

Average Precision

0,6

Runs

Figure 4. Performance variation for runs in CLEF 2003 (task bilingual, topic language German)

Table 6. Improvement by fusion based on named entities for several tasks Improvement over best run

2001

bi

German

0.509

Optimal average precision name fusion 0.518

2001

multi English

0.405

0.406

0%

2002

bi

0.4935

0.543

10%

CLEF year

Run type

Topic language

English

Average. precision best run

2%

2002

multi English

0.378

0.403

6.5%

2003

bi

German

0.460

0.460

0%

2003

bi

English

0.348

0.369

6.1%

2003

multi English

0.438

0.443

1.2%

7. RESUME In our study, we analyzed named entities in all CLEF topics and their effect on retrieval performance for all individual systems on all topics. The occurence of named entities in topics makes them “easier” for retrieval systems. This results shows the potential for further research on the retrieval results of evaluation campaigns. During the design of the retrieval environment, it is necessary to carefully monitor the number of named entities in topics. In addition, the analysis of named entities has been identified as a promising approach to improve retrieval results.

8. ACKNOWLEDGMENTS The first author was supported by a grant from the German Research Foundation (DFG, grant nr. MA 2411/3-1). We also acknowledge the work of several students from the University of Hildesheim who contributed to this analysis.

9. REFERENCES [1] Allan, J. and Raghavan, H. Using part-of-speech Patterns to Reduce Query Ambiguity. Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR ´02) (Tampere, Finland, Aug. 11-15, 2002). ACM Press, 2002, 307-314. [2] Braschler, M. CLEF 2002 - Overview of Results. Evaluation of Cross-Language Information Retrieval Systems. Third Workshop of the Cross Language Evaluation Forum 2003, (Trondheim, Norway Aug 21.-22.). Springer [LNCS] 2004, to appear. Preprint http://www.clef-campaign.org. [3] Buckley, C. and Voorhees, E. The Effect of Topic Set Size on Retrieval Experiment Error. Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR ‘02) (Tampere, Finland, Aug. 11-15, 2002). ACM Press, 2002, 316-323. [4] Demner-Fushman, D. and Oard, D. W. The effect of bilingual term list size on dictionary-based cross-language

information retrieval. Thirty-Sixth Hawaii International Conference on System Sciences (HICSS) (Hawaii, Jan 6-9, 2003). [5] Eguchi, K., Kando, N. and Kuriyama, K. Sensitivity of IR Systems Evaluation to Topic Difficulty. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002) (Las Palmas de Gran Canaria, Spain, May 29-31, 2002). ELRA, Paris, 585-589. [6] Gey, F. C. Research to improve Cross-Language Retrieval. Position Paper for CLEF. Cross-Language Information Retrieval and Evaluation. Workshop of Cross-Language Evaluation Forum (CLEF 2000) (Lisbon, Portugal, Sept 2122, 2000) Springer [LNCS] 83-88. [7] Harman, D. and Voorhees, E. Overview of the Sixth Text REtrieval Conference. The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication, National Institute of Standards and Technology, Gaithersburg, Maryland, 1997, http://trec.nist.gov/pubs/ [8] Hollink, V., Kamps, J., Monz, C. and de Rijke, M. Monolingual Document Retrieval for European Languages. Information Retrieval 7, 1-2 (Jan.-April 2004), 33-52. [9] Mandl, T., and Womser-Hacker, C. Proper Names in the Multilingual CLEF Topic Set. Evaluation of CrossLanguage Information Retrieval Systems. Proceedings of the CLEF 2003 Workshop. (Trondheim, Norway Aug 21.-22.). Springer [LNCS] 2004, to appear. [10] Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (eds.) Evaluation of Cross-Language Information Retrieval Systems. Third Workshop of the Cross Language Evaluation Forum 2002 (Rome) Springer [LNCS] 2003. [11] Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (eds.) Evaluation of Cross-Language Information Retrieval Systems. Third Workshop of the Cross Language Evaluation Forum 2003 [LNCS] (Trondheim, Norway 21.-22. Aug). 2004, to appear. [12] Sparck Jones, K. Reflections on TREC. Information Processing & Management 31, 3 (May/June 1995), 291-314. [13] Virga, Paola., Khundanpur, Sanjeev. Transliteration of Proper Names in Cross-Language Applications. . Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR ´03) ACM Press, 2003, 365-366 [14] Voorhees, E. and Buckland, L. (eds.) The Eleventh Text Retrieval Conference (TREC 2002) NIST Special Publication: SP 500-251. http://trec.nist.gov/pubs/trec11/t11_proceedings.html. [15] Zobel, J. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of the 21st Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR ‘98) (Melbourne, Australia, Aug. 24-28, 1998). ACM Press, New York, 1998, 307-314

Suggest Documents