Cross-Lingual Information Retrieval Problems - Semantic Scholar

11 downloads 0 Views 141KB Size Report
The main problems associated with dictionary-based CLIR are (1) phrase .... The medical dictionary contained 67,000 Finnish and English entry words.
Cross-Lingual Information Retrieval Problems: Methods and findings for three language pairs Ari Pirkola & Turid Hedlund & Heikki Keskustalo & Kalervo Järvelin University of Tampere Department of Information Studies Finland Email: [email protected]

Abstract In this paper we will discuss dictionary-based cross-language information retrieval (CLIR) methods, and report recent findings and problems. We will consider three language pairs for CLIR: Finnish to English, English to Finnish, Swedish to English. We show that Finnish and Swedish have special features, e.g., the frequency of homography and a high frequency of compound words that affect retrieval effectiveness. Especially correct word form normalization and compound splitting are essential. We report findings concerning the effectiveness of various query translation methods, query structures and linguistic tools used for CLIR. We also point out some problems and deficiencies in such tools.

1. Introduction There is an increasing amount of full text material in various languages available through the Internet and other information suppliers. Therefore Cross-language information retrieval (CLIR) has become an important new research area (Oard & Dorr, 1996; Pirkola, 1999). It is a process of selecting and ranking documents in a language different from the query language. One of the main approaches to CLIR is based on bilingual translation dictionaries. For an overview of the approaches, see (Hull & Greffenstette, 1996; Oard & Dorr, 1996; Pirkola, 1999). The main problems associated with dictionary-based CLIR are (1) phrase identification and translation, (2) source language ambiguity, (3) translation ambiguity, (4) the coverage of dictionaries, (5) the processing of inflected words, and (6) untranslatable keys, in particular proper names spelled differently in different languages. Translation ambiguity refers to the proportional increase of bad keys due to translation. Research has developed many effective methods to handle the problems. These involve the use of special dictionaries for the dictionary coverage problem (Pirkola, 1998, 1999), POS tagging for phrase translation (Ballesteros and Croft, 1997) and for removing bad translation equivalents (Ballesteros and Croft, 1998; Davis, 1997), stemming and morphological analysis to handle inflected words (Hull, 1996; Krovetz, 1993; Porter, 1980), corpus-based query expansion (Ballesteros and Croft, 1998; Nie et al., 1999; Sheridan et al., 1997), and query structuring for the ambiguity problem (Pirkola, 1998, 1999; Sperer and Oard, 2000). Because English has been the main language for IR system development, much research on IR involves English. However, IR systems for small languages like Finnish and Swedish and other languages differing from English in morphology (inflection, derivation, gender and compound words), or in semantic features (e.g., the frequency of homonymy, polysemy and hyponymy), cannot be developed properly without studying their special features. Although Spanish and Chinese have rendered special tracks in the TREC Conferences1 the results cannot be applied on linguistically quite different languages.

1 The fourth, fifth and sixth Text REtrieval Conferences, 1998-1998. URL: http://trec.nist.gov/

1

In this paper we will discuss appropriate CLIR methods, report recent findings, and report some problems to be solved in CLIR. We concentrate on three different CLIR tasks, namely Finnish to English, English to Finnish, and Swedish to English query translation. We will report on the use of natural language processing (NLP) and query structuring (the Pirkola Method; Pirkola, 1998) for CLIR. The structuring of queries refers to the grouping of search keys, and the use of proper query operators. Publicly available NLP tools have some pitfalls for CLIR, which are discussed. Some 8 - 9 million people, mainly in Sweden, speak Swedish as a native language. There is also a Swedish-speaking minority population in Finland. However, due to close relationships between the Nordic countries and the other Scandinavian languages the number of people who speak Swedish and can understand it is much larger (Teleman, Hellberg & Andersson 1999). Approximately 20 million people have a basic knowledge of Swedish. Thus, a careful generalization of the results in this study can be made to some other languages. Some 5 million people speak Finnish. Both Finnish and Swedish have characteristics quite different from English, e.g., the frequency of compound words, and share these features with some other languages, e.g., German. Finnish is exceptional due to its rich inflectional morphology. The differences between the languages, and the appropriate techniques, may be useful in broadening the scope of CLIR to novel languages, not resembling English in their features. The rest of this paper is organized as follows. Section 2 considers natural language processing for CLIR. Sections 3 to 5 report findings in CLIR with language pairs Finnish to English (Pirkola, 1998; Pirkola, Keskustalo & Järvelin, 1999), English to Finnish (Puolamäki, Pirkola & Järvelin, 2000), Swedish to English (Hedlund, Pirkola & Järvelin, 2000). We report findings concerning the effectiveness of various query translation methods (e.g., the use of dictionaries), query structures (in particular structured queries based on the Pirkola Method) and discuss how linguistic tools (e.g., dictionaries, word form normalizers) should be used for CLIR. Section 6 presents concluding remarks.

2 Natural language processing for CLIR Natural language processing involves linguistic methods and the analysis can take place on different levels of a language, i.e., morphological, syntactic and semantic levels. On a morphological level the structure of words is analyzed. Recognizing different word forms as variants of the same basic word affects both indexing (word weights) and retrieval (matching). Syntactic analysis determines the structure of phrases and sentences, while semantic analysis investigates the meaning or sense of words and sentences. Commonly used methods in document indexing are word form normalization (Koskenniemi 1983; Pirkola 1999) and stemming (Harman 1991). A stemmer removes affixes from the word forms and the output is a common root, not necessarily a real word. A similar type of process is normalization, but in this case the output is the base form, a real word. Due to stemming and normalization three kinds of benefits may be gained (Harman 1991; Alkula & Honkela 1992). 1) A user does not need to worry about truncation and inflection, because different forms of the key are automatically conflated into the same form. 2) Stemming and normalization result in storage savings. 3) Stemming and normalization may improve retrieval performance, especially recall since a larger number of potentially relevant documents are retrieved. However, no significant improvement in performance was found by Harman (1991) in her experiment with simple stemmers for the English language. For inflectionally more complex languages the results are not necessarily the same. Finnish and Swedish are rich in compound words, and is thus confronted with the problem of embedded search keys. Splitting the compounds into their components allows the use of the component words as separate search keys. For instance, the decomposition of the compound hustak (roof of the house) gives the expansion keys hus (house) and tak (roof). If the compound hustak was truncated and used as a key, documents including the word tak would not be found. Finnish has a particularly rich inflectional morphology. Each noun may have, theoretically, 2200 forms (Karlsson, 1987), adjectives and verbs even

2

more. Therefore word form normalization appears very important for Finnish — it is a prerequisite for query translation as translation dictionaries contain their entry words in basic word forms. In IR part-of-speech (POS) tagging may be used to identify central words (word classes, especially nouns) and phrases of a sentence. In CLIR part-of-speech tagging is useful in matching the source language keys with correct translation dictionary entry words. A syntactic parser (program) determines the structure of a sentence according to a particular grammar (Grishman 1986). The parsing procedure may involve the assignment of a tree structure to the input sentence. Linguistic transformation, such as transforming active sentences into passive can be of potential value for IR. Syntactic analysis can be used as a basis for further analysis, e.g., anaphor resolution. In CLIR also syntactic parsing may be useful for matching source language keys with correct translation dictionary entry words. Word sense disambiguation is an NLP method, which aims at finding correct senses for word occurrences. It has been studied intensively in IR and other fields. The methods used in the studies include dictionaries (Dagan et al. 1991; Guthrie et al. 1991; Krovetz and Croft 1989), knowledge bases (Hirst 1987), statistical methods (Brown et al. 1991; Schütze & Pedersen 1995), multiple knowledge sources (McRoy 1992), thesauri (Voorhees 1993) and pseudo-words (Sanderson 1994; Sanderson 1997). Most IR studies have reported no or only slight improvements in retrieval performance due to word sense disambiguation (Krovetz & Croft 1992; Sanderson 1994; Sanderson 1997; Vorhees 1993). Word sense disambiguation helps in matching the source language keys with correct sense among translation dictionary entry words.

3 Findings on Finnish to English CLIR Methods and data The test collection was a subset of the TREC collection, consisting of AP Newswire, DOE Abstract, and Federal Register documents. The test collection contained 514,825 English documents. As test requests we used 34 health related TREC topics. We used the FINTWOL morphological analyzer for word form normalization and compound splitting. Inflected word forms of Finnish natural language/sentence queries were turned to base forms, because the dictionary entry words are in base forms. Finnish compounds were split, because sometimes they are found in dictionaries only as their components. Both compounds and their components were translated. As test dictionaries we used two Finnish - English - (Finnish) translation dictionaries, a general and medical dictionary. The general dictionary contained 65,000 Finnish and 100,000 English entry words. The medical dictionary contained 67,000 Finnish and English entry words. The commercial versions of the dictionaries were converted automatically to CLIR versions by removing from them all other material except for actual dictionary words. The retrieval system was the InQuery information retrieval system which is a probabilistic system based on Bayesian inference net model (Broglio et al, 1994). Queries can be formulated as bag of word queries or can be structured by a variety of operators provided by the system. The Kstem morphological stemmer, which produces real English words as its stemming output is part of InQuery. It was used for stemming the words of the documents. Thus, the database index included stemmed words. Constructing and translating queries Figure 1 gives an overall picture of the basic test processes. To get test queries that are comparable to the original English queries, the English queries were translated into Finnish by a human translator (by the author), and the Finnish queries were retranslated back to English by means of dictionaries. This approach is often used in dictionary-based CLIR studies.

3

[Figure missing] Figure 1. The basic test processes In TREC topics, important words are found in title, description, and narrative fields (some topics do not have narrative fields). Test requests were constructed on the basis of these fields. The test requests were shortened versions of the TREC topics, consisting of 1-2 natural English sentences. Hence, the test requests represented requests that could be used by real users. There were two main query types. The requests as such were the first type. This type is called natural language/sentence, and is abbreviated to NL/S. The second type was formulated on the basis of the requests by selecting from them the most important words and phrases. It is called natural language/word and phrase, and is abbreviated to NL/WP. The English NL/S and NL/WP queries were translated into Finnish by the author. As a translation aid the author used printed dictionaries. The test dictionaries were not used in this phase. The English NL/S and NL/WP queries that provided the basis for Finnish queries, were also used as baselines for CLIR queries (see Figure 1). The term CLIR queries refers to final queries, i.e., queries translated by means of dictionaries. Both NL/S and NL/WP queries were divided into two subtypes, structured and unstructured queries. The structured queries had dictionary-based facets, i.e., the words that were derived from the same Finnish word, were grouped together by the syn-operator of InQuery. Figure 1 illustrates the structuring method applied in the study, showing how the original English query osteoporosis prevent reduce research is transfomed into a structured CLIR query. It should be noted that unstructured and structured NL/WP queries as well as unstructured and structured NL/S are comparable, as they are derived from the same Finnish queries. They also have the same baseline. The NL/WP and NL/S queries are not comparable since they do not have identical search key sets. Compound words are common in Finnish, whereas noun phrases, except for proper name phrases, are relatively rare. A Finnish compound word is often translated as a noun phrase in English (like compound word and yhdyssana). This is the main reason why phrases were identified in NL/WP queries (both in the original English queries, or the baseline, and in the Finnish queries). In this way a precise correspondence was obtained between the baseline queries with their many phrases and the Finnish queries with their many compound words. Phrase identification probably favored the baseline queries. The effect on CLIR queries was small, as the Finnish queries did not have many phrases. The query operators for NL/S and NL/WP queries were the sum-, syn-, and uwn-operators. Search keys contained in the sum-operator have equal influence on search results. The syn-operator was used in structured CLIR queries (see Figure 1). The syn-operator treats its operand search keys as instances of the same word. The uwn-operator (unordered window n) is a proximity operator. It was used, with n=3, to combine phrase components and the English equivalent words derived from the same Finnish compound (see Section 4). The translation methods were the following: •

gd translation: Finnish search keys were translated by means of the general dictionary.



sd -> gd translation: Finnish search keys were translated by means of the medical dictionary and the general dictionary, in this order. General dictionary translation was applied after medical dictionary translation only if the latter did not translate a word.



sd and gd translation: Finnish search keys were translated by means of the medical dictionary and the general dictionary. Duplicate words were removed.

4

If a word or a phrase was not found as an entry word in the dictionaries, it was sent unchanged to the final query. These kinds of expressions were English proper names, acronyms, and Finnish words not found in the dictionaries. Findings The performance of test queries was evaluated as 10% recall precision, average precision at 10%-100% recall, and as precision-recall graphs. The results are presented in Tables 1-2 and Figures 2-5. As shown, there is a significant gap between the baseline (BL) and the unstructured NL/S queries (Table 1 and Figure 2). At 10% recall, the precision of the baseline is 37.9%, but only 15.4% for gd queries. Special dictionary effect is clear, but baseline queries still perform much better than sd -> gd and sd and gd queries; at 10% recall the precision of sd -> gd and sd and gd queries is, roughly, only half of the precison of the baseline. When average precision is considered, the gap in performance is greater in favor of the baseline. Structure put in NL/S queries through dictionaries results in a significant improvement in performance (Table 1 and Figure 3). At 10% recall the best CLIR queries, sd and gd, give the precision figure 35.9%, which is only 2.0% below the precision of the baseline queries (37.9%). The average precision of sd and gd queries is 12.9% and that of the baseline queries16.8%. The translation method used is of great importance. At the high precision level (10% recall - 50% recall), gd and sd -> gd queries perform much poorer than sd and gd queries. As shown in Table 2 and Figure 4, the performance of unstructured NL/WP queries is significantly below that of the baseline. Figure 4 is very much like Figure 2, which demonstrates the behavior of the unstructured NL/S queries. As in NL/S queries, structuring improves the performance of NL/WP queries significantly (Table 2 and Figure 5). The best structured cross-language queries, sd and gd, do almost as well as the baseline. For the former, precision at 10% recall is 31.1%, and for the latter 31.8%. The average precision is practically the same, 12.4% for sd and gd queries and 12.5% for the baseline. At three recall levels, 50%, 60%, and 70%, sd and gd queries give better precision figures than the baseline. The figures are, respectively, 13.2% and 12.8%, 9.5% and 8.8%, and 6.3% and 6.1% (these figures are not given in the tables of the paper).

Table 1. The performance of NL/S queries Query type/Translation type Structured, dictionary-based facets GD SD --> GD SD and GD Unstructured GD SD --> GD SD and GD Structured and unstructured, baseline

5

10%-recall P

Average P

30,9 30,4 35,9

10,5 11,3 12,9

15,4 19,2 20,4

5,1 5,8 6,3

37,9

16,8

40 P r e c i s i o n

35 30 25 20 15 10 5 0 10

20

30

40

50

60

70

80

90

100

Recall SD --> GD BL

GD SD and GD

Figure 2. Precision-recall curves for unstructured NL/S queries

40 P r e c i s i o n

35 30 25 20 15 10 5 0 10

20

30

40

50

60

70

80

90

100

Recall GD SD and GD

SD --> GD BL

Figure 3. Precision-recall curves for structured NL/S queries

Table 2. The performance of NL/WP queries

Query type/Translation type Structured, dictionary-based facets GD SD --> GD SD and GD Unstructured GD SD --> GD SD and GD Structured and unstructured, baseline

6

10%-recall P

Average P

24,9 26,1 31,1

9,8 10,5 12,4

16,5 14,6 19,3

5,7 5,0 6,5

31,8

12,5

35 P r e c i s i o n

30 25 20 15 10 5 0 10

20

30

40

50

60

70

80

90 100

Recall GD SD and GD

SD --> GD BL

Figure 4. Precision-recall curves for unstructured NL/WP queries

35 P r e c i s i o n

30 25 20 15 10 5 0 10

20

30

40

50

60

70

80

90

100

Recall GD SD and GD

SD --> GD BL

Figure 5. Precision-recall curves for structured NL/WP queries

4 Findings on English to Finnish CLIR It is possible that the specific linguistic features of Finnish as a source language or English as a target language contributed to the good performance of the structured queries in Fin-Eng CLIR (Section 3.). Thus, the effectiveness of the query structuring method may depend on the languages of a CLIR system (or the direction of translations). We explored whether the method is useful also in Eng -Fin text retrieval, i.e., the case where translations are done in an opposite direction to those done in the experiments presented in Section 3. English and Finnish are different types of languages, particularly in morphology. In English grammatical relations are indicated mainly by prepositions while Finnish typically uses a grammatical case. In Finnish, there are 14 features in the category of case. Therefore, the number of word forms that a given Finnish lexeme may take is very high, theoretically 2200 forms for nouns (Karlsson, 1987). Inflection has depressing effect on CLIR effectiveness, but it is hard to estimate in which case, Fin-Eng or Eng-Fin, the problem is more severe. In Fin-Eng retrieval, inflection causes difficulties especially in query processing whereas in Eng-Fin retrieval troubles occur in indexing.

7

In Finnish multiword expressions are typically compound words, in English they are often phrases. From the IR and CLIR perspectives, a compound word is a more convenient type of expression than a phrase, because compound decomposition is easier than phrase identification. In this respect Fin-Eng retrieval is easier than Eng-Fin retrieval. Finnish compounds can be split effectively into component words by a dictionary-based morphological analyzer. The English equivalents of the components can be combined by a proximity operator in CLIR queries (Section 3). The application of the technique in Eng-Fin retrieval requires that phrases are identified. Correct phrase identification is difficult, however. We also studied phrase identification and a structuring technique utilizing a proximity operator. If phrases are not identified in CLIR, phrase components instead of full phrases are translated, and the senses of multi-word keys may be lost. This causes loss of retrieval effectiveness (Hull & Grefenstette, 1996). Automatic phrase identification methods involve the use of collocation statistics (Buckley et al., 1996), part-of-speech tagging (Ballesteros & Croft, 1997) , and shallow syntactic analysis (Strzalkowski, 1995; Zhai et al., 1997). In cross-language retrieval where the target language is a compound language, i.e., a language where multiword expressions are compounds rather than phrases, it would be possible to recognize as phrases the adjacent request words that correspond to a compound word in a target language. Compound languages involve such languages as German, Dutch, Swedish, and Finnish. A phrase identification system could be based on the translation dictionary or the database index of a retrieval system, or it may be constructed as an independent system. In the present study, we marked as phrases in the English requests the adjacent words as well as the words separated by the preposition of that corresponded compound words in Finnish requests. The Finnish equivalents of an English phrase were combined by a proximity operator (uw3) in CLIR queries. The effectiveness of phrase-based queries was compared to that of word-based queries. Methods and data The test collection contained around 55.000 articles published in three Finnish newspapers in 1988-1992. The average article length was 233 words. Our test environment provides 35 test requests (in Finnish) for which the relevance of 16 000 articles is known (Kekäläinen, 1999; Kekäläinen & Järvelin, 1999). 20 of the 35 requests were used as test requests in this study. The requests were natural sentences. For this study, the inflected words of the requests were normalized into their base forms, and compound words were decomposed into their component words by the FINTWOL morphological analyzer. The normalized Finnish words were (1) used as search keys in the baseline queries, and were (2) translated into English by the author. The translations were checked by a colleague whose native language is English. Human translation was done to get test queries that are comparable to the original Finnish queries. The English words were translated back to Finnish by an English - Finnish electronic dictionary (Section 3). As a test system we used the InQuery retrieval system The query structuring technique was the same as in Fin-Eng CLIR, i.e., the translation equivalents of a source language word were combined by the syn-operator of the InQuery retrieval system. In addition, in phrase-based structured queries the uw3-operator was applied to the Finnish equivalents of the English phrase components; those equivalents that corresponded to the first part of the phrase were joined by the operator to those equivalents that corresponded to the second part. All the combinations were generated. For example, the English equivalent of a Finnish compound tuotantomäärä is volume of production. This was translated back to Finnish using the electronic dictionary, and the components were combined by the uw3-operator in a phrase-based query: #syn(#uw3(esitys erä) #uw3(esitys joukko) #uw3(esitys kvantiteetti) #uw3(esitys määrä) #uw3(esitys paljous) #uw3(esitys suure) #uw3(produktio erä) #uw3(produktio joukko) #uw3(produktio kvantiteetti) #uw3(produktio määrä) #uw3(produktio paljous) #uw3(produktio suure) #uw3(tuotanto erä) #uw3(tuotanto joukko) #uw3(tuotanto kvantiteetti) #uw3(tuotanto määrä) #uw3(tuotanto paljous) #uw3(tuotanto suure)) The query types were as follows: 1. Finnish word-based queries, i.e., baseline for the CLIR queries of steps 2-4

8

2. Word-based unstructured CLIR queries 3. Word-based structured CLIR queries 4. Phrase-based structured CLIR Findings The results were evaluated as (1) average precision over ten recall points (10-100%), and as (2) precisionrecall graphs. The results are presented in Table 3 and Figure 6. As shown in Table 3, the average precision of word-based structured queries is 27.4% while unstructured queries give the precision figure of 18.8%. The relative improvement percentage due to structuring is 45.7% (column 3). Phrase-based structured queries perform slightly better than word-based structured queries, with the relative improvement percentage due to structuring and phrase identification being 54.3%. As shown in column 4 in Table 3, the relative performance percentages of CLIR queries with respect to baseline queries are 77.0% (word-based structured queries), 81.5% (phrase-based structured queries), and 52.8% (word-based unstructured queries). Figure 6 shows precision-recall curves for CLIR and baseline queries. As can be seen, structured queries perform markedly better than unstructured queries but fall below baseline queries. Phrase-based queries perform well particularly at the 10%-recall level. Table 3. The performance of CLIR and baseline queries

Avg. Precision

% Change Str. vs. Unstr.

Precision in relation to baseline

Word-based structured

27,4

45,7

77,0

Word-based unstructured

18,8

-

52,8

Phrase-based structured

29,0

54,3

81,5

Baseline (Finnish) queries

35,6

-

100,0

Query Type

9

80 70

Precision

60 50 40 30 20 10 0 10

20

30

40

50

60

70

80

90

100

Recall Baseline Structured

Unstructured Phrase

Figure 6. Precision-recall curves for CLIR and baseline queries

5 Findings on Swedish to English CLIR Swedish has linguistic features, for example, the use of “fogemorphemes” in compound words and a high frequency of homographs that affect IR performance (Hedlund et al. 2000). When decomposing compound words, morphological analysis programs have pitfalls that affect retrieval results and query translation. Here we shall focus on morphological decomposition of compounds and homographs. Compound splitting. The constituents of a compound are needed for dictionary translation, in particular, when the whole compound is not in the dictionary. On the other hand we have common compounds that are lexicalized and the meaning can no longer be determined on the basis of the constituents, for example jordgubbe (strawberry). Swedish, Finnish and German compounds are spelled as one word. In Swedish the components are often joined by a joining morpheme, a “fogemorpheme”. This is not the case in for example English or French. In Finnish the non-last components often are in the genitive case. In German the non-last component is also often in an inflected form. Table 4 presents an example of compound splitting by SWETWOL. Here the components skog and industrin are joined with the fogemorpheme “s”. There are other fogemorphemes in Swedish as well. The latter component industrin (the industry) is normalized to the base form industri, which is a hyperonym of skogsindustri and therefore often a valuable search key. The former component skogs has retained the “s”, and is not normalized to the base form skog. Table 4. Morphological analysis and compound splitting Input word skogsindustrin forest industry)

10

(the

Compound splitting

SWETWOL analysis

skogs#industri industry)

# N UTR DEF SG NOM (noun # noun, uter, definite, sg. nominative)

(forest #

We have developed an algorithm for recognizing and handling fogemorphemes. It seeks to recognize, for all constituents, the base forms, and thereby allow the translation of all constituents through a bilingual dictionary. The algorithm for handling fogemorphemes appears to work well in the query formulation process and essentially reduces the number of non-translated words in several topics. However, since we deal with constituents of compounds the actual effect on the search result also depends on other factors, such as the extent to which the constituents carry important search keys. Nevertheless, the lesson is that (CL)IR requires morphological processing for compounds, which yields correct basic word forms. Mere component separation is not sufficient. Compound words and inflected from components may be a frequent feature among natural languages. Homographs. Swedish is rich in homographs with many senses. Frequent words in a language usually tend to have many senses, and they also tend to appear as constituents in compound words. When automating the query formulation process in CLIR we have to deal with compound words that can be morphologically separated to contain three or four constituents. In a translation process every constituent is translated separately and then combined as a phrase with the other translated constituents. Thus the number of alternative combinations may grow very rapidly. The following example shows the features in compound splitting, homographs and the fogemorpheme algorithm. The Swedish word flygplansolycka (aeroplane accident) is in an automated query formulation process handled like this: •

the word plan is a homograph having the two senses “plan” and “plane”



the morphological analysis: flyg#plans#olycka



the fogemorpheme algorithm output: flyg#plan#olycka



the translation process output where every combination of constituents is a combinations are treated as translation alternatives and synonyms is as follows:

#SYN( #OD4(aviation plane accident) #OD4(aviation plane disaster) #OD4(aviation plane misfortune) #OD4(aviation plane calamity) #OD4(aviation flat accident) #OD4(aviation flat disaster) #OD4(aviation flat misfortune) #OD4(aviation flat calamity) #OD4(aviation level accident) #OD4(aviation level disaster) #OD4(aviation level misfortune) #OD4(aviation level calamity) #OD4(aviation ground accident) #OD4(aviation ground disaster) #OD4(aviation ground misfortune) #OD4(aviation ground calamity) #OD4(aviation plan accident) #OD4(aviation plan disaster) #OD4(aviation plan misfortune) #OD4(aviation plan calamity)

phrase

and all

#OD4(plane plane accident) #OD4(plane plane disaster) #OD4(plane plane misfortune) #OD4(plane plane calamity) #OD4(plane flat accident) #OD4(plane flat disaster) #OD4(plane flat misfortune) #OD4(plane flat calamity) #OD4(plane level accident) #OD4(plane level disaster) #OD4(plane level misfortune) #OD4(plane level calamity) #OD4(plane ground accident) #OD4(plane ground disaster) #OD4(plane ground misfortune) #OD4(plane ground calamity) #OD4(plane plan accident) #OD4(plane plan disaster) #OD4(plane plan misfortune) #OD4(plane plan calamity))

Without the fogemorpheme algorithm we would not have been able to translate the word plans since it is not in base form and the translation would be like this: SYN( #OD4(aviation plans accident) #OD4(aviation plans disaster)

11

#OD4(aviation plans misfortune) #OD4(aviation plans calamity) #OD4(plane plans accident) #OD4(plane plans disaster) #OD4(plane plans misfortune) #OD4(plane plans calamity)

The non-translated constituent thus would ruin the translation of the compound. We are currently evaluating how such NLP processing affects search effectiveness in large test collections.

6 Discussion and Conclusions We have discussed dictionary-based cross-language information retrieval (CLIR) methods, and reported recent findings and encountered problems. We considered three language pairs for CLIR: Finnish to English, English to Finnish, Swedish to English. Finnish and Swedish are rather different from English, both being rich in compounds, Finnish having a rich inflectional morphology, and Swedish text having very frequently homographs. In summary, our findings suggest that: •

query structuring through synonym sets is a simple and essential tool for dictionary-based CLIR effectiveness; query structuring performs disambiguation indirectly;



the parallel use of general and special dictionaries improves effectiveness; in different types of collections, e.g., domain specific collections, the sequential sd -> gd application of dictionaries may perform better;



word by word translation of natural language request sentences yields performance comparable to (or better than) available by selecting source keys and phrases; languages rich in compounds have an additional advantage of source language compounds trivially suggesting target language phrases;



proper names are generally not translatable and may pose matching problems due to differing spelling (transliteration) and inflection; proper names thus require other translation techniques such as based on n-grams;



when compound word components may inflect or are joined together by special morphemes (fogemorphemes), compound splitting must recognize the correct base form of the components; otherwise component translation is endangered;

Linguistic features of both the source language (for retrieval) and the target language (both indexing and retrieval) must be observed for successful CLIR. Understanding the variety of natural languages that may be used for CLIR, there are many problems to be studied in this research area.

References Alkula, R., Honkela, T. 1992. Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen tulkintaohjelmien avulla. FULLTEXT-projektin loppuraportti. [Linguistic processing and retrieval techniques in Finnish fulltext databases. Final report of the FULLTEXT project.] VTT julkaisuja publikationer 765. Espoo: VTT.

12

Ballesteros, L. & Croft, W. B. 1997. Phrasal translation and query expansion techniques for crosslanguage information retrieval. In: Proceedings of the 20th ACM SIGIR Conference: 84-91. Ballesteros, L., Croft, W. B. 1998. Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21stAnnual International ACM SIGIR Conference: 64-71. Broglio, J., Callan, J. & Croft, W.B. 1994. Inquery system overview. In: Proceedings of the TIPSTER Text Program (Phase I): 47-67. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J. & Mercer, R.L. 1991. Word-sense disambiguation using statistical methods. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics: 264-270. Buckley, C., Singhal, A., Mitra, M. & Salton, G. 1996. New retrieval approaches using SMART: TREC4. In: The Fourth Text REtrieval Conference (TREC-4), Gaithesburg, MD. Available at: http://trec.nist.gov/pubs/trec4/t4_proceedings.html Dagan, I., Itai, A. & Schwall, U. 1991. Two languages are more informative than one. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics: 130-137. Davis, M. 1997. New experiments in cross-language text retrieval at NMSU's Computing Research Lab. In: The Fifth Text REtrieval Conference (TREC-5), Gaithesburg, MD. Available at: http://trec.nist.gov/pubs/trec5/t5_proceedings.htm Grishman, R. 1986. Computational linguistics: an introduction. Cambridge: Cambridge University Press. Guthrie, J. A., Guthrie, L., Wilks, Y. & Aidinejad, H. 1991. Subject-dependent co-occurrence and word sense disambiguation. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics: 146-152. Harman, D. 1991. How effective is suffixing? Journal of the American Society for Information Science 42(1): 7-15. Hedlund, T., Pirkola, A. and Järvelin, K. 2000. Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing & Management 37, to appear. Hirst, G. 1987. Semantic interpretation and the resolution of ambiguity. Cambridge: Cambridge University Press.

13

Hull, D. 1996. Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science 47(1): 70-84. Hull, D. & Grefenstette, G. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th ACM SIGIR Conference: 49-57. Karlsson, F. 1987. A Finnish grammar. Porvoo: WSOY. Kekäläinen, J. 1999. The effects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval. Ph.D. Thesis, University of Tampere. Kekäläinen, J. & Järvelin, K. 1999. The co-effects of query structure and expansion on retrieval performance in probabilistic text retrieval. Information Retrieval 1(4): 329-344. Koskenniemi K. 1983. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. Thesis. University of Helsinki. Krovetz, R. 1993. Viewing morphology as an inference process. In: Proceedings of the 16th ACM SIGIR Conference: 191-202. Krovetz, R. & Croft, W.B. 1989. Word sense disambiguation using machine-readable dictionaries. In Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 127-136. Krovetz, R. & Croft, W.B. 1992. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10(2): 115-141. McRoy, S.W. 1992. Using multiple knowledge sources for word sense disambiguation. Computational Linguistics 18(1): 1-30. Nie J-Y, Simard M, Isabelle P & Durand R. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd ACM Sigir Conference: 74-81. Oard, D. & Dorr, B. 1996. A survey of multilingual text retrieval. Technical Report UMIACS-TR-96-19. University of Maryland, Institute for Advanced Computer Studies. Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference: 55-63.

14

Pirkola, A. 1999. Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of Tampere. Pirkola, A. 1999. Homonymy in cross-language retrieval. University of Tampere, Department of Information Studies. Unpublished manuscript. Pirkola, A. & Keskustalo, H. & Järvelin, K. 1999. The effects of translation method, conjunction, and facet structure on concept-based cross-language retrieval. Information Retrieval 1: 217 - 250. Porter M.F. 1980. An algorithm for suffix stripping. Program 14, 130-137. Puolamäki, D., Pirkola, A. & Järvelin, K. 2000. Applying Query Structuring in Cross-Language Retrieval. Manuscript. Sanderson, M. 1994. Word sense disambiguation and information retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 142-151. Sanderson, M. 1997. Word sense disambiguation and information retrieval. Ph.D. Thesis University of Glasgow, Department of Computing Science. Sheridan, P., Braschler, M. & Schäuble, P. 1997. Cross-language information retrieval in a multilingual legal domain. In Peters, C. & Thanos, C., ed. Research and Advanced Technology for Digital Libraries. First European Conference, ECDL '97. Lecture Notes in Computer Science, 1324: 253 268. Schütze, H. & Pedersen, J.O. 1995. Information retrieval based on word senses. In: Proceedings of the Symposium on Document Analysis and Information Retrieval: 161-175. Sperer, R. & Oard, D.W. 2000. Structured translation for cross-language IR. In: Proceedings of the 23rd Annual International ACM SIGIR Conference: . Strzalkowski, T. 1995. Natural language information retrieval. Information Processing & Management 31(3): 397-417. Teleman, Hellberg, & Andersson, E. 1999. Svenska Akademiens grammatik 1-4 [Grammar of the Swedish Academy 1-4]. Stockholm: Svenska Akademien. Vorhees, E.M. 1993. Using WordNet to disambiguate word senses for text retrieval. In: Proceedings of the 16th Annual International ACM SIGIR Conference: 171-180.

15

Zhai, C., Tong, X., Milic-Frayling, N. & Evans, D.A. 1997. Evaluation of syntactic phrase indexing CLARIT NLP track report. In: The Fifth Text REtrieval Conference (TREC-5), Gaithesburg, MD. Available at: http://trec.nist.gov/pubs/trec5/t5_proceedings.html

16

Suggest Documents