Using Multilingual Resources for Building SloWNet ... - Semantic Scholar

23 downloads 0 Views 209KB Size Report
extending it to Bulgarian and Romanian as soon as tagging information becomes ... Again, most nouns are monosemous (almost 57,000) and there are only.
Using Multilingual Resources for Building SloWNet Faster Darja Fišer Department of Translation, Faculty of Arts, University of Ljubljana Aškerčeva 2, 1000, Ljubljana, Slovenia [email protected]

Abstract. This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries from the lexicon. In the end, Slovene lexicon entries sharing the same synset id were organized into a synset. The results were evaluated against a goldstandard and checked by hand. Keywords: multilingual lexica, parallel corpora, word senses, word-alignment.

1 Introduction Automated approaches for wordnet construction, extension and enrichment all aim to facilitate faster, cheaper and easier development. But they vary according to the resources that are available for a particular language. These range from Princeton WordNet (PWN) [7], the backbone of a number of wordnets [16, 14], to machine readable bilingual and monolingual dictionaries which are used to disambiguate and structure the lexicon [11], and taxonomies and ontologies that usually provide a more detailed and formalized description a domain [6]. For the construction of Slovene wordnet we have leveraged the resources at our disposal, which are mainly corpora. Based on the assumption that the translation relation is a plausible source of semantics we have used multilingual parallel corpora to extract semantically relevant information. The idea that senses of ambiguous words in SL are often translated into distinct words in TL and that all SL words that are translated into the same TL word share some element of meaning has already been explored by e.g. [13] and [10]. Our work is also closely related to what was been reported by [1], [3] and [17]. The paper is organized as follows: the methodology used in the experiment is explained in the next section. Sections 3 and 4 present and evaluate the results and the last section gives conclusions and work to be done in the future.

2 Methodology

2.1 Parallel Corpora The experiment was conducted on two very different corpora, the MultextEast corpus [2] and the JRC-Acquis corpus [14]. The former is relatively small (100,000 words per language) and it only contains a single text, the novel “1984” by George Orwell. Although the corpus contains a single a literary text but is written in a plain and contemporary style and contains general vocabulary. But because it had already been sentence-aligned and tagged, as many as five languages could be used (English, Czech, Romanian, Bulgarian and Slovene). The latter, by contrast, contains EU legislation and is very domain-specific. It is also the biggest parallel corpus of its size in 21 languages (about 10 million words per language). However, the JRC-Acquis is paragraph-aligned with HunAlign [18] but is not tagged, lemmatized, sentence- or wordaligned. This means that the pre-processing stage was a lot more demanding than with the 1984 corpus. We were therefore forced to initially limit the languages involved to English, Czech and Slovene with the aim of extending it to Bulgarian and Romanian as soon as tagging information becomes available for these languages. The English and Slovene part JRC-Acquis of the corpus were first tokenized, tagged lemmatised with totale [4] while the Czech part was kindly tagged for us with Ajka [14] by the team from the Faculty of Informatics from the Masaryk University in Brno. We included the first 2000 documents from the corpus in the dataset and filtered out all function words. Both corpora were sentence- and word-aligned with Uplug [15] for which the slowest but best performing ‘advanced setting’ was used. It first creates basic clues for word alignments, then runs GIZA++ [13] with standard settings and aligns words with the existing clues. Alignments with the highest confidence measure are learned and the last two steps are repeated three times. The output of the alignment process is a file containing word links with information on word link certainty between the aligned pair of words and their unique ids. 2.2 Extracting Translations of One-Word Literals Word-alignments were used to create bilingual lexicons. In order to reduce the noise in the lexicon as much as possible, only 1:1 links between words of the same part of speech were taken into account. All alignments occurring only once were discarded. In this experiment, synonym identification and sense disambiguation were performed by observing semantic properties of words in several languages. This is why the information from bilingual word-alignments was combined into a multilingual lexicon. The lexicon is based on English lemmas and their word ids, and it contains all their translation variants found in other languages. The obtained multilingual lexicon was then compared to the already existing wordnets in the corresponding languages.

For English, PWN was used while for Czech, Romanian and Bulgarian wordnets from the BalkaNet project [16] were used. There were two reasons for using BalkaNet wordnets: (1) the languages included in the project correspond to the multilingual corpus we had available; and (2) the wordnets were developed in parallel, they cover a common sense inventory and are also aligned to one another as well as to PWN, making the intersection easier. If a match was found between a lexicon entry and a literal of the same part of speech in the corresponding wordnet, the synset id was remembered for that language. If after examining all the existing wordnets there was an overlap of synset ids across all the languages for the same lexicon entry, it was assumed that the words in question all describe the concept marked with this id. Finally, the concept was extended to the Slovene part of the multilingual lexicon entry and the synset id common to all the languages was assigned to it. All the Slovene words sharing the same synset id were treated as synonyms and were grouped into synsets. 2.3 Extracting Translations of Multi-Word Literals The automatic word-alignment used in this experiment only provides links between individual words, not phrases. However, simply ignoring all the expressions that extend beyond word boundaries would be a serious limitation of the proposed approach, especially because so much energy has been invested in the preparation of the resources. The second part of the experiment is therefore dedicated to harvesting multi-word expressions from parallel corpora. The starting point was a list of multi-word literals we extracted from PWN. It contains almost 67,000 unique expressions. A great majority of those (almost 61,000) are from nominal synsets. Another interesting observation is that most of the expressions (more than 60,000) appear in only one synset and are therefore are monosemous. Again, most nouns are monosemous (almost 57,000) and there are only about 150 nouns that have more than three senses. The highest number of senses for nouns is 6, much lower than for verbs which can have up to 19 senses. We therefore concluded that sense disambiguation of multi-word expressions will not be a serious problem, and limited the approach only to English and Slovene. Bearing in mind the differences between the two languages, we also assumed that we would not be very successful in finding accurate translations of e.g. phrasal verbs automatically, which is why we decided to first look for two- and three- word nominal expressions only. First, the Orwell corpus was searched for the nominal multi-word expressions from the list. If an expression was found, the id and part of speech for each constituent word was remembered. This information was then used to look for possible Slovene translations of each constituent word in the file with word alignments. In order to increase the accuracy of the target multi-word expressions, translation candidates had to meet several constraints: (1) a Det-Noun phrase could only be translated by a single Noun (example: ‘a people’ – ‘narod’); (2) a Det-Adj phrase could only be translated by a single Noun or by a single Adj_Pl (example: ‘the young’ – ‘mladina’ or ‘mladi’);

(3) an (Adj-)Adj-Noun phrase could only be translated by an (Adj-)Adj-Noun phrase (example: ‘blind spot’ – ‘slepa pega’); (4) a (Adj-)Noun-Noun phrase could be translated either by an (Adj-)Adj-Noun or by a Noun-Noun_gen phrase (examples: ‘swing door’ – ‘nihajna vrata [AdjN]’, ‘death rate’ – ‘stopnja umrljivosti [N-N_gen]’, exceptions: ‘cloth cap’ which is translated into Slovene as ‘pokrivalo iz blaga [a cap made of cloth]’, ‘chestnut tree’ – ‘kostanj’); (5) a Noun-Prep-Noun phrase could be translated by a Noun-Noun_gen or by an Adj-Noun phrase (examples: ‘loaf of bread’ – ‘hlebec kruha’, ‘state of war’ – ‘vojno stanje’, exception: ‘Republic of Slovenia’ – ‘Republika Slovenija [NN_nom]’); (6) a Noun-Noun-Noun phrase could only be translated by a Noun-Noun_genNoun_gen phrase (example: ‘infant mortality rate’ – ‘stopnja umrljivost otrok’, exception: ‘corn gluten feed’ – ‘krma iz koruznega glutena [feed made of corn gluten]’). Because word-alignment is far from perfect, alignment errors were avoided by checking whether translation candidates actually appear as a phrase in the corresponding sentence in the corpus. If a translation was not found for all the parts of the multi-word expression in the file with alignments, an attempt was made to recover the missing translations by first locating the known translated word in the corpus and then using the above-mentioned criteria to guess the missing word from the context. In the end, the canonical word forms for the successfully translated were extracted from the corpus and all phrases sharing the same synset id were joined into a single synset.

3 Results

2.1 Word-Based Approach The first version of the Slovene wordnet (SLOWN0) was created by translating Serbian synsets [12] into Slovene with a Serbian-Slovene dictionary [5]. The main disadvantage of that approach was the inadequate disambiguation of polysemous words, therefore requiring extensive manual editing of the results. In the current approach we tried to use multilingual information to improve the disambiguation stage and generate more accurate synsets. In the experiment with the Orwell corpus, four different settings were tested, each of them using one more language [8]. Table 1 shows the number of nominal one-word synsets generated from the Orwell corpus, depending on the number of languages involved. Recall drops significantly when a new language is added. On the other hand, the average number of literals per synset is not affected. The same approach was also tested on the JRC-Acquis corpus that is from an entirely different domain and is much larger [9]. It is interesting to observe the change in synset coverage and quality resulting from the different dataset.

However, because the corpus is not annotated with the linguistic information needed in this experiment, we could only implement the approach on English, Czech and Slovene in this setting. Note that although the corpus used was much larger, the number of the generated synsets is only slightly higher. This could be explained by the high degree of repetition and domain-specificity of texts from the dataset. Table 1. Nominal synsets generated by leveraging existing multi-lingual resources (one-word literals only).

nouns max l/s avg l/s

SLOWN0 SLOWN1 SLOWN2 SLOWN3 SLOWN4 SLOWNJRC 3,210 2,964 870 671 291 3,528 40 10 7 6 4 9 4.8 1.4362 1.4 1.4 1.7 2.6

2.2 Phrase-Based Approach Nominal multi-word literals were extracted from PWN and then translated into Slovene based on word-alignments. In order to avoid the alignment errors some restrictions on the translation patterns were introduced and phrase candidates were checked in the Slovene corpus as well. This simple approach to match phrases in word-aligned parallel corpora yielded more synsets than was initially expected. If it was extended to other patterns, even more multi-word literals could be obtained. Another approach would be to use statistical co-occurrence measures to check the validity of more elusive patterns. Table 2. Nominal synsets generated from parallel corpora (two-word and three-word literals only).

mwe’s found mwe’s translated max l/s avg l/s

ORWELL JRC 163 5,652 121 (73%) 1,984 (34%) 4 2 1,29 1,13

4 Evaluation

2.2 Synset Quality Automatic evaluation was performed against a manually created goldstandard. Its literals were compared to literals in the automatically induced wordnets with regard to which synsets they appear in. This information was used to calculate precision, recall and f-measure.

Precision gives the proportion of retrieved and relevant synset ids for a literal to all synset ids for that literal. Recall is the proportion of relevant synset id retrieved for a literal out of all relevant synset ids available for that literal. Finally, precision and recall were combined in the traditional f-measure: (2 * P * R) / (P + R). This seems a fairer alternative to simply evaluating synsets because of the restricted input vocabulary.

Fig. 1. A comparison of precision, recall and f-measure for nominal synsets according to the number of languages used in the disambiguation stage of automatic synset induction from the Orwell corpus.

Figure 1 shows the drop in recall and increase in precision and f-measure each time a new language is added to the disambiguation stage, peaking at 77,37% for precision, 75,88% for recall and 76,62% for f-measure. The results for the JRC-Acquis corpus are worse due to fewer languages involved and less accurate word-alignment (precision: 67.0%, recall: 72.0% and f-measure: 69.4%).

2.3 Multi-Word Expressions Because there was virtually no overlap between the goldstandard and synsets containing automatically translated multi-word expressions, all the synsets obtained from the Orwell corpus was checked by hand. As can be seen in Table 3, about a third of the generated literals were completely wrong.

The errors were analyzed and grouped into categories. Most errors (17 synsets) occurred because an English multi-word expression should be translated into Slovene with a single word (e.g. ‘top hat’ – ‘cilinder’). The next category (12 synsets) contains alignment errors in which one of the constituent words is mistranslated or a translation is missing (e.g. ‘mortality rate’ – ‘umrljivost otrok’, should be ‘stopnja smrtnosti’). In the next category there are 8 expressions that have been translated correctly but can not be included in the synset because the senses of the translation and the original synset are not the same (e.g. ‘white knight’ – ‘beli tekač’ as in chess, should be ‘beli vitez’ as in business takeovers). And finally, there are 12 borderline cases that contain a correct translation but also an error (e.g. ‘black hole’ – ‘črna odprtina[wrong]’ and ‘črna luknja[correct]’). Table 3. Manual evaluation of multi-word expressions obtained from the Orwell corpus.

completely wrong contain some errors fully correct total no. of synsets

ORWELL 39 (32%) 12 (10%) 70 (58%) 121

A larger-scale evaluation of multi-word expressions harvested from the JRCAcquis has not been carried out but is planned for the near future. A quick overview of the results suggests that the quality of the generated synsets is comparable to the ones obtained from the Orwell corpus.

5 Conclusions In this paper we have presented an approach to automatically generate wordnet synsets from the two parallel corpora. The method works best on nouns which are disambiguated against several languages. The limitation of the word-alignment based approach was successfully overcome by using the alignment information to form multi-word expressions. However, the issue of adding multi-word units to wordnet is far from exhausted. More sophisticated statistic-based methods could be used to find more reliable translations of multi-word units. Another possibility to get even more added value from parallel corpora would be an attempt to identify (domain-specific) multi-word expressions that are not part of PWN and add them to Slovene wordnet.

References 1. Diab, Mona (2004): The Feasibility of Bootstrapping an Arabic WordNet leveraging Parallel Corpora and an English WordNet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo. 2. Dimitrova, L.; Erjavec, T.; Ide, N.; Kaalep, H.; Petkevic V.; Tufis, D. (1998): Multext-East: Parallel and Comparable Corpora for Six Central and Eastern European Languages. In: Proceedings of ACL/COLING98, pp 315-19, Montreal, Canada. 3. Dyvik, Helge (2002). Translations as semantic mirrors: from parallel corpus to wordnet. Revised version of paper presented at the ICAME 2002 Conference in Gothenburg. 4. Erjavec, Tomaž, Ignat, C., Pouliquen, B., Steinberger, R. (2005): Massive multilingual corpus compilation: ACQUIS Communautaire and totale. In: Proceedings of the Second Language Technology Conference. Poznan, Poland. 5. Erjavec, Tomaž, Darja Fišer (2006): Building Slovene WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluation LREC'06. 24-26th May 2006, Genoa, Italy. 6. Farreres, Xavier, Karina Gibert, Horacio Rodriguez (2004): Towards Binding Spanish Senses to Wordnet Senses through Taxonomy Alignment. In: Proceedings of the Second Global WordNet Conference, pp. 259-264, Brno, Czech Republic, January 20-23, 2004. 7. Fellbaum, C. (ed.) (1998): WordNet. An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. 8. Fišer, Darja (2007a): Leveraging parallel corpora and existing wordnets for automatic construction of the Slovene wordnet. In: Proceedings of the 3rd Language and Technology Conference L&TC'07, 5-7 October 2007. Poznan, Poland. 9. Fišer, Darja (2007b): A multilingual approach to building Slovene Wordnet. In: Proceedings of the workshop on A Common Natural Language Processing Paradigm for Balkan Languages held within the Recent Advances in Natural Language Processing Conference RANLP'07. 26 September 2007, Borovets, Bulgaria. 10. Ide, Nancy; Erjavec, T.; Tufis, D. (2002): Sense Discrimination with Parallel Corpora, In: Proceedings of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54-60, Philadelphia. 11. Knight, K., S. Luk. (1994): Building a Large-Scale Knowledge Base for Machine Translation. In: Proceedings of the American Association of Artificial Intelligence AAAI94. Seattle, WA. 12. Krstev, Cvetana, G. Pavlović-Lažetić, D. Vitas, I. Obradović (2004): Using textual resources in developing Serbian wordnet. In: Romanian Journal of Information Science and Technology. (Volume 7, No. 1-2), pp 147-161. 13. Och, Franz Josef;Hermann Ney (2003): A Systematic Comparison of Various Statistical Alignment Models. V: Computational Linguistics, 29(1): 19-51. 14. Pianta, Emanuele, L. Bentivogli, C. Girardi: MultiWordNet (2002): developing an aligned multilingual database. In: Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002. 13. Resnik, Philip, David Yarowsky (1997): A perspective on word sense disambiguation methods and their evaluation. In: ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4-5, 1997, Washington, D.C., 79-86. 14. Sedlacek, R.; Smrz, P. (2001): A New Czech Morphological Analyser ajka. In: Proceedings of the 4th International Conference, Text, Speech and Dialogue. Zelezna Ruda, Czech Republic. 14. Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006): The JRC-Acquis: A multilingual aligned parallel corpus with

20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy, 24-26 May 2006. 15. Tiedemann, Jörg (2003): Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis, Studia Linguistica Upsaliensia 1. 16. Tufis, D.; Cristea, D.; Stamou, S. (2000): BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In: Dascalu, Dan (ed.): Romanian Journal of Information Science and Technology Special Issue. 7/1-2, 9-43. 17. van der Plas, Lonneke, Jörg Tiedemann (2006): Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity. In: Proceedings of ACL/COLING 2006. 18. Varga, D.; Halacsy, P.; Kornai, A.; Nagy, V.; Nameth, L.; Tron, V. (2005): Parallel corpora for medium density languages. In: Proceedings of RANLP’2005, pp. 590-596. Borovets, Bulgaria.

Acknowledgements I would like to thank Aleš Horák from the Faculty of Informatics, Brno Masaryk University, for POS-tagging and lemmatizing the Czech part of the JRC-Acquis corpus.

Suggest Documents