Leveraging parallel corpora and existing wordnets ... - Semantic Scholar

16 downloads 0 Views 141KB Size Report
five languages: English, Czech, Romanian, Bulgarian and. Slovene. Although the corpus is ... First, all but the content-words (nouns, main verbs, adjectives and ...
Leveraging parallel corpora and existing wordnets for automatic construction of the Slovene wordnet Darja Fišer

Abstract The paper reports on a series of experiments conducted in order to test the feasibility of automatically generating synsets for Slovene wordnet. The resources used were the multilingual parallel corpus of George Orwell’s Nineteen Eighty-Four and wordnets for several languages. First, the corpus was word-aligned to obtain multilingual lexicons and then these lexicons were compared to the wordnets in various languages in order to disambiguate the entries and attach appropriate synset ids to Slovene entries in the lexicon. Slovene lexicon entries sharing the same attached synset id were then organized into a synset. The results obtained by the different settings in the experiment are evaluated against a manually created goldstandard and also checked by hand.

1

Introduction

Research teams have approached the construction of their wordnets in different ways depending on the lexical resources they had at their disposal. There is no doubt that manual construction of an independent wordnet by experts is the most reliable technique as it yields the best results in terms of both linguistic soundness and accuracy of the created database. But such an endeavor is an extremely labor-intensive and time-consuming process, which is why alternative, fully automated or semi-automatic approaches have been proposed that have tried to leverage the existing resources in order to facilitate faster and easier development of wordnets. The lexical resources that have proved useful for such a task fall into several categories. (1) Princeton WordNet (PWN) is an indispensable resource and serves as the backbone of new wordnets in approaches following the expand model (Vossen 1998). This model takes a fixed set of synsets from PWN which are then translated into the target language, preserving the structure of the original wordnet. The cost of the expand model is that the resulting wordnets are heavily biased by the PWN, which can be problematic when the source and target linguistic systems are significantly different. Nevertheless, due to its greater simplicity, the expand model has been adopted in a number of projects, such as the BalkaNet (Tufis 2000) and MultiWordNet (Pianta, Bentivogli and Girardi 2002). (2) A very popular approach is to link English entries from machine-readable bilingual dictionaries to PWN synsets under the assumption that their counterparts in the target language correspond to the same synset (see Knight and Luk 1994). A well-known problem with this approach is that bilingual dictionaries are generally not concept-based but follow traditional lexicographic principles, which is the biggest obstacle to overcome if this technique is to be useful is the disambiguation of dictionary entries. (3) Machine-readable monolingual dictionaries have been used as a source for extracting taxonomy trees by parsing definitions to obtain genus terms and then disambiguating the genus word, resulting in a hyponymy tree (see Farreres et al. 1998). Problems that are inherent to the monolingual dictionary (circularity, inconsistencies, genus omission) as well as limitations with genus term disambiguation must be borne in mind when this approach is considered.

(4) Once taxonomies in the target language are available, they can be mapped to the target wordnet, enriching it with valuable information and semantic links at relatively low cost, or they can be mapped to ontologies in other languages in order to create multilingual ontologies (Farreres et al. 2004). Our attempt in the construction of the Slovene wordnet will be to benefit from the resources we have available, which are mainly corpora. Based on the assumption that translations are a plausible source of semantics we will use a multilingual parallel corpus to extract semantically relevant information. The idea that semantic insights can be derived from the translational relation has already been explored by e.g. Resnik and Yarowsky (1997), Diab (2002) and Ide et al. (2002). It is our hypothesis that senses of ambiguous words in one language are often translated into distinct words in another language. We further believe that if two or more words are translated into the same word in another language, then they often share some element of meaning. This is why we assume that the multilingual-alignment based approach will convey sense distinctions of a polysemous source word or yield synonym sets. The paper is organized as follows: the next section gives a brief overview of the related work after which the methodology for our experiment is explained in detail. Sections 2.1 and 2.2 present and evaluate the results obtained in the experiment and the last section gives conclusions and work to be done in the future.

1.1

Related work

The following approaches are similar to ours in that they rely on word-aligned parallel corpora. Dyvik (2002) identified the different senses of a word based on corpus evidence and then grouped senses in semantic fields based on overlapping translations which indicate semantic relatedness of expressions. The fields and semantic features of their members were used to construct semilattices which were then linked to PWN. Diab (2004) took a word-aligned English-Arabic parallel corpus as input and clustered source words that were translated with the same target word. Then the appropriate sense for the words in clusters was identified on the basis of word sense proximity in PWN. Finally, the selected sense tags were propagated to the respective contexts in the parallel texts.

Sense discrimination with parallel corpora has also been investigated by Ide et al. (2002) who used the same corpus as input data as we did for this experiment but then used the extracted lexicon to cluster words into senses. Finding synonyms with word-aligned corpora was also at the core of work by van der Plas and Tiedemann (2006) whose approach differs from ours in the definition of what synonyms are, which in their case is a lot more permissive than in ours.

2

Methodology

In the experiment we used the parallel corpus of George Orwell’s Nineteen Eighty-Four (Erjavec and Ide 1998) in five languages: English, Czech, Romanian, Bulgarian and Slovene. Although the corpus is a literary text, the style of writing is ordinary, modern and not overly domain-specific. Furthermore, because the 100,000 word corpus is already sentence-aligned and tagged, the preprocessing phase was not demanding. First, all but the content-words (nouns, main verbs, adjectives and adverbs) were discarded to facilitate the word-alignment process. A few formatting and encoding modifications were performed in order to conform to the required input format by the alignment tool. The corpus was word-aligned with Uplug, a modular tool for automated corpus alignment (Tiedemann, 2003). Uplug converts text files into xml, aligns the files at sentence-level, then aligns them at word-level and generates a bilingual lexicon. Because our files had already been formatted and sentence-aligned, the first two stages were skipped and we proceeded directly to word-alignment. The advanced setting was used which first creates basic clues for word alignments, then runs GIZA++ (Och and Ney 2003) with standard settings and aligns words with the existing clues. The alignments with the highest reliability scores are learned and the last two steps are repeated three times. This is Uplug’s slowest standard setting but, considering the relatively small size of the corpus, it was used because it yielded the best results. The output of the alignment process is a file with information on word link certainty between the aligned pair of words and their unique ids (see Figure 1).

Figure 1: An example of word links Word ids were used to extract the lemmas from the corpus and to create bilingual lexicons. In order to reduce the noise in the lexicon as much as possible, only 1:1 links between words of the same part of speech were taken into account and all alignments occurring only once were discarded. The generated lexicons contain all the translations of an English word in the corpus with alignment frequency, part of speech and the corresponding word ids (see Figure 2). The size of each lexicon is about 1,500 entries.

3 0.075 age,n,doba,n oen.2.9.14.16.1.2;oen.2.9.14.16.3.2;oen.2.9.26.3.5.2; 2 0.154 age,n,obdobje,n oen.2.9.26.3.10.3;oen.2.9.26.4.6.5; 4 0.075 age,n,starost,n oen.1.4.24.1.1;oen.1.8.39.4.2;oen.2.4.55.4.3; 5 0.118 age,n,vek,n oen.1.2.38.2.1.8;oen.1.2.38.2.1.3;oen.1.2.38.2.1.1; 4 0.078 aged,a,star,a oen.1.6.4.3.12;oen.1.8.94.4.5;oen.2.1.53.12.4; 2 0.068 agree,v,strinjati,v oen.1.8.50.4.1;oen.3.4.7.1.2; 9 0.104 aim,n,cilj,n oen.1.4.24.10.6;oen.1.5.27.1.3;oen.2.8.55.6.2; 2 0.051 air,n,izraz,n oen.2.8.19.1.9;oen.2.8.52.2.3; 6 0.080 air,n,videz,n oen.2.5.2.7.12;oen.2.5.6.15.16;oen.3.1.19.6.7; 14 0.065 air,n,zrak,n oen.1.1.6.7.7;oen.1.1.7.2.12;oen.1.2.29.2.7;oen.1.3.3.7.3;

Figure 2: An example of bilingual lexicon entries The bilingual lexicons were used to create multilingual lexicons. English lemmas and their word ids were used as a cross-over and all the translations of an English word occurring more than twice were included. If an English word was translated by a single word in one language and by several words in another language, all the variants were included in the lexicon because it is assumed the difference in translation either signifies a different sense of the English word or it is a synonym to another translation variant (see the translations for the English word “army” in Figure 3 which is translated by the same words in all the languages except in Slovene). Whether the variants are synonymous or belong to different senses of a polysemous expression is to be determined in the next stage. In this way, 3 multilingual lexicons were created: En-Cs-Si (1,703 entries), En-CsRo-Si (1,226 entries) and En-Cs-Ro-Bg-Si (803 entries). 2 (v) 2 (n) 3 (n) 3 (n) 2 (n) 2 (n)

answer argument arm arm army army

odgovoriti odpovědět răspunde argument argument argument roka paže braŃ roka ruka braŃ armada armáda armată vojska armáda armată

odgovoriti argument ruka ruka armija armija

Figure 3: An example of multilingual lexicon entries The obtained multilingual lexicons were then compared against the already existing wordnets in the corresponding languages. For English, the Princeton WordNet (Fellbaum 1998) was used while for Czech, Romanian and Bulgarian wordnets from the BalkaNet project (Tufis 2000) were used. The decision for using BalkaNet wordnets is twofold: first, the languages included in the project correspond to the multilingual corpus we had available, and second, the wordnets were developed in parallel, they cover a common sense inventory and are also aligned to one another as well as to PWN, making the intersection easier. If a match was found between a lexicon entry and a literal of the same part of speech in the corresponding wordnet, the synset id was remembered for that language. If after examining all the existing wordnets there was an overlap of synset ids across all the languages (except Slovene, of course) for the same lexicon entry, it was assumed that the words in question all describe the concept marked with this id. Finally, the concept was extended to the Slovene part of the multilingual lexicon entry and the synset id common to all the languages was assigned to it. All the Slovene words sharing the same synset id were treated as synonyms and were grouped into synsets (see Figure 4).

ENG20-03500773-n luč svetilka ENG20-04210535-n miza mizica ENG20-05291564-n kraj mesto prostor ENG20-05597519-n doktrina nauk ENG20-05903215-n beseda vrsta ENG20-06069783-n dokument papir košček ENG20-06672930-n glas zvok ENG20-07484626-n družba svet ENG20-07686671-n armada vojska ENG20-07859631-n generacija rod ENG20-07995813-n kraj mesto prostor ENG20-08095650-n kraj mesto prostor ENG20-08692715-n zemlja tla svet ENG20-09620847-n deček fant ENG20-09793944-n jetnik ujetnik ENG20-10733600-n luč svetloba

Figure 4: An example of the created synsets for Slovene Other language-independent information (e.g. part of speech, domain, semantic relations) was inherited from the Princeton WordNet and an xml file was created. The automatically generated Slovene wordnet was loaded into VisDic, a graphical application for viewing and editing dictionary databases stored in XML format (Horak and Smrž 2000). In the experiment, four different settings were tested: SLWN1 was created from an English-Slovene lexicon compared against the English wordnet; for SLWN2 Czech was added; SLWN3 was created using these as well as Romanian, and SLWN4 was obtained by including Bulgarian as well.

2.1

Results

In our previous work, a version of the Slovene wordnet (SLOWN0) was created from Serbian wordnet (Krstev et al. 2004) was translated into Slovene with a Serbian-Slovene dictionary (see Erjavec and Fišer 2006). The main disadvantage of that approach was the inadequate disambiguation of polysemous words, therefore requiring extensive manual editing of the results. SLOWN0 contains 4,688 synsets, all from Base Concept Sets 1 and 2. Nouns prevail (3,210) which are followed by verbs (1,442) and adjectives (36). There are no adverbs in BCS1 and 2, which is why none were translated into Slovene. Average synset length (number of literals per synset) is 5.9 and the synsets cover 119 domains (see Table 2). In the latest approach to further extend the existing core wordnet for Slovene we attempted to take advantage of several available resources (parallel corpora and wordnets for other languages). At the same we time interested in an approach that would yield more reliable synset candidates. This is why we experimented with four different settings to induce Slovene synsets, each using more resources and including more languages in order to establish which one is the most efficient from the cost and benefit point of view. SLOWN1: This approach, using resources for two languages, is the most similar to our previous experiment, only that this time an automatically generated lexicon was used instead of a bilingual glossary and the much larger English wordnet was used instead of the Serbian one (See Table 1). SLOWN1 contains 6,746 synsets belonging to all three Base Concept Sets and beyond. A lot more verb (2,310) and adjective synsets (1,132) were obtained, complemented by adverbs (340) as well.

Average synset length is much lower (2.0), which could be a sign of a much higher precision than in the previous wordnet, while the domain space has grown (126). synsets avg. l/s bcs1 bcs2 bcs3 other domains nouns max l/s avg. l/s verbs max l/s avg. l/s adjectives max l/s avg. l/s adverbs max l/s avg. l/s

PWN 115,424 1.7 1,218 3,471 3,827 106,908 164 79,689 28 1.8 13,508 24 1.8 18,563 25 1.7 3,664 11 1.6

CSWN 28,405 1.5 1,218 3,471 3,823 19,893 156 20,773 12 1.4 5,126 10 1.8 2,128 8 1.4 164 4 1.5

ROWN 18,560 1.6 1,189 3,362 3,593 10,416 150 12,382 17 1.7 4,603 12 2.1 833 11 1.5 742 10 1.6

BGWN 21,105 2.111 1,218 3,471 3,827 12,589 155 13,924 10 1.8 4,163 18 3.6 3,009 9 1.7 9 2 1.2

GOLDST 1,179 2.1 320 804 55 0 85 828 8 1.9 343 10 2.4 8 4 1.6 0 0 0

Table1: Statistics for the existing wordnets SLOWN2: Starting from the assumption that polysemous words tend to be realized differently in different languages and that translations themselves could be used to discriminate between the different senses of the words as well as to determine relations among lexemes Czech was added to the lexicon as well as in the lexicon-wordnet comparison stage. As a result, the number of obtained synsets is much lower (1,501) as is the number of domains represented by the wordnet (87). Average synset length is 1.8 and nouns still represent the largest portion of the wordnet (870). synsets avg l/s bcs1 bcs2 bcs3 other domains nouns max l/s avg l/s verbs max l/s avg l/s adjectives max l/s t avg l/s adverbs max l/s avg l/s

SLOWN0 4,688 5.9 1,219 3,469 0 0 119 3,210 40 4.8 1,442 96 8.4 36 30 8.2 0 0 0

SLOWN1 6,746 2.0 588 1,063 663 4,432 126 2,964 10 1.4362 2,310 76 3.3 1,132 4 1.2 340 20 2.1

SLOWN2 1,501 1.8 324 393 230 554 87 870 7 1.4 483 15 2.7 118 4 1.1 30 5 1.6

SLOWN3 SLOWN4 1,372 549 2.4 1.8 293 166 359 172 22 99 496 112 83 60 671 291 6 4 1.4 1.7 639 249 30 26 3.7 2.6 32 9 2 2 1.0 1.1 30 0 3 0 1.4 0

Table 2: Results obtained in the experiment SLOWN3: It was assumed that by adding another language recall would fall but precision would increase and we wished to test this hypothesis in a step-by-step way. In this setting, the multilingual lexicon was therefore extended with Romanian translations and the Romanian wordnet was used to obtain an intersection between lexicon entries and wordnet synsets. The number of synsets falls slightly in SLOWN3 but the average number of literals per synset increases.

The domain space is virtually the same, as is the proportion of nouns, verbs and adverbs with adjectives falling the most drastically. SLOWN4: Finally, Bulgarian was added both in the lexicon and wordnet lookup stage. The last wordnet created within this experiment only contains 549 synsets with 1.8 as average synset length. The number of noun and verb synsets is almost the same and in this case no adverbial synsets were obtained. The number of domains represented in this wordnet is 60 (see Table 2).

2.2

Evaluation and discussion of the results

In order to evaluate the results obtained in the experiment and to decide which setting performs best the generated synsets were evaluated against a goldstandard that was created by hand. The goldstandard contains 1,179 synsets from all three Base Concept Sets. This is why evaluation only takes into account the automatically generated synsets that belong to these three categories. Average synset length in the goldstandard is 2.1 which is comparable to Bulgarian wordnet and to the Slovene wordnets created within this experiment. Synsets belong to 80 different domains, a major part of them are nouns (828) but there are no adverbs in the goldstandard, which is why they will not be evaluated. Also, since only three adjective synsets overlapped in the goldstandard and the generated wordnets, they were excluded from the evaluation as well. Table 3 shows the results of the automatic evaluation of the wordnets obtained in the experiment. The manually created goldstandard that was used for evaluation was created by hand and contains multi-word literals as well. The automatic method presented in this paper is limited to one-word translation candidates, which is why multi-word literals were disregarded in the evaluation. The most straightforward approach for evaluation of the quality of the obtained wordnets would be to compare the generated synsets with the corresponding synsets from the goldstandard. But in this way we would be penalizing the automatically induced wordnets for missing literals which are not part of the vocabulary of the corpus that was used to generate the lexicons in the first place. Instead we opted for a somewhat different approach by comparing literals in the goldstandard and in the automatically induced wordnets with regard to which synsets they appear in. This information was used to calculate precision, recall and fmeasure. This seems a fairer approach because of the restricted input vocabulary. nouns precision recall f-measure verbs precision recall f-measure total prec. total recall total f-m total

SLOWN0 261 50.6% 89.3% 64.6% 174 43.8% 74.7% 55.3% 445 48.0% 83.5% 61.0%

SLOWN1 322 70.2% 87.3% 77.8% 127 35.8% 70.3% 47.4% 449 60.6% 82.6% 69.9%

SLOWN2 223 78.4% 81.7% 80.0% 79 54.2% 66.2% 59.6% 302 72.3% 77.6% 74.9%

SLOWN3 179 73.0% 77.4% 75.1% 69 37.5% 72.5% 49.4% 248 63.2% 76.2% 69.1%

SLOWN4 103 84.1% 78.2% 81.1% 53 46.0% 59.1% 51.7% 156 71.3% 71.5% 71.4%

Table 3: Automatic evaluation of the results

As can be seen from Table 3, the approach taken in this experiment outperforms the previous attempt in which a bilingual dictionary was used as far as precision is concerned. Also, the method works much better for nouns than for verbs, regardless of the setting used. In general, there is a steady growth in precision and a corresponding fall in recall with a gradual growth in fmeasure (see Table 3). The best results are obtained by merging resources for English, Czech and Slovene (74.89%), when adding Romanian the results fall (69.10%) and then rise again when Bulgarian is added, reaching almost the same level as setting 2 (71.34). It is interesting to see the drop in quality when Romanian is added; this might occur either because the wordalignment is worse for English-Romanian (because of a freer translation and consequently poorer quality of sentence-alignment of the corpus), or it might be due to the properties of the Romanian wordnet version we used for the experiment, which is smaller than other wordnets. In order to gain insight into what actually goes on with the synsets in the different settings, we generated an intersection of all the induced wordnets and checked them manually. Automatic evaluation shows that the method works best for nouns, which is why we focus on them in the rest of this section. The sample we used for manual evaluation contains 165 synsets which are the same in all the generated wordnets and can therefore be directly compared. In manual evaluation we checked whether the generated synset obtains a correct literal at all. We classified errors into several categories: the wrong literal is a hypernym of the concept in question (more general), the wrong literal is a hyponym of the concept (more specific), the concept is semantically related to the concept (meronym, holonym, antonym), the literal is simply wrong. The results of manual evaluation of the small sample confirm the results obtained by automatic evaluation. However, it was interesting to see that even though SLOWN2 contains slightly more errors than SLOWN4, there is more variation (more synset members) in SLOWN2 which makes it a more useful resource and is therefore the preferred setting (see Table 4). Unsurprisingly, the least problematic synsets are those lexicalizing specific concepts (such as “rat”, “army”, “kitchen”) and the most difficult ones were those containing highly polysemous words and describing vague concepts (e.g. “face” which as a noun has 13 different senses in PWN or “place” which as a noun has 16 senses). total no. of syns fully correct syns no correct lit. at least 1 corr lit. hypernym hyponym sem. related lit. more err. types

SLOWN1 165 (100%) 96 (58.2%) 6 (3.6%) 43 (26.0%) 3 (1.8%) 6 (3.6%) 2 (1.2%) 8 (4.8%)

SLOWN2 165 (100%) 103 (62.4%) 5 (3.0%) 37 (22.4%) 3 (1.8%) 6 (3.6%) 4 (2.4%) 5 (3.0%)

SLOWN3 165 (100%) 119 (72.1%) 10 (6.0%) 20 (12.1%) 3 (1.8%) 10 (6.0%) 2 (1.2%) 0 (0.0%)

Table 4: Manual evaluation of the results

SLOWN4 165 (100%) 134 (81.2%) 9 (5.4%) 14 (8.4%) 0 (0.0%) 6 (3.6%) 1 (3.6%) 0 (0.0%)

3

Conclusions and future work

Lexical semantics is far from being trivial for computers as well as humans. This can be seen by relatively low interannotator agreement scores when humans are asked to assign a sense to a word from the parallel corpus as was used in this study (approx. 75% for wn senses as reported by Ide et al. 2002). Comparing the size of the lexicons as well as the results obtained in the presented set of experiments they lie within these figures, showing that the method is promising for nouns (much less for other parts of speech) and should be investigated further on other, larger and more varied corpora. The next step will therefore be the application of the second setting yielding the best results in this experiment to the multilingual and much larger ACQUIS corpus (Steinberger et al. 2006). Attempts have already been made to word-align the ACQUIS corpus (e.g. Giguet, Luquet 2006) but their alignments are not useful for our method as their alignments vary greatly in length and also include function words. This is why the wordalignment phase will have to be done from scratch, a nontrivial task because a lot of preprocessing (tagging, lemmatization and sentence-alignment) is required. The results could further be improved by using the latest versions of the wordnets. The ones that were used in this experiment are from 2004 when the BalkaNet project ended but the teams have continued developing their wordnets and they are now much larger and better resources.

4

References

Diab, Mona (2004): The Feasibility of Bootstrapping an Arabic WordNet leveraging Parallel Corpora and an English WordNet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo 2004. Dyvik, Helge (2002). Translations as semantic mirrors: from parallel corpus to wordnet. Revised version of paper presented at the ICAME 2002 Conference in Gothenburg. Erjavec, Tomaž, Darja Fišer (2006): Building Slovene WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluation LREC'06. 24-26th May 2006, Genoa, Italy. Fellbaum, Christiane (1998): WordNet: An Electronic Lexical Database. MIT Press. Giguet, Emmanuel, Luquet, Pierre-Sylvain (2006): Multilingual Lexical Database Generation from Parallel Texts in 20 European Languages with Endogenous Resources. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Horak, Ales, Pavel Smrz (2000): New Features of Wordnet Editor VisDic. In: Romanian Journal of Information Science and Technology Special Issue (Volume 7, No. 12). Ide, Nancy, Tomaž Erjavec, Dan Tufis (2002): Sense Discrimination with Parallel Corpora. In: Proceedings of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, pp. 54-60. Krstev, Cvetana, G. Pavlović-Lažetić, D. Vitas, I. Obradović (2004): Using textual resources in developing Serbian wordnet. In: Romanian Journal of Information Science and Technology. (Volume 7, No. 1-2), pp 147161.

Och, Franz Josef, Hermann Ney (2003): A Systematic Comparison of Various Statistical Alignment Models. In: Computational Linguistics (Volume 29, No. 1). Tiedemann, Jörg (2003): Recycling Translations Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing, Doctoral Thesis. Studia Linguistica Upsaliensia 1. Tufis, Dan (2000): BalkaNet - Design and Development of a Multilingual Balkan WordNet. In: Romanian Journal of Information Science and Technology Special Issue (Volume 7, No. 1-2). van der Plas, Lonneke, Jörg Tiedemann (2006): Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity. In: Proceedings of ACL/COLING 2006. Vossen, Piek (ed.) (1998): EuroWordNet: a multilingual database with lexical semantic networks for European Languages. Kluwer, Dordrecht. Farreres, Xavier, G. Rigau, H. Rodrguez (1998): Using WordNet for Building WordNets. In Proceedings of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, Canada. Pianta, Emanuele, L. Bentivogli, C. Girardi: MultiWordNet (2002): developing an aligned multilingual database. In: Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002. Knight, K., S. Luk. (1994): Building a Large-Scale Knowledge Base for Machine Translation. In: Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA. Farreres, Xavier, Karina Gibert, Horacio Rodriguez (2004): Towards Binding Spanish Senses to Wordnet Senses through Taxonomy Alignment. In: Proceedings of the Second Global WordNet Conference, pp. 259-264, Brno, Czech Republic, January 20-23, 2004. Resnik, Philip, David Yarowsky (1997): A perspective on word sense disambiguation methods and their evaluation. In: ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4-5, 1997, Washington, D.C., 79-86. Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006): The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy, 24-26 May 2006.

Suggest Documents