A Multilingual Approach to Building Slovene Wordnet Darja Fišer Department of Translation, Faculty of Arts, University of Ljubljana Aškerčeva 2 1000 Ljubljana, Slovenia
[email protected]
Abstract The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translational relation on the one hand reduces ambiguity of a source word and on the other conveys semantic relatedness of a set of target words. We tried to identify sense distinctions of polysemous words and obtain sets of synonyms by first extracting multilingual lexicons from a word-aligned JRC-Acquis parallel corpus and then comparing them with the already existing wordnets in various languages. At this stage, lexicon entries were disambiguated and appropriate synset ids were assigned to their Slovene translation equivalents. Finally, the Slovene lexicon entries sharing the same assigned synset id were organized into a synset.
Keywords wordnet, word senses, semantic relations, parallel corpora, automatic word alignment, multilingual lexicon extraction
1. Introduction WordNet [7] is an extensive lexical database in which words are divided by part of speech and organized into a hierarchy of nodes. Each node represents a concept and words denoting the same concept are grouped into a synset with a unique id (e.g. ENG20-02853224-n: {car, auto, automobile, machine, motorcar}). Concepts are defined by a short gloss (e.g. 4-wheeled motor vehicle; usually propelled by an internal combustion engine) and are also linked to other relevant synsets in the database (e.g. hypernym: {motor vehicle, automotive vehicle}, hyponym: {cab, hack, taxi, taxicab}). Over time, WordNet has become one of the most valuable resources for a wide range of NLP applications, which initiated the development of wordnets for many other languages as well1. One of such enterprises is the building of Slovene WordNet [5]. While it is true that manual construction of each wordnet is the most reliable and produces the best results as far as linguistic soundness and accuracy is concerned, such an endeavor is too time-consuming and too expensive for most research teams to be feasible. This is why alternative, semi- or fully automatic approaches have been proposed.
1
http://www.globalwordnet.org/gwa/wordnet_table.htm
By taking advantage of the existing resources they facilitate faster and easier development of a wordnet. The approaches to build, extend and enrich wordnets vary according to the resources that are available for a particular language, ranging from Princeton WordNet [15, 20], bilingual and monolingual dictionaries [12], to taxonomies and ontologies [6]. In the process of construction of the Slovene wordnet we try to leverage the resources we have available, which are mainly corpora. Based on the assumption that translations are a plausible source of semantics we have used a multilingual parallel corpus to extract semantically relevant information. The idea that semantic insights can be derived from the translational relation has already been explored by e.g. [1, 3, 11]. It is our hypothesis that senses of ambiguous words in one language are often translated into distinct words in another language (e.g. Slovene equivalent for the English word ‘school’ meaning educational institution is ‘šola’ and ‘jata’ for a large group of fish). We further believe that if two or more words are translated into the same word in another language, then they often share some element of meaning (e.g. the English word ‘boy’ meaning a young male person can be translated into Slovene as either ‘fant’ or ‘deček’). This is why we assume that the multilingualalignment based approach will convey sense distinctions of a polysemous source word (as in the first example) or yield synonym sets (as in the second example). The paper is organized as follows: a brief overview of the related work is given in the next section after which the methodology for our experiment is described. Sections 3 and 4 present and evaluate the results obtained in the experiment and the last section gives conclusions and work to be done in the future.
1.1 Related work Below we outline some approaches that are similar to ours in extracting semantically related information on wordaligned parallel corpora. Dyvik [3] identified the different senses of a word based on corpus evidence and then grouped senses in semantic fields based on overlapping translations which indicate semantic relatedness of expressions. The fields and semantic features of their members were used to construct semilattices which were then linked to PWN.
Diab [1] took a word-aligned English-Arabic parallel corpus as input and clustered source words that were translated with the same target word. Then the appropriate sense for the words in clusters was identified on the basis of word sense proximity in PWN. Finally, the selected sense tags were propagated to the respective contexts in the parallel texts. Sense discrimination with parallel corpora has also been investigated by Ide et al. [11] who used the extracted lexicon to cluster words into senses. Finding synonyms with word-aligned corpora was also at the core of work by van der Plas and Tiedemann [21] whose approach differs from ours in the definition of what synonyms are, which in their case is a lot more permissive than in ours. In a previous related experiment [8] we have used the Multext-East parallel corpus [2], Princeton WordNet [6] and BalkaNet wordnets [20]. The corpus was word-aligned and a multilingual lexicon was extracted. Lexicon entries were then mapped onto the existing wordnets and Slovene lexicon entries were assigned appropriate synset ids. These were then used to generate Slovene synsets. The evaluation of the results showed that the approach is promising. However, the corpus used was relatively small (100,000 words per language) and it only contained a single text, the novel “1984” by George Orwel. Therefore the number of the generated synsets was rather low, especially when more languages were added at the disambiguation stage. The aim of the present experiment is to take the best performing settings from the previous experiment and use it on a much larger dataset, the JRC-Acquis corpus [18]. This corpus is much larger (10 million words per language) but also domain-specific, as it contains EU legal acts only. The difference in size and genre against the “1984” corpus makes it interesting to see the range as well as the quality of the created synsets.
2. Description of the approach 2.1 Parallel corpus The JRC-Acquis corpus contains EU legislation; declarations and resolutions and international agreements in all 20 official languages of the EU plus Romanian and is currently the only parallel corpus of its size in so many languages. It is available in TEI-compliant XML format, and includes marked-up texts and bilingual alignment information for all the 190+ language pairs [15]. The JRC-Acquis is paragraph-aligned with HunAlign [22] but is not tagged, lemmatized, sentence- or wordaligned. Attempts have already been made to word-align the ACQUIS corpus [9] but their alignments are not useful for our method as their alignments vary greatly in length, include function words and do not contain POS information. This means that the pre-processing stage in this experiment was a lot more demanding than with the 1984 corpus.
We therefore decided to initially limit the languages involved to English, Czech and Slovene because the combination of these three languages performed well in the previous experiment with the aim of extending it to Bulgarian and Romanian in the near future. The English and Slovene part of the corpus were first tokenized, after which word-forms were tagged with their context-disambiguated part-of-speech (morphosyntactic description), and finally, the tagged word-forms were lemmatised with totale [4] trained on 1984 while the Czech part was kindly tagged for us with Ajka [17] by the team from the Faculty of Informatics from the Masaryk University in Brno. For the dataset used in the experiment, we chose the first 2000 documents from the JRC-Acquis corpus. All function words were discarded, then the corpus was sentence and word aligned with Uplug [19] for which the slowest but best performing ‘advanced setting’ was used. It first creates basic clues for word alignments, then runs GIZA++ [14] with standard settings and aligns words with the existing clues. Alignments with the highest confidence measure are learned and the last two steps are repeated three times. The output of the alignment process is a file of word links with information on word link certainty between the aligned pair of words and their unique ids.
2.2 Lexicon extraction Word ids from the file with word alignments were used to extract the lemmas from the corpus and to create bilingual lexicons (English-Czech and English-Slovene). In order to reduce noise in the lexicon as much as possible, only 1:1 links between words of the same part of speech were taken into account and all alignments occurring only once were discarded. Also, tokens containing non-letter strings were filtered out because these are usually document reference numbers and are not relevant for our tasks. The generated lexicons contain all the translations of an English word in the corpus with alignment frequency, part of speech and the corresponding word ids. The size of each lexicon is almost 10,000 entries. The extracted bilingual lexicons were used to create a multilingual lexicon. English lemmas and their word ids were used as a cross-over and all the translations of an English word occurring more than once were included. If an English word was translated by a single word in one language and by several words in another language, all the variants were included in the lexicon because it is assumed the difference in translation either signifies a different sense of the English word or it is a synonym to another translation variant. The next step will determine whether the variants are synonymous or belong to different senses of a polysemous expression. The generated English-CzechSlovene lexicon contains 8,400 entries.
2.3 Sense assignment and synset induction In the next phase, English entries from the multilingual lexicon were mapped onto Princeton WordNet [7] and Czech entries onto Czech wordnet developed within the BalkaNet project [20]. If a match was found between a lexicon entry and a literal of the same part of speech in the corresponding wordnet, the synset id was remembered for that language. If after examining all the existing wordnets there was an overlap of synset ids in both languages for the same lexicon entry, it was assumed that the words in question describe the concept marked with this id. Finally, the concept was extended to the Slovene part of the multilingual lexicon entry and the synset id common to both languages was assigned to it. All the Slovene words sharing the same synset id were treated as synonyms and were grouped into synsets. Other language-independent information (e.g. part of speech, domain, semantic relations) was inherited from the Princeton WordNet and an xml file was created. The automatically generated Slovene wordnet was loaded into VisDic, a graphical application for viewing and editing dictionary databases stored in XML format [10].
3. Results In our previous experiment, Slovene wordnet was created from the Multext-East parallel corpus. The results were encouraging, which is why we wished to test the bestperforming approach on the JRC-Acquis corpus that is from an entirely different domain and is much larger. Due to different characteristics of the resource used it is interesting to observe the change in synset coverage and quality. However, because the corpus is not annotated with the linguistic information needed, a lot of preprocessing was required beforehand, which is why the present experiment had to be limited to English, Czech and Slovene although the previous study shows that the quality of the generated synsets improves if more languages are included. Table 1 shows the statistics of Princeton (PWN) and Czech (CSWN) wordnets that were used to generate Slovene synsets as well as the statistics of the manually created goldstandard that will be used for automatic evaluation of the results of this experiment (see Section 4). The goldstandard was created from Serbian wordnet [13] which was translated into Slovene with a Serbian-Slovene dictionary [5]. It contains 3,169 synsets from Base Concept Sets 1-3. There are no adverbs in the goldstandard, which is why they are not evaluated and discussed in this paper. Table 2 gives the statistics for two versions of Slovene wordnet generated in the previous experiment and for the version obtained from the JRC-Acquis corpus. SLOWN1 was created from mapping Orwell’s English-Slovene lexicon onto Princeton WordNet. It contains 6,746 synsets belonging to all three Base Concept Sets and beyond. Average synset length is 2.0 and synsets belong to 126
different domains. The much smaller SLOWN2 was created from Orwell’s English-Czech-Slovene lexicon and contains 1,501 synsets from 87 domains. There are 1.8 literals per synset on average, most of which are nominal. Table 1. Statistics for the existing wordnets synsets avg. l/s bcs1 bcs2 bcs3 other domains nouns max l/s avg. l/s verbs max l/s avg. l/s adjectives max l/s avg. l/s adverbs max l/s avg. l/s
PWN 115,424 1.7 1,218 3,471 3,827 106,908 164 79,689 28 1.8 13,508 24 1.8 18,563 25 1.7 3,664 11 1.6
CSWN 28,405 1.5 1,218 3,471 3,823 19,893 156 20,773 12 1.4 5,126 10 1.8 2,128 8 1.4 164 4 1.5
GOLDST 3,169 1.9 1,188 1,857 123 0 108 2,422 10 1.8 712 16 2.3 35 5 2.1 0 0 0.0
SLOWNJRC is the wordnet generated in this experiment. It is larger than both older versions (7,168 synsets), though not as large as was expected considering the much bigger size of the corpus used. Another interesting fact is that although this version contains more synsets than the older ones, they belong to fewer domains. However, these observations are less surprising when it is borne in mind that the corpus and the vocabulary used in it are much more domain-specific. Average synset length is 2.5, which suggests lower precision of the generated synsets. Table 3. Statistics for versions of Slovene wordnets synsets avg. l/s bcs1 bcs2 bcs3 other domains nouns max l/s avg. l/s verbs max l/s avg. l/s adjectives max l/s avg. l/s
SLOWN1 6,746 2.0 588 1,063 663 4,432 126 2,964 10 1.4 2,310 76 3.3 1,132 4 1.2
SLOWN2 1,501 1.8 324 393 230 554 87 870 7 1.4 483 15 2.7 118 4 1.2
SLOWNJRC 4,768 2.5 283 400 229 3,753 103 3,528 9 2.6 619 9 2.1 572 8 2.1
4. Evaluation and discussion
4.2 Evaluation of the lexicon
4.1 Precision, recall and f-measure The results obtained in the experiment were first evaluated compared against the manually created goldstandard that also contains multi-word literals. The automatic method presented in this paper is limited to one-word translation candidates, which is why multi-word literals were not included in the evaluation. The most straightforward approach for evaluation of the quality of the obtained wordnets would be to compare the generated synsets with the corresponding synsets from the goldstandard. But in this way we would be penalizing the automatically induced wordnets for missing literals which are not part of the vocabulary of the corpus that was used to generate the lexicons in the first place. Instead we opted for a somewhat different approach by comparing literals in the goldstandard and in the automatically induced wordnets with regard to which synsets they appear in. This information was used to calculate precision, recall and f-measure. Precision gives the proportion of retrieved and relevant synset ids for a literal to all synset ids for that literal. Recall is the proportion of relevant synset id retrieved for a literal out of all relevant synset ids available for that literal. Finally, precision and recall were combined in the traditional fmeasure: (2 * P * R) / (P + R). This seems a fairer approach because of the restricted input vocabulary. Table 3. Automatic evaluation of the results SLOWN1 nouns precision recall f-measure verbs precision recall f-measure adjectives precision recall f-measure total precision recall f-measure
SLOWN2
SLOWNJRC
70.2% 87.3% 77.8%
78.4% 81.7% 80.0%
67.0% 72.0% 69.4%
35.8% 70.3% 47.4%
54.2% 66.2% 59.6%
44.0% 48.1% 46.0%
86.7% 100.0% 92.9%
100.0% 77.8% 87.5%
85.0% 93.0% 88.8%
60.7% 82.6% 69.9%
72.3% 77.6% 74.9%
53.4% 81.4% 64.5%
Table 3 shows the results of the automatic evaluation of the wordnets obtained in the experiment. In line with our previous findings, the method works much better for adjectives and nouns than for verbs on this dataset as well. Compared to the results we obtained in the initial experiment, the overall quality of the results obtained from the JRC-Acquis corpus are worse (f-measure is 64.6% compared to 74.9% in the previous experiment involving the same number of languages).
In order to gain more insight into what caused the drop in quality, we first examined the bilingual lexicons extracted from the word-aligned corpus. Because these are used to map the entries onto English and Czech wordnets and to provide Slovene translations of wordnet literals, the quality of the lexicons no doubt influences the entire approach. For this reason, 500 randomly selected entries from the EnglishSlovene lexicon were examined. Errors in the lexicon were classified into four categories: wrong alignment (align), wrong tokenization (tok), wrong lemmatization (lem) and wrong alignment because of a multi-word-expression (mwe). The errors found in the lexicon are summarized in Table 4. Table 4. Evaluation of the English-Slovene lexicon wrong alignment: wrong lemmatization: wrong lemmatization (slo): wrong lemmatization (eng): wrong tokenization: multi-word-expression:
46 27 20 5 5 9
15.33% 9.00% 6.67% 1.67% 1.67% 3.00%
Most alignment mistakes appear with words which are neither English nor Slovene (citations in the corpus from other languages) which means that the accuracy of the alignment could be improved by filtering out the sections of the corpus in which they appear. Again, the best alignments are those of nouns, followed by verbs and adjectives. Below is an analysis of the mistakes for each part of speech. Nouns: Common alignment mistakes in nouns are 3and 4-letter abbreviations which are either misaligned or very often lemmatized in the wrong way. There is also quite a lot of problems with English noun1-noun2 compounds which are translated with an adj-noun or noun2-noun1 phrase into Slovene (e.g. ‘durum’/‘pšenica’ where ‘durum wheat’ is translated with ‘pšenica durum’ and thus misaligned or ‘flare’/‘salo’ where ‘flare fat’ is translated with ‘trebušno salo’ but aligments across parts of speech are not done: noun-adjective alignments ‘flare’ with ‘trebušno’). The lemmatization goes wrong in cases where word forms have more than one possible lemma (e.g. ‘drug’/‘drog’, here ‘droga‘ would be correct, ‘drog’ means ‘pole’ or ‘deadline’/‘roka’, here ‘rok’ would be correct, ‘roka’ means ‘hand’). Other common lemmatization mistakes are those of unknown words (e.g ‘veto’/‘vet’ instead of ‘veto’). The last group of well-known errors are multi-word expressions, in one language a concept is expressed with more than one word and in another language, a compound is used, which is why the alignment is wrong (e.g. ‘grandmother’/‘mati’ instead of ‘stara mati’, ‘oxygenation’/‘dovajanje’ instead of ‘dovajanje kisika’).
Verbs: Most of the errors are due to wrong pos assignment (nouns are mistreated as verbs, a lot of foreign words are treated as nouns). With verbs, lemmatization turns out to be more problematic for English where participles of irregular verbs are left as lemmas which is correct in case the same word functions as an adjective but not if it is a verb. Adjectives: With adjectives there are only a few pos and lemmatization problems, the biggest problem seems to be wrong alignments content-wise for which there is mostly no obvious reason (e.g. ‘eventual’/‘titanov’). Sometimes the problem seems to lie in phrases (multiple modifiers of nouns) which are expressed in one order in one language but in another in the other language, so that the resulting alignments are wrong (e.g. ‘deboned’/‘puranji’ are clearly from the same phrase but were not aligned correctly). This analysis shows that more effort should be made to improve the quality of word-alignment and the extracted lexicons, for example by extracting 1:1 paragraph alignments, setting a lower threshold for allowed sentence length and filtering out alignments with a confidence measure below 0.05. It is believed that the improved lexicons should then yield better results.
4.3 Evaluation of domains Finally, a set of automatically generated synsets were selected and checked manually. Automatic evaluation shows that the method works best for nouns, which is why we focus on them in the rest of this section. The sample we used for manual evaluation contains 200 synsets from 8 different wordnet domains. 100 of them were selected from the domains similar to the texts from the corpus (administration, chemistry, economy, law) and 100 from the domains entirely different than the texts in the corpus (linguistics, literature, publishing, mathematics). These two groups of synsets will be compared in order to see whether the results are better for domain-specific vocabulary. In manual evaluation we checked whether the generated synset obtains a correct literal at all. Then we classified errors into several categories: the wrong literal is an alignment error (see previous section), the wrong literal is semantically related to the concept (hypernym, hyponym) and the literal is not correctly disambiguated. As can be seen from Table 5, the quality of the generated synsets varies significantly from domain to domain. While both samples contain roughly the same amount of alignment errors inherited from the lexicons, there is a big difference between disambiguation errors (wrong sense assignment to polysemous literals) (4 compared to 19). The analysis suggests that by using domain information as a criterion for sense assignment at the synset generation stage synset quality could be improved.
Table 5. Automatic evaluation of the results fully correct completely wrong contains an alignment error contains a related expression contains a disambiguation error
D1 65 14 23 9 4
D2 33 21 22 4 19
5. Conclusions and future work In this paper we have presented an approach to automatically generate wordnet synsets from the JRCAcquis parallel corpus. The results are lower than in our previous experiment where Multext-East corpus was used in a similar way but this is understandable because a major part of linguistic annotation of the Multext-East corpus was done manually and is therefore much more accurate than the automatic tagging and lemmatization of the JRCAcquis. Also, high-precision sentence-alignment of the corpus that contains a single novel was much easier than the alignment of the JRC-Acquis corpus which consists of 8000 documents per language. Nevertheless, we still believe the approach is promising, especially because error analysis has indicated several possibilities for the improvement of the technique, especially as far as bilingual lexicons are concerned. Also, accuracy of the generated synsets could be improved significantly by adding more languages. Evaluation of the results has shown that the method works best for nouns. But because the usefulness of the wordnet as a whole also depends on the density of the hierarchy nominal synsets are organized into, it is necessary to investigate how complete the resulting wordnet is in this respect and come up with a technique to fill the identified gaps in the hierarchy of the automatically generated wordnet. The manually checked sample of synsets has also shown that there is a strong correlation between the quality of the generated synsets and an overlap of domains between the corpus used and the synsets. This means that the wordnet could be refined by restricting the domain-space at the lexicon-to-wordnet mapping stage according to the corpus used. Apart from that, complementary methods to extend the wordnet even further are already being looked into. For example, by using bilingual resources to add monosemous synsets from the Princeton Wordnet because literals that only appear in one sense do not need to be disambiguated and therefore do not require extensive multilingual resources. Also, to complement the word-alignment method which is limited to producing one-word literals, domainspecific multi-word vocabulary could be obtained from specialized glossaries and lexicons.
Acknowledgements I would like to thank Aleš Horák from the Faculty of Informatics, Brno Masaryk University, for POS-tagging and lemmatizing the Czech part of the JRC-Acquis corpus.
References [1] Diab, Mona (2004): The Feasibility of Bootstrapping an Arabic WordNet leveraging Parallel Corpora and an English WordNet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo. [2] Dimitrova, L.; Erjavec, T.; Ide, N.; Kaalep, H.; Petkevic V.; Tufis, D. (1998): Multext-East: Parallel and Comparable Corpora for Six Central and Eastern European Languages. In: Proceedings of ACL/COLING98, pp 315-19, Montreal, Canada. [3] Dyvik, Helge (2002). Translations as semantic mirrors: from parallel corpus to wordnet. Revised version of paper presented at the ICAME 2002 Conference in Gothenburg. [4] Erjavec, Tomaž, Ignat, C., Pouliquen, B., Steinberger, R. (2005): Massive multilingual corpus compilation: ACQUIS Communautaire and totale. In: Proceedings of the Second Language Technology Conference. Poznan, Poland. [5] Erjavec, Tomaž, Darja Fišer (2006): Building Slovene WordNet. In: Proceedings of the 5th International Conference on Language Resources and Evaluation LREC'06. 24-26th May 2006, Genoa, Italy. [6] Farreres, Xavier, Karina Gibert, Horacio Rodriguez (2004): Towards Binding Spanish Senses to Wordnet Senses through Taxonomy Alignment. In: Proceedings of the Second Global WordNet Conference, pp. 259-264, Brno, Czech Republic, January 20-23, 2004. [7] Fellbaum, C. (ed.) (1998): WordNet. An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. [8] Fišer, D (to appear): Leveraging parallel corpora and existing wordnets for automatic construction of the Slovene wordnet. [9] Giguet, Emmanuel, Luquet, Pierre-Sylvain (2006): Multilingual Lexical Database Generation from Parallel Texts in 20 European Languages with Endogenous Resources. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. [10] Horak, Ales, Pavel Smrz (2000): New Features of Wordnet Editor VisDic. In: Romanian Journal of Information Science and Technology Special Issue (Volume 7, No. 1-2). [11] Ide, Nancy; Erjavec, T.; Tufis, D. (2002): Sense Discrimination with Parallel Corpora, In: Proceedings of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54-60, Philadelphia. [12] Knight, K., S. Luk. (1994): Building a Large-Scale Knowledge Base for Machine Translation. In: Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA.
[13] Krstev, Cvetana, G. Pavlović-Lažetić, D. Vitas, I. Obradović (2004): Using textual resources in developing Serbian wordnet. In: Romanian Journal of Information Science and Technology. (Volume 7, No. 1-2), pp 147-161. [14] Och, Franz Josef;Hermann Ney (2003): A Systematic Comparison of Various Statistical Alignment Models. V: Computational Linguistics, 29(1): 19-51. [15] Pianta, Emanuele, L. Bentivogli, C. Girardi: MultiWordNet (2002): developing an aligned multilingual database. In: Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002. [16] Resnik, Philip, David Yarowsky (1997): A perspective on word sense disambiguation methods and their evaluation. In: ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4-5, 1997, Washington, D.C., 79-86. [17] Sedlacek, R.; Smrz, P. (2001): A New Czech Morphological Analyser ajka. In: Proceedings of the 4th International Conference, Text, Speech and Dialogue. Zelezna Ruda, Czech Republic. [18] Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006): The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy, 24-26 May 2006. [19] Tiedemann, Jörg (2003): Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis, Studia Linguistica Upsaliensia 1. [20] Tufis, D.; Cristea, D.; Stamou, S. (2000): BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In: Dascalu, Dan (ed.): Romanian Journal of Information Science and Technology Special Issue. 7/1-2, 9-43. [21] van der Plas, Lonneke, Jörg Tiedemann (2006): Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity. In: Proceedings of ACL/COLING 2006. [22] Varga, D.; Halacsy, P.; Kornai, A.; Nagy, V.; Nameth, L.; Tron, V. (2005): Parallel corpora for medium density languages. In: Proceedings of RANLP’2005, pp. 590-596. Borovets, Bulgaria. [23] Vossen, Piek (ed.) (1998): EuroWordNet: a multilingual database with lexical semantic networks for European Languages. Kluwer, Dordrecht.