How to Tell a Pine Cone from an Ice Cream Cone. Proceeding of SIGDOC, Toronto. (1986) 24â26. 6. Mihalcea, R. and Moldovan, D.I.: A Method for Word Sense ...
Exploiting the Translation Context for Multilingual WSD Lucia Specia and Maria das Gra¸cas Volpe Nunes Departamento de Ciˆencias de Computa¸cao e Estat´ıstica ICMC - Universidade de S˜ ao Paulo, Caixa Postal 668, 13560-970 S˜ ao Carlos, Brazil {lspecia, gracan}@icmc.usp.br
Abstract. We propose a strategy to support Word Sense Disambiguation (WSD) which is designed specifically for multilingual applications, such as Machine Translation. Co-occurrence information extracted from the translation context, i.e., the set of words which have already been translated, is used to define the order in which disambiguation rules produced by a machine learning algorithm are applied. Experiments on the English-Portuguese translation of seven verbs yielded a significant improvement on the accuracy of a rule-based model: from 0.75 to 0.79.
1
Introduction
Word Sense Disambiguation (WSD) in multilingual applications is concerned with the choice of the most appropriate translation of an ambiguous word given its context. Although it has been agreed that multilingual WSD differs from monolingual WSD [4] and that WSD is more relevant in the context of a specific application [13], little has been done on the development of WSD modules for particular applications. WSD approaches generally focus on application independent monolingual contexts, particularly English. In the case of Machine Translation (MT), the application we are dealing with, WSD approaches usually apply traditional sense repositories (e.g., WordNet [7]) to identify the monolingual senses, which are then mapped into the target language translations. However, mapping senses between languages is a very complex issue. One of the reasons for this complexity is the difference in the sense inventories of the languages. We recently evidenced this for English-Portuguese [11], showing that many English senses can be translated into a unique Portuguese word, while some English senses need to be split into different translations, conveying sense distinctions that only exist in Portuguese. Published in the Proceedings of the 9th International Conference on Text, Speech and Dialogue, TSD-2006, September 11-15, Brno. Lecture Notes in Artificial Intelligence (LNAI),
2
Lucia Specia and Maria das Gra¸cas Volpe Nunes
In addition to the differences in the sense inventory, the disambiguation process can vary according to the application. For instance, in monolingual WSD the main information source is the context of the ambiguous word, that is, the surrounding words in a sentence or text. For MT purposes, however, the context may also include the translation in the target language, i.e., words in the text which have already been translated. This strategy has not been explored specifically for WSD, although some related approaches have used similar techniques for other purposes. For example, a number of approaches to MT, especially the statistics-based approaches, make use of the words which have already been translated as context, implicitly accomplishing basic WSD during the translation process [1]. Some approaches for monolingual WSD use techniques which are similar to ours to gather cooccurrence evidence from corpora in order either to carry out WSD [6], or to create monolingual sense tagged corpora [2]. Other monolingual related approaches explore the already disambiguated or unambiguous words by taking into account their senses in order to disambiguate a given word [5]. In this paper we investigate the use of the translation context, that is, the surrounding words which have already been translated, as knowledge source for WSD. We present experiments on the disambiguation of seven verbs in EnglishPortuguese translation. The target language contextual information is based on the co-occurrence of the possible translations of the ambiguous word with a number of translated words, queried as n-grams in the web, using Google. The resultant rank of translation is then used to reorder the set of rules produced by a machine learning approach for WSD. In such approach, multiples rules, pointing to different translations, can be applied to disambiguate certain cases, and thus the order in which the rules are applied plays and important role. The original approach relies on the order given by the machine learning algorithm, but we show that additional information to support the selection of the most appropriate rule for each case can improve the disambiguation accuracy. In what follows we first briefly describe our approach for WSD (Section 2). We then present our strategy to gather information about the translation context from the web (Section 3) and the way this strategy is used in our WSD approach, with some experiments and the results achieved (Section 4).
2
A Hybrid Relational Approach for WSD in MT
In [9] we present an approach for WSD which exploits knowledge-based and corpus-based techniques and employs a relational formalism to represent both examples and linguistic knowledge. The relational representation is considerably more expressive than the attribute-value format used by all the algorithms traditionally applied to generate WSD models. The major advantages of such formalism are: it avoids sparse-ness in data, and it allows the use of the two types of evidence (examples and linguistic knowledge) during the learning process. In order to explore relational machine learning in our WSD approach we use Inductive Logic Programming (ILP) [8]. ILP employs techniques of both Ma-
Lecture Notes in Computer Science
3
chine Learning and Logic Programming to build first-order logic theories from examples and background knowledge, which are also represented by means of first-order logic clauses. We implemented our approach using Aleph [12], an ILP system which provides a complete relational learning inference engine and various customization options. We use seven groups of syntactic, semantic, and pragmatic knowledge sources (KSs) as background knowledge: KS1 : Bag-ofwords of 5 lemmas surrounding the verb; KS2 : Part-of-speech (POS) tags of content words in a 5 word window surrounding the verb; KS3 : Subject and object syntactic relations with respect to the verb under consideration; KS4 : Context words represented by 11 collocations with respect to the verb; KS5 : Selectional restrictions of verbs and semantic features of their arguments; KS6 : Idioms and phrasal verbs; KS7 : A count of the overlapping words in dictionary definitions for the possible translations of the verb and the words surrounding it in the sentence. Some of these KSs were extracted from our sample corpus, while others were automatically extracted from lexical resources1 . The majority of these KSs have been traditionally employed in monolingual WSD, while other are specific for MT (KS6 and KS7 ). Additionally, our repository of senses is also specific for MT: instead of disambiguating the English senses and then mapping them into the Portuguese corresponding translations, we disambiguate directly among Portuguese translations. Thus, our sample corpus is tagged with the translation of the ambiguous verbs under consideration, while our set of possible translations for each verb is given by bilingual dictionaries: Michaelis and Password EnglishPortuguese Dictionaries. Based on the background knowledge and on examples of disambiguation, Aleph’s inference engine induces a set of symbolic rules. The resultant set of rules depends strongly on the order in which the training examples are given to the inference engine. Usually, to classify (disambiguate) new cases, the rules are applied in the order they are produced, and all the cases covered by a certain rule are assessed as correctly or incorrectly classified and removed from the test set. Although there can be other rules which will classify the same case, they are ignored. In what follows first present our experiment on acquiring co-occurrence information about the translation context, which will be used to reorder the set of rules.
3
Acquiring the Translation Context Information
We explore the translation context information through the analysis of the cooccurrence frequencies of sets of Portuguese words which have already been translated and each possible translation of the verb under consideration. This is obtained by querying the web (via Google’s API2 ) with these sets of words. In principle, any subset of the words already translated to the target language could be used as context. However, words in certain positions are much more important 1 2
Michaelis and Password English-Portuguese Dictionaries, LDOCE, and WordNet www.google.com/apis/
4
Lucia Specia and Maria das Gra¸cas Volpe Nunes tomar - I will take my medicines tonight, as prescribed by the doctor. Vou tomar meus rem´edios hoje ` a noite, conforme indicado pelo m´edico. Fig. 1. Example of English-Portuguese parallel sentence in our corpus Table 1. Our set of verbs and their number of possible translations in the corpus Verb Translations Verb Translations come get give go
226 look 242 make 128 take 197
63 239 331
than others. In order to identify a relevant contextual structure, we first ran a series of experiments with different structures, which we briefly describe in Section 3.2, after presenting our experimental setting. 3.1
Experimental Setting
In our experiments we are using the same set of words as in our WSD approach: seven highly frequent and ambiguous verbs previously identified as problematic for English-Portuguese MT systems: ”to come”, ”to get”, ”to give”, ”to look”, ”to make”, and ”to take”. Since we do have at our disposal a good quality English-Portuguese MT system, we use a sentence aligned parallel corpus - produced by human translators - to provide the translation context. In order to experiment with different query structures, we selected 100 English sentences for each of the seven verbs (a total of 700 sentences) from the parallel corpus Compara [3], which comprises fiction books. We used a version of this corpus in which the translations of the ambiguous verbs have already been (automatically) annotated. For each occurrence of a verb, this corpus contains the English sentence, annotated with such translation, and the corresponding Portuguese sentence, as the example shown in Fig. 1. Our translation context strategy requires a list with all the possible translations of each verb. This list was extracted from bilingual dictionaries (e.g., DIC Prtico Michaelis), amounting to the numbers shown in Table 1. They include translations of both phrasal and non-phrasal usages of the English verbs. In order to assess the efficiency of our strategy, by embedding the resultant contextual information into our WSD approach (Section 4), we consider the set of test examples already used in our previous experiments with that approach: around 50 sentences for each ambiguous verb, also extracted from Compara. It is worth mentioning that these sentences are not part of the training example set: a rule-based model for each verb was produced by Aleph based on a training corpus of around 150 examples, and evaluated on an independent test corpus of around 50 examples.
Lecture Notes in Computer Science
5
Table 2. Example of queries and their respective number of hits in Google Type (a)
(b)
3.2
Queries
Hits
”tomar meus rem´edios” 228 ”pegar meus rem´edios” 17 ”levar meus rem´edios” 8 (tomar AND (rem´edios OR hoje OR noite 3,500,000 OR conforme OR indicado OR m´edico)) (pegar AND (rem´edios OR hoje OR noite 2,290,000 OR conforme OR indicado OR m´edico)) (levar AND (rem´edios OR hoje OR noite 1,770,000 OR conforme OR indicado OR m´edico))
Discovering the Most Useful Kind of Contextual Information
In our experiments we consider the use of the translation context in a hypothetical rule-based transfer MT system which first translates all the unambiguous words in the sentence and then each ambiguous word, using the WSD module. We assume that all the other words in the sentence will have already been translated, remaining only the verb to be disambiguated. In fact, since we are using the parallel corpus to provide the translation context, any combination of words in the target sentence, except the ambiguous one, could be used as context. In order to find out a suitable set of words, we experimented with different n-grams and bags-of-words, all including each of the possible translations of the ambiguous verb. As described in [10], after experimenting with the six types of query for a small number of words, we chose the two presenting the most promising results (examples are given in Table 2): (a) trigrams with the 1st two words to the right of the verb; (b) bags-of-words with all the content words already translated in the sentence, requiring any subset of the words to be in the results. Given the parallel corpus as exemplified in Fig. 1, for each sentence, we created queries with the contextual words and each of the possible translations of the verb under consideration, and then submitted each query to Google. For example, assuming that the verb ”to take” has only three possible translations, ”tomar” (consume, ingest), ”pegar” (buy, select), and ”levar” (take someone to some place), the queries that will be built for the example sentence are shown in Table 2 (the translation each query represents is in bold face). The relevant information returned by Google is the number of hits for each query. 3.3
Results
In Table 3 we show the accuracies that would be achieved by our strategy based on the translation context if it was used as unique information source for WSD. We also show the potential of this strategy to be used as additional information source, supporting a WSD approach. We first show the accuracy of the baseline
6
Lucia Specia and Maria das Gra¸cas Volpe Nunes Table 3. Accuracies: baseline, queries of the types (a) and (b) Verb Baseline come get give go look make take Aver.
0.4 0 0.8 0.4 0.4 0.8 0.2 0.43
Qr.(a) Qr.(a) Qr.(a) Qr.(b) Qr.(b) Qr.(b) 1st 1st-3rd 1st-10th 1st 1st-3rd 1st-10th Choice Choice Choice Choice Choice Choice 0.4 0.4 0.5 0.5 0.6 0.3 0.8 0.50
0.4 0.7 0.7 0.9 0.9 0.6 1 0.74
0.8 0.7 0.8 1 1 1 1 0.90
0.2 0 0.4 0 0.4 0 0 0.14
0.2 0 0.8 0 0.4 0 0.2 0.23
0.2 0.2 1 0 0.4 1 0.4 0.46
using the most frequent translation in the set of 100 sentences per verb: 0.43, on average. We assumed the most frequent translation to be the one given first by the dictionary Password. In the third to fifth columns of Table 3 we present the accuracies obtained in our experiments using queries of the type (a). The third column shows the percentage of sentences for which the query with the maximum number of hits contained the actual translation of the verb in that sentence. The other two columns show the percentages of sentences for which the query with the actual translation was included among the top 3 and top 10 positions, respectively, according to the number of hits. The sixth to eighth columns show the corresponding accuracies for queries of the type (b). In general, the results were considerably better for queries of the type (a). The problem with queries of the type (b) is that long sentences produce queries consisting of many words and these are too general to accurately identify the sense of the verb. On average, the accuracy of the strategy for the first choice with queries of the type (a) (0.5) outperforms the baseline considering the most frequent sense. However, as previously mentioned, our goal is to use the information provided by this strategy as additional evidence to our WSD approach, which already uses many other knowledge sources. In what follows we present a proposal to use this information in our WSD approach: given the set of rules that can be applied to classify a certain example, we choose the most appropriate rule according to the co-occurrence information.
4
Using Contextual Information to Reorder WSD Rules
As previously mentioned, in the rule-based models produced by ILP systems, multiple rules can cover the same example(s). We propose a strategy to employ contextual information acquired as explained in Section 3 to choose the most appropriate rule. Essentially, instead of applying each rule in the order it appears in the theory and using it to classify all the examples covered, we take each test
Lecture Notes in Computer Science
7
Example: Quincas Borba took him everywhere. They slept in the same room. Translation: Quincas Borba levava-o para toda parte, dormiam no mesmo quarto. Rules: [Rule 19] = has sense(A, tomar) IF has collocation(A, col 11, in). [Rule 33] = has sense(A, levar) IF has bag(A, him). [Rule 48] = has sense(A, puxar) IF has collocation(A, col 2, propernoun). Google queries: ” para toda” Google rank: ir 216 - levar 129 - dar 87 - fazer 56 ... Rule chosen: [Rule 33], translation = levar –> correct Fig. 2. Example of application of our strategy to reorder the disambiguation rules
example individually and, given all the rules covering that example, we select the most appropriate rule according to the score of the translation pointed out by the rule in the rank provided by Google for the translation context of that example (for queries of the type (a)). For example, in Fig. 2 we show the three rules covering one example of the verb ”to take”, which was correctly disambiguated as ”levar”. In this case, the correct translation was not the best scored in the rank provided by Google: it came in second, but since there was not a rule for the top translation (”ir”), ”levar” was chosen. We experimented with this strategy in the set of rules produced by our WSD approach for each of the verbs based on 150 training examples, and tested on around 50 examples per verb (cf. Section 3.1). This set of rules was produced by Aleph after adjusting several parameters and trying out different search, induction and evaluation methods (as described in [9]), and was considered a satisfactory disambiguation model. In Table 4 we first show the accuracy of our WSD approach without the translation context information, i.e., applying the rules in their original order. We then show the accuracy of the WSD model with the rule reordering, based on the translation context information. In the last column we present the percentage of examples for which multiple rules could be applied. The fact that a high percentage of examples is covered by multiple rules is very common in ILP-based models, and thus a strategy for choosing the best rule can make a crucial difference. As we can see, the strategy based on the translation context significantly improved the average accuracy of the WSD model.
5
Conclusions
We described a strategy to support WSD in MT which is specific to this application: it uses co-occurrence information about the translation context. In order to gather this information, after experimenting with 700 sentences containing seven highly ambiguous verbs by searching Google with all the possible translations of the ambiguous verbs, with different queries formulated as bags-of-words and n-grams, we chose the most promising type of query: trigrams containing two words to the right of the verb. We then embedded the ranking of co-occurrence information provided by Google in our WSD approach by using the scores to
8
Lucia Specia and Maria das Gra¸cas Volpe Nunes
Table 4. Accuracy of the WSD approach with and without the translation context information Verb
Without With translation % Examples with translation context context - query (a) multiple rules
come get give go look make take Aver.
0.82 0.51 0.96 0.73 0.83 0.74 0.66 0.75
0.82 0.61 0.96 0.82 0.80 0.79 0.69 0.79
0.33 0.33 0.51 0.69 0.34 0.33 0.38
reorder multiple rules covering the same example. The use of this very simple strategy yielded a significant improvement on the average accuracy of the WSD model. More complex strategies, also based on the translation context information, could yield even better results and will be investigated in future work.
References 1. Dagan, I. and Itai, A.: Word Sense Disambiguation Using a Second Language Monolingual Corpus. Computational Linguistics, 20 (1994) 563–596 2. Fern´ andez, J., Castilho, M., Rigau, G., Atserias, J., and Turmo, J.: Automatic Acquisition of Sense Examples using ExRetriever. Proceedings of the LREC, Lisbon (2004) 25–28 3. Frankenberg-Garcia, A., and Santos, D.: Introducing COMPARA: the PortugueseEnglish Parallel Corpus. Corpora in translator education, Manchester (2003) 71–87 4. Hutchins, W.J. and Somers H.L.: An Introduction to Machine Translation. Academic Press, Great Britain (1992) 5. Lesk, M.: Automated Sense Disambiguation Using Machine-readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. Proceeding of SIGDOC, Toronto (1986) 24–26 6. Mihalcea, R. and Moldovan, D.I.: A Method for Word Sense Disambiguation of Unrestricted Text. Proceedings of the 37th Meeting of the ACL, Maryland (1999) 7. Miller, G.A., Beckwith, R.T., Fellbaum, C.D., Gross, D., and Miller, K.: WordNet: An On-line Lexical Database. International Journal of Lexicography, 3(4) (1990) 235–244 8. Muggleton, S.: Inductive Logic Programming. New Generation Computing, 8(4) (1991) 295–318 9. Specia, L. A Hybrid Relational Approach for WSD - First Results. To appear in the COLING/ACL 06 Student Research Workshop. Sydney (2006) 10. Specia, L., Nunes, M.G.V., Stevenson, M.: Translation Context Sensitive WSD. To appear in the 11th Annual Conference of the European Association for Machine Translation, Oslo (2006) 11. Specia, L., Castelo-Branco, G., Nunes, M.G.V., Stevenson, M.: Multilingual versus Monolingual WSD. Workshop Making Sense of Sense, EACL, Trento (2006)
Lecture Notes in Computer Science
9
12. Srinivasan, A.: The Aleph Manual. Technical Report. Computing Laboratory, Oxford University (2000) 13. Wilks, Y. and Stevenson, M.: The Grammar of Sense: Using Part-of-speech Tags as a First Step in Semantic Disambiguation. Natural Language Engineering, 4(1) (1998) 1–9