Introduction to the Special Issue on SENSEVAL

0 downloads 0 Views 3MB Size Report
Apr 4, 1997 - a word is used in a book or in conversation, generally speaking, just one of those meanings will apply. ..... with the machine learning systems performing best. ...... comic. 502 complain. 1116 confine. 586 connect. 516 cook. 1922 creamy ..... The baselines, like the systems, are free to exploit the pre-specified.
Computers and the Humanities 34: 1–13, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

1

Introduction to the Special Issue on SENSEVAL A. KILGARRIFF1 and M. PALMER2 1 ITRI, University of Brighton; 2 University of Pennsylvania

Abstract. S ENSEVAL was the first open, community-based evaluation exercise for Word Sense Disambiguation programs. It took place in the summer of 1998, with tasks for English, French and Italian. There were participating systems from 23 research groups. This special issue is an account of the exercise. In addition to describing the contents of the volume, this introduction considers how the exercise has shed light on some general questions about word senses and evaluation. Key words: word sense disambiguation, evaluation, SENSEVAL

1. Introduction S ENSEVAL was the first open, community-based evaluation exercise for Word Sense Disambiguation programs. It took place in the summer of 1998 under the auspices of ACL SIGLEX (the Association for Computational Linguistics Special Interest Group on the Lexicon), EURALEX (European Association for Lexicography), ELSNET, and EU Projects SPARKLE and ECRAN. This special issue is an account of the exercise. In this introduction, we first describe the problem and the historical context; then the papers; then we address some criticisms of the evaluation paradigm; and finally, we look forward to future S ENSEVALs. 2. SENSEVAL: The Context 2.1. T HE

PROBLEM

As dictionaries tell us, most common words have more than one meaning. When a word is used in a book or in conversation, generally speaking, just one of those meanings will apply. This is not an issue for people. We are very rarely slowed down in our comprehension by the need to determine which meaning of a word applies. But it is a very difficult task for computers. The clearest case is in Machine Translation. If the English word drug translates into French as either drogue (‘bad’ drugs) or médicament (‘good’ drugs), then an English-French MT system needs to disambiguate drug if it is to make the correct translation. Similarly, information retrieval systems may retrieve documents about a drogue when the item of interest is a médicament; information extraction systems may make wrong assertions; text-

2

KILGARRIFF AND PALMER

to-speech systems will make errors where there are multiple pronunciations for the same spelling, as in violin bows and ships’ bows. For virtually all Natural Language Processing applications, word sense ambiguity is a potential source of error. For forty years now, people have been writing computer programs to do Word Sense Disambiguation (WSD). The field is surveyed, from earliest times to recent work, in (Ide and Véronis, 1998) and the reader is directed to that paper for historical background and the kinds of methods that have been used.

2.2. W HAT

ARE WORD SENSES ?

Before a WSD problem is well-defined, a set of word senses to disambiguate between is required. This raises a number of issues. First, which dictionary? People often refer to ‘the dictionary’ as if there were just one, definitive one. But dictionaries differ and, for very many words, any two will give different analyses. Readings treated as distinct in one dictionary will be merged in the other. Bigger dictionaries will give more senses than smaller ones. Lexicographic policies regarding grammar, phraseology and metaphor all affect what a particular dictionary treats as a sense or subsense. Also, some dictionary entries are better than others. Sometimes the lexicographer will not have arrived at a clear image of what the distinction between two putative senses is before writing the entry, and sometimes, even though the distinction was clear to him/her, he or she will not have succeeded in making in clear in the entry. Second, homonymy and polysemy. In homonymy, there are two or more distinct ‘words’ which happen to have the same form. In polysemy a single word has multiple meanings. Distinctions between homonyms are clear, and disambiguating between them is, for people, straightforward. For polysemous words, it may not be so, either in the abstract or in relation to particular contexts. When a drug is stolen from the pharmacy, it is indeterminate between drogue and médicament. It might appear appealing to distinguish homonymy resolution from polysemy resolution, but in practice, there are no general, systematic methods for making the distinction, and experts frequently disagree. While relations between homonyms are arbitrary, relations between polysemes are riddled with regularities. Thus rabbit is like chicken, turkey and lamb in having both an ‘animal’ sense and a ‘meat of that animal’ sense. Kangaroo and emu also appear to participate in the pattern; certainly, one might find either on a restaurant menu with a ‘meat’ reading required. Where a regularity could be applied to a word, but the derived sense is neither particularly common, nor is there anything about it which is not predictable, it will not generally be listed in a dictionary and we may say it is not ‘lexicalised’. Yet clearly, words are used in such ways and a disambiguation program will need to do something with them. Also, the regularities are rarely fully predictive. Pig does not have the meat sense.

3

INTRODUCTION

In sum, there are various reasons why people who do not have any trouble understanding a word in context, might nonetheless have difficulty assigning it to a sense from a dictionary. In some cases, towards the homonymy end of the spectrum, the word sense disambiguation problem does appear to map straightforwardly to something that people do when they understand a sentence with an ambiguous word in it. As we move towards senses that are closely related, the task seems more artificial, and people may disagree. We return to the causes and implications of such disagreements at various points in this introduction and elsewhere in the special issue.

2.3. E VALUATION There are now many working WSD programs. An obvious question is, which is best? Evaluation has excited a great deal of interest across the Language Engineering world of late. Not only do we want to know which programs perform best, but also, the developers of a program want to know when modifications improve performance, and how much, and what combinations of modifications are optimal. US experience in DARPA competitive evaluations for speech recognition, dialogue systems, information retrieval and information extraction has been that the focus provided by an evaluation brings research communities together, forces consensus on what is critical about the field, and leads to the development of common resources, all of which then stimulates further rapid progress (see, e.g. Gaizauskas, 1998). Reaping these benefits involves overcoming two major hurdles. The first is agreeing an explicit and detailed definition of the task. The second is producing a “gold standard” corpus of correct answers, so it is possible to say how much of the time a program gets it right. In relation to WSD, defining the task includes identifying the set of senses between which a program is to disambiguate, the “sense inventory” problem. Producing a gold standard corpus for WSD is both expensive, as it requires many person-months of annotator effort, and hard because, as earlier evidence has shown, if the exercise is not set up with due care, different individuals will often assign different senses to the same word-in-context.

2.4. H ISTORY

OF

WSD

EVALUATION

People producing WSD systems have always needed to evaluate them. A system developer needs a test set of some sort to determine when the system is working at all, and whether a change has improved matters or made them worse. So system developers have frequently worked through a number of sentences containing the words of interest, assigning to each a sense-tag from whatever dictionary they were using. They have then, on some occasions, stated the percentage correct for their system in the write-up.

4

KILGARRIFF AND PALMER

Gale, Church and Yarowsky (1992) review, exhaustively and somewhat bleakly, the state of affairs as at 1992. They open with: We have recently reported on two new word-sense disambiguation systems . . . [and] have convinced ourselves that the performance is remarkably good. Nevertheless, we would really like to be able to make a stronger statement, and therefore, we decided to try to develop some more objective evaluation measures. First they compare one of their systems’ (Yarowsky, 1992) performance with that of other WSD systems for which accuracy figures are available (taking each word addressed by each other system in turn). While the comparison of numbers suggests in most cases that their system does better, they note one feels uncomfortable about comparing results across experiments, since there are many potentially important differences including different corpora, different words, different judges, differences in precision and recall, and differences in the use of tools such as parsers and part of speech taggers etc. In short, there seem to be a number of serious questions regarding the commonly used technique of reporting percent correct on a few words chosen by hand. Apparently, the literature on evaluation of word-sense disambiguation fails to offer a clear model that we might follow in order to quantify the performance of our disambiguation algorithms. (p. 252) The paper was written at a time of increasing interest in evaluation in Language Engineering in general, and the concerns they list are in large part those that are resolved by collaborative, co-ordinated community-wide evaluation exercises as in the DARPA model. The topic was raised again four years later, as the central issue of a workshop of the ACL Lexicon Special Interest Group (SIGLEX) in Washington, April 1997. The DARPA community had been baffled by the difficulty, perhaps impossibility, of determining a methodology for the evaluation of semantic interpretation. There was not even a consensus on the right level of semantic representation, let alone what that representation should contain. Martha Palmer, as chair of SIGLEX, suggested that a workshop be organised around the central questions of whether or not “hand tagged text [would] also be of use for assigning semantic characteristics to words in their context . . . to what end should hand tagging be performed, what lexical semantic information should be hand tagged, and how should this tagging be done?” During the workshop, chaired by Marc Light, sense tagging was recognised as a relatively uncontroversial level of semantic analysis that might be more amenable to evaluation than other more problematic levels. Resnik and Yarowsky made some practical proposals for evaluation of WSD systems using machine learning techniques (Resnik and Yarowsky, 1997). These were broadly welcomed, and led to extensive and enthusiastic discussions. There was a high degree of consensus that the field of WSD would benefit from careful evaluation,

5

INTRODUCTION

and that researchers needed to collaborate and make compromises so that an evaluation framework could be agreed upon. An actual experiment in a community wide-evaluation exercise would allow us to address three fundamental questions: 1. What evidence is there for the ‘reality’ of sense distinctions? 2. Can we provide a consistent sense tagged Gold Standard and appropriately measure system performance against it? 3. Is sense tagging a useful level of semantic representation: what are the prospects for WSD improving overall system performance for various NLP applications? Following the Washington meeting, Adam Kilgarriff undertook the coordination of a first evaluation exercise, christened SENSEVAL.1 The exercise culminated in a workshop (held at Herstmonceux Castle, Sussex, England) in September 1998. Most of the papers in this special issue have their origins in presentations at that workshop. The evidence of the workshop sheds light on the first question, and gives an unequivocal ‘yes’ to the second. The third is more complex, and we return to it in Section 4. 3. Papers 3.1. L ANGUAGES

COVERED ; ‘ FRAMEWORK ’ PAPERS

Most research in WSD has been on English. There are many resources available for English, much commercial interest, and much expertise in the problems it presents. It is easiest to set up an exercise for English. However, there was no desire for hegemony, so ACL SIGLEX’s position was simply that, wherever there was an individual or group with the commitment and resources to set up an exercise for a given language, they would be welcomed and encouraged, though they would then be responsible for all the language-specific work (including funding the resource development). There were preliminary discussions regarding six languages in all, and for the first SENSEVAL, there were English, French and Italian tasks. The French and Italian teams worked together under the banner of ROMANSEVAL and adopted parallel designs. For each of the three exercises, there is a paper describing how the exercise was set up, and the results: for English, by Kilgarriff and Rosenzweig; for French, by Segond; and for Italian, by Calzolari and Corazzari. These papers describe the choice of lexicon and corpus for each task; the methods used for choosing a sample of word types; the approach to manual sense tagging; the level of agreement between different human sense-taggers; baselines; system results; and problems and anomalies encountered during the whole process. An evaluation needs a scoring metric, and one of the issues raised by (Resnik and Yarowsky 1997) was that a simple metric, whereby a correct response scores 1 and anything else scores 0, is not satisfactory. It says nothing about what to do where there are multiple correct answers, or where a system returns multiple responses, or where the tags are hierarchically organised, so that one tag may be a generalisation or specialisation of another. In the one paper in the special

6

KILGARRIFF AND PALMER

Table I. Numbers of participants for each language Systems

Research groups

Papers

Brief note

English French Italian

18 5 2

17 4 2

15 1 1

3 3 0

Totals

25

23

17

6

issue which is not specific to WSD, Melamed and Resnik present a scoring scheme meeting the desiderata. The scheme underlay the scoring strategies used in S ENSEVAL. Krishnamurthy and Nicholls describe the process of manually tagging the English test corpus, with detailed discussion of the cases where the lexical entry and/or corpus instance meant that there was not a straightforward, single, correct sense tag for the corpus instance. They thereby provide a research agenda for work in the area: what must one do, to the dictionary, or WSD system, or larger theoretical framework, to not inevitably go wrong, for each of these types of cases? In a short note, Moon asks what the scale of the WSD problem is, and shows that it relates, for general English, to the order of 10,000 words – a consideration that becomes critical should it be necessary to do lexicographical work on each one of those words. 3.2. PARTICIPATING

SYSTEMS

All research teams which participated in the evaluation – that is, which applied their WSD system to the test data and returned results – were invited to submit descriptions of their system and its performance on the task to the special issue. Table I shows, for each task, how many participating systems, research groups and special issue papers there are.2 For most of the 25 participating systems, there is a paper in the special issue (and for six of the remainder, there are brief descriptions inserted as appendices to the appropriate ‘framework’ paper). The systems use a range of machine learning algorithms and consult a variety of lexical resources. When this exercise was first proposed, in Washington in 1997, it was notable that the participants fell into opposing camps – the proponents of machine learning techniques versus the proponents of hand-crafted lexical resources. Each camp eagerly anticipated demonstrating their superiority in S ENSEVAL. Notable at the workshop was the frequency with which participants had merged the two approaches. Several ‘unsupervised systems’ – those relying on lexical resources – made extensive use of the training data to fine-tune their

7

INTRODUCTION

systems, and several ‘supervised systems’ – those relying on machine learning from training data – had a lexical resource as a fall-back where the data was insufficient. When it came to getting the task done, the purity of the approach was less important than the robustness of the system performance. The extensive discussion of criteria for a sense inventory also created more awareness among the participants of how fundamental the lexicon is to the task. It is only worth learning sense distinctions if they can in fact be distinguished. The English exercise was set up with substantial amounts of training data, which supported machine-learning approaches. This was clearly reflected in the results, with the machine learning systems performing best. The highest performing systems utilised a wide range of features, including inflectional form of the word to be disambiguated, part-of-speech tag sequences, semantic classes, and collocates at specific positions as well as ‘anywhere in ak-word window of the target word’. Some of these features are dependent on others, so techniques such as O’Hara et al.’s that do not assume independence when incorporating features, could make a more principled use of the data. This makes the good performance of Chodorow et al. intriguing as their Bayesian model does assume independence. One system (Hawkins’s) used some manually rather than automatically derived features, with the manual acquisition organised so that it could be rapidly bootstrapped from untagged training material. Veenstra et al. improved their system performance when they optimised the settings in their model for each individual word based on performance in a cross validation exercise. They got quite distinct settings for each individual lexical item. Approaches that are sensitive to such individual differences are clearly necessary, but the requisite amount of training data is disconcerting. An ability to leverage sparse data effectively, as was done by exemplar based approaches, mitigates this need to some degree. One of the pleasant outcomes of the evaluation was that many groups were clearly using the data to test a particular attribute of their system, rather than focusing simply on maximising results. Systems that used only grammatical relations or subcategorisation frames did not fare as well in the performance comparisons, but gained valuable information about the contribution of individual feature types. This type of scholarly approach to training and testing benefits the field as much as an approach that is primarily focused on winning the bake-off. Future S ENSEVALs will do well to continue to foster this exploratory attitude. 3.3. D ISCUSSION

PAPERS

The papers by Hanks, Palmer, Ide, and Wilks examined the fundamental question of how sense distinctions can be made reliably, providing critical perspectives and suggestions for future tasks. The question of the role of WSD in a complete NLP system is also raised.

8

KILGARRIFF AND PALMER

Hanks asks, simply, “Do word meanings exist?” and reminds us of the extent to which they are figments of the lexicographer’s working practice. As he says, “if senses don’t exist, then there is not much point in trying to disambiguate them”. His corpus analyses of bank, climb and check show how different components of the meaning potential of the word are activated in different contexts. His paper is a call for representations of word meaning that go beyond “checklist theories of meaning” and record meaning components, organised into hierarchies and constellations of prototypes, and for algorithms that work out which of the components are activated in a context of use. The Palmer paper is complementary, in that it asks the same question but from the perspective of an NLP system. How are different senses of the same word characterised in a computational lexicon? She focuses on verb entries. Since they typically consist of predicate argument structures with possible semantic class constraints on the arguments, possible syntactic realizations and possible inferences to be drawn, alternative senses must differ concretely in one or more of these aspects. The more closely each entry in a dictionary “checklist” can be associated with a concrete change along one or more of these dimensions, the more readily a computational lexicon can capture the relevant distinctions. The meaning components desired by Hanks can correspond to one or more elements of this type of representation, suggesting a measure of convergence between the lexicographic community and the computational lexical semantics community. Ide presents a study into the use of aligned, parallel corpora for identifying word senses as items that get systematically translated into one or more other languages in the same way. This is a highly appealing notion, and is indeed a strategy used by lexicographers in determining the senses a word has in the first place. It offers the prospect of taking the confounding factors of lexicographic practice out of the definition of word senses. Ide’s study is small-scale, but charts the issues that would need addressing if the strategy was to be adopted more widely (see also section 5 below). Wilks asks several central questions about the way in which the WSD field is proceeding: will data-driven methods reach their upper bound all too soon, precipitating a return to favour of AI strategies? Where do discussions of lexical fields and vagueness take us? He presents the case against the “lexical sample” aspect of the design of the S ENSEVAL task.3 He also addresses the larger question of the usefulness of WSD for complete NLP systems and notes that Kilgarriff is associated with a sceptical view, which sits oddly for one organising S ENSEVAL: There need be no contradiction there, but a fascinating question about motive lingers in the air. Has he set all this up so that WSD can destroy itself when rigorously tested? . . . [the issue goes] to the heart of what the S ENSEVAL workshop is for: is it to show how to do better at WSD, or is it to say something about word sense itself? Let me (Kilgarriff) take this opportunity to respond. S ENSEVAL is, from one point of view, an experiment designed to replace scepticism about both the reality

INTRODUCTION

9

of word senses and the effectiveness of WSD, by percentages. It answers some simple, quantitative questions: what is the upper bound for human inter-taggeragreement (95%); and at what level do state-of-the-art systems perform (75–80%) (both answers relative to a fine-grained, corpus-based dictionary; see Kilgarriff and Rosenzweig, this volume, for discussion). S ENSEVAL provided a clear picture of the types of systems that performed best (the ‘empiricist’ methods, using as much training data as was available) and, as a side-product, provided an extensive sensetagged corpus where instances that had given rise to tagger disagreement could be identified for further analysis (Kilgarriff, 2000). We return to the relation between S ENSEVAL and the usefulness of WSD in complete NLP systems in the next section. 4. Responses to Criticisms Given our conscious similarity to the DARPA quantitative evaluation paradigm, the recurring criticisms of it are the first ones to be addressed. These are as follows:4 1. It discourages novel approaches and risk taking, since the focus is on improving the error rate. This can be done most reliably by duplicating the familiar methods that are currently scoring best. 2. There is a substantial overhead involved both in setting up the evaluations and in participating in them. 3. It encourages a competitive (as opposed to collaborative) ethos. 4. Unless the tasks are carefully chosen to focus on the fundamental problems in the field, they will draw energy away from those problems. The first criticism cannot hold of a first evaluation of a given task (and is unlikely to apply unless the evaluation becomes a substantial undertaking with reputations hanging on the outcome). Indeed, the informal flavour of S ENSEVAL fostered experimentation and diversity. The second also does not apply to this first, small-scale evaluation (where much was done on goodwill) but is likely to apply to future, hopefully larger-scale evaluations. The case will have to be made for the substantial costs reaping commensurable benefits. There are of course many precedents for this; as (Hirschman, 1998) says, Evaluation is itself a first-class research activity: creation of effective evaluation methods drives rapid progress and better communication within a research community. (pp. 302–303) The third is a concern that was discussed at length in the course of S ENSEVAL, particularly in relation to the question, should the full set of results be made public? This would potentially embarrass research teams whose systems did not score so well, and may deter people from participating in the future. It was eventually agreed that, given the early stage of maturity of the field, the merits of having all results in the open outweighed the risks, but not without dissenters. In more general terms, our experience has been that of other DARPA evaluations: both the fellow-feeling

10

KILGARRIFF AND PALMER

that comes of working on the same problem and the modest dose of competitive tension have been productive. The last criticism demands much fuller discussion, and lies at the heart of evaluation design. It was the third fundamental question that we were hoping to address: Is sense tagging a useful level of semantic representation: what are the prospects for WSD improving overall system performance for various NLP applications? One critic of the process chose not to participate because, in their system, WSD occurred as a byproduct of deeper reasoning. It would not make sense to participate in an exercise that treated WSD as of interest in its own right. They were engaged in a harder task, so had no inclination to work on intermediate outputs as defined by an easier task. The sense distinctions that needed making would also only be identified in the course of specifying the overall NLP system outputs, so, taking them from a dictionary was not a relevant option (see also Kilgarriff, 1997). The question recurs in the evaluation literature, as, for any subtask, the validity of evaluation is contingent on the validity of the analysis that identifies the subtask as a distinct process (Palmer et al., 1990; Sparck Jones and Galliers, 1996; Gaizauskas, 1998). Despite being theory-dependent in this way, subtask evaluations can clearly be of great value. Evaluations focused on end results (which are often also user-oriented) tend not to help developers determine the contributions of individual components of a complex system. Thus parsing is generally agreed upon as a separable NLP task, and evaluations associated with the Penn Treebank have emphasised syntactic parsing as a separate component. The focus has resulted in significantly improved parsing performance (even though re-integrating these improved parsers into NLP applications is itself a non-trivial task that has yet to be achieved). S ENSEVAL can be seen as an experiment to test the hypothesis that “WSD is a separable NLP subtask”. It would seem some parts of the task, such as homograph resolution, can be effectively addressed with nothing more than shallow-processing WSD techniques, while others, such as metaphor resolution, require full-fledged NLP. Results suggest that at least 75% of the task could usefully be allocated to a shallow-processing WSD module, and that at least 5% could not. Although we may have demonstrated that WSD can be defined as a separate task, we have not established that good WSD performed as a separate stage of processing can improve the overall performance of an NLP application such as IR or MT. Indeed, the difficulty of demonstrating the positive impact of natural language processing subcomponents on Information Retrieval has dogged the field for decades. These subcomponents, whether they perform noun phrase chunking or WSD, may show improved performance on their individual subtasks, but they have little effect on the overall task performance (Buckley and Cardie, 1998; Voorhees, 1999). Machine Translation and cross-linguistic IR would seem more promising areas for illustrating the benefit of WSD. A clear demonstration would require establishing the baseline performance of a given NLP system, and then showing a significant percentage improvement in those figures when WSD is added. For

INTRODUCTION

11

instance, specific lexical items can be highlighted in a Machine Translation task, and the number of errors in translation of these items both with and without WSD calculated. Future S ENSEVALs must address this issue more directly. 5. Towards Future SENSEVALs S ENSEVAL participants were enthusiastic about future S ENSEVALs, with several provisos. Some wanted evaluation on texts with all content words tagged. General NLP systems that perform WSD on the route to a comprehensive semantic representation need to disambiguate every word in the sentence, so, for people with this goal on their medium-term horizon, an evaluation which looked only at corpus instances of selected words missed the central issue. Also, it seems likely that tag-assignments are mutually constraining. Only data with tags for several of the words in each sentence can pinpoint the interactions. A pilot study for the tagging of running text with revised WordNet senses was presented at SIGLEX99 and positively received (Palmer et al., 2000). Participants also wanted confirmation that the senses they were distinguishing were relevant to some type of NLP task, such as Information Retrieval or Machine Translation. (There is a close overlap between this concern and the goal of confirming WSD as a separable NLP subtask, as discussed above.) At the Herstmonceux workshop, we resolved to tie WSD more closely to Machine Translation, and to attempt to use sense inventories which were appropriate for Machine Translation tasks. The foundational work of Resnik and Yarowsky (1997, 1999) and Ide (this volume) on clustering together monolingual usages based on similar translations provides a preliminary framework. It is of course well known that languages often share several senses for single lexical items that are translations of each other, and translation simply preserves the ambiguities. Conversely, different translations in another language do not always correlate with a valid sense distinction in the source language (Palmer and Wu, 1995). Having the same translation does not ensure sense identification, and having separate translations does not ensure sense distinctions. However, multiple translations of a single word can provide objective evidence for possible sense distinctions, and, given our current state of knowledge, any such evidence is to be embraced. 6. Conclusion This special issue provides an account of S ENSEVAL, the first open, communitybased evaluation for WSD programs. There were tasks for three languages, and 23 research teams participated. By making direct comparisons between systems possible, and by forcing a level of agreement on how the task should be defined, the exercise sharpened the focus of WSD research. The volume contains detailed accounts of how the evaluation exercises were set up, and the results. Most of the participating systems are described and there are

12

KILGARRIFF AND PALMER

position papers on several of the difficult issues surrounding WSD and its evaluation: what word senses are, how they should be identified, and how separable from a particular application context the WSD task, and any specific sense inventory, will ever be. As this introduction conjectures, for some of these questions, the outcomes from S ENSEVAL can be seen as quantitative answers. We hope that S ENSEVAL, and this volume, will provide a useful reference point for future S ENSEVALs and other future WSD research worldwide. Acknowledgements We would like to thank Cambridge University Press, EPSRC (grant M03481), ELRA (European Linguistic Resources Association), the European Union (DG XIII), Longman Dictionaries and Oxford University Press for their assistance in goods and kind with the S ENSEVAL exercise. We would also like to thank Carole Tiberius for her role in organising the workshop. R ESOURCES

AVAILABLE , SEE WEBSITE

http://www.itri.brighton.ac.uk/events/senseval Notes 1 The name is due to David Yarowsky. 2 For the purposes of this table, ‘research teams’ are treated as distinct if they are responsible for

different systems, and the different systems have different writeups, even if the individuals overlap. 3 For the case for the lexical sample approach, see section 2 of Kilgarriff and Rosenzweig, this volume. 4 For discussion see Sproat et al. (1999).

References Buckley, C. and C. Cardie. “EMPIRE and SMART Working Together”. Presentation at the DARPA/Tipster 24-Month Meeting, 1998. Gaizauskas, R. “Evaluation in Language and Speech Technology: Introduction to the Special Issue”. Computer Speech and Language, 12(4) (1998), 249–262. Gale, W., K. Church and D. Yarowsky. “Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs”. In Proceedings, 30th ACL. 1992, pp. 249–256. Hirschman, L. “The Evolution of Evaluation: Lessons from the Message Understanding Conferences”. Computer Speech and Language, 12(4) (1998), 281–307. Ide, N. and J. Véronis. “Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art’. Computational Linguistics, 24(1) (1998), 1–40. Kilgarriff, A. “Foreground and Background Lexicons and Word Sense Disambiguation for Information Extraction”. In Proc. Workshop on Lexicon Driven Information Extraction. Frascati, Italy, 1997, pp. 51–62. Kilgarriff, A. “Generative Lexicon Meets Corpus Data: The Case of Non-Standard Word Uses”. In Word Meaning and Creativity. Ed. P. Bouillon and F. Busa, Cambridge: Cambridge University Press, forthcoming, 2000.

INTRODUCTION

13

Palmer, M., H.T. Dang and J. Rosenzweig. “Sense Tagging the Penn Treebank”. Submitted to the Second Language Resources and Evaluation Conference. Athens, Greece, 2000. Palmer, M., T. Finin and S. Walters. “Evaluation of Natural Language Processing Systems”. Computational Linguistics, 16(3) (1990), 175–181. Palmer, M. and Z. Wu. “Verb Semantics for English-Chinese Translation”. Machine Translation, 10, (1995), 59–92. Resnik, P. and D. Yarowsky. “A Perspective on Word Sense Disambiguation Methods and Their Evaluation”. In Tagging Text with Lexical Semantics: Why, What and How? Ed. M. Light, Washington, 1997, pp. 79–86. Resnik, P. and D. Yarowsky. “Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation”. Natural Language Engineering Journal, to appear. Sparck Jones, K. and J. Galliers. Evaluating Natural Language Processing Systems: An Analysis and Review. Berlin: Springer-Verlag, 1996. Sproat, R., M. Ostendorf, and A. Hunt: 1999, “The Need for Increased Speech Synthesis Research”. Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis. Voorhees, E.M.: 1999, “Natural Language Processing and Information Retrieval”. In Proceedings of Second Summer School on Information Extraction. Springer-Verlag, Lecture Notes in Artificial Intelligence. Yarowsky, D. “Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora”. In COLING 92. Nantes, 1992.

Computers and the Humanities 34: 15–48, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

15

Framework and Results for English SENSEVAL A. KILGARRIFF1 and J. ROSENZWEIG2 1 ITRI, University of Brighton, Brighton, UK; 2 University of Pennsylvania, Pennsylvania, USA

Abstract. S ENSEVAL was the first open, community-based evaluation exercise for Word Sense Disambiguation programs. It adopted the quantitative approach to evaluation developed in MUC and other ARPA evaluation exercises. It took place in 1998. In this paper we describe the structure, organisation and results of the SENSEVAL exercise for English. We present and defend various design choices for the exercise, describe the data and gold-standard preparation, consider issues of scoring strategies and baselines, and present the results for the 18 participating systems. The exercise identifies the state-of-the-art for fine-grained word sense disambiguation, where training data is available, as 74–78% correct, with a number of algorithms approaching this level of performance. For systems that did not assume the availability of training data, performance was markedly lower and also more variable. Human inter-tagger agreement was high, with the gold standard taggings being around 95% replicable. Key words: evaluation, SENSEVAL, word sense disambiguation

1. Introduction In this paper we describe the structure, organisation and results of the SENSEVAL exercise for English. The architecture of the evaluation was as in MUC and other ARPA evaluations (Hirschman, 1998). First, all likely participants were invited to express their interest and participate in the exercise design. A timetable was worked out. A plan for selecting evaluation materials was agreed. Human annotators were set on the task of generating a set of correct answers, the ‘gold standard’. The gold standard materials, without answers, were released to participants, who then had a short time to run their programs over them and return their sets of answers to the organisers. The organisers then scored the answers, and the scores were announced and discussed at a workshop. Setting up the exercise involved a number of choices – of task, corpus and dictionary, words to be investigated and relation to word class tagging. In sections 2–5, we describe the theoretical and practical considerations and the choices that were made. In the following sections we describe the data, the manual tagging process (including an analysis of inter-tagger agreement), the scoring regime, the participating systems, and the baselines. Section 11 presents the results. Sec-

16

KILGARRIFF AND ROSENZWEIG

tion 12 considers the relations between polysemy, entropy and task difficulty, and section 13, an experiment in pooling the results of different systems. The first three Appendices briefly describe the three systems for which there is no full paper in the Special Issue, and the fourth presents samples of the dictionary entries and corpus instances used in SENSEVAL. A note on terminology: in the following, a ‘corpus instance’ or ‘instance’ is an instance of a word occurring in context in a corpus, or, a particular token of the word. A ‘word’ is a word type, or lexical word. Thus the sentence Dog eats dog contains two, not three, words. 2. Choice of Task: ‘All-Words’ vs. ‘Lexical-Sample’ Evidently, the task was word sense disambiguation (WSD), in English. Two variants of the WSD task are ‘all-words’ and ‘lexical-sample’. In all-words, participating systems have to disambiguate all words (or all open-class words) in a set of texts. In lexical-sample, first, a sample of words is selected. Then, for each sample word, a number of corpus instances are selected. Participating systems then have to disambiguate just the sample-word instances. For SENSEVAL, the lexical-sample variant was chosen. The reasons were linked with issues of dictionary choice and corpus choice. They included the following: − Cost-effectiveness of tagging: it is easier and quicker for humans to sense-tag accurately if they concentrate on one word, and tag multiple occurrences of it, than if they have to focus on a new dictionary entry for each word to be tagged. − The all-words task requires access to a full dictionary. There are very few full dictionaries available (for low or no cost) so dictionary choice would have been severely limited. The lexical-sample task required only as many dictionary entries as there were words in the sample. − Many of the systems interested in participating could not have participated in the all-words task, either because they needed sense-tagged training data (see also below) or because they needed some manual input to augment the dictionary entry for each word to be disambiguated. − It would be possible for systems designed for the all-words task to participate in the lexical-sample task, whereas the converse was not possible (except for a hopelessly small subset of the data). A system that tags all words does, by definition, tag a subset of the words. − Provided the sample was well-chosen, the lexical-sample strategy would be more informative about the current strengths and failings of WSD research than the all-words task. The all-words task would provide too little data about the problems presented by any particular word to sustain much analysis.1

ENGLISH FRAMEWORK

2.1.

17

A QUESTION OF TIMING

All-words systems can participate in the lexical-sample task, but at a disadvantage. The disadvantage would be substantially offset if the words in the lexical sample were not announced prior to the distribution of the evaluation material. Then, it would be possible for supervised learning systems to participate and to exploit training materials, but there would not be time for non-automatic tailoring of systems to the particular problems presented by the words in the sample. This strategy was considered, and was partially adopted, with the words being announced just two weeks (in principle) before the test data was released. The constraints on its adoption were both practical and theoretical: − Systems such as CLRES and UPC - EHU2 perform extensive analyses of dictionary definitions. The software needs to be adapted to work with the particular dictionary format. For these systems to participate, a substantial sample of entries was required for porting the system to the new dictionary. To this end, a set of ‘dry run’ dictionary entries was distributed early. It was however possible that the forty lexical entries in the dry-run sample did not exhibit the full range of dictionary-formatting phenomena found in the thirty-five evaluation sample entries. − The organisers did not share the assumption of some researchers that manual input, for the lexical entry of each word to be disambiguated, should be viewed as illegitimate. One high-performing system (DURHAM) owed some of its accuracy to what was, in effect, additional lexicography undertaken for the words in the evaluation sample. (Harley and Glennon, 1997) describes a highquality WSD system built on the basis of telling lexicographers to put into the dictionary, the information that would be required for WSD. The objection to this approach is economic: there are vast numbers of ambiguous words, so it is too expensive. That need not be so. As (Moon, this volume) shows, the number of words requiring disambiguation in English is in the order of 10,000: if each requires fifteen minutes of human input, the whole lexicon calls for around two person-years, which is no more than many WSD systems have taken to design and build. The customer for a WSD system will be interested in its performance, not the purity of its knowledge-acquisition methods. − In practice, it was not viable to draw a line between legitimate ‘debugging’ and possibly illegitimate ‘manual system enhancement’. Nor was it possible to set the deadlines very tightly, given the usual complications of conflicting deadlines, absences from the office, etc. ‘Manual system enhancement’ could not be severely constrained by time limits.

18

KILGARRIFF AND ROSENZWEIG

3. Choice of Dictionary and Corpus The HECTOR lexical database was chosen. HECTOR was a joint Oxford University Press/Digital project (Atkins, 1993) in which a database with linked dictionary and corpus was developed. For a sample of words, dictionary entries were written in tandem with sense-tagging all occurrences of the word in a 17M-word corpus (a pilot for the British National Corpus3 ). The sample of words comprised those items with between 300 and 1000 instances in the corpus. The tagger-lexicographers were highly skilled and experienced. There was some editing, with a second lexicographer going through the work of the first, but no extensive consistency checking. The primary reason for the choice was a simple one. At the time when a choice was needed, it was not evident whether there was any funding available for manual tagging. Had funding not been forthcoming, then, with the HECTOR data, it would still have been possible to run SENSEVAL as corpus instances had been manually tagged in the HECTOR project. (In the event, there was funding,4 and all evaluation data was doubly re-tagged. Un-re-tagged HECTOR data was used for the training dataset.) The resource has been offered for use under licence in SENSEVAL, without charge, by Oxford University Press. There was one other possible source of already tagged data: the SEMCOR corpus, tagged according to WordNet senses (Fellbaum, 1998). However, SEMCOR was already widely used in the WSD community so it could not provide ‘unseen’ data for evaluation. Also, it had been tagged according to an all-words strategy, so would have pointed to an all-words evaluation. Supplementary reasons for choosing the HECTOR data were: − The dictionary entries were fuller than in most paper dictionaries or WordNet, and this was likely to be beneficial for WSD. − The lexicography was highly corpus-driven, and was thus (arguably) representative of the kind of lexicography that is likely to serve NLP well in the future. − No previous WSD work had used HECTOR, so no WSD team was at a particular advantage. − The corpus was of general English. It had been decided at a previous ACL SIGLEX meeting (Kilgarriff, 1997) that WSD evaluation should aim to use general language rather than a specific domain. One disadvantage of the HECTOR corpus material in the form in which it was received from OUP was that corpus instances were associated with very little context: generally two sentences and sometimes just one sentence. Strategies for gleaning information from a wider context would not show their strength.

ENGLISH FRAMEWORK

19

4. Lexicon Sampling A criticism of earlier forays into lexical-sample WSD evaluation is that the lexical sample had been chosen according to the whim of the experimenter (or to coincide with earlier experimenters’ selections). For SENSEVAL, a principled approach based on a stratified random sample was used. A simple random sample of polysemous words would have been inappropriate, since, given the Zipfian distribution of word frequencies, most or all of the sample would have been of low-frequency words. High frequency words are both intrinsically more significant (as they account for more word tokens) and tend to present a more challenging WSD problem (as there is a high correlation between frequency and semantic complexity). For English SENSEVAL, a sampling frame was devised in which words were classified according to their frequency (in the BNC) and their polysemy level (in WordNet). For each word class under consideration (noun, verb, adjective), frequency and polysemy were divided into four bands, giving a 4 × 4 grid. A sample size of 40 words was then set (for both dry-run and evaluation samples). The sample was divided between the grid cells according to: (1) the number of words in the grid and (2) the proportion of corpus tokens they accounted for. We were constrained to use HECTOR words so we then took a random sample of the required size of the HECTOR words in each grid cell. (For some grid cells, there were not enough HECTOR words, so substitutes were taken from other cells.)5 The number of gold-standard corpus instances per word was also based on the grid. For simpler words (with lower frequency and polysemy) a smaller number was appropriate. Higher-frequency or more polysemous words tend to be more complex and harder for WSD so more data was needed. Different grid-cells were associated with different numbers of corpus-instances-per-word-type, from 160, for the least common and polysemous words, to 400, for the most. 5. Gold-Standard Specifications 5.1.

WORD CLASS ( AND PART- OF - SPEECH TAGGING ): WORDS AND TASKS

Word class issues complicated the task definition. The primary issue was: was the assignment of word class (POS-tagging) to be seen as part of the WSD task? In brief, the argument for was that, in any real application, the word sense tagging and POS-tagging will be closely related, with each potentially providing constraints on the other. The argument against was ‘divide and rule’: POS-tagging is a distinct sub-area of NLP, with its own strategies and issues, and (arguably) a high accuracy rate, so was best kept out of the equation, the better to focus on WSD performance. A previous SIGLEX meeting had seen a majority in favour of decoupling, but no unanimity. For English SENSEVAL, for most of the evaluation words, the tasks were decoupled, with the part-of-speech (noun, verb or adjective) of the corpus instance

20

KILGARRIFF AND ROSENZWEIG

specified by the organisers as part of the input to the WSD task. However for five words, the tasks were not decoupled, so participating systems had to assign a sense without prior knowledge of word class. This gave rise to a distinction between words and ‘tasks’. Each SENSEVAL task was identified by a word and either a word class (noun, verb or adjective) or p for ‘Part of speech not provided’. The task name comprised the word and one of -n, -v, -a or -p. Some words were associated with more than one task, e.g. sack has sack-n and sack-v.6 Thus there are both words that occur with different parts of speech in different tasks, and words that occur with unspecified part of speech in a single-p task. The evaluation sample comprised 34 words and 41 tasks.7 The manual taggers assigned word class as well as sense tag so that, for example, a corpus instance of sack could be allocated to either the sack-n or sack-v task. Most of the time this was straightforward but there were exceptions, notably gerunds (his sanctioning of the initiative), participles (severely shaken he . . . ) and modifiers (bitter beer). Gerund instances were taken out of the -v tasks, as they were not verbal. Participles and nominal modifiers revealed a deeper issue. It was a useful simplifying assumption that lexical word class matched corpus-instance word class, but there were exceptions. Thus verbal float had a ‘sound’ sense, “to be heard from a distance”, and adjectival floating had no corresponding sense, yet the instance the floating melody reached even the Vizier’s ears was clearly an adjectival use of the ‘sound’ sense. In the gold standard there are a very small number of instances where there is a mismatch between the word class of the corpus instance, and the word class of the semantically closest word sense. 5.2.

PROPER NAMES

Straightforward proper-name instances were not included in the gold standard materials. There were however also a number of instances where the word was being used in one of its standard senses within a proper name. Thus the Cheltenham Hurdle is a hurdle race, and Brer Rabbit is a rabbit. These cases were included in the gold standard, with the complete correct answer having two parts: the appropriate sense for hurdle or rabbit, and the proper-name tag, PROPER, which was available for all words. 5.3.

OTHER DIFFICULT CASES

For cases where more than one word sense applied, or appeared equally valid, or there was insufficient context to say which applied, the gold standard specifies all salient senses. Where none of the HECTOR senses fit, the gold standard states “unassignable” with the universally-available tag UNASS. For ‘exploitations’, where the use is related to one of the senses in some way but does not directly match

21

ENGLISH FRAMEWORK

Table I. Dry run data: words and numbers of instances attribute brick collective connect dawn govern layer port spite underground

364 586 495 516 551 593 492 874 577 519

bake bucket comic cook drain impress lemon provincial storm vegetable

346 174 502 1922 578 641 225 373 763 636

beam cell complain creamy drift impressive literary raider sugar

337 698 1116 101 515 711 690 164 855

boil civilian confine curious expression intensify overlook sick threaten

567 582 586 465 917 234 437 639 307

it, the gold standard specifies both the sense and UNASS. (In the taggers’ first pass, there was a finer-grained analysis of the misfit categories, but for WSD evaluation, a scheme simple enough to score by was required.) For the taggers’ perspective on the exercise, and the instances that made the work difficult and interesting, see (Krishnamurthy and Nicholls, this volume). 6. The Data There were three data distributions. The target dates were end April Dry run data end June Training data mid July Evaluation data

6.1.

DRY- RUN DATA

The dry-run data comprised lexical entries and hand-tagged corpus instances, and was sampled in the same way as the training and evaluation data. It could be used to adapt systems to the format and style of data that would be used for evaluation. It comprised the words and associated numbers of instances shown in Table I. 6.2.

TRAINING DATA

The training distribution comprised lexical entries and hand-tagged corpus instances for the lexical sample that was to be used for evaluation. The lexical entries were provided so that participants could ensure that their systems could parse and exploit the dictionary entries and add to them where necessary

22

KILGARRIFF AND ROSENZWEIG

Table II. Evaluation tasks and dataset sizes Nouns -n

Verbs -v

Adjectives -a

Indeterminates -p

accident behaviour bet1 disability2 excess float 1 giant1 knee onion promise1 rabbit2 sack1 scrap1 shirt steering2

267 279 274 160 186 75 118 251 214 113 221 82 156 184 176

amaze bet1 bother bury calculate consume derive float 1 invade promise1 sack1 scrap1 seize

70 177 209 201 217 186 216 229 207 224 178 186 259

brilliant deaf2 floating 1 generous giant1 modest slight wooden

229 122 47 227 97 270 218 195

band bitter hurdle2 sanction shake

302 373 323 431 356

TOTAL

2756

TOTAL

2501

TOTAL

1406

TOTAL

1785

1 Multiple tasks for these words: training data shared. 2 No training data for these items.

(see discussion on timing above). The corpus instances were provided so that supervised-training systems could be trained for the words in the lexical sample. For five words there was no training data (see Table II), and for the remainder, the quantity varied widely between 26 and 2008 instances, depending simply on how many there were available. In both dry-run and training data, corpus instances were provided complete with the sense-tag that had been assigned as part of the original HECTOR tagging, but there had been no re-tagging. Unlike the evaluation data, there was no explicit information on word class, though this was deducible from the sense-tag with over 99% accuracy.8 6.3.

EVALUATION DATA

The evaluation distribution simply contained a set of corpus instances for each task. Each had been tagged by at least three humans, though these tags were, of course, not part of the distribution. (It did not contain lexical entries because they were already available in the training distribution.)

ENGLISH FRAMEWORK

23

Examples of lexical entries and corpus instances are included in Appendix 4. Lexical entries were distributed in their native format, minimally-structured SGML, with a utility to convert into latex and thereby to produce output of the form shown in Appendix 4. Corpus entries were distributed as ASCII texts, with the word to be tagged indicated by a tag, each instance having a six-digit reference number (starting with 7, unique within a given task), one sentence on each line, and instances separated by an empty line. There were 8448 corpus instances in total in the evaluation data. The tasks and associated quantities of data are presented in Table II. 6.4.

WORDNET MAPPING

For participants whose systems output WordNet senses, a mapping from WordNet senses to HECTOR senses was provided. As previous evidence of sense-mapping has always found (e.g. (Byrd et al., 1987)) the result is not altogether satisfactory, with gaps, one-to-many and many-to-many mappings. 6.5.

SPECIFICATIONS FOR RETURNING RESULTS

Systems were required to return, for scoring, a one-line answer for each corpus instance for which they were returning a result. A line comprised 1. The task 2. The reference number for the instance 3. One or more sense tags, optionally with associated probabilities. Where there were no numbers, the probability mass was shared between all listed tags. 7. Gold Standard Preparation: Manual Tagging The preparation of the gold standard included: obtaining funding to pay taggers selecting individuals selection of materials, including weeding-out anomalous items9 preparation of detailed tagging instructions, including fine-grained definition of the evaluation task in relation to e.g., word class, proper names, hard-to-tag cases, and data formats for distributing work to taggers and for them to return their answer keys − sending out data to taggers − processing returned work to identify those cases where there was unanimity amongst taggers, and those where there was not (so arbitration was required) − administration of arbitration phase.

− − − −

All stages were completed between March and August 1998.

24 7.1.

KILGARRIFF AND ROSENZWEIG

INTER - TAGGER AGREEMENT AND REPLICABILITY

Preparation of a gold standard worthy of the name was critical to the validity of the whole SENSEVAL exercise. The issue is discussed in detail in (Gale et al., 1992) and (Kilgarriff, 1998). A gold standard corpus must be replicable to a high degree: the taggings must be correct, and it can only be deemed that they are correct if different individuals or teams tagging the same instance dependably arrive at the same tag. Gale et al. identify the problem as one of identifying the ‘upper bound’ for the performance of a WSD program. If people can only agree on the correct answer x% of the time, a claim that a program achieves more than x% accuracy is hard to interpret, and x% is the upper bound for what the program can (meaningfully) achieve. There have been some discussions as to what this upper bound might be. Gale et al. review a psycholinguistic study (Jorgensen, 1990) in which the level of agreement averaged 68%. But an upper bound of 68% is disastrous for the enterprise, since it implies that the best a program could possibly do is still not remotely good enough for any practical purpose. Even worse news comes from (Ng and Lee, 1996), who re-tagged parts of the manually tagged SEMCOR corpus (Fellbaum, 1998). The taggings matched only 57% of the time. For SENSEVAL, it was critical to achieve a higher replicability figure. To this end, the individuals to do the tagging were carefully chosen: whereas other tagging exercises had mostly used students, SENSEVAL used professional lexicographers. A dictionary which would facilitate accurate tagging was selected. Taggers were encouraged to give multiple tags (one of which might be UNASS) rather than make a hard choice, where more than one tag was a good candidate. And the material was multiply tagged, and an arbitration phase introduced. First, two or three lexicographers provided taggings. Then, any instances where these taggings were not identical were forwarded to a further lexicographer for arbitration. At the time of the SENSEVAL workshop, the tagging procedure (including arbitration) had been undertaken once for each corpus instance. Individual lexicographers’ initial pre-arbitration results were scored against the post-arbitration results. The scoring algorithm was as for system scores. The scores ranged between 88% to 100%, with just five out of 122 results for pairs falling below 95%. To determine the replicability of the whole process in a thoroughgoing way, the exercise was repeated for a sample of four of the words. The words were selected to reflect the spread of difficulty: we took the word which had given rise to the lowest inter-tagger agreement in the previous round, (generous, 6 senses), the word that had given rise to the highest, (sack, 12 senses), and two words from the middle of the range (onion, 5, and shake, 36). The 1057 corpus instances for the four words were tagged by two lexicographers who had not seen the data before; the nonidentical taggings were forwarded to a third for arbitration. These taggings were then compared with the ones produced previously.

25

ENGLISH FRAMEWORK

Table III. Replicability of manual tagging Word

Inst

A

B

Agr %

generous onion sack shake

227 214 260 356

76 10 0 35

68 11 3 49

88.7 98.9 99.4 95.1

ALL

1057

121

131

95.5

Table III shows, for each word, the number of corpus instances (Inst), the number of multiply-tagged instances in each of the two sets of taggings (A and B), and the level of agreement between the two sets (Agr). There were 240 partial mismatches, with partial credit assigned, in contrast to just 7 complete mismatches. For evidence of the kinds of cases on which there were differences of taggings, see Krishnamurthy and Nicholls (this volume). This was a most encouraging result, which showed that it was possible to organise manual tagging in a way that gave rise to high replicability, thereby validating the WSD enterprise in general and SENSEVAL in particular. 8. Scoring Three granularity levels for scoring were defined. At the fine-grained level, only identical sense tags counted as a match. At the coarse-grained level, all subsense tags (corresponding to codes such as 1.1, 2.1) were assimilated to main sense tags (corresponding to codes such as 1, 2) in both the answer file and in the key file, so a guess of 1.1 in the answer file counts as an exact match of a correct answer of 1, 1.1 or 1.2 in the key. At the third, ‘mixed-grain’ level, full credit for a guess was awarded if it was subsumed by an answer in the key file, and partial credit if it subsumed such an answer, as described in Melamed and Resnik (this volume; hereafter MR). For many instances in HECTOR, it does seem appropriate to give credit for a sense when the correct answer is a subsense of that sense, and vice versa – but in others it does not. Consider HECTOR’s sense 1 of shake, MOVE, defined as: to move (someone or something) forcefully or quickly up and down Sense 1.1 CLEAN, is, to remove (a substance, dirt, object etc.) from something by agitating it and it does seem appropriate to give credit where sense 1.2 is given for 1 or vice versa. But sense 1.2, DUST, is to leave that place or abandon that thing for ever

26

KILGARRIFF AND ROSENZWEIG

as in shaking the dust of Kingston off her feet forever. While the etymological link to senses 1 and 1.1 is evident, the difference in meaning is such that it seems quite inappropriate to assign credit to a guess of 1.2 where the correct answer was 1. The validity of subsuming subsenses under main senses remains open to question. In the event, the choice of scoring scheme made little difference to the relative scores of different systems, or of systems on different tasks. Except where explicitly noted, the remainder of the paper refers only to fine-grained scores. Where a system returned several answers, it was assumed that the probability mass was shared between them, and credit was assigned as described in MR.10 All the scoring policies make the MR assumption that there is exactly one correct answer for each instance. This is so even though provision is made for multiple answers in the answer key, because these answers are viewed disjunctively, that is, the interpretation is that any of them could be the correct answer, not that the correct answer comprises all of them. It is hard to determine on a general basis whether a given instance of multiple tags in the key should be interpreted conjunctively or disjunctively (see also Calzolari and Corazzari, this volume). The precision or performance of a system is computed by summing the scores over all test items that the system guesses on, and dividing by the number of guessed-on items. Recall is computed by summing the system’s scores over all items (counting unguessed-on items as a zero score), and dividing by the total number of items in the evaluation dataset or subtask of evaluation. These measures may be viewed as the expected precision and recall of the system in a simpler testing situation where only one answer for each question may be returned, and where each answer either matches the key exactly or does not match it at all.11

9. Systems The 18 systems which returned results are shown in Table IV.12 Systems differ greatly in terms of the input data they require and the methodology they employ. This makes comparisons particularly odious, but, to make the comparisons marginally more palatable, they were classified into two broad categories, the supervised systems, which needed sense-tagged training instances of each word they were to disambiguate, and the ones which did not, hereafter ‘unsupervised’.13 The scheme is a first pass, and various classifications seem anomalous. Some supervised systems are also equipped to fall back on alternative tagging strategies in the absence of an annotated training corpus, while some unsupervised systems default to a frequency-based guess if information from a training corpus is available. Systems such as S USS and C LRES were in principle unsupervised, but used the training data (as well as the dry-run data) to debug and improve the configuration of their programs. We use the scheme to simplify the presentation of results, but ask the reader to treat it indulgently.

27

ENGLISH FRAMEWORK

Table IV. Participating systems for English Group Unsupervised CL Research, USA Tech U Catalonia, Basque U U Ottawa U Manitoba U Sunderland U Sussex U Sains Malaysia XEROX-Grenoble, CELI, Torino Post-workshop results only CUP/Cambridge Lang Services Supervised Bertin, U Avignon Educ Testing Service, Princeton John Hopkins U Korea U New Mex State, UNC Asheville Tech U Catalonia, Basque U U Durham U Manitoba U Manitoba U Tilburg

Contact

Shortname

Litkowski Agirre Barker Lin Ellman McCarthy Guo Segond

clres upc-ehu-un ottawa mani-dl-dict suss sussex malaysia xeroxceli

Harley

cup-cls

de Loupy Leacock Yarowsky Ho Lee O’Hara Agirre Hawkins Suderman Lin Daelemans

avignon ets-pu hopkins korea grling-sdm upc-ehu-su durham manitoba-ks manitoba-dl tilburg

All systems are described by their authors in this Special Issue, either in a paper, or, for CUP - CLS, MALAYSIA and OTTAWA, in Appendices to this paper. 9.1.

UPPER BOUND USING WORDNET MAPPING

Four of the systems (UPC - EHU - UN , UPC - EHU - SU , SUSSEX AND OTTAWA) disambiguated according to WordNet senses and used the HECTOR–WordNet map provided by the organisers. To assess how system performance was degraded by the mapping, we computed an upper bound by taking the gold-standard answers, mapping them to the WordNet tags (using an inverted version of the same mapping) and then mapping them back to HECTOR tags. The resulting tagging was scored using the standard scoring software. The strategy gave answers for just 79% of instances; for the remaining 21%, the correct HECTOR tag did not feature in the mapping. Precision was also 79%. Even though the set of tags is guaranteed to

28

KILGARRIFF AND ROSENZWEIG

include all correct tags, on this algorithm, the mappings in both directions are frequently one-to-many so the correct answer is diluted. Evidently, systems using the WordNet mapping were operating under a severe handicap and their performance cannot usefully be compared with that of systems using HECTOR tags directly.14 (Other systems such as ETS - PU and the two MANITOBA systems used WordNet or other lexical resources, but not in ways which left them crucially reliant on the sense-mapping.) 10. Baselines Two sets of baselines are used: those that make use of the corpus training data, and those that only make use of the definitions and illustrative examples found in the dictionary entries for the target words. The baselines which use training data are intended for comparison with supervised systems, while the ones that use only the dictionary are suitable for comparisons with unsupervised systems. None of the baselines in either set draws on any form of linguistic knowledge, except for those that are coupled with the phrase filter, which recognises inflected forms of words and applies rudimentary ordering constraints for multi-word expressions. The baselines, like the systems, are free to exploit the pre-specified part-of-speech tags of the words to be disambiguated for the noun, verb and adjective (hereafter -nva) tasks. Some of the baselines also make use of the root forms of the words to be disambiguated.15 The baselines used for comparison in this paper are: R ANDOM : – gives equal weight to all sense tags that match the test word’s root form and, for -nva tasks, part of speech.16 C OMMONEST: – always selects the most frequent of the training-corpus sense tags that match the test word’s root form (and, for -nva tasks, part of speech). The frequency calculation ignores cases involving multiple sense tags or where the tag is PROPER or UNASS. It makes no guesses on the words for which no training data was available. L ESK : – uses a simplification of the strategy suggested by (Lesk, 1986), choosing the sense of a test word’s root whose dictionary definition and example texts have the most words in common with the words around the instance to be disambiguated. The strategy is, for each word to be tagged: (a) For each sense s of that word, (b) set weight(s) to zero.

ENGLISH FRAMEWORK

29

(c) Identify set of unique words W in surrounding sentence. (d) For each word w in W, (e) for each sense s of the word to be tagged, (f) if w occurs in the definition or example sentences of s, (g) add weight(w) to weight(s). (h) Choose the sense with greatest weight(s) Weight(w) is defined as the inverse document frequency (IDF) of the word w over the definitions and example sentences in the dictionary. IDF is a standard measure used in information retrieval which serves to discount function words in a principled way, since it is inversely proportional to a word’s likelihood of appearing in an arbitrary definition or example. The IDF of words likethe, and, of is low, as they appear in most definitions, while the IDF of content words is high. The IDF of a word w is computed as −log(p(w)), where p(w) is estimated as the fraction of dictionary ‘documents’ which contain the word w. Each definition or example in the dictionary is counted as one separate document. At no point are the words stemmed or corrected for case if capitalised. L ESK - DEFINITIONS : – as LESK, but using only the dictionary definitions, not the dictionary examples. This baseline was included because the HECTOR dictionary has far more examples than most dictionaries, so, where systems assumed more standard dictionaries and did not exploit what was, effectively, a small sense-tagged corpus, L ESK DEFINITIONS would be a more salient baseline. L ESK - CORPUS : – as LESK, but also considers the tagged training data for words where it is available, so can be compared with supervised systems. For each word w in the sentence containing the test item, this baseline not only tests whether w occurs in the dictionary entry for each candidate sense, but also whether it appears in the same sentence as one of the instances of that sense in the training corpus. That is, (f) above is replaced with: (f’) if w occurs in the definition, example sentences or training-corpus contexts of s, In this case the IDF weights of words are computed for the words’ distribution in both the dictionary and the corpus. Each definition or example in the dictionary is counted as one separate document, and also each set of training-corpus contexts for a sense tag is counted as a single additional document. For sense tags which do not appear in the training corpus, the baseline reverts to the strategy of unsupervised LESK , but with the benefit of corpus-derived inverse document frequency weights for words.

30

KILGARRIFF AND ROSENZWEIG

Although LESK - CORPUS does not explicitly represent the relative corpus frequencies of sense tags, it implicitly favours common tags because these have larger context sets, and an arbitrary word in a test-corpus sentence is therefore more likely to occur in the context set of a commoner training-corpus sense tag. . . . +PHRASE - FILTER : All of the above are also coupled with a phrase filter designed to scan for multiword expressions in a very shallow way. The phrase filter uses only the dictionary, an inflected-word-forms recogniser, and some rudimentary knowledge about the ordering of the words in each phrase. The phrase filter is used in conjunction with the baselines as a pre-processor. It runs first, vetoing all senses for multi-word items if there is no evidence for them in the test instance, and vetoing all senses except those for the appropriate multi-word if evidence for one of the dictionary instances is found. 10.1.

PROLOGUE TO RESULTS

Scores were computed on various subsets of the test data, where each subset is intended to highlight a different aspect of the task. There are subtasks for measuring system performance on particular parts of speech, on words for which no training data is available, and on words tagged by the annotators as proper nouns. However, the items on which individual systems significantly outperform or underperform the average did not correlate strongly with any of these broad subsets, so it was not easy to discern which techniques suited which kinds of words or instances. Individual items in the dataset are not graded in any way for difficulty. This is a limitation of the evaluation since most systems did not tag the entire dataset but carved out more or less idiosyncratic subsets of it, abstaining from guessing about the remainder. Without difficulty ratings for items, we cannot say whether two systems that tag only part of the data have chosen equally hard subsets, and results may not be comparable. In particular, systems which focus on high-frequency phenomena for which reliable cues are available may benefit from saying nothing about more difficult cases. The highly skewed distribution of language phenomena, with a few very frequent phenomena and a long tail of rarer ones, also means that systems will primarily be evaluated with respect to their ability to handle a few common types of problems. Their ability to handle a range of rarer problems will have little impact on their score. Even if a system does not choose to restrict itself to the subset of common cases, there will be little else for it to demonstrate its versatility on.

31

ENGLISH FRAMEWORK

precision 1.0

?hector

0.9 durham tilburg ♦♦ ♦ ♦ets-pu hopkins ?♦manitoba-ks, lesk-corpus

0.8 0.7 0.6 0.5 0.4 0.3

grling-sdm ♦ manitoba-dl ♦ common ? ♦ ♦korea suss clres ♦ ?lesk mani-dl-dict ♦ ♦xeroxceli ♦upc-ehu-su ♦sussex ♦upc-ehu-un ♦ottawa ?lesk-defs ♦avignon

♦ ?random malaysia

0.2 0.1 0.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

recall Figure 1. System performance on all test items.

11. Results for Participating Systems The following graphs summarise system performance on several main tasks of the evaluation. Unsupervised systems are in italics, supervised in roman. The human score, HECTOR, corresponds to the annotations made by the lexicographers who initially marked up the test corpus. All the graphs show fine-grained, non-minimal scores. Five baselines are also provided for comparison: LESK - CORPUS, LESK, LESK DEFS (all with the phrase filter), COMMONEST and RANDOM . Baselines are bold or italic, according to whether they use the training corpus or not, and have scores marked with stars, where competing systems have diamonds. Figure 1 demonstrates that the state of the art, for a fine-grained WSD task where there is training data available, is at around 77%: the highest scoring system scored 77.1%.17 Where there is training data available, systems that use it perform substantially better than ones that do not. The Lesk-derived baselines performed well. The majority of systems were outperformed by the best of the baselines for their system-type. 11 systems also returned results by a later deadline. This was mainly to allow further de-bugging, where the rush to meet the pre-workshop deadline had meant the system was still very buggy. Ten of the second-round systems were revised versions of first-round systems and one, cup-cls, was a new participant. The highest-scoring of the second-round systems had a marginally higher score (78.1%)

32

KILGARRIFF AND ROSENZWEIG

precision 1.0

?hector

0.9 0.8 0.7

♦upc-ehu-su

0.6 0.5

♦cup-cls ?lesk-defs

♦upc-ehu-un

0.4

♦hopkins grling-sdm ♦ets-pu ♦tilburg ♦ ♦avignon korea ♦ ?common?lesk-corpus ♦clres-revised ?lesk ♦ suss

0.3 ?random

0.2 0.1 0.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

recall Figure 2. Later-deadline system performance on all test items.

than the highest-scoring of the first-round systems. Second-round results are shown in Figure 2. Figures 3 and 4 show performance on the nouns and on the verbs. For nouns, the top performance was over 80%; for the verbs, the best systems scored around 70%.18 11.1.

TASKS WITH AND WITHOUT TRAINING DATA

Some of the supervised systems (DURHAM , HOPKINS , MANITOBA - DL) were designed to fall back on unsupervised techniques, or to rely on dictionary examples when no corpus training data was available. One might have expected these systems to perform at the same levels as the unsupervised ones for those tasks where there was no training data. But this was not the case. The supervised systems performed better than the unsupervised even for these words. In general, the systems that attempt both the no-training-data words and the others do better on the no-training-data words. This is a consequence of frequency: corpus data was supplied wherever there was any data left over after the test material was taken out from the HECTOR corpus, so the no-training-data words were the rarer words – and low polysemy is correlated with low frequency: in this case, 7.28 senses per word on average as opposed to 10.79 for words with corpus training data. The entropy is also lower on average: 1.57 versus 1.92 for words with training data.19 As a result, supervised systems which do not attempt to tag these words are

33

ENGLISH FRAMEWORK

precision 1.0

?hector

0.9

♦durham tilburg ♦hopkins ♦ ♦ets-pu ♦korea ♦suss grling-sdm ♦ ♦manitoba-ks ?lesk-corpus manitoba-dl ♦ ♦avignon ♦clres ♦mani-dl-dict ?commonest ?lesk ♦sussex ♦xeroxceli ♦upc-ehu-su

0.8 0.7 0.6 0.5 0.4

♦ottawa

0.3

♦upc-ehu-un ?lesk-defs ♦malaysia ?random

0.2 0.1 0.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

recall Figure 3. System performance on nouns subtask.

precision 1.0

?hector

0.9 0.8 hopkins ets-pu ♦ ♦ ?♦lesk-corpus, durham tilburg ♦♦ manitoba-ks ♦grling-sdm ♦manitoba-dl ?commonest ♦suss ?lesk ♦clres ♦mani-dl-dict♦xeroxceli ♦korea ?lesk-defs

0.7 0.6 0.5 0.4 0.3

?random ♦malaysia

0.2 0.1 0.0

0.0

0.1

0.2

0.3

0.4

0.5

recall Figure 4. System performance on verbs subtask.

0.6

0.7

0.8

0.9

1.0

34

KILGARRIFF AND ROSENZWEIG

at a disadvantage compared with supervised systems that do somehow manage to tag them.

11.2.

SCORING BASED ON REDUCTION OF BASELINE ERROR RATE

Participants were free to return guesses for as many or as few of the items as they chose. Hence, participants who, by accident or design, only returned guesses for the easier items may be considered to have inflated scores, and those who have returned guesses for difficult cases, deflated ones. Thus, the SUSSEX system returned guesses for just 879 (10%) of the items in the dataset (just those items where the word to be tagged was the head of the object-noun-phrase of one of a particular set of high-frequency verbs). The over-all precision of SUSSEX (based on its performance on just these items) is 0.36, as compared to 0.39 for the LESK - DEFINITIONS baseline. However, if we look only at the 879 items for which SUSSEX returned an answer, SUSSEX performed better than the baseline. It so happens that SUSSEX had selected a harder-than-average set of items to return guesses for, and its performance should be seen in that light. On one large subset of the data, the 2500 items in the verb tasks, none of the systems is capable of achieving more than a 2% improvement over the best baseline’s error rate.

11.3.

PART- OF - SPEECH ACCURACY

For the -p tasks, the input did not provide a part-of-speech specification so the system had, implicitly, to provide one. Most systems guessed part-of-speech correctly over 90% of the time, the two lowest scores being 78% (MANI - DL - DICT) and 87% (MANITOBA - KS). POS-tagging accuracy was not correlated with sense-tagging accuracy. For most systems, the results relative to baseline are better for -p tasks than for -nva tasks. For -nva tasks, systems and baselines alike can look up the correct part of speech simply by checking the filename suffix. For-p tasks, the baselines, unlike the systems, had no POS-tagging module so made many word class errors. For example, TILBURG achieves 13.05% error reduction relative to the LESK CORPUS baseline. However, much of this is due to the baseline’s performance on the indeterminate items, where it makes many more errors simply because it is not equipped with a part-of-speech tagger. If consideration is restricted to the -nva task, the error reduction due to TILBURG decreases to 4.52%. There were a total of 286 items tagged with PROPER in the answer key. These items are always also assigned a dictionary sense tag in addition to PROPER (see section 5.2). Only three systems ever guess PROPER: HOPKINS, TILBURG and ETS - PU .20 Of these, HOPKINS succeeds in recognising 56.1%, TILBURG 14.3%, and ETS - PU 5.6%. Of the remaining systems, some seem able to distinguish likely

ENGLISH FRAMEWORK

35

Figure 5. Improvement in system performance when responses are limited to sense tags with a part of speech appropriate to the file type of each test item; unfilled circles show original scores, filled circles, improved ones.

proper nouns as they tend to abstain from guessing more often on the PROPER instances. As discussed in section 5.1, it was possible for a sense tag from the ‘wrong’ word class to apply, e.g., although the SOUND sense for float was a sense for verbal float, it could be the most salient sense, to be found in the gold standard, for an adjectival instance. Thus the task definition permitted any sense tag for any word class for the word (as well as PROPER and UNASS) as possibilities. If that was interpreted as indicating that the -n, -v or -a label on the task imposed no constraint on the sense tags which could apply, then the label provided no, or very little, useful information. In practice, this occurred less than 1% of the time, and systems which only ever guessed ‘right’ word class senses benefited from the simplifying assumption. Systems which did not make this assumption frequently paid heavily, committing 10% of their total errors in this way.21 Figure 5 shows, for those systems, how much their performance improves if we ignore errors which would not have occurred had they heeded the part-of-speech constraint. The shift in precision is accomplished by throwing out any guesses that the system makes in the wrong part of speech. Since all of these were wrong anyway, recall is not affected, but precision increases, sometimes dramatically.

36

KILGARRIFF AND ROSENZWEIG

Figure 6. Distribution of sense tags for generous versus slight in training corpus.

12. Polysemy, Entropy, and Task Difficulty The distribution of sense tags in the training and evaluation data is highly skewed, with a few very common sense tags and a long tail of rarer ones. This suggests that the distributions of sense tags for individual words in the data will also be quite skewed and that the entropy of these distributions will be fairly low. However, there is substantial variation of entropy across words. For instance, both generous and slight are adjectives with 6 senses, but the entropy of slight is 1.28 while that of generous is 2.30. This is because of the unusually even distribution of sense tags for generous, as shown in Figure 6 of the training-data distributions for the two adjectives. Polysemy and entropy often vary together, but not always. As Table V shows, the nouns, on average, had higher polysemy than the verbs but the verbs had higher entropy. For verbs, the corpus instances were spread across the dictionary senses more evenly than for nouns. Systems tend to do better on the nouns than the verbs, suggesting that entropy is the better measure of the difficulty of the tasks. The correlation between task polysemy and system performance is –0.258. The correlation between entropy and system performance is stronger: –0.510. When considering just the supervised sys-

37

ENGLISH FRAMEWORK

Table V. Polysemy and entropy of selected evaluation subtasks Task

Average polysemy

Average entropy

eval (all items) nouns verbs adjectives

10.37 9.16 7.79 6.76

1.91 1.74 1.86 1.66

tems, the correlation with entropy is –0.699. Correlation with polysemy for these systems is –0.247. This might be thought surprising. Where a sense-tag distribution has high entropy, most candidate senses are well-represented in the training corpus, so supervised systems should be able to arrive at good models for all of them and discriminate between them reliably. Against that stand two arguments, one mathematical, one lexicographic. The mathematical one is that low-entropy distributions are often dominated by a single sense, in which case the system can perform well by guessing the dominant sense wherever it does not have good evidence to the contrary. The lexicographic one is this: in deciding what senses to list for a word, lexicographers will only give rarer possibilities the status of a sense where they are quite distinct (Kilgarriff, 1992, chapter 4). Senses which are quite distinct to the lexicographer will tend to be those that are easier for systems to discriminate. At the one end of the spectrum are tasks like generous-a where all the meaning distinctions are subtle and overlapping, and the senses tend to be of comparable frequency, giving high entropy for the number of senses. At the other end are tasks like slight-a where the sense distinctions are reasonably clear for lexicographers and systems alike, but the rarer senses are far rarer than the dominant one or two, giving low entropy. The relations between polysemy and precision, and entropy and precision, are depicted in Figures 7 and 8.22 There are a few outliers. The two vertical lines on the right of the polysemy graph correspond to the tasks band-p (29 senses) and shake-p (36 senses). Systems perform quite well on these tasks despite their high polysemy. In the case of bandp, this relates to its low entropy (1.75); system performance on band is close to system performance on other tasks with similar entropy. This in turn relates to the high incidence of compound nominals among the senses of band: big band, band saw, elastic band etc. These have distinct, unpredictable real-world meanings, so the lexicographer is inclined to treat them as distinct senses even if they are infrequent; for WSD systems, they will be easy to get right. Shake-p has high entropy (3.69), so the good system performance on this word cannot be explained by the effect of this variable. For shake, like band, multiword expressions hold the key. Shake one’s head is the commonest use of shake

38

KILGARRIFF AND ROSENZWEIG

Figure 7. Precision of all systems on words with different numbers of senses.

Figure 8. Precision of all systems on words with different entropy measures.

ENGLISH FRAMEWORK

39

in the training data, and over 50% of the test items involve some multi-word expression. 13. Pooling the Results of Multiple Systems Improvements in precision can be achieved by having sets of participating systems vote on which sense tag should be assigned to each test item.23 Three voting schemes were explored. U NANIMOUS only assigns a tag if all the systems in the voting pool agree on that tag unequivocally (or abstain from tagging it). A BSOLUTE MAJORITY assigns a tag if one tag gets more of the non-abstaining systems’ votes than all the others combined. If no tag gets an absolute majority of votes, no guess is made. W INNER simply guesses the tag or tags that receive more votes than any others. For ABSOLUTE MAJORITY and WINNER, systems which assign weights to multiple sense tags are counted as voting fractionally for each of these sense tags according to the weight they assign them. The voting schemes were applied to various sets of systems, including: all (the complete set of participating systems); all S (all the supervised systems); and best S (the better half of the supervised systems, as measured by their overall precision). All the voting schemes gave higher precision than any of their contributing systems. However, all systems agree unanimously on only 3% of items, and even then there are several cases where they do not get the right tag. The agreement between better-performing systems is generally higher than the agreement between systems that do not perform so well. By combining the best supervised systems in the best S voting pool, we achieve 96% precision on a substantial fragment of the dataset (53%). This is comparable to human precision on this task, as measured by the lexicographers’ annotations. The recall is of course substantially lower, and the cases that are left out are evidently the more difficult ones. The shallow LESK - PLUS - CORPUS baseline with the PHRASE FILTER attains 86.4% precision on the same subset of the test data, as compared with 49.4% on the remaining test items which the voting pool cannot agree on. The voting pool therefore achieves 66.2% error reduction over the baseline on the fragment of the test data that it tags, as opposed to the 85.1% that one would expect if the items tagged by the voting pool were an arbitrary sample of the test data. But such a high-precision partial annotation, produced automatically, can still be extremely useful. It can serve as a valuable first pass over raw data, and one can anticipate it being used in a variety of ways, including the preparation of gold standard data for future S ENSEVALs. 14. Conclusion English SENSEVAL was an engaging and successful exercise. The strategy developed for the evaluation made evaluation possible and meaningful. Others have worried that WSD cannot be meaningfully evaluated because people so often

40

KILGARRIFF AND ROSENZWEIG

disagree on what the correct sense is; in the course of the data preparation phase, this ghost was laid to rest, as the human sense-tagging proved to be replicable with a high degree of accuracy. There was a very high level of interest and engagement with the exercise, with eighteen systems from sixteen research groups participating. Participants were in general grateful that the exercise had been organised, as it enabled them to find out how their system (and its various components) compared with others, in a way that had been near impossible before. It also promoted the coherence of the field through providing a common reference point for evaluation data and methodology. The exercise identified the state-of-the-art for fine-grained WSD. Where a reasonable quantity of pre-tagged training data was available, the best current systems were accurate 74–78% of the time (where they aimed to tag all instances, i.e. maximising recall). It is interesting to note that a number of systems had very similar scores at the top end of the range, and that the LESK - CORPUS baseline, which simply used overlap between words in the training data and test instance, was not far below, at 69%. For systems that did not assume the availability of training data, scores were both lower and more variable. Where training data was available, there has been some convergence on the appropriate methods to use, but where a dictionary is the major source, there has been no such convergence. System performance correlates more closely with entropy than with polysemy. However there are many outliers and exceptions, and there remains much work to be done in identifying which kinds of words are easy for WSD, and which are difficult. Limitations of the exercise included the limited amount of context available for each test instance; the small number of words investigated; and, most centrally, uncertainty about the sense inventory that had been selected for the exercise. H EC TOR senses may be as valid as those from any other dictionary, but was that good enough? Were they relevant for any NLP task that a WSD module might be useful for? This issue is discussed further in the Introduction and discussion papers in the special issue. We believe SENSEVAL has done much to take WSD research forward. We look forward to future SENSEVALs with the continued engagement and co-operation of all researchers in the area.

Notes 1 For a fuller statement of the case see (Kilgarriff, 1998). For the counter-arguments, see (Wilks,

this volume). All systems are referred to by their short names, as given in Table IV. Hereafter the BNC: for more information see http://info.ax.ac.uk/bnc The funding was from the UK EPSRC under grant M03481. The sampling strategy is fully described in (Kilgarriff, 1998). This was motivated by economy: it made an extra pass over the data to determine part-of-speech unnecessary. 2 3 4 5 6

ENGLISH FRAMEWORK

41

7 float was associated with three tasks, float-v , float-n and floating-a , sometimes also called float-a . 8 In the event, there were some differences of format between the dry-run training data, and evalu-

ation data, because, between the releases, there was more time to clean up data and to work on the task specification. This caused some participants substantial inconvenience. 9 For example, numerous corpus instances had been used as HECTOR dictionary examples. These needed weeding out from the evaluation materials. With thanks to Frédérique Segond and Christiane Fellbaum for pointing this out. 10 If the numbers associated with multiple guesses that a system returned did not sum to one, they were first normalised so that they did. 11 There was one further variable in the scoring: ‘minimal’ vs ‘full’ scoring. Minimal scoring was defined as the score a system achieved if it was evaluated only on those instances where the key was a single sense. The intention was to provide a score with a clear, unequivocal interpretation. In the event, once again, the choice of scheme made little difference to the relative scores and the remainder of the paper refers only to full scores. 12 All but one returned results before the workshop. Several returned further results by a later, postworkshop deadline. CUP - CLS was the one system that only returned results by the later date. 13 Earlier classifications made a further distinction within the unsupervised systems, between the ‘all-words’ systems that could disambiguate all (content) words, and ‘others’, which could not. In the event this distinction was hard to draw, and there was only one likely candidate for this ‘other’ category, so the distinction is not used here. 14 C UP - CLS was under a similar handicap, as it used a mapping for the CIDE dictionary. 15 The root form is given as the prefix of the file name that a test item occurs in, so is, in this exercise, available to all systems. If it were not given in the file name, some linguistic analysis would be required to obtain it. 16 Here and for other comparable computations below, PROPER and UNASS tags are left out, since giving them equal weight would greatly reduce the weights for actual dictionary senses of lowpolysemy words. 17 For the coarse-grained task, the equivalent figures would be 5% higher. The performance of all systems improves under coarse-grained scoring, but in general the relative performance of the systems was not affected (even though some systems had been optimised for the coarse-grain level). The average system precision score on all test items improves from 0.55 to 0.66, or 20%, when scoring is at the coarse-grained instead of the fine-grained level. 18 The other two categories, adjectives and -p tasks, had top levels between these two. 19 Entropy is calculated as −6(p(x) · log(p(x))) where x ranges over all sense tags of a word, and p(x) is the fraction of training occurrences of the word tagged with x. 20 P ROPER tags do occur in the responses of a couple of other systems, but at most only once or twice per system. 21 In one case, KOREA, the wrong guesses resulted from a systematic false assumption. 22 Out-of-candidate-set guesses (for sense tags of the wrong part of speech) have been disregarded in computing the systems’ performance on the above graphs, as the inflated polysemy levels, where, e.g, all adjectival senses were included as possibilities for a verbal task, would complicate the figure. 23 The idea was suggested by Eneko Agirre and David Yarowsky.

Appendix 1: The CUP-CLS system The C UP -C LS sense tagger was created at Cambridge University Press with support from the EC funded project ACQUILEX II, and developed further by Cambridge Language Services with support from the DTI/SALT funded project Integrated Language Database, and is fully described in Harley and Glennon (1997). No further modifications have been made to the tagger since that date, and there was no fine-tuning for the HECTOR tags

42

KILGARRIFF AND ROSENZWEIG

or data. The mapping between CIDE (CIDE, 1995), the dictionary used by the CUP/CLS tagger, and the HECTOR dictionary was done by Guy Jackson to the simple guidelines of noting a map wherever there was an overlap between a CIDE sense and a HECTOR sense. In particular, this meant that many CIDE senses often mapped to one HECTOR sense. This meant that the tagger, which only chooses one CIDE sense for each instance, inevitably tagged many words with multiple HECTOR senses solely because of the mapping. The upper bound for the CIDE mapping (computed as described for WordNet in Section 9) gave figures of 90% attempted and 71% precision. In the evaluation, one of the tags chosen by the CUP/CLS sense tagger after the mapping was correct 64% of the time, i.e. the tagger was definitely wrong 36% of the time. The tagger itself could be improved by a number of measures mentioned in the 1997 paper, in particular by using an external part of speech tagger. (The tagger was not given part of speech information for the evaluation). The mapping could be improved by only mapping the most likely matches not all possible matches, or by mapping to the fine-grained CIDE ‘example’ level, rather than to the coarser CIDE definition level as now.

References Harley, A. and D. Glennon. “Combining Different Tests with Additive Weighting and Their Evaluation”. In Tagging Text with Lexical Semantics: Why, What and How? Ed. M. Light, Washington, 1997, pp. 74–78.

Appendix 2: The OTTAWA system The OTTAWA system for word sense disambiguation is part of a larger project that aims to acquire knowledge from technical text semi-automatically. In the absence of hand-coded domain knowledge, the knowledge acquisition tools rely on linguistic knowledge, a cooperating user and general-purpose, publicly available information sources, such as WordNet. For word sense disambiguation, it is possible to use the semantic relationships among nouns in WordNet to compute a measure of semantic similarity of each of the senses of two words. The WSD algorithm attempts to disambiguate nouns by measuring the semantic similarity of senses of words appearing in the same syntactic context: the direct object of a verb. For example, if two nouns appear as direct objects of the same verb, the algorithm measures the similarity of each sense of one noun with each sense of the other noun. The two nouns are disambiguated to the two most similar senses. The algorithm is presented in detail in Li et al. (1995) and Szpakowicz et al. (1996).

References Li, X., S. Szpakowicz and S. Matwin. “A WordNet-based Algorithm for Word Sense Disambiguation”. In Proceedings, IJCAI ’95. Montreal, 1995, pp. 1368–1374. Szpakowicz, S., S. Matwin and K. Barker. “WordNet-based Word Sense Disambiguation that Works for Small Texts”. Technical Report Computer Science TR-96-03, School of Information Technology and Engineering, University of Ottawa, 1996.

ENGLISH FRAMEWORK

43

Appendix 3: The MALAYSIA System M ALAYSIA uses a prescriptive semantic primitive based approach in tagging. Its vocabulary was around 2,000 words for SENSEVAL. The strategy is described in (Wilks et al., 1989) and (Guo, 1995).

References Guo, C.-M. Constructing a MTD from LDOCE, Chapt. Part 2. Norwood, New Jersey: Ablex, 1995, pp. 145–234. Wilks, Y., D. Fass, C.-M. Guo, J. McDonald, T. Plate and B. Slator: 1989, ‘A Tractable Machine Dictionary as a Resource for Computational Semantics”. In Computational Lexicography for Natural Language Processing. Eds. B. K. Boguraev and E. J. Briscoe, Harlow: Longman, pp. 193–238.

Appendix 4: HECTOR Lexical Entries and Corpus Instances for Generous and Onion GENEROUS: Dictionary Entry

1 unstint (512274)[adj-qual] (of a person or an institution) giving willingly more of something, especially money, than is strictly necessary or expected; (of help) given abundantly and willingly 1. Kodak, one of British athletics’ most faithful and generous sponsors, have officially ended their five-year, £5 million backing. [[= sponsor]] 2. The British people historically have been extraordinarily generous at disaster giving. [[subj[person] comp/= at; c/n/giving]]

3. Grateful thanks to Mr D.S.V. Fosten for his generous help, advice and knowledge freely given. [[= help]] 4. It is fashionable to attack doctors for being too liberal in dispensing medication and less than generous with their explanations. [[= with]] 5. The US jazz press has been generous in its praise. [[= in poss nu]] 6. He was generous with the time he gave to professional organisations. [[= with time]] (note = entry is oversplit – WRT)

2 bigbucks (512309)[adj-qual] (of something monetary) consisting of or representing a large amount of money, sometimes with the implication that the amount is greater than is deserved 1. The Government is unlikely to be pushed into generous concessions by the rash of public sector disputes. [[= concession]] 2. It pays you generous interest on your money. [[= interest]] 3. Butler had assembled a complicated financial package which included generous loans to enable the voluntary bodies to build or convert schools for secondary purposes. [[= [money]]]

4. Generous offers from News International have helped drive up pay. [[= offer]] 5. I can offer you . . . a cheque for the generous sum of £15,000. [[= [money]]]

44

KILGARRIFF AND ROSENZWEIG

3 kind (512277)[adj-qual; often pred] (of a person or an action) manifesting an inclination to recognize the positive aspects of someone or something, often disinterestedly; (of something that is offered by one person to another) favouring the recipient’s interests rather than the giver’s 1. He was always generous to the opposition. [[= to the opposition]] 2. His interpretation of my remarks had been generous, often creatively so, making of them something far more brilliant than I had intended. [[subj/interpretation comp/=]] 3. This generous desire to show us the best in an author is manifested in his long chapter about Spenser. [[= desire]] 4. Some high-minded men believed that the Germans would turn against Hitler if offered generous enough terms. 5. The emotions are generous –. altruistic almost –. . . . we feel disturbed personally for other people, for people who have no direct connection with us. [[subj[emotion] comp/=]] 4 liberal (512410)[adj-qual; often attrib] leaning toward the positive; liberal 1. A 25 per cent success rate would be a generous estimate. [[= estimate]] 2. Salaries are based on a generous comparison with those paid by the federal civil service of the richest country in the world, the USA. [[= comparison]] 3. With the wheels lowered (limiting speed a generous 134 kts) an Apache will settle at 95-100 kts. [[= [measurement]]]

5 copious (512310)[adj-qual; usu attrib] (of something that can be quantified) abundant; copious 1. Serve immediately with generous amounts of fresh Parmesan. [[= [quantity]]] 2. In winter protect your cheeks with a generous application of moisturiser. [[= application]]

3. Labour spokesmen made generous use of statistics to castigate the government for refusing to spend more money on science. [[= use]]

6 spacious (512275)[adj-qual; usu attrib] (of a room or building) large in size; spacious; (of clothing) ample 1. As if the house were not large enough, there are generous attics stretching right across it, offering another five rooms for expansion.[[= [room]]] 2. A generous grill pan large enough to take a family-sized mixed grill [[= pan]] 3. A cream crepe dress . . . with generous puffed sleeves and a pleated skirt [[c/[garment]]]

GENEROUS: Corpus Instances 700002 As he said in another context, “it was a yell rather than a thought.” The wildness of the suggestion that their own father should wait until they had grown up before being allowed access to his own sons revealed, as well as pain, a < tag >generous< / > love. 700003 Broderick launches into his reply like a trouper.

ENGLISH FRAMEWORK

45

“Oh, it was wonderful, fascinating, a rich experience. He’s a very < tag >generous< / > actor and obviously he’s very full.” 700004 Man Ray, born Emmanuel Radnitzky of Jewish immigrants in Philadelphia in 1890, renounced deep family and ethnic ties in his allegiance to the cult of absolute artistic freedom. Paradoxically, his fame as the almost hypnotic photo-portrayer of the leading artistic figures around him, his novel solarisations, rayographs and cliches de verre (the last two cameraless manipulations of light and chemistry alone), and his original work for Vogue and Harper’s became a diamond-studded albatross about the neck of a man who wanted to be recognised, first and foremost, as a painter. A more < tag >generous< / > supply of illustrations might have helped the reader place him in the history of 20th-century art. 700005 Mrs Brown said: “It’s a really great way of attracting people’s attention, because they can’t fail to notice us.” “People have been very < tag >generous< / > and we raised about #200 within the first few hours.” 700006 A super year for all cash, career and personal affairs. ARIES (Mar 21–Apr 20): There are some hefty hints being thrown around on Tues day from folk who may be angling for a favour, a promise or a < tag >generous< / > gesture. 700007 Seconds later, airborne missiles whooshed through the air from all directions, apparently aimed at our heads. It would be < tag >generous< / > to call them fireworks, but that implies something decorative, to which one’s response is “Aaah”, not “Aaagh”. 700008 Although he has spent most of his working life in academia he did have an eight-year stint, from 1963, in industrial research. Industry is < tag >generous< / > to Imperial &dash. it endows chairs, sponsors students and gives the college millions of pounds of research contracts every year &dash. but, despite that, Ash is still very critical of it. 700009 This was typical of the constant negotiation and compromise that characterised the wars. The Dunstanburgh agreement was made at Christmas-time in 1462, but it was not just the season which put the Yorkist government in a < tag >generous< / > mood. 700010 The third concert, of Brahms’s Third and First symphonies, revealed the new Karajan at his most lovable, for these were natural, emotional, and &dash. let the word escape at last &dash. profound interpretations: voyages of discovery; loving traversals of familiar, exciting ground with a fresh eye and mind, in the company of someone prepared to linger here, to exclaim there; summations towards which many of his earlier, less intimate performances of the works had led. Karajan had pitched camp with Legge and the Philharmonia in 1949 when a < tag > generous< / > grant from the Maharaja of Mysore had stabilized the orchestra’s fin-

46

KILGARRIFF AND ROSENZWEIG

ances and opened up the possibility, in collaboration with EMI, of extensive recording, not only of the classic repertory but of works that caught Karajan’s and Legge’s fancy: Balakirev’s First Symphony, Roussel’s Fourth Symphony, the still formidably difficult Music for Strings, Percussion, and Celesta by Barto´.k, and some English music, too.

ONION: Dictionary Entry

1 veg (528347)[nc, nu] (field = Food) the pungent swollen bulb of a plant, having many concentric skins, and widely used in cooking as a vegetable and flavouring 1. 2. 3. 4. 5.

. . . mutton stew, with potatoes and onions floating in the thickened parsley sauce. . . . a finely chopped onion. Gently fry the onion and garlic for 5 minutes. . . . served with chips, tomatoes, onion rings and side salad. . . . french onion soup.

(kind = cocktail onion, salad onion, Spanish onion, spring onion) (note = cannot separate successfully nu and nc senses)

2 plant (528344)[nc] (field = Botany) the liliaceous plant, Allium cepa, that produces onions, having a short stem and bearing greenish-white flowers; any similar or related plant 1. 2. 3. 4.

When carrots are grown alongside onions, they protect each other from pests. Shallots belong to the onion family. . . . onion sets Allium giganteum is an attractive onion with four feet tall stems topped with dusky purple flowers.

onion dome basil (528376)[nc] (field = Architecture) a bulbous dome on a church, palace, etc 1. . . . the multicoloured onion domes of St Basil’s Cathedral. [[=]] (note = typically Russian?)

onion-domed roofed (528375)[adj-classif] (field = Architecture) (of a church or other building) having one or more onion domes 1. Soll is a charming cluster of broad roofed houses and inns sprawling lazily around an onion domed church. [[=]]

spring onion spring (528348)[nc] (field = Botany, Food) a variety of onion that is taken from the ground before the bulb has formed fully, and is typically eaten raw in salads 1. Garnish with spring onions and radish waterlilies. [[=]]

ENGLISH FRAMEWORK

47

ONION: Corpus Instances 700001 They had obviously simply persuaded others to go through this part of their therapy for them. “I want salt and vinegar, chilli beef and cheese and < tag >onion< / >!” said Maisie. 700002 “Or perhaps you’d enjoy a bratwurst omelette?” Pale, Chay told the waiter to have the kalbsbratwursts parboiled for four minutes at simmer then to grill them and serve them with smothered fried < tag >onions< / > and some Dijon mustard. 700003 With the motor running, slowly add the oil until the mixture is the consistency of a thick mayonnaise. Stir in the < tag >onion< / >, add the salt and pepper or a little more lemon juice if required. 700004 The huge browned turkey was placed in the centre of the table. The golden stuffing was spooned from its breast, white dry breadcrumbs spiced with < tag >onion< / > and parsley and pepper. 700005 Ingredients: 12oz/375g mince 1oz/30ml vegetable or olive oil 2 medium < tag >onions< / >, diced 1 green pepper, diced 3 stalks celery, sliced 1 tin (14oz/400g) plum tomatoes 1tsp sugar Cayenne pepper to taste (at least 1/2 tsp) Salt, pepper Half a 14oz/400g tin of red kidney beans, drained, or 7oz/200g tin of sweetcorn, drained 1 jalapeno pepper, sliced (optional) For the cornbread: 4oz/125g cornmeal (yellow coarse grind &dash. the Encona brand is widely available) 1oz/30g plain flour 1/2 tsp salt 1tsp baking powder 1 egg 5oz/150ml milk 1tbs vegetable oil 2oz/60g grated cheese Method: In a saute pan, brown meat in oil; stir in onions, green pepper and celery. 700007 Heat the oil in a heavy-bottomed pan and add the beef. Fry, turning frequently to seal the meat. Add the < tag >onion< / >, garlic, carrot, celery and leek and cook for 2 minutes. 700008 Pre-heat the oven to gas mark 1 " / " 2 60°ree. 1 " / " 2 25°ree.F. 2, Heat the oil and butter together in a heavy pan or casserole dish, add the < tag >onion< / > and peppers and cook until soft. 700009 If you have no greenhouse then sow one row thinly and transplant the thinnings, raking in two handfuls of fertiliser per square yard before sowing or planting. Spring < tag >onions< / > are treated in the same way as radish, while parsnips must go in early, should be sown in shallow drills with around three or four seeds together at six inch intervals after a handful of fertiliser per square yard has been worked in. 700010 One of the best bulbous plants for drying is Allium albopilosum (christophii).

48

KILGARRIFF AND ROSENZWEIG

This ornamental < tag >onion< / > blooms in June with large globe-shaped flowers up to ten inches in diameter, with small star-shaped silver-lilac flowers.

References Atkins, S. “Tools for Computer-Aided Corpus Lexicography: The Hector Project”. Acta Linguistica Hungarica, 41 (1993), 5–72. Byrd, R. J., N. Calzolari, M. S. Chodorow, J. L. Klavans, M. S. Neff and O. A. Rizk. “Tools and Methods for Computational Lexicology”. Computational Linguistics, 13 (1987), 219–240. CIDE. “Cambridge International Dictionary of English”. Cambridge, England: CUP, 1995. Fellbaum, C. (ed.). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. Gale, W., K. Church and D. Yarowsky. “Estimating Upper and Lower Bounds on the Performance of Word-sense Disambiguation Programs”. In Proceedings, 30th ACL, 1992, pp. 249–156. Harley, A. and D. Glennon. “Combining Different Tests with Additive Weighting and Their Evaluation”. In Tagging Text with Lexical Semantics: Why, What and How? Ed. M. Light, Washington, 1997, pp. 74–78. Hirschman, L. “The Evolution of Evaluation: Lessons from the Message Understanding Conferences”. Computer Speech and Language, 12(4) (1998), 281–307. Jorgensen, J. C. “The Psychological Reality of Word Senses”. Journal of Psycholinguistic Research, 19(3) (1990), 167–190. Kilgarriff, A.: 1992, ‘Polysemy’. Ph.D. thesis, University of Sussex, CSRP 261, School of Cognitive and Computing Sciences. Kilgarriff, A.: 1997, ‘Evaluating Word Sense Disambiguation Programs: Progress Report’. In Proc. SALT Workshop on Evaluation in Speech and Language Technology. Ed. R. Gaizauskas, Sheffield, pp. 114–120. Kilgarriff, A. “Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs”. Computer Speech and Language, 12(4) (1998), 453–472. Special Issue on Evaluation of Speech and Language Technology, edited by R. Gaizauskas. Lesk, M. E. “Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone”. In Proc. 1986 SIGDOC Conference. Toronto, Canada, 1986. Ng, H. T. and H. B. Lee. “Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach”. In ACL Proceedings. Santa Cruz, California, 1996, pp. 40–47.

Computers and the Humanities 34: 49–60, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

49

Framework and Results for French FRÉDÉRIQUE SEGOND∗ Xerox Research Centre Europe, Meylan, France

1. Setting Up the French Exercise To make the evaluation exercise valuable and useful it is important to prepare the evaluation material according to a rigorous methodology. This includes having clear criteria for choosing the words, being aware of the consequences on the evaluation of the dictionary and corpus choice. Also, because sense disambiguation is a difficult task even for human beings, it is important to provide comparison figures with human tagger agreement. In the following sections we present the adopted methodology together with the material used. 1.1.

CHOOSING THE CORPUS

The corpus used for the ROMANSEVAL1 exercise is the same as the ones used within the ARCADE project.2 It is a parallel corpus comprising nine European languages3 (ca. 1.1 million words per language). This corpus has been developed within the MLCC-MULTEXT projects.4 It is composed of written questions asked by members of the European parliament on a wide variety of topics and of corresponding answers from the European Commission. The format is just plain text. Sentences are relatively long and the style is, unsurprisingly, rather administrative. Although we did not use yet the parallel aspect of this corpus we plan to use it in order to study, for instance, relationships between sense tagging and translation. 1.2.

CHOOSING THE WORDS

The choice of test words is particularly difficult and cannot be left to intuition. Frequency criteria have a lot of drawbacks. If it is true that frequent content words of a text are very polysemous, it also has been shown that a large number of the words tend to be mostly monosemous in a given corpus. As such, a list of frequent words does not permit a proper evaluation of automatic WSD systems. Choosing the most polysemous words of a dictionary has also some drawbacks: chances are high that few of these senses appear in a corpus.

50

´ ERIQUE ´ FRED SEGOND

Table I. Average polysemy across four dictionaries.

French dictionary Italian dictionary English dictionary7 WordNet

verbs

adjectives

nouns

12.6 5.3 5.1 8.63

6.3 4.7 4.4 7.95

7.6 4.9 5.0 4.74

We used a combination of these two methods. We extracted 60 words (i.e. 20 nouns, 20 verbs, 20 adjectives) from 3 lists of 200 non part-of-speech ambiguous5 words obtained according to frequency criteria. The words chosen had word forms with comparable frequencies in the corpus, around the desired number of 50, so that, for each test word, all its contexts will be tested. These words were then proposed to 6 human judges who had to decide, for each of them, whether or not they were polysemous in the evaluation corpus.6 A score was then attributed to each word by summing up the answers and the 20 words with the highest grade were selected. Altogether, full agreement on polysemy was achieved on only 4.5% of the words. Conversely, 40.8% of words were unanimously judged as having only one sense; the rest received mixed judgement. The words are presented below. The numbers in brackets are firstly, the full number of senses (where each sense or subsense is treated as distinct), and second, the number of “top-level” sense distinctions. Petit Larousse dictionary entries are often hierarchical, and it is likely that, for many NLP tasks, top-level disambiguation is sufficient. nouns barrage (6;2), chef (7;6), communication (4;2), compagnie (8;4), concentration (4;4), constitution (6;4), degré (17;4), détention (2;2), économie (8;2), formation (13;9), lancement (3;3), observation (7;3), organe (5;5), passage (12;2), pied (15;5), restauration (7;2), solution (4;2), station (7;3), suspension (8;3), vol (9;2) adjectives biologique (3;3), clair (9;2), correct (3;3), courant (6;6), exceptionnel (2;2), frais (8;3), haut (10;3), historique (4;3), plein (11;9), populaire (4;4), régulier (12;2), sain (6;2), secondaire (10;3), sensible (11;9), simple (11;4), strict (4;4), sûr (5;5), traditionnel (2;2), utile (2;2), vaste (3;3) verbs arrêter (8;3), comprendre (4;2), conclure (4;3), conduire (6;4), connaître (9;4), couvrir (16;3), entrer (9;4), exercer (6;6), importer (5;2), mettre (20;5), ouvrir (16;10), parvenir (4;4), passer (37;9), porter (26;8), poursuivre (5;5), présenter (13;4), rendre (12;3), répondre (9;3), tirer (30;9), venir (12;3) Because the chosen words are the same ones as the one chosen within ARCADE it will be possible to adopt a multilingual perspective on WSD systems.

FRAMEWORK AND RESULTS FOR FRENCH

1.3.

51

CHOOSING THE DICTIONARY

For French we used the Petit Larousse (Larousse95) dictionary. It is a monolingual dictionary of 54,900 entries which is widely available on CD-ROM. Most French speakers are familiar with this dictionary and therefore no particular training was required for human taggers. There are many differences in the lexical resources used for the different languages. One difference is the average number of senses that are given by each dictionary for each part of speech (see Table I). All else being equal, the more senses, the more difficult the disambiguation task.8 Another difference concerns the way these resources have been built. For instance the Oxford English dictionary used within SENSEVAL is corpus and frequency based, while the Petit Larousse is a traditional dictionary with a clear encyclopedic bias. Corpus and frequency based dictionaries first display senses which have the highest frequency in corpora. This influences evaluation results in terms of comparison with the baseline as well as in terms of inter-tagger agreement.9 Also of importance is the fact that, unlike for the English exercise, for French and Italian there was no particular adequacy of the dictionaries to the corpora. Indeed the English experiment in SENSEVAL was in an especially favorable situation: contexts from the HECTOR corpus were tagged with the HECTOR dictionary based on the same corpus. The high inter-tagger agreement reached is in a accordance with Kilgarriff’s (1998b) hope that such particular context would ease the taggers’ task. None of the French participants had the advantage of using their own dictionary/ontology. They all had to map them to the Larousse dictionary. This mapping has a lot of consequences on system evaluation, especially when participating systems had to map fine-grained dictionaries with the Petit Larousse. 1.4.

TAGGING TEXT

In order to create an evaluation corpus, six human informants10 were asked to semantically annotate the corpus. Each of the 60 words appeared in 50 different contexts which yielded 3000 contexts to be manually sense-tagged.11 Annotators were instructed to choose either zero, one, or several senses for each word in each context. (A question mark was used when none of the senses matched the given context. The question-mark sense was treated as an additional sense for each word, taking together all meanings not found in the dictionary.) Because the Petit Larousse encodes more senses for verbs than for adjectives and nouns, annotators gave more senses per context for this part of speech. Still, it appeared that the average number of senses (used by a single judge in a given context) per part of speech is not very high. The average number of answers per word ranged from 1 to 1.3. Annotators used up to six senses in a single answer for a given context.

52

´ ERIQUE ´ FRED SEGOND

Table II. Inter-tagger agreement for French

Nouns Verbs Adjectives

Full Max.

Full Min.

Pair Max.

Pair Min.

Pair Wei

Agree cor.

44% 29% 43%

45% 34% 46%

72% 60% 49%

74% 65% 72%

73% 63% 71%

46% 41% 41%

Agreement among annotators was computed according to the following measures: − Full agreement among annotators. Two variants have been computed: • Min: counts agreement when judges agree on all the senses proposed for a given context • Max: counts agreement when judges agree on at least one of the senses proposed for a given context − Pairwise agreement. Three variants have been computed: • Min: counts agreement when judges agree on all the senses proposed for a given context • Max: counts agreement when judges agree on at least one of the senses proposed for a given context • Weighted: Accounts for partial agreement using the Dice coefficient |A∩B| (Dice = 2 |A|+|B| ) − Weighted pairwise agreement corrected for chance: using the Kappa statistic:12 k=

P observed−P expect ed 1−P expect ed

A kappa value of 1 indicates perfect agreement, and 0 indicates that agreement is no better than chance. (It can also become negative in case of systematic disagreement). According to each of the above measures the inter-tagger agreement for French is an shown in Table II. The kappa values here are low, and indicate an enormous amount of disagreement between judges. Looked at word-by-word, the values range between 0.92 and 0.01; for some words, agreement was no better than chance.

FRAMEWORK AND RESULTS FOR FRENCH

53

This semantically hand-tagged corpus has been used for evaluation purposes only. Participating systems did not benefit from a training corpus either to train their system, or to tune their sense mappings. For training they were given an untagged corpus containing the test words. This was due to lack of time and resources. 2. Participating Systems and Evaluation Procedure Four institutions participated with five systems in the French ROMANSEVAL exercise. They were: EPFL Ecole Polytechnique Fédérale de Lausanne IRISA Institut de recherche en informatique et Systèmes Aléatoire, Rennes LIA-BERTIN Laboratoire d’informatique, Université d’Avignon, and BERTIN, Paris XRCE Xerox Research Centre Europe, Grenoble The first three systems are briefly described in the Appendix. The fourth has a paper of its own in this Special Issue. The test procedure followed the steps described below: − Each site received well in advance the raw corpus in order to get familiar with the format, and to interface, tune and train their systems as much as possible, − a dry run was organised in order to check the procedures and evaluation programs, − each site received the test words, − each site returned the semantically-tagged test words. Then each system was evaluated according to the metrics described in the next section. 3. Evaluation Metrics and Results The measure of human inter-tagger agreement set the upper bound of the efficiency measures. It would be unrealistic to expect WSD systems to agree more with the reference corpus than human annotators among themselves. Given the low human inter-tagger agreement, we tried to be as generous as possible. We treated the gold standard as the union of all answers given by all human taggers and adopted the following metrics: − Agree counts matches between the system and gold standard, weighted by the ∩ syst em number of proposed senses: human syst em − Kappa which is as above, corrected for chance agreement

54

´ ERIQUE ´ FRED SEGOND

Figure 1. Results for adjective, nouns, verbs, all sense.

Figure 2. Results for adjective, nouns, verbs, top-level senses only.

FRAMEWORK AND RESULTS FOR FRENCH

Figure 3.

55

56

´ ERIQUE ´ FRED SEGOND

Figure 4. Results according to Precision and Recall.

In order to provide a line of comparison we also computed results for two baseline “trivial” systems which we called Base and Cheap. Base always chooses the first sense proposed in the Petit Larousse dictionary. (As already noted, one cannot assume that the first sense is the most common). Cheap is a variant of Lesk’s method (Lesk86) which relies on finding the best overlap between a word in context and a dictionary definition. The results are presented in Figures 1 and 2. The first considers all senses and subsenses as distinct. The second looked only at “top level” sense distinctions. For this calculation, all subsenses were treated as equivalent to the top level sense they fell under. Consider the case where, at the first level of the hierarchy, a word has senses 1 and 2, and sense 1 has subsenses a and b. Then, if the Gold Standard answer is 1a and a system response is 1b, then, in the top level calculation, the system response is correct, since both Gold Standard and system responses are equivalent to 1. (All other results figures are calculated on the basis of all-senses). It is also interesting to explore which words were easier and which harder. Figure 3 shows, for each word, the average Kappa score for agreement between the system and the human taggers, for all seven systems. The graph indicates that some words presented far more problems than others.

FRAMEWORK AND RESULTS FOR FRENCH

57

All metrics have their own advantages and we decided to use the usual precision and recall figures as a secondary source, for ease of comparison with the English exercise. In our case precision is correct senses/total senses proposed and recall is correct senses retrieved/total senses in reference. The precision/recall results are shown in Figure 4. The quantitative results still need to be refined (for example in terms of metrics) and discussed among participants. A qualitative study still needs to be undertaken, asking, for instance: what are the difficult words for systems, why are they difficult, what is the impact of sense mapping, what is the impact of the evaluation metrics, and what are the multilingual issues involved and the relationship with translation? We invite readers to participate in this process. The overall exercise went very well thanks to the dedication and the motivation of all participants. We have been able to achieve a great deal in a little time and with few resources. We have laid the methodology and groundwork for a larger scale evaluation. Further experiments can include: the addition of new texts, the use of different dictionaries, and running an all-word tagging exercise as well as measuring efficiency of WSD in real tasks. Notes ∗ I am especially grateful to Jean Véronis with whom I organised the ROMANSEVAL exercise. This

paper is mainly a compilation of previous publications by Jean Véronis (see in particular Véronis 1998, Véronis et al. 1998, and Ide and Véronis 1998). Many thanks also to Marie-Hélène Corréard, Véronika Lux and Corinne Jean for comments on previous versions of the paper. 1 See http://www.lpl.univ-aix.fr/projects/romanseval 2 See http://www.lpl.univ-aix.fr/projects/arcade 3 The languages are: Dutch, Danish, English, French, German, Greek, Italian, Portuguese and Spanish. 4 MLCC stands for Multilingual Corpora for Cooperation; see MLCC, 1997. 5 This was to eliminate the need for POS tagging of the corpus, and the associated hand-validation. 6 The question asked was “According to you, does the word X have several senses in the following contexts?” They had three possible answers: “yes”, “no” and “I don’t know”. 7 These figures do not take into account the four POS ambiguous words. 8 This holds for both humans (according to Fellbaum, 1997) and automatic systems. 9 Fellbaum (1997) reports higher inter-tagger agreement when senses in dictionary entries are ordered according to their frequency of occurrence in the corpus, with the most frequent sense placed first. 10 The informants were linguistic students at Université de Provence. 11 We would like to thank Corinne Jean and Valérie Houitte for their help in coordinating the task. 12 The kappa statistic (Cohen, 1960; Carletta, 1996) measures the “true” agreement, i.e. of the proportion of agreement above what would be expected by chance. The extension of kappa for partial agreement, as proposed in Cohen (1968), was used.

58

´ ERIQUE ´ FRED SEGOND

Appendix: Brief Descriptions of Three ROMANSEVAL WSD Systems for French IRISA WSD SYSTEM

Ronan Pichon and Pascale Sébillot The WSD system that we have developed is based on a clustering method, which consists of associating a contextual vector with each noun, verb and adjective occurrence in the corpus (not only with the 60 words of the test) and in aggregating the most “similar” elements at each step of the clustering. The contents (the words and their frequencies) of the clusters in which test occurrences appear are then used to choose the Petit Larousse most relevant sense(s). Some problems Concerning verbs, results are not very good. In fact, we have stopped the search of the meanings of the test occurrences. One explanation: there are greedy clusters which “swallow” a lot of verbs; therefore, the interpretation of the class is impossible. This greedy cluster phenomenon also happens for other categories, but it is very accentuated for the verbs. A “normal” class contains about 30–50 elements (that means about 6 to 8 distinct lemmas); a greedy cluster can contain 2000 elements; the maximal cluster for verbs that we have found had 20000 elements. Different contexts for nouns, verbs and adjectives will probably improve the results. For example, we think that for adjectives, it will be better to consider a closer context (better than the whole sentence). WSD System of Laboratoire Informatique D’Avignon and Bertin Technologies Claude de Loupy, Marc El-Bèze and Pierre-François Marteau Due to the lack of a training corpus in ROMANSEVAL, it was impossible to use the automatic method we have implemented for English SENSEVAL (see our full paper in this volume for a description of the SCT method). This has led us to perform a semi-automatic experiment for the French task. This procedure makes use of the test corpora. For each word to be tagged, the set of sentences was submitted to the same automatic preprocessing as for the English task. We then manually extracted some patterns and assigned them to one or more senses, where possible. When more than one sense could be attached to a corpus instance, the instance was duplicated for each sense. Some omissions in the definitions caused problems for the manual assignment of sense. For instance, the very frequent chef-d’oeuvre was not represented.

FRAMEWORK AND RESULTS FOR FRENCH

59

This work was done for the French corpus and the English counter-part. Moreover, samples have been extracted from the definitions. The confidence of a sample depends both on the number of times it appears and an arbitrary score given by a human judge. The very good results we have obtained in that way may be considered as an upper bound of French WSD performances for an automatic system using the SCT method and a very large coverage bilingual corpus. WSD System of EPFL, Swiss Federal Institute of Technology Martin Rajman The EPFL team proposed a disambiguation model based on Distributional Semantics (DS), which is an extension of the standard Vector Space (VS) model. The VS model represents a textual document dn as a vector (wn1 , . . . , wnM ), called lexical profile, where each component wnk is the weight (usually the frequency) of the term tk in the document (terms are here various predefined textual units, such as words, lemmas or compounds). The DS model further takes the co-frequencies between the terms in a given reference corpus into account. These co-frequencies are considered to provide a distributional representation of the “semantics” of the terms. In the DS model, each term ti is represented by a vector ci = (ci1 , . . . , ciP ) (co-occurrence profile), where each component cik is the frequency of cooccurrence between the term under consideration ti and the indexing term tk . The documents are then represented as the average vector of the co-occurrence profiles of the terms they contain dn =

M X

wni ci

i=1

In the DS-based disambiguation model, the context of any ambiguous word and each of its definitions is first positioned in the DS vector space. Then, the semantic similarity between a context (represented by a vector C) and each of the definitions (represented by a vector Di ) is computed according to a similarity formula such C.Di as cosine similarity (cos(C, Di ) = kCkkD ) and the definition corresponding to the ik higher similarity is selected. References Carletta, J. “Assessing agreement on classification tasks: the kappa statistic” Computational Linguistics, 22(2) (1996), 249–254. Cohen, J. “A coefficient of agreement for nominal scales” Educational and psychological Measurement, 20, (1990), 37–46. Cohen, J. “Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit” Psychological Bulletin, (70)4 (1968), 213–220.

60

´ ERIQUE ´ FRED SEGOND

Fellbaum, C., Grabowski, J. and S. Landes. Analysis of Hand-Tagging Task in: Proceedings of ANLP, Workshop on Tagging Text with Lexical Semantics, Why, What and How?, Washington D.C., April 1997. Ide, N. and J. Véronis, “Introduction to the special issue on word sense disambiguation: the state of the art” Computational Linguistics, 24(1) (1998), 1–40. Kilgarriff, A. “SENSEVAL: an exercise in evaluating word sense disambiguation programs” Proceeding of LREC, Granada, May 1998, pp: 581–588. Kilgarriff, A. “Gold standard datasets for evaluating word sense disambiguation programs” Computer Speech and Language, 12(4) (1998b), 453–472. Le Petit Larousse illustré – dictionnaire encyclopédique Edited by P. Maubourguet, Larousse, Paris, 1995. Lesk. Automated sense disambiguation using machine-readable dictionaries: how to tell a pine cone from an ice-cream cone in: Proceedings of the 1986 SIGDOC Conference. Toronto, June 1986, New York: Association for Computing Machinery, pp. 24–26. Multilingual Corpora for Co-Operation. Distributed by ELRA 1997. Jean Véronis. A study of polysemy judgements and inter-annotator agreement in : Programme and advanced papers of the SENSEVAL workshop, Herstmonceux Castle, September 1998. Véronis, J., Houitte, V. and C. Jean. Methodology for the construction of test material for the evaluation of word sense disambiguation systems in : Workshop WLSS, Pisa, April 1998.

Computers and the Humanities 34: 61–78, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

61

Senseval/Romanseval: The Framework for Italian NICOLETTA CALZOLARI and ORNELLA CORAZZARI Istituto di Linguistica Computazionale (ILC) – CNR, Via della Faggiola 32, Pisa, Italy (E-mail: {glottolo,corazzar}@ilc.pi.cnr.it)

Abstract. In this paper we present some observations concerning an experiment of (manual/ automatic) semantic tagging of a small Italian corpus performed within the framework of the SENSEVAL/ROMANSEVAL initiative. The main goal of the initiative was to set up a framework for evaluation of Word Sense Disambiguation systems (WSDS) through the comparative analysis of their performance on the same type of data. In this experiment there are two aspects which are of relevance: first, the preparation of the reference annotated corpus, and, second, the evaluation of the systems against it. In both aspects we are mainly interested here in the analysis of the linguistic side which can lead to a better understanding of the problem of semantic annotation of a corpus, be it manual or automatic annotation. In particular, we will investigate, firstly, the reasons for disagreement between human annotators, secondly, some linguistically relevant aspects of the performance of the Italian WSDS and, finally, the lessons learned from the present experiment. Key words: semantic tagging, word sense disambiguation, WSDS evaluation, inter-annotator agreement, Italian corpus annotation

1. Introduction One of the most important aspects of the SENSEVAL/ROMANSEVAL initiative was the objective of setting up a comparative framework for evaluating WSDS in a multilingual environment, with two Romance languages – French and Italian – in addition to English. An innovative side was the selection of the corpus material for French and Italian and the definition of a common annotation methodology in order to allow cross-lingual comparison and evaluation of data and results. The experiment on semantic tagging implied different phases: 1) selection of the material, i.e., a corpus and a reference dictionary; 2) selection of a list of lemmas and extraction of a subset of their corpus occurrences; 3) semantic tagging performed in different sites, consisting of the assignment of the dictionary reading numbers to the corpus occurrences; 4) comparison and evaluation of the results; 5) running of the WSDS; 6) evaluation and comparison of the WSDS’ results; 7) evaluation of the experiment in view of future extensions.

62

CALZOLARI AND CORAZZARI

A further step, consisting of a cross-lingual comparison of French and Italian, can be performed in cooperation between the University of Aix-en-Provence (Laboratoire Parole et Langage), Rank Xerox Research Centre of Grenoble and the Institute of Computational Linguistics (ILC) of Pisa. In this introductory section we provide an overview of the selected text corpus, lemmas, dictionary and defined rules for manual annotation. The selected corpus was a parallel multilingual corpus of approximately 1.1 million words per language, consisting of extracts from the Journal of the European Commission, Written Questions (1993).1 The dictionary selected was a medium-sized printed Italian dictionary of about 65,000 lemmas (Garzanti, 1995), with no hierarchical structure within entries and not corpus and frequency based. This choice was determined by the fact that presently no large coverage computational semantic lexicon exists for Italian, even though it is obviously of less interest in view of automatic tagging for Language Engineering (LE) applications. Moreover, a medium-sized dictionary was preferred to a more fine-grained and larger dictionary since an extended set of reading numbers – not necessarily and always well differentiated – would make not only the automatic WSD task too complex, but also the evaluation task much more difficult, since annotators would tend to disagree or to assign multiple tags, thus augmenting the disagreement rate. As to the selection of the words to be tagged, it was based on three criteria: (i) their being translations of words chosen for French, in order to allow comparative evaluations of the results, (ii) their polysemy, (iii) the number of occurrences in the corpus (at least 50). Twenty nouns, 20 verbs and 18 adjectives were selected and their corpus occurrences were extracted. Of these, 40 words were translations of words selected for French. Not all translated lemmas were kept as some were not polysemous in Italian. The number of corpus occurrences to be tagged was 2701 (954 nouns, 857 verbs and 890 adjectives). The semantic annotation was performed – for each word – by two human annotators. Three sites were involved in tagging (Pisa: ILC; Roma: University of Tor Vergata; Torino: CELI). The result is a list of occurrences with two reading numbers (assigned by the two annotators) taken from the definitions of the dictionary. A few conventional tags were defined to cover some particular cases, i.e.: (i) a question mark (?) when the meaning of the occurrence was missing in the paper dictionary, or more generally when semantic annotation was quite problematic, (ii) reading numbers separated by a slash when more than one dictionary meaning could be assigned to the same corpus occurrence, (iii) a star (*) to mark cases in which a different POS was wrongly selected among the occurrences of a given syntactic category. The main issues on which we report in the following sections are: (i) the level of agreement between human annotators, (ii) the evaluation of the main reasons for disagreement focusing on the linguistic aspects, (iii) some general observations

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

63

concerning the performance of the Italian WSDS, (vi) lessons learned from the present experiment in view of future evaluation tasks. 2. Manual Annotation: Agreement vs. Disagreement Rate A single reading number was assigned by the annotators 91% of the time; in a much smaller number of cases two or more reading numbers (4.8%) or a question mark (1.9%) was given.2 Therefore, in 6.7% (4.8% + 1.9%) of the cases, the paper dictionary turned out to be somehow not sufficiently representative of the language attested in the text corpus. The specificity of the corpus partially explains that, but this crucial point will be further examined with illustrative examples in the following section. We mainly focus here on the comparison of the semantic tagging by the different human annotators. The level of agreement among annotators was computed according to two criteria: − full agreement, when there is complete agreement on all senses proposed for a given wordform; − partial agreement, when there is agreement on at least one of the senses proposed for a given wordform (this can be obtained e.g., between senses 1 and 1/2). The following table displays the results in terms of partial vs. full agreement for each POS: PoS N V A Tot.

Occurr. 954 857 890 2701

Part.Agr. 863 (90.4%) 716 (83.5%) 677 (76%) 2256 (83.5%)

Full.Agr. 814 (85.3%) 681 (79.4%) 552 (62%) 2047 (75.7%)

We can notice a rather broad convergence between annotators, probably due also to a dictionary with not too fine-grained distinctions. The highest level of agreement was reached on nouns, while the other two syntactic categories, especially adjectives, show more divergence. It is evident that by allowing the assignment of multiple tags to the same wordform, and accepting only partial agreement (e.g., between 1/2 and 2), the opportunities of agreement between annotators are sensibly increased. On the other hand, considering the results in terms of full agreement, the distance between nouns and verbs slightly decreases, but the distance between verbs and adjectives becomes much higher (adjectives seem more difficult to agree on). If we take into account now the tags assigned to each occurrence by the two annotators, we obtain three types of possible combinations: (i) the two tags are

64

CALZOLARI AND CORAZZARI

identical; (ii) the two tags are only partially equivalent (e.g., 1/2 and 2); (iii) the two tags are completely different. Almost all identical answers are single sense tags, while multiple tags are rarely exactly the same (only 6 cases). On the other hand, complete divergences are mainly due to different single reading numbers, but also to the fact that in a high number of cases at least one annotator judged a given word meaning missing from the dictionary.

Equiv.Tags I (e.g., 1 and 1) II (* and *) III (? and ?) IV (e.g., 1/2 and 1/2) Tot.

Part.Equiv.Tags I (e.g., 1/2 and 1) II (e.g., 1/2 and 1/5) Tot.

Divergent.Tags I (e.g., 1 and 2) II (e.g., 1 and ?) III (e.g., 1 and *) IV (e.g., 1 and 4/5) V (e.g., 1/2 and 4/5) VI (* and ?) VII (e.g., 1/2 and ?) Tot.

N 812 (85.1%)

2 (0.2%) 814 (85.3%)

V 661 (77.1%) 6 (0.7%) 14 (1.6%) 681 (79.4%)

N 49 (5.1%)

V 35 (4%)

49 (5.1%)

35 (4%)

N 71 (7.4%) 17 (1.7%) 1 (0.1%) 2 (0.2%)

V 92 (10.7%) 37 (4.3%) 2 (0.2%) 8 (0.9%)

91 (9.5%)

1 (0.1%) 1 (0.1%) 141 (16.4%)

A 514 (57.7%) 33 (3.7%) 1 (0.1%) 4 (0.4%) 552 (62%)

A 117 (13.1%) 8 (0.8%) 125 (14%)

A 154 (17.3%) 22 (2.4%) 17 (1.9%) 17 (1.9%) 3 (0.3%)

213 (23.9%)

Finally, it is worth noting that the agreement between annotators depends also on the individual words. Upon closer analysis, it turns out that two verbs which have two senses in the dictionary (arrestare (to arrest; to stop), comprendere (to understand; to include)) and three nouns (agente (agent) (3 senses), compagnia (company; group) (6 senses), lancio (throwing; launching) (3 senses)) were annotated in exactly the same way. In terms of partial agreement, also the verb rendere (to render; to return) (6 senses), the noun corso (course; stream; current use; circulation) (8 senses) and the adjective stretto (narrow; tight; close) (5 senses) were treated in the same way. It is worth noting that there is an apparent absence of correlation between the polysemy of a lemma and the agreement vs. disagreement rate. Indeed, highly

65

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

polysemous words such as passare (to pass) (16 readings) and corso (8 readings) do not have the highest disagreement rate (16% and 1.9%), while lemmas such as biologico (biological) (3 readings) and popolare (popular) (4 readings) show a remarkable disagreement between annotators (73.6% and 75%). However, this is mainly due to the fact that only 4 senses of passare and 2 of corso are attested in the selected corpus. In fact, because of the specificity of the corpus at hand, only some senses of the most polysemous words are attested. For instance compagnia has six senses in the dictionary but only three of them occur in the corpus according to all annotators. For the same reason the verb importare occurs only with the meaning to import and never with the meaning to matter. Indeed, the degree of attested and actual polysemy in the corpus seems more important than the more ‘abstract’ or potential degree of polysemy displayed in the dictionary. 3. Major Reasons for Disagreement between Annotators In this section we discuss the most frequent and regular types of disagreement between annotators and illustrate their causes. We examined in detail the cases where the annotators had disagreed, and classified them according to the scheme below. Generally speaking, divergences of judgement seem to be due to all the elements involved in the experiment, namely, the dictionary (88.3%), the human annotators (7.9%), and the corpus (2.3%). The weight of the first element with respect to the other ones is striking. We mainly focus here on the problems related to the dictionary and the corpus, which can be subclassified as shown in the table below. Causes of Divergence Dictionary Problems Ambiguity of Dict. Read. Missing Reading Multiword Expression Metaphorical usage Corpus Problems Too short context Type of text Human Errors and Others Tot.

N

V

A

Tot.

107 (76.4%) 11 (7.8%) 4 (2.8%)

103 (58.5%) 34 (19.3%) 3 (1.7%) 7 (3.9%)

285 (84.3%) 15 (4.4%) 17 (5%)

495 (75.6%) 60 (9.1%) 24 (3.6%) 7 (1%)

1 (0.7%)

4 (2.2%) 3 (1.7%) 22 (12.5%) 176

8 (2.3%)

13 (1.9%) 3 (0.4%) 52 (7.9%) 654

17 (12.1%) 140

13 (3.8%) 338

The ambiguity of dictionary readings is the most important cause of divergence for all POS and especially for nouns and adjectives. On the other hand, many verbal occurrences were tagged differently because their sense in the corpus was considered missing from the dictionary by one annotator. The other reasons for divergence between annotators seem to be far less important. Nevertheless their

66

CALZOLARI AND CORAZZARI

relevance has to be measured with respect to the type of selected corpus. For instance, multiword expressions (from now on MWEs) do not seem to be numerous in the text corpus under scrutiny. 3.1.

AMBIGUITY OF DICTIONARY READINGS

By ambiguity of dictionary readings we mainly refer to three different problematic aspects of dictionary definitions that will be examined one by one in this section: vagueness, excessive granularity, inconsistency. In a high number of cases the disagreement between annotators about the interpretation, and therefore assignment, of two or more readings is due to the above mentioned problems. For instance, for the word soluzione (solution), in 31 cases out of 51, one annotator chose reading No. 2, the other, reading No. 3, thus showing the difficulties raised by the choice between the ‘event’ interpretation of reading No. 2 (to solve, to be solved) and the ‘result’ interpretation of reading No. 3 (solution, agreement). Another example is alto which, in 24 cases out of 51, receives reading No. 4 and 8 by different annotators. In this case the problem was to select between alto as big, tall of reading No. 4 and important, elevated of reading No. 8. There are many cases of such ‘regular disagreement’, and the most striking cases of this kind are listed below:

PoS N N N N V V V A A A A A A A

Lemma soluzione ordine esercizio diritto mantenere chiedere rispondere stretto utile alto civile particolare biologico sicuro

Dic.Readings 2 and 3 1 and 2 1 and 3 3 and 5 1 and 2 1 and 2 5 and 6 2 and 4 1 and 2 4 and 8 1 and 2 1 and 2 2 and 3 1 and 4

Number of Disagr. 31 14 14 11 14 11 10 35 24 24 19 13 11 10

N.Occ. 51 51 51 51 51 51 51 51 51 51 51 51 38 43

Let us examine the problems of dictionary interpretation more in detail by providing illustrative examples. 3.1.1. Vagueness The borderline between slightly different meanings is not always clearly stated in dictionary definitions, and neither the examples nor the synonyms provided for

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

67

each meaning allow a better differentiation. For example, mantenere (to maintain/ to keep) – which means in the dictionary both 1. tenere, far durare in modo che non venga meno (i contatti) (to keep contacts) and 2. tenere saldo, difendere (un primato) (to hold the supremacy/ a position) – occurs, among others, in the following ‘ambiguous’ contexts (i.e., where both readings can apply): − le Nazioni Unite dispongono di forze armate proprie per mantenere la pace. (United Nations have their own army to maintain peace.) − Potranno essi ad esempio mantenere la loro condizione di neutralità? (Will they be able to hold, for instance, their position of neutrality?) − Mentre taluni donatori sono disposti a mantenere l’attuale livello dei loro stanziamenti di aiuto (While some donors are ready to maintain their level of financial help) In 14 cases, reading No. 1 was chosen by one annotator while the other one assigned reading No. 2 (ten cases) and 1/2 (four cases) for the same corpus occurrences. The vagueness of some sense distinctions in the dictionary is definitely the most important cause of disagreement. 3.1.2. Excessive granularity: need for under-specification In a number of occurrences, the sense in the corpus context is under-specified with respect to the distinctions in the dictionary, which are, by the way, good and necessary in other contexts. This is a consequence of the lexicographer’s need to classify in disjoint classes what frequently appears – in the actual usage – as a ‘continuum’ resistent to clear-cut disjunctions. For instance, conoscere (to know) is defined both as 1. sapere, avere esperienza (to know, to have experience) and as 2. avere notizia, cognizione di qualcosa (to be informed). This distinction is in some ways too fine-grained and cannot be easily applied to all contexts. For example: − La Commissione conosce i gravi problemi che la siccità pone all’agricoltura portoghese. (The Commission is aware of the big problems that drought causes to the Portuguese agriculture.) − La Commissione conosce perfettamente l’insoddisfacente situazione fiscale in cui si trovano le persone soggette all’imposta sul reddito. (The Commission is fully aware of the unsatisfactory fiscal situation of people who have to pay tax on their income.) In five cases one annotator chose reading No. 1 and the other reading No. 2, while in two cases the choice was respectively reading No. 2 and 1/2. For these contexts it would be necessary, in reality, to have a reading which is underspecified with respect to the source of the knowledge.

68

CALZOLARI AND CORAZZARI

3.1.3. Inconsistency The same linguistic phenomenon is sometimes treated in different ways in the dictionary. This lack of a coherent theoretical approach behind dictionary definitions forces the annotators to decide individually about the treatment of particular cases. In this sense dictionary inconsistency is indirectly responsible for the disagreement between different annotators. An interesting example is provided by deverbal nouns which often have a ‘process/event’ and a ‘result’ interpretation. The dictionary is rather incoherent with respect to this property, since it provides this distinction for lexical items such as acquisto (buying), produzione (production), etc., but not, for instance, for comunicazione (communication), etc. which are defined only as event nominal. Indeed, the disambiguation of these two senses is perhaps translationally and syntactically irrelevant and quite problematic in most of the contexts, e.g., in the following: − In una comunicazione al Consiglio e al Parlamento europeo, del 30 aprile 1992 (1), la Commissione ha illustrato le sue riflessioni sulle future relazioni tra la Comunità europea e il Magreb. (In a communication to the Council and to the European Parliament on 30 April 1992, the Commission illustrated its observations about the future contacts between the European Community and Maghreb.) so that it seems unrealistic to expect both readings to be present and distinct in the dictionary but, at least, both word senses should be mentioned together in the definition. However some contexts clearly select one or the other meaning, as in the following example where the lemma comunicazione has only a ‘result’ interpretation: − La Commissione continuerà pertanto ad esaminare la comunicazione della Commissione del 18 gennaio 1990 dal titolo ‘Un grande mercato interno dell’automobile’. (The Commission will continue therefore to examine the communication of the Commission on 18 January 1990, entitled ‘A big internal car market’.)

3.2.

MISSING READINGS

A surprising regularity of treatment is found for occurrences which receive the question mark by one annotator (judging that the meaning is missing from the dictionary) and one reading number (which looks at a closer analysis as the most general sense) by the other. These cases reveal the presence of a real problem of interpretation of the context that the dictionary does not help to solve. For example, coprire (to cover) combined with the contexts: settori (areas), zone rurali (rural areas), foreste lontane (far forests), i casi (cases), un divario (a gap), tutte

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

69

le regioni (all the departments), il fabbisogno (needs), le esigenze (requirements) receives reading No. 4 by one annotator (No. 4: proteggere, difendere dall’offensiva del nemico o dell’avversario: – la ritirata – nel linguaggio bancario e delle assicurazioni, garantire: – un rischio – le spese, recuperare le spese sostenute(to protect, to defend from the enemy’s attack – in the banking, insurance domain: to guarantee risks, expenses, to get one’s money back) while the other considers these occurrences as senses missing from the dictionary. Another example is perseguire (to pursue). In 7 corpus contexts, it has a juridical meaning which is not explicitly mentioned in the dictionary (1. cercare di raggiungere, ottenere (un obiettivo) (to pursue an aim); 2. perseguitare (to prosecute; to indict)) as in perseguire i responsabili di gravi violazioni dei diritti internazionali/ perseguire le violazioni commesse dagli Stati membri . . . (to prosecute those who are responsible for violation of international rights). One annotator judged this meaning as missing from the dictionary, while the other assigned reading No. 2 which seems to be the closest to the meaning of these corpus occurrences. Once more, the dictionary turns out to be unsatisfactory at least when confronted with this corpus.

3.3.

MULTIWORDS AND METAPHORICAL USAGES

One of the problems of semantic tagging is the treatment of MWEs, even though their frequency depends on the type of selected text corpus and lemmas. For example, breve (short) in the corpus at hand occurs only in MWEs, as well as most of the occurrences of capo (head). Examples of MWE are fare capo alla direzione generale (to link up with the administrative department); in ordine al prelievo parafiscale(as to the fiscal system); libero arbitrio; ribadire a chiare lettere (to say clearly) etc. The semantic tagging of the words in bold raises the following problem: how should we annotate MWEs, i.e., should they be annotated as (i) a set of single elements, or as (ii) non–compositional units? In the first case, which semantic tag (reading number) should be assigned to each element? These questions are strictly related to the way traditional dictionaries provide and structure lexical information. Indeed, (a) only a restricted number of MWEs is provided, and (b) they are usually more or less arbitrarily assigned to one or another reading of the lemma. For instance, figurative expressions such as aprire gli occhi a qualcuno (to make someone aware of something), aprire l’animo a qualcuno (to open one’s heart) are considered equivalent to aprire una bottiglia (to open a bottle) and included in the first reading of the verbaprire (to open) which is dischiudere, disserrare (to disclose). In this case, should the reading No. 1 be assigned to the verb as a single word? The previous questions are also connected to the semantic and syntactic peculiarity of MWEs, i.e., to their ‘non-compositionality’ (Corazzari, 1992). Indeed, the semantic annotation of their single components does not allow us to access

70

CALZOLARI AND CORAZZARI

all the semantic – and indirectly morpho-syntactic – properties of the sequences as a whole. For instance, if we consider the example aprire la strada a qualcuno/qualcosa (lit. trans.: to open the road to someone/something) we may say that: − although this expression is structurally complex, it behaves semantically, as well as syntactically, like a single predicate; − the global meaning of the MWE cannot be derived from the meaning of its components; − the selectional restrictions as well as the argument structure of the verb aprire are not the same as those of the expression aprire la strada: the first one selects a Subject and an Object, while aprire la strada requires a Subject (either ‘human’ or ‘non-human’) and an obligatory Indirect Object. Also, much simpler MWEs show the same properties. For instance in ordine al problema economico (as far as the economical problem is concerned) is a combination of two prepositions and a noun, but has a prepositional function as a whole. The non-compositionality of this MWE is particularly evident at the translational level where in order to (literal translation) has a totally different meaning. We have just outlined some obvious and well-known reasons in favour of an annotation of MWEs as non-compositional units. Another phenomenon – somehow connected to MWEs – is an important cause of disagreement between annotators, i.e., the metaphorical usage of a lemma. The borderline between MWEs and metaphorical expressions is sometimes quite fuzzy, even though the latter are potentially unlimited and unpredictable depending only on the writer/speaker’s imagination. Indeed, only the most commonly used metaphorical usage of lemmas are included in the dictionary, under the label ‘figurative meanings’. A specific annotation strategy should be set up for handling coherently the metaphorical usage of lemmas, reminding us that they could never be exhaustively listed in the dictionary.

3.4.

PROBLEMS RELATED TO THE CORPUS

The annotation problems related to the corpus concern, on one hand, the type of text and, on the other hand, the size of the context of the word occurrences. Dealing with a multilingual corpus and therefore – as far as Italian is concerned – with a translated corpus, we find wrong or unusual Italian expressions which cannot be easily classified according to the dictionary definitions. For instancenon aprono nessun diritto particolare (lit. trans.: they do not open any particular right) does not seem a correct Italian expression: indeed aprire is used improperly and therefore it is quite difficult to choose among the different dictionary reading numbers.

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

71

Other cases which were differently coded by the annotators for the same reason are: − condurre una riflessione (lit. trans.: to do an observation) − condurre una politica di parità (to do a politic of equality) As to the second problem, context size, which was established as the sequence of variable length included between two carriage returns, turned out to be insufficient in some rare cases. 4. Some Observations about the Performance of the WSDS Two systems participated in the evaluation for Italian: from Pisa (ILC) and Rome (Eulogos). The quantitative evaluation of their results is given in Veronis (1998). We provide here only a few observations concerning linguistic aspects related to their performance. 4.1.

POLYSEMY AND PERFORMANCE

Also for WSDS there is no clear correlation between degree of polysemy and performance of the systems, i.e., correctness of their results. For instance the adjectives alto (8 senses) and biologico (3 senses) are wrongly tagged (by one system) in 29 of the occurrences, legale (legal) (2 senses) in 41, and libero (free) (8 senses) in 27. The same is true for nouns and verbs: e.g., centro (centre) (8 senses) is wrong in 6 of the occurrences, while concentrazione (concentration) (2 senses) in 39; rendere (6 senses) is tagged completely correct by one system (syntactic clues were very relevant for this particular verb), and passare with 16 senses receives just one incorrect tag. We must observe, however, that most words are used in the chosen corpus in just very few senses: e.g., libero in 2 of the 8 senses, centro also in 2 out of the 8, etc. This may have a strong impact on performance and may be more relevant than dictionary polysemy. It is worth noting that sometimes wrong tags were assigned by a system exactly where the human annotators were in disagreement. This happens more often than expected by chance, and signals clear cases of not enough or not good information either in the corpus context or in the dictionary. In a few cases we also observed that a system produced a disjunction of tags exactly in those cases where annotators gave a multiple tag. This is a strong sign of real ambiguity (or too great similarity) in the dictionary definitions. 4.2.

DIFFERENCE IN PERFORMANCE BETWEEN THE SYSTEMS

The two Italian WSDS, even though similar in terms of a global quantitative evaluation (see Veronis, 1998), present very often quite different distributions of wrong

72

CALZOLARI AND CORAZZARI

and correct tags, obviously due to the different techniques and approaches used. This is a sign of the need for a qualitative analysis/evaluation of the results accompanying the quantitative one, both for an interpretation of the reasons for success and failure, and for the evaluation task to be of real help in improving the system. We enumerate here some of the differences: (i) the use of multiple tags was much more frequent by one system, thus increasing the possibility of ‘partial agreement’ with the reference corpus, (ii) the ‘?’ sign was much more frequently used by one system, to signal cases of inability to assign a tag, thus increasing precision, (iii) one system gave one and the same tag to all corpus occurrences for many words (8 verbs out of 20), thus hinting at the possible technique of choosing the most probable word sense (the disadvantage being that they may be all wrong, as happened with one word!).

4.3.

USE OF MULTIPLE TAGS AND CASES OF DISAGREEMENT IN HUMAN ANNOTATION : THEIR EFFECTS ON THE EVALUATION

The use of multiple tags or – even worse – the cases of disagreement in the human annotated corpus largely increase the possibility of success for the WSDS calculated in terms of ‘partial agreement’. Where human annotators disagree, the ‘gold standard’ includes all the tags that either of the annotators gave, so there is much more chance of a WSDS coinciding with at least one of the two (or more) tags. Therefore, the paradoxical situation arises that the most complex or difficult cases (where multiple tags are given or there is disagreement between annotators) are somehow the easiest for the systems if calculation of success is done in terms of partial agreement. This has to be weighted in the quantitative evaluation of WSDS.

4.4.

SOME CONCLUSIONS WITH RESPECT TO THE ANALYSIS OF WSD s ’ RESULTS

The first important observation is that it is necessary to analyse qualitatively (not only quantitatively) the results, because the simple numbers can be misleading, e.g.: − a specific text type may privilege one or two readings only, thus allowing an easier tuning of the system; − a text with many recurrent MWEs may facilitate disambiguation. It is therefore better to test systems with contexts taken from many different text types, so that a larger variety of readings is attested. We noticed in fact that actual polysemy in the text corpus is much more problematic than theoretical/potential polysemy in the dictionary.

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

73

In general, there is no correlation between multiple tags assigned by annotators and by the systems. However, the contexts with different tags given by different annotators present a quite different typology of cases, which must be carefully considered in order to better evaluate the quantitative results. The following cases require a different interpretation: − at a better analysis, one tag is correct, the other is wrong: the ‘partial agreement’ evaluation with respect to one tag only (the incorrect one) may wrongly inflate success rate; − a ‘?’ tag, saying that a reading is missing, and a reading number are given: this is more difficult to match by the system than if two different reading numbers are given (one of the two may be more easily matched); − the two tags are both applicable, because the context is actually ambiguous between the two and/or the dictionary readings are not differentiated enough (e.g., chiedere (to ask), between 1. in order to obtain and 2. in order to know, or conoscere, between 1. to experience and 2. to know): many contexts can express both senses at the same time (these are the cases for which an underspecified reading/tag would be useful). This last type of disagreement, i.e., the cases of ‘real’ ambiguity, are common in the contexts examined. This is clear evidence of the gap existing very often between (i) a sort of ‘theoretical language’, used by linguists/lexicographers who have to classify the linguistic world in disjoint classes, and (ii) actual usage of the language, which is very often a ‘continuum’ resistant to clear-cut disjunctions, and needs to remain ambiguous with respect to imposed classifications. This is particularly true at the level of semantic analysis and annotation, where vagueness of language is a ‘requirement’ and not a ‘problem’ to be eliminated. The problem is then how to individuate when this second type of ‘only apparent disagreement’ is present, thus pointing to a problem in the dictionary used: partial agreement by the system is here perfectly acceptable. Again, figures must be carefully handled. Paradoxically, if there is disagreement between annotators or if there are multiple tags – as said above – it is much easier for a system to agree with at least one annotator: if both tags are possible (as for ambiguous contexts) there is no problem, but if only one tag is correct and the system agrees with the other tag, then a system may be evaluated highly while making mistakes. The same situation arises if it is the system which uses many multiple tags (at least one may more easily agree with an annotator). The conclusion in these cases is that ‘the more difficult the easier’ for a system. On the other side, to discard all cases of disagreement or multiple tags is obviously incorrect: they have different meanings in different situations – as said above. The conclusion is that more attention should be paid to the definition of the quantitative criteria for evaluation, to take care of these aspects.

74

CALZOLARI AND CORAZZARI

5. Lessons Learned from the Present Experiment and Main Conclusions Finally, we would like to draw some conclusions about the way the experiment was conducted in order to point out its limits and to contribute to improving future initiatives of this kind. 5.1.

THE DICTIONARY: TOWARDS A COMPUTATIONAL LEXICON WITH SEMANTICS

The choice and interpretation of the dictionary turned out to be a critical issue. In particular, the printed dictionary proved to be not sufficiently representative of the language attested in the text corpus. In a next round, a computational lexicon could be used for Italian, e.g., the EuroWordNet (Alonge et al., 1998; Rodriguez et al., 1998) or SIMPLE (Ruimy et al., 1999) lexicons (with their extensions as provided in the Italian National Projects starting in ’99). This will give more coherent and useful results from a LE viewpoint, with use of semantic types and hierarchical information enabling semantic generalisations. In general, disagreement between annotators (and sometimes the use of multiple tags) is to be interpreted as a warning that there is something wrong in the dictionary used (or in its interpretation by the annotator, which frequently amounts to something not being clear in it). Some important requirements for a computational lexicon with semantics – as emerged from this analysis – are the following: − need for under-specified readings in particular cases (maybe subsuming more granular distinctions, to be used only when disambiguation is feasible in a context): this implies paying careful attention to the phenomenon of regular under-specification/polysemy as occurring in texts; − need for different readings to be well-differentiated, otherwise annotators and systems tend to disagree or to give multiple tags, thus inappropriately augmenting the chances of success in the evaluation; − need for good dictionary coverage with respect to attested readings (to avoid the gap between current dictionaries’ ‘theoretical’ language and ‘actual’ language as used in text corpora), possibly with indication of domain/text type differences; − need for encoding/listing MWEs; − need for encoding metaphorical usage. A detailed analysis of representation and encoding of the last two aspects has to be done. It is worth noting that from a practical point of view a better encoding of MWEs could simplify automatic annotation, since they could be provided as a mere list to WSDS.

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

75

Crucial questions for a semantic computational lexicon are the following: − Should/could a dictionary contain indication of clues for disambiguation associated with each reading (e.g., syntactic vs. semantic vs. lexical clues) when this is feasible? − If so, could we profit from the task of manual semantic annotation of the so-called ‘gold standard’, and ask lexicographers to make explicit such clues where they can be individuated? It is well-known that this is not an easy task, because often different strategies – working at different levels of linguistic analysis – are at play in a disambiguation task. This is one of the aspects that makes semantic disambiguation such a difficult and challenging task. − Do available dictionaries contain all that is needed for semantic classification/disambiguation? Or is there the need for other dimensions? These are non-trivial aspects which deserve attention when planning and designing a computational lexicon. 5.2.

THE NATURE OF ‘ MEANING ’

Nevertheless, it is worth noting that one of the central questions is the nature of ‘meaning’ itself, that is rather a ‘continuum’ resistant to clear-cut distinctions – as the need for multiple tags for semantic annotation proves. Indeed, human intuition and sensibility still play a relevant role in word sense disambiguation, especially so when dictionary definitions are unsatisfactory and leave to the annotator the task of interpreting them. From this point of view it is interesting that, for instance, multiple tags are very rarely equivalent between annotators. Underspecification tries to partially tackle this aspect. 5.3.

THE CORPUS : TOWARDS A SEMANTICALLY TAGGED CORPUS FOR LE

The phase of selection of the corpus material appears to be crucial for a correct performance of the experiment. In particular it seems advisable to select a ‘balanced’ reference corpus which reflects a variety of text types, genres and domains rather than a specific text corpus like the one that we chose for satisfying the multilingual requirement. (It is well known that only a narrow range of parallel corpora are available). Indeed a specific text type may privilege only a subset of senses of a given lemma, thus simplifying the annotation task and increasing the chances of success, since a WSDS may be tuned ad hoc for choosing only among the most probable readings in that domain/text type/genre. At the same time, also a text with many recurrent MWEs may facilitate disambiguation, since WSDS can be provided with an ad hoc list. Conversely, a text corpus with a small number of MWEs provides a wrong view on the language, leading to the conclusion that MWEs are not an

76

CALZOLARI AND CORAZZARI

important problem. Variety and representativity of (i) lemmas, (ii) MWEs, (iii) senses, and (iv) linguistic problems are only guaranteed by a well-balanced corpus, in the same way as correctness/reliability of the results is guaranteed by a well designed dictionary. Again, in a next round a more balanced semantically tagged corpus produced within the Italian National Project will be used, similarly to what happened this time for English. 5.4.

THE CHOICE OF THE LEMMAS

Considering now the selected lemmas, it is advisable to extract for each of them a reasonable amount of different word-forms for two main reasons: first, some specific senses are connected to a particular morpho-syntactic form, which implies that by excluding a certain word-form we exclude also some senses; secondly, a particular word-form can occur preferably in a given text type with only one sense providing a partial view on the different senses of the lemma and a wrong view on their frequency (this is why all the examined corpus occurrences of breve are the same, i.e., the MWE in breve). As we have already stressed, the context size of the occurrence is also relevant to a correct semantic annotation. It seems advisable to choose a more significant or extended window in order to allow better sense disambiguation, be it manual or automatic. 5.5.

INTERACTION BETWEEN SEMANTICS AND SYNTAX

The aspect of the interaction between semantics and syntax is interesting from the perspective of automatic tagging, i.e., for WSDS. An analysis of the linguistic level at which to find the optimal clues for disambiguation (e.g., a particular subcategorised preposition, or a lexical collocation, or the co-occurrence with a specific subject, or even a particular morphological inflection, etc.) could lead to adding a very useful type of information to the different senses of an entry in a computational lexicon. The expensive phase of human semantic annotation, necessary to build a large and representative semantically tagged corpus, could aim also at getting this result, i.e., at individuating – when possible – the clues for disambiguation, for them to be encoded in a computational lexicon. 5.6.

NEED FOR A COMMON ENCODING POLICY ?

The present initiative was intended to prepare the ground for a future real task of semantic tagging/evaluation for LE applications. From this perspective one of the questions to be asked is the following: − Can we define, and how, a ‘gold standard’ for evaluation (and training) of WSDS?

SENSEVAL/ROMANSEVAL: THE FRAMEWORK FOR ITALIAN

77

To answer this question in a way that is meaningful for LE applications implies not only an analysis of the state-of-the-art, and experiments like the present one, but also careful consideration of the needs of the community – also applicative/industrial requirements – before starting any large development initiative of corpus annotation which can fulfil NLP application requirements with respect to WSD. This aspect has not been really considered in the present initiative. The above question implies other questions: − Can we agree on a common encoding policy? Is it feasible? Desirable? To what extent? A few actions in this direction could be the following: − to base semantic tagging on commonly accepted standards/guidelines (with implications for a future EAGLES initiative): up to which level this can be done is a matter of consideration; − to involve the community and collect and analyse existing semantically tagged corpora used for different applications; − before providing the necessary common platform of semantically tagged corpora, the different application requirements must be analysed; − to build a core set of semantically tagged corpora, encoded in a harmonised way, for a number of languages. A future EAGLES group could work on these tasks, building on and extending results of the current group on Lexicon Semantics (Sanfilippo A. et al., 1999), towards the objective of creating a large harmonised infrastructure for evaluation and training, as is so important in Europe where all the difficulties connected with the task of building language resources are multiplied by the multilingual factor. Notes 1 The corpus is part of the MLCC Corpus distributed by ELRA. 2 The missing 2.3% concerns the semantic tags which are a star, not considered in the calculation

because it is irrelevant to the present discussion.

References Alonge, A., N. Calzolari, P. Vossen, L. Loksma, I. Castellon, M. A. Marti and W. Peters. “The Linguistic Design of the EuroWordNet Database”. Special issue on EuroWordNet, Computers and the Humanities, 32(2–3) (1998). Busa, F., N. Calzolari, A. Lenci and J. Pustejovski. Building a Lexicon: Structuring and Generating Concepts. Computational Semantics Workshop, Tilburg, 1999. Corazzari, O. Phraseological Units. NERC Working Paper, NERC-92-WP8-68, 1992. Garzanti Editore. Dizionario Garzanti di Italiano. Garzanti Editore, Milano, 1995.

78

CALZOLARI AND CORAZZARI

Rodriguez, H., S. Climent, P. Vossen, L. Loksma, W. Peters, A. Alonge, F. Bertagna, A. Roventini. “The Top-Down Strategy for building EuroWordNet: Vocabulary Coverage, Base Concepts and Top Ontology”. Special Issue on EuroWordNet, Computers and the Humanities, 32(2–3) (1998). Ruimy N. et al. SIMPLE – Lexicon Documentation for Italian. D.03.n.1, Pisa, 1999. Sanfilippo A. et al. Preliminary Recommendations on Semantic Encoding. EAGLES LE3-4244, 1999. Veronis J. Presentation of SENSEVAL. Workshop Proceedings, Herstmonceux, 1998.

Computers and the Humanities 34: 79–84, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

79

Tagger Evaluation Given Hierarchical Tag Sets I. DAN MELAMED1 and PHILIP RESNIK2 1 West Group (E-mail: [email protected]); 2 University of Maryland (E-mail:

[email protected])

Abstract. We present methods for evaluating human and automatic taggers that extend current practice in three ways. First, we show how to evaluate taggers that assign multiple tags to each test instance, even if they do not assign probabilities. Second, we show how to accommodate a common property of manually constructed “gold standards” that are typically used for objective evaluation, namely that there is often more than one correct answer. Third, we show how to measure performance when the set of possible tags is tree-structured in an IS-A hierarchy. To illustrate how our methods can be used to measure inter-annotator agreement, we show how to compute the kappa coefficient over hierarchical tag sets. Key words: evaluation, ambiguity resolution, WSD, inter-annotator agreement

1. Introduction Objective evaluation has been central in advancing our understanding of the best ways to engineer natural language processing systems. A major challenge of objective evaluation is to design fair and informative evaluation metrics, and algorithms to compute those metrics. When the task involves any kind of tagging (or “labeling”), the most common performance criterion is simply “exact match,” i.e. exactly matching the right answer scores a point, and no other answer scores any points. This measure is sometimes adjusted for the expected frequency of matches occuring by chance (Carletta, 1996). Resnik and Yarowsky (1997, 1999), henceforth R&Y, have argued that the exact match criterion is inadequate for evaluating word sense disambiguation (WSD) systems. R&Y proposed a generalization capable of assigning partial credit, thus enabling more informative comparisons on a finer scale. In this article, we present three further generalizations. First, we show how to evaluate non-probabilistic assignments of multiple tags. Second, we show how to accommodate a common property of manually constructed “gold standards” that are typically used for objective evaluation, namely that there is often more than one correct answer. Third, we show how to measure performance when the set of possible tags is treestructured in an IS-A hierarchy. To illustrate how our methods can be applied to the comparison of human taggers, we show how to compute the kappa coefficient (Siegel and Castellan, 1988) over hierarchical tag sets.

80

MELAMED AND RESNIK

Table I. Hypothetical output of four WSD systems on a test instance, where the correct sense is (2). The exact match criterion would assign zero credit to all four systems. Source: (Resnik and Yarowsky, 1997) sense of interest (in English)

(1) monetary (e.g. on a loan) (2) stake or share ⇐ correct (3) benefit/advantage/sake (4) intellectual curiosity

1

WSD System 2 3

4

0.47 0.42 0.06 0.05

0.85 0.05 0.05 0.05

1.00 0.00 0.00 0.00

0.28 0.24 0.24 0.24

Our methods depend on the tree structure of the tag hierarchy, but not on the nature of the nodes in it. For example, although these generalizations were motivated by the SENSEVAL exercise (Kilgarriff and Palmer, this issue), the mathematics applies just as well to any tagging task that might involve hierarchical tag sets, such as part-of-speech tagging or semantic tagging (Chinchor, 1998). With respect to word sense disambiguation in particular, questions of whether part-of-speech and other syntactic distinctions should be part of the sense inventory are orthogonal to the issues addressed here.

2. Previous Work Work on tagging tasks such as part-of-speech tagging and word sense disambiguation has traditionally been evaluated using the exact match criterion, which simply computes the percentage of test instances for which exactly the correct answer is obtained. R&Y noted that, even if a system fails to uniquely identify the correct tag, it may nonetheless be doing a good job of narrowing down the possibilities. To illustrate the myopia of the exact match criterion, R&Y used the hypothetical example in Table I. Some of the systems in the table are clearly better than others, but all would get zero credit under the exact match criterion. R&Y proposed the following measure, among others, as a more discriminating alternative: Score(A) = Pr (c|w, context(w)) A

(1)

In words, the score for system A on test instance w is the probability assigned by the system to the correct sense c given w in its context. In the example in Table I, System 1 would get a score of 0.42 and System 4 would score zero.

TAGGER EVALUATION GIVEN HIERARCHICAL TAG SETS

81

3. New Generalizations The generalizations below start with R&Y’s premise that, given a probability distribution over tags and a single known correct tag, the algorithm’s score should be the probability that the algorithm assigns to the correct tag. 3.1.

NON - PROBABILISTIC ALGORITHMS

Algorithms that output multiple tags but do not assign probabilities should be treated as assigning uniform probabilities over the tags that they output. For example, an algorithm that considers tags A and B as possible, but eliminates tags C, D and E for a word with 5 tags in the reference inventory should be viewed as assigning probabilities of 0.5 each to A and B, and probability 0 to each of C, D, and E. Under this policy, algorithms that deterministically select a single tag are viewed as assigning 100% of the probability mass to that one tag, like System 4 in Table I. These algorithms would get the same score from Equation 1 as from the exact match criterion. 3.2.

MULTIPLE CORRECT TAGS

Given multiple correct tags for a given word token, the algorithm’s score should be the sum of all probabilities that it assigns to any of the correct tags; that is, multiple tags are interpreted disjunctively. This is consistent with instructions provided to the SENSEVAL annotators: “In general, use disjunction . . . where you are unsure which tag to apply” (Krishnamurthy and Nicholls, 1998). In symbols, we build on Equation 1: Score(A) =

C X

Pr (ct |w, context(w)), A

(2)

t =1

where t ranges over the C correct tags. Even if it is impossible to know for certain whether annotators intended a multi-tag annotation as disjunctive or conjunctive, the disjunctive interpretation gives algorithms the benefit of the doubt. 3.3.

TREE - STRUCTURED TAG SETS

The same scoring criterion can be used for structured tag sets as for unstructured ones: What is the probability that the algorithm assigns to any of the correct tags? The complication for structured tag sets is that it is not obvious how to compare tags that are in a parent-child relationship. The probabilistic evaluation of taggers can be extended to handle tree-structured tag sets, such as HECTOR (Atkins, 1993), if the structure is interpreted as an IS-A hierarchy. For example, if word sense A.2 is a sub-sense of word sense A, then any word token of sense A.2 also IS-A token of sense A.

82

MELAMED AND RESNIK

Figure 1. Example tag inventory.

Under this interpretation, the problem can be solved by defining two kinds of probability distributions: 1. Pr(occurrence of parent tag|occurrence of child tag); 2. Pr(occurrence of child tag|occurrence of parent tag). In a tree-structured IS-A hierarchy Pr(parent|child) = 1, so the first one is easy. The second one is harder, unfortunately; in general, these (“downward”) probabilities are unknown. Given a sufficiently large training corpus, the downward probabilities can be estimated empirically. However, in cases of very sparse training data, as in SENSEVAL, such estimates are likely to be unreliable, and may undermine the validity of experiments based on them. In the absence of reliable prior knowledge about tag distributions over various tag-tree branches, we appeal to the maximum entropy principle, which dictates that we assume a uniform distribution of sub-tags for each tag. This assumption is not as bad as it may seem. It will be false in most individual cases, but if we compare tagging algorithms by averaging performance over many different word types, most of the biases should come out in the wash. Now, how do we use these conditional probabilities for scoring? The key is to treat each non-leaf tag as under-specified. For example, if sense A has just the two subsenses A.1 and A.2, then tagging a word with sense A is equivalent to giving it a probability of one half of being sense A.1 and one half of being sense A.2, given our assumption of uniform downward probabilities. This interpretation applies both to the tags in the output of tagging algorithms and to the manual (correct, reference) annotations.

4. Example Suppose our sense inventory for a given word is as shown in Figure 1. Under the assumption of uniform downward probabilities, we start by deducing that Pr(A.1|A) = 0.5, Pr(A.1a|A.1) = 0.5 (so Pr(A.1a|A) = 0.25), Pr(B.2|B) = 31 , and so on. If any of these conditional probabilities is reversed, its value is always 1. For example, Pr(A|A.1a) = 1. Next, these probabilities are applied in computing Equation 2, as illustrated in Table II.

83

TAGGER EVALUATION GIVEN HIERARCHICAL TAG SETS

Table II. Examples of the scoring scheme, for the tag inventory in Figure 1. Manual Annotation

Algorithm’s Output

Score

B A A A A.1 A.1 and A.2 A.1a A.1a and B.2 A.1a and B.2 A.1a and B.2 A.1a and B.2

A A A.1 A.1b A A A B A.1 A.1 and B.2 A.1 and B

0 1 1 1 0.5 0.5 + 0.5 = 1 0.25 Pr(B.2|B) = 31 0.5 0.5 × 0.5 + 0.5 × 1 = 0.75 0.5 × 0.5 + 0.5 × 0.333 = 0.4

5. Inter-Annotator Agreement Given Hierarchical Tag Sets Gold standard annotations are often validated by measurements of inter-annotator agreement. The computation of any statistic that may be used for this purpose necessarily involves comparing tags to see whether they are the same. Again, the question arises as to how to compare tags that are in a parent-child relationship. We propose the same answer as before: Treat non-leaf tags as underspecified. To compute agreement statistics under this proposal, every non-leaf tag in each annotation is recursively distributed over its children, using uniform downward probabilities. The resulting annotations involve only the most specific possible tags, which can never be in a parent-child relationship. Agreement statistics can then be computed as usual, taking into account the probabilities distributed to each tag. One of the most common measures of pairwise inter-annotator agreement is the kappa coefficient (Siegel and Castellan, 1988): K=

Pr(A) − Pr(E) 1 − Pr(E)

(3)

where Pr(A) is the proportion of times that the annotators agree and Pr(E) is the probability of agreement by chance. Once the annotations are distributed over the leaves L of the tag inventory, these quantities are easy to compute. Given a set of test instances T, 1 XX Pr(A) = P r(l|annotation 1 (t)) · P r(l|annotation 2 (t)) (4) |T | t T lL X Pr(E) = Pr(l)2 (5) lL

84

MELAMED AND RESNIK

Computing these probabilities over just the leaves of the tag inventory ensures that the importance of non-leaf tags is not inflated by double-counting. 6. Conclusion We have presented three generalizations of standard evaluation methods for tagging tasks. Our methods are based on the principle of maximum entropy, which minimizes potential evaluation bias. As with the R&Y generalization in Equation 1, and the exact match criterion before it, our methods produce scores that can be justifiably interpreted as probabilities. Therefore, decision processes can combine these scores with other probabilities in a maximally informative way by using the axioms of probability theory. Our generalizations make few assumptions, but even these few assumptions lead to some limitations on the applicability of our proposal. First, although we are not aware of any algorithms that were designed to behave this way, our methods are not applicable to algorithms that conjunctively assign more than one tag per test instance. A potentially more serious limitation is our interpretation of treestructured tag sets as IS-A hierarchies. There has been considerable debate, for example, about whether this interpretation is valid for such well-known tag sets as HECTOR and WordNet. This work can be extended in a number of ways. For example, it would not be difficult to generalize our methods from trees to hierarchies with multiple inheritance, such as WordNet (Fellbaum, 1998). References Atkins, S. “Tools for computer-aided lexicography: the Hector project”. In Papers in Computational Lexicography: COMPLEX ’93. Budapest, 1993. Carletta, J. “Assessing agreement on classification tasks: the Kappa statistic”. Computational Linguistics 22(2), 249–254, 1996. Chinchor, N. (ed.) “Proceedings of the 7th Message Understanding Conference”. Columbia, MD: Science Applications International Corporation (SAIC), 1998. Online publication at http://www.muc.saic.com/proceedings/muc_7_toc.html. Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database; Cambridge, MA: MIT Press, 1998. Krishnamurthy, R. and D. Nicholls. “Peeling an onion: the lexicographer’s experience of manual sense-tagging”. In SENSEVAL Workshop. Sussex, England, 1998. Resnik, P. and D. Yarowsky. “A perspective on word sense disambiguation methods and their evaluation”. In M. Light (ed.): ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How? Washington, D.C., 1997. Resnik, P. and D. Yarowsky. “Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation”. Natural Language Engineering, 5(2), 1999. Siegel, S. and N.J. Castellan, Jr. Nonparametric Statistics for the Behavioral Sciences. Second edition. McGraw-Hill, 1988.

Computers and the Humanities 34: 85–97, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

85

Peeling an Onion: The Lexicographer’s Experience of Manual Sense-Tagging RAMESH KRISHNAMURTHY1,∗ and DIANE NICHOLLS2 1 [email protected]; 2 [email protected]

Abstract. SENSEVAL set itself the task of evaluating automatic word sense disambiguation programs (see Kilgarriff and Rosenzweig, this volume, for an overview of the framework and results). In order to do this, it was necessary to provide a ‘gold standard’ dataset of ‘correct’ answers. This paper will describe the lexicographic part of the process involved in creating that dataset. The primary objective was for a group of lexicographers to manually examine keywords in a large number of corpus contexts, and assign to each context a sense-tag for the keyword, taken from the Hector dictionary. Corpus contexts also had to be manually part-of-speech (POS) tagged. Various observations made and insights gained by the lexicographers during this process will be presented, including a critique of the resources and the methodology. Key words: context, corpus, evaluation, lexicography, part-of-speech tagging, word sense disambiguation, sense-tagging

1. Introduction Lexicography is a multi-faceted activity. Far from being a harmless drudge, a lexicographer needs to access a wide range of linguistic and cultural knowledge and employ analytical and editorial skills in a creative process that is neither wholly art nor wholly science. Using corpus contexts is a relatively recent methodology (Sinclair, 1987). It can add enormously to intuition and introspection, especially in terms of accuracy and frequency. It can also be expensive and time-consuming (not to mention repetitive and tedious for the lexicographer). Getting any two human beings to agree on anything can be difficult, and lexicographers are generally more disputatious than average. In this particular task, knowing that a crucial aspect of our role was in providing independent taggings in order to gauge the degree of consensus among human taggers, the lexicographers deliberately did most of the work in isolation. We knew that others might be analysing the same word, but did not communicate with them about it in any detail. Six highly experienced lexicographers participated in the manual tagging,1 and the whole exercise spanned approximately two months. In late May, the lexicographers were sent draft tagging instructions, Hector dictionary entries (see Atkins, 1993), and 100 corpus contexts for the test word promise. This was followed by a face-to-face meeting in Brighton in early June, to compare experiences, fine-

86

KRISHNAMURTHY AND NICHOLLS

tune the procedures and so on. Thereafter, there was very little communication, apart from the occasional email or telephone call. The deadline was fixed for 17th July. Subsequently, there was a brief second phase, during which disagreements between human taggers and/or Hector’s tagging were reviewed by three of the lexicographers (see Kilgarriff and Rosenzweig, this volume). This paper is based on the experiences and comments of all the lexicographers who took part, but the responsibility for any errors or misrepresentations lies with the authors. Throughout this paper, Hector dictionary headwords and POS tags are in bold, sense-tags are underlined, context words are in capitals, and corpus contexts are in italics.

2. Procedures The Hector dictionary entries consisted of headwords with numbered senses and subsenses, each associated with a mnemonic codeword, some clues (syntax, collocates), optionally a subject field or register specification, a definition, and one or more examples (often with additional notes). The corpus contexts were numbered, and the word to be tagged was the first instance of the headword in the last sentence of the context. Lexicographers were to return files containing: context number, Hector sense-mnemonic, and part-of-speech. Various options were available for complex cases, with unassignable, typo (i.e. typographic error) and proper noun as special tags. Specific instructions to lexicographers included the following: (a) If a corpus instance clearly matches a subsense, assign the subsense. If the match is not clear, assign the main-level sense (e.g. sense 4.1 of promise is ‘make a promise’, so a corpus instance such as He muttered a promise should not be assigned to sense 4.1, but to the more general main sense 4). (b) Tag gerunds as a (for adjective) or n (for noun). Note that ‘promising’ may be a noun form, but it is not the common noun form of ‘promise’ (which is “promise”!), so it would be misleading to tag it n. In such cases, use the POS-tag “ ?”. (c) Treat heads of reduced relative clauses (i.e. -ed and -ing forms) as verb occurrences. (d) When assigning POS, do not treat the lexical unit as something larger than the single word (even if it is linguistically accurate to do so). Give the POS for the target word alone. (e) In return files, the first column is for reference number, second for mnemonic, third for POS. (f) Where there is not enough context to be certain which sense applies, write ‘no-context’ in the fourth column.

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

87

(g) Use disjunction (‘mnemonic1 or mnemonic2 or mnemonic3’) in the mnemonic column. (h) In general, use disjunction rather than opting for just one tag, or using the ? or x (for one-off ‘exploitations’) suffixes, where you are unsure which tag to apply.

3. Time Constraints and Working Methods The average working rate was 66 contexts per hour (reported rates varied between 40 contexts per hour to over 100). The rates were lower at the beginning of the task, and also varied according to the difficulty of each word. All the lexicographers found that they worked faster as they became more accustomed to the dictionary sense-divisions and the mnemonic tags. Also, whereas they tended to look at the whole context initially, for later contexts a quick glance at the immediately surrounding context was often sufficient.

4. Hector Dictionary Entries 4.1.

SENSE DIVISIONS

Sense-division is a notoriously difficult area of lexicography (Stock, 1984, Kilgarriff, 1998), and one that can give rise to heated and acrimonious debate on occasion. The lexicographers in this exercise were quite critical of the Hector sense divisions that they were being compelled to apply to the corpus contexts. They frequently suggested that the distinctions made in Hector were not sufficiently fine to reflect the corpus contexts: accident: add an extra sense or sub-sense between sense 1 (crash: “an unfortunate or disastrous incident not caused deliberately; a mishap causing injury or damage . . . ”) and sense 2 (chance : “. . . something that happens wit hout apparent or deliberate cause; a chance event or set of circumstances.”) to cover broken windows and spilt coffee (rather than car crashes or nuclear meltdowns), characterised by contexts such as have an accident, it was an accident, etc. accident: the expression ‘accident and emergency’ (used to denote a medical speciality and hospital department) should be treated as a separate sense. generous: sense 3 kind (for definition, see Appendix to Kilgarriff and Rosenzweig, this volume) is really two different senses; the definition is in two halves (“recognizing positive aspects” and “favouring recipient rather than giver”); if subdivided, the second definition could then be expanded to covergenerous conditions of employment, generous odds, etc.

88

KRISHNAMURTHY AND NICHOLLS

hurdle: add a sub-sense, ‘threshold, qualifying standard’, to sense 5 (obstacle: “(in metaphorical use) an obstacle or difficulty”) for contexts like the 5% hurdle in elections, the quality hurdle for TV franchises. knee: add a sub-sense to sense 1 (patella: “the joint between thigh and lower leg”) for ‘marker of height on the body’ (cf Hector dictionary 4th and 5th examples . . . any hemline above the knee . . . and . . . you’d be up to your knees in a bog.) shake: the physical, literal sense of ‘shake someone off’ or ‘shake someone’s hand/arm off’ is missing in Hector, but present in the corpus lines. slight: important to split sense 1 (tiny: “very small in amount, quantity, or degree”) to distinguish examples with negative force (mainly predicative) from those with positive/neutral force (mainly attributive); (cf. ‘little’ and ‘a little’ etc). Very few comments suggested that the Hector senses were too finely distinguished: generous: often difficult from the context to decide between sense 3 (kind) and sense 1 (unstint) (for definitions, see Appendix to Kilgarriff and Rosenzweig, this volume), so create an umbrella sense covering ‘a person or their character’. 4.2.

GRAMMAR

The Hector dictionary aimed to place semantics first, with syntax merely acting in a secondary, supporting role. This meant that syntactic coding could not be taken as definitive. Also, the coding did not distinguish obligatory from optional syntactic features. The lexicographers certainly noticed many instances of corpus contexts which matched a Hector sense in terms of meaning, but did not match the sense’s syntactic specification. band: senses 1 (mus: “a group of musicians”) and 2 (group : “a group of people with a common interest, often of a criminal nature”) are labelled as nc (countable noun) in Hector, but need to be additionally labelled as collective nouns, because they can be used with a plural verb. behaviour: sense 1 (socialn: “the way in which one conducts oneself”) is marked nu (uncountable noun), but there are several nc instances. consume: ‘consumed with’ is not covered in the syntax or examples, yet is common. invade: senses 2 (takeover: “(of persons, less commonly of animals/things) to make an incursion into an area, etc”), 2.1 (infest : “(of parasite/disease) to infest an organism”), and 2.2 (habitat : “(of animal/plant) to spread into a new area/habitat”)

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

89

are only marked vt (transitive verb) and need to be additionally marked vi (intransitive verb). sanction: sense 2 (penalty: “a penalty for not complying with a rule, a coercive/ punitive measure”) is marked nc, but there are nu contexts.

4.3.

MULTI - WORD ITEMS

One problem raised was how to code an item, when an n-mod (noun used as modifier) sense was specified in the dictionary and the item was part of a compound, in cases where the whole compound was modifying, rather than the headword on its own (e.g. ACCIDENT in personal accident insurance). The variability of phrases was also a matter of concern: bother: if you can be bothered and couldn’t be bothered didn’t exactly match the phrase “can’t be bothered” in the entry for bother, yet they were clearly closely related to it.

4.4.

DEFINITIONS

One consistent difficulty was with the distinctions made between animate and inanimate entities. behaviour: a lot of contexts show institutions (e.g. banks and unions) acting as conscious entities. The use of near-synonyms in separate definitions also caused problems. In the entry for bitter, sense 2 (feelings) says “(of people, their remarks, or feelings)”, and sense 4 (unpleasant) says “(of an experience, situation, event, emotion, etc.)”. The difference between ‘feelings’ and ‘emotion’ was difficult to resolve.

4.5.

EXAMPLES

Occasional criticisms were made of the examples given in the Hector dictionary: shake: They were badly shaken by the affair was tagged by the lexicographers as the verb sense disturb (“(of a person, event, phenomenon, etc) to disturb, disconcert, or upset the equilibrium (of a society, group, person)”) or as the adjective sense troubled (“(of a person) severely upset or shocked, as after an accident, bad news, etc”). The distinction is not clear in the Hector examples.

90

KRISHNAMURTHY AND NICHOLLS

5. Lexicographers’ Observations on the Corpus Contexts Once the human taggers had established a working procedure, familiarized themselves with the various aspects of the Hector dictionary outlined above, received their individual assigned words and digested the sense definitions available to them, they then turned to the corpus contexts for each word. Although the majority of contexts were clear and simple to tag, the taggers encountered a number of difficulties.

5.1.

INSUFFICIENT CONTEXT

Some contexts, particularly the more literary or specialised, were too brief for a sense to be assigned. Others were either too vague or the dictionary sense distinction didn’t help. For example, in bet (n), the senses are either wager:n (an arrangement between two or more people whereby each risks a sum of money or property on the outcome of a race, competition, game or other unpredictable event), or speculation:n (an act of speculation or an opinion about an outcome or future situation). These two have the same syntactic information and, semantically, only differ regarding whether money or property is involved. There were at least seven contexts where it was not clear whether money or a simple statement of certainty was involved, so the tagger could not know which of two possible senses to assign. For example: 700235 Opinions are opinions, of course, but when they are so uniform and consistent (particularly about a polling result which can be interpreted completely differently), we readers have to ask whether you might collectively be trying to tell us something? TODAY a contest will begin that may finally settle aBET made 21 years ago. 700296 Temple Cowley Pool: No, I have not lost my BET! Some contexts simply made no sense at all to the tagger or at least left the taggers with a feeling that there was a large gap in their world knowledge, or a sense missing in the dictionary of which they themselves were unaware2 : 700004 In fact it is not all that obvious, and I did take the precaution of simulating it on a computer to check that intuition was right. Grudger does indeed turn out to be an evolutionarily stable strategy against sucker and cheat, in the sense that, in a population consisting largely of grudgers, neither cheat nor sucker will INVADE. 700007 The locally stable strategy in any particular part of the trench lines was not necessarily Tit for Tat itself. Tit for Tat is one of a family of nice, retaliatory but

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

91

forgiving strategies, all of which are, if not technically stable, at least difficult to INVADE once they arise. 5.2.

ENCYCLOPAEDIC OR ‘ REAL - WORLD ’ KNOWLEDGE

A broad bank of encyclopaedic or real-world knowledge and the ability to make assumptions and leaps of logic were a distinct advantage. The taggers could draw on their own experience of the world when assigning senses to contexts. This advantage was very much in evidence in the tagging of band, where tagging would have been difficult, if not impossible in some cases, if the tagger had not known, for example, that ‘Top of The Pops’ is a popular music programme on British TV: 700277 ‘It’d be something they remembered.’ ‘It’s good to see real BANDS on Top Of The Pops,’ adds John. Without this knowledge, a tagger could, potentially, based on syntactic information alone, select any of nine noun senses. Likewise, if the tagger didn’t know or couldn’t guess who The Stones, The Beatles, The Smiths, Hue And Cry or The Who were, they could justifiably assume that they were simply a gang of people: ‘a group of people who have a common interest or object, often of a criminal nature’: 700231 WHILE The Stones appealed to the students, and the Beatles to the girls and their mums, The Who were always the lads’ BAND. 700284 The Smiths, yeah, they are a thinking man’s BAND. 700087 Scots music is all about the voice and the person & dash. That’s why Country, folk-rock and older, more emotional forms are so dominant. And, for better or worse, Hue And Cry are a Glasgow BAND.” Equally, in the shake context below: 700390 “She believed she was not a lover of women because there was no genital contact.” Three weeks before Beauvoir’s death, Dr Bair was still SHAKING her. The human taggers’ deductive abilities were clear in their choice of disturb:v (to disturb, disconcert or upset the equilibrium of a society, group or person) over move:v (to move someone or something forcefully or quickly up and down or to and fro).

92 5.3.

KRISHNAMURTHY AND NICHOLLS

TAGGERS ’ WORLD VIEW OR PERSONAL BIAS

How a line was tagged sometimes depended on the tagger’s individual ‘view of the world’. In the shake context below, tags varied depending on whether it was thought that a ghost was a person (“shake off a pursuer”) or a thing (“shake off a bad memory”). 700176 A curious combination of two basses, fiddle and accordion meeting the Guardian Women’s page. Crawling out from the wreckage of The Cateran, the Joyriders feature two ex-members, Murdo MacLeod and Kai Davidson, plus one tall American named Rick on drums. It takes four songs for them to SHAKE off their own ghost, but halfway through the aptly named Long Gone it disappears. Similarly, at rabbit, there were several contexts containing references to ‘Roger Rabbit’ and ‘Peter Rabbit’, and tagging varied depending on whether the tagger saw them as toys or animals or neither (in addition, in each case, to them being proper names).3 700090 Beatrix Potter’s Peter RABBIT is one of Japan’s most famous characters: he is often the first Englishman encountered by young readers, and the miniature quality of Potter’s stories and illustrations strikes some deep chord in the Japanese heart. 700240 The sight of Peter RABBIT hanging up in an old-fashioned butcher’s window brings tears to our eyes, while pretty pink portions prepared and hacked by the supermarket cause no such qualms. Similarly, when tagging onion, the tagger was faced with a choice between two senses: veg:n “the pungent bulb of a plant . . . , widely used in cooking”; and plant :n “the plant that produces onions”. But the matter of when an onion is a vegetable and when it is a plant is a difficult question. For example, when you ‘plant the onions’, are you putting the bulb (veg:n) in the ground or creating a potential onion plant (plant:n)? And when you ‘harvest the onions’ or ‘lift the onions out of the soil’, are they vegetables or still plants? Since the sense boundaries were blurred, it was necessary to develop a policy and one tagger decided to select plant:n when the onions were still in the soil, had foliage, were being grown, harvested, watered etc., and veg:n when they were being peeled, cooked, sliced etc. However, if I say ‘I enjoy growing onions’, I surely mean the vegetables not the plants. It seemed that which senses were assigned to the contexts depended on the tagger’s personal understanding of when an onion was an onion, and while each tagger developed a policy for their decision-making and could defend their choices, they were keenly aware that another tagger, particularly one who was a keen gardener or cook, could have a different view that was equally defensible.

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

93

700095 Lift the ONIONS carefully with a fork and lay them out in a sunny place for a few days for them to dry. 700135 By August, the foliage will begin to topple and go yellow. Ease a fork under each ONION to break the roots and leave them on top of the soil to ripen in the sun. 700028 Wire mesh, or Netlon stretched from twigs, will also protect the sets from birds and cats. Weed regularly and water thoroughly in dry weather. Your ONIONS will be ready to harvest in late July or August when the foliage dies and begins to flop over . The reportedly personal and largely ad hoc nature of taggers’ strategies for coping with lexical ambiguity in such cases did not, however, prevent a high level of intertagger agreement. 5.4.

NON - STANDARD USES OF LANGUAGE

Just as people do not always follow the rules of grammar and syntax, they also use the semantic aspects of language imaginatively and creatively. Beyond the inclusion of recognised figurative sense extensions in the dictionary, there is little provision for this unpredictable aspect of language use. 5.4.1. Hyperbole, Metaphor and Anthropomorphism A problem frequently presented itself when inanimate objects were given human characteristics or emotions. 700254 Then Olybrius’ fury flared and even the ground SHOOK in fear. Only humans or animals can shake with fear and this is made explicit in the dictionary sense: tremble:v “(especially of people or their limbs) to tremble or quiver, especially through fear, illness, shock, or strong emotion”. This deliberate contravention of selectional preferences is used by the author for hyperbolic or humorous effect. This is by no means an uncommon phenomenon in language. While lexicographers attempt to set down generalisations about syntactic or semantic behaviour, identifying constraints and organising linguistic features into manageable categories, language users continue to subvert language for their own ends, be they emphatic, comic, or ironic, or simply because they can. The human taggers were faced with a choice between tremble:v and move:v “to move (someone or something) forcefully or quickly up and down or to and fro”. The sense move :v would certainly cover the ground shook, but since ‘fear’ is the asserted cause of the shaking and is normally restricted to animate objects, it is clear that this is a figurative use and that what is implied at a deeper semantic

94

KRISHNAMURTHY AND NICHOLLS

level is tremble:v. Should the taggers ignore both what they know to be possible in reality and the semantic features set down in the dictionary entry for tremble:v or ignore the poetic aspect (‘in fear’) of the context itself and tag it at the literal level? No policy was developed to deal with such cases and the decision was left to the individual taggers. They were also instructed not to confer with each other. The taggers differed in their choices. A similar case is seen in the use of metaphor in the following consume context: 700063 Apart from the obvious advantage of quieter touring brought by the fifth ratio, the five-speed ‘box also seems to have done the SL’s fuel consumption no small favour. Overall its exclusively lead-free diet was CONSUMED at the rate of 10.2mpg, even with a thirsty catalyst as standard. Here the author has deliberately taken advantage of the ambiguity between the concrete eat:v “(of a person or animal) to eat, drink or ingest (food, drink or other substances)” sense and the more figurative resource:v “(of people, societies, project, machines, etc) to use up (time, money, fuel or other resources)”. An engine is described as ‘consuming’ a ‘diet’ of unleaded petrol and having a ‘thirsty’ catalyst. The language characteristic of the eat:v sense is used to anthropomorphise the engine, but the meaning is the resource:v sense. The human tagger, whilst aware that the context operates on two semantic levels, must choose between the two senses, though neither fully captures what is essentially a concatenation of two senses. Should the tagger assign it a sense according to the language of the imagery or according to the underlying sense? Dictionaries do not allow for metaphor. This dilemma is echoed in the context below: 700160 The production will be a flop. In the past couple of years the opposition parties have become skilled at being anti-Thatcherite, CONSUMING rich pickings from the slow collapse of Thatcherism. The imagery is of vultures dining on a carcass, but the actual reference is to political advantages, resources, benefits etc. A perfect example of an extended metaphor which leaves a human tagger wondering whether to tag the literal use or the actual metaphorical sense, is shown in the context below: 700171 What was designed by Mrs Thatcher as a Conservative flagship has become, in the words of John Biffen, the Tories’ Titanic. Meanwhile, back on the bridge, a new tremor has SHAKEN the ship with news of a Treasury instruction that low-spending councils must be ready to bail out the high-spenders to reduce the impact.

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

95

On the literal level, a tremor has ‘shaken’ a ship. But the tremor is a metaphor for the bad news, and the ship is a metaphor for a human institution. Literally, the sense used is move:v, but metaphorically, it is disturb:v. 5.4.2. Literary References and Idioms In the corpus contexts for bury, there were three examples of variation on the wellknown quotation from Shakespeare’s ‘Julius Caesar’ – ‘I come to bury Caesar not to praise him’. In fact, all three instances take the original idiom and capitalize on the ambiguity between the inter:v sense (the original sense intended in the play) and the defeat:v sense. bury 1 [inter] [vt; often pass] (of a person or group) to deposit (a corpse or other remains of the dead) in a permanent resting place such as a grave, a tomb, the sea, etc., usually with funeral rites. bury 6.1 [defeat] [vt] to overwhelm (an opponent) totally or beyond hope of recovery. 700107 Gift’s performance will either strike a blow for that much-maligned species, the rock star turned serious actor, or reinforce the opinion that such forays are ego-fueled flights of fancy. No doubt Roland Gift’s Shakesperian debut will be attended by critics who will have come not to praise but to BURY him. 700132 It will be her 111th singles match at Wimbledon, one more than Billie Jean King. She has contested 10 finals over the past 18 years. Graf will not be there to praise the American but to BURY her, just as the 18-year-old Andrea Jaeger annihilated King in their semi-final six years ago, 6-1, 6-1 in 56 minutes. As with the metaphorical uses described in the previous section, this use of a popular idiom can be read on two levels, the original or literal sense and the underlying extended sense. The dilemma here would at least give the human tagger cause to hesitate. 5.4.3. Zeugma Another non-standard use of language is seen in the zeugmatic context below: 700028 Kadar’s funeral is the first event to involve workers on a large scale since Mr Grosz replaced him as general-secretary 13 months ago. Mr Pal Kollat, a shipbuilder, described Kadar as an honest man and ‘a leader whose lectures we could understand and whose lectures made sense”. The question now is whether the workers respect for the party will be BURIED along with Kadar.

96

KRISHNAMURTHY AND NICHOLLS

The author uses one verb with two nouns, but to each noun a different verb sense applies. While Kadar’s burial is literal (inter:v), respect’s burial is another, figurative, sense – consign:v (to consign to oblivion or obscurity; to put an end to). It certainly seemed that this context could not be assigned a single sense. This is a further example of the many ways in which language users flout the ‘rules’ of their language in order to take advantage of its endlessly productive potential. The various problems encountered by the lexicographers when asked to pair the extremely diverse styles, registers, genres and subject matters covered in a large set of corpus instances with a closed set of dictionary senses are the same problems which humans encounter in their everyday communicative activity. The exercise was carried out under fairly strict time constraints and the lexicographers did not discuss their dilemmas among themselves, neither were they called upon to justify the decisions they made. Discussion of the processes by which such decisions are made is, unfortunately, beyond the scope of this paper. 6. Conclusion It might be expected, from the extensive catalogue of problematic contexts surveyed in this paper, that the human taggers would have been permanently at odds with each other, and that very little consensus in the sense-tags would have occurred. However, in the total of 8,449 contexts tagged, the rate of agreement was over 95% in most cases (see Kilgarriff and Rosenzweig, this volume). Almost miraculously, human beings are able to navigate through the multitude of contradictory or mutually incompatible linguistic signals encoded in a text, and with only a small contextual environment as guide, to arrive at a preferred semantic interpretation that is shared by others in their language community. It remains to be seen, from the evaluation of the automatic software tagging results, to what extent the sophisticated techniques employed have managed to approximate to this most human of skills. Can a computer peel an onion? Notes ∗ This article is based on a paper given at the SENSEVAL Workshop, Herstmonceux Castle,

Sussex, England, 2–4 September 1998. 1 In addition to the authors, they were Lucy Hollingworth, Guy Jackson, Glennis Pye, and John

Williams. 2 One of the referees of this paper informed us that these two examples are in fact from a game-

theoretic puzzle called ‘The Prisoner’s Dilemma’ for which suggested computational strategies were named ‘grudger’, ‘cheat’, ‘sucker’ etc. ‘Tit for Tat’ was the strategy that consistently beat all the others! 3 Fully fledged proper names, where there was no relation between any of the word’s meanings and its use in the name, were removed from the set of corpus instances to be tagged. However, instances such as ‘General Accident’ and ‘Peter Rabbit’, where the word both had one of its usual meanings and was in a name, were tagged with relevant sense and P (Proper Name).

PEELING AN ONION: THE LEXICOGRAPHER’S EXPERIENCE

97

References Atkins, S. “Tools for Computer-Aided Corpus Lexicography: The Hector Project”. Acta Linguistica Hungarica, 41(1993), 5–72. Atkins, B.T.S. and Levin, B. “Admitting Impediments”. In: Zernik, Uri (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, Lawrence Erlbaum, New Jersey. 1991, pp. 233–262. Kilgarriff, A. “The Hard Parts of Lexicography”. International Journal of Lexicography, 11:1 (1998), 51–54. Sinclair, J.M., (ed.) Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London. 1987. Stock, P.F. Polysemy. In: Hartmann, R.R.K. (ed.), LEXeter ’83 Proceedings, Max Niemeyer Verlag, Tubingen. 1984, pp. 131–140.

Computers and the Humanities 34: 99–102, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

99

Lexicography and Disambiguation: The Size of the Problem ROSAMUND MOON Department of English, University of Birmingham, Great Britain

This contribution is by way of a footnote to discussions of the sense disambiguation problem, and it sets out quantifications based on theCollins Cobuild English Dictionary (hereafter CCED).1 These suggest that as few as 10000–12000 items in the central vocabulary of English are polysemous: that is, potentially ambiguous at the word level. Other items have only one meaning in general or common use. CCED contains 39851 different lemmas/headwords and headphrases.2 It is intended for advanced learners of English as a second or foreign language, and its aim is to cover the central vocabulary of English as it appears in a large corpus of current text, the Bank of English (now 329 million words, but 200 million at the time of preparing CCED). While, clearly, no dictionary is either perfectly comprehensive or perfectly accurate, and while lexicographers produce at best only approximations of semantic and lexical truth, it has been assumed in the following that the analysis and coverage of meanings in CCED, because it is corpus-based, are a reasonable representation of those words and meanings found in general current English, and that the words and meanings not included in CCED are likely to be relatively rare, technical, or specialized in other ways. In fact, the headwords in CCED seem to account for about 94% of alphabetical tokens in the Bank of English, and over half the remaining tokens in the corpus are names, variant or aberrant spellings, and hapaxes. Table I gives the number of headword items in CCED with particular numbers of senses. These headwords represent lemmas: CCED does not normally separate homographic forms into separate headword entries, even where they belong to different wordclasses or parts of speech, or where they are etymologically discrete. Accordingly, for example, nominal and verbal senses of cut or dream are treated together in single entries, as are bark ‘noise made by dog’ and bark ‘outer covering of trees’.3 It can be seen that the majority of entries in CCED have just one sense, and only 5384, or 13%, have more than two. The average number of senses per dictionary entry is 1.73. If phrasal verbs, phrases, and idioms are excluded, the average number of senses per item is 1.84. Phrasal verbs each have an average 1.62 senses.

100

ROSAMUND MOON

Table I. Headwords and senses in CCED. Number of senses 1 2 3 4 5 6–9 10–14 15–19 20/over

Number of headwords

Proportion of headwords

27600 6867 2323 1103 591 912 289 96 70

69.26% 17.23% 5.83% 2.77% 1.48% 2.29% 0.73% 0.24% 0.18%

Phrases and idioms are generally not polysemous: 89% of those items included in CCED have only one sense, and the average is 1.06 senses.4 Many of the entries with two or more senses operate in two different wordclasses, and so are polyfunctional rather than, or as well as, polysemous: this means that the disambiguation process is simplified. If syntax as well as form is taken into account, a more refined assessment is possible. The next set of figures is based on a division of CCED entries according to wordclass, with nominal, verbal, adjectival, adverbial, phrasal, and other (miscellaneous) senses separated out. Closed-set items – determiners, prepositions, and conjunctions, such as a, about, across, all, and – have been excluded here for the sake of simplicity, and because the nature of the distinctions between their different ‘senses’ is generally syntactic, functional, or discoursal, rather than semantic: they are therefore not the primary targets for sense disambiguation work. This division of CCED entries produces a new total of 49420 items, of which about 25% are polysemous. There are now in absolute terms more two- and threesense items, since many of the heavily polysemous headwords in CCED have at least two senses in two or more wordclasses, but the average number of senses per item has fallen slightly to 1.69. Table II gives the profile for the whole set and for the specified wordclasses (the numbers of nouns etc. with two or more senses). About 14% of the two-sense nouns can be disambiguated syntactically, through countability differences between senses, or formally, because one sense is capitalized: that is, pluralizability, determiner concord, and spelling are distinguishing characteristics. Although polysemous verb and adjective senses can sometimes be distinguished through transitivity, gradability, and prepositional or clausal complementation, this is comparatively infrequently a simple matter of binary distinctions: collocation and valency generally seem more significant criteria for lexicographers. There is of course a close correlation between frequency and polysemy, and more heavily polysemous items are usually more frequent. The 455 headwords in

101

LEXICOGRAPHY AND DISAMBIGUATION: THE SIZE OF THE PROBLEM

Table II. Headwords, senses, and wordclasses in CCED. Number of senses 2 3 4 5 6–9 10–14 15–19 20/over

Number of items

Proportion of items

Nouns

Verbs

Adjectives

Adverbs

Phrasal verbs

Phrases

7362 2384 994 527 666 163 49 33

14.9% 4.82% 2.01% 1.07% 1.35% 0.33% 0.1% 0.07%

3513 1132 444 239 247 39 5 2

1152 501 263 142 196 54 17 17

1616 471 153 75 111 28 11 1

369 101 44 26 20 3 0 0

357 113 57 23 34 2 0 0

242 16 2 0 2 0 0 0

CCED with 10 or more senses all have at least 10 tokens per million words in the Bank of English, and together they account for nearly 50% of its alphabetical tokens. Many of the 455 are closed-set items: they are generally of very high frequency, alone accounting for 40% of the corpus. Many of the rest are versatile words occurring in many different collocations and contexts: CCED has used such features as criteria in making sense distinctions, even though there may be little substantial difference in core meaning. The most heavily polysemous of these items in CCED are the nouns line, service, side, thing, time, and way; the verbs get, go, hold, make, run, and take; and the adjectives, dry, heavy, light, low, open, and strong. At least some of these words are likely to have been finely split and undergeneralized in CCED in order to simplify explanations of meaning and to demonstrate typical lexicogrammatical and textual environments, for the benefit of CCED’s target users. (See the paper by Krishnamurthy and Nicholls in this issue for a discussion of lexicographical procedures in relation to the Hector data.) The above represents just one dictionary’s account of polysemy and selection of headwords and meanings: other dictionaries of different sizes and types may provide different statistics. However, it may be used as a benchmark and as an indication of the extent of the task of sense disambiguation, whether manual or automatic. In this particular lexicon (approximately 40000 different entries, or 50000 if wordclass is taken into account), about 75% of items have either only one sense or only one sense per wordclass: nearly 1% more are closed-set items which do not need this kind of disambiguation. This leaves approximately 10000 of the headwords found in CCED (12000 items if separated into wordclasses) to focus on: about 7000 of these, in either case, have just two senses, and there are probably only 1000 very complex items to deal with.

102

ROSAMUND MOON

Notes 1 Collins Cobuild English Dictionary (1995, 2nd edition). London & Glasgow: HarperCollins. 2 All data is copyright Cobuild and HarperCollins Publishers, and is reproduced with their permis-

sion. 3 For convenience, exact counts are given here. These represent best attempts to extract the inform-

ation from dictionary files, but variability and inconsistency in coding mean that other methods of counting could lead to slightly different figures. Note that the count of headwords/headphrases corresponds to ‘lemmas’, not to the dictionary publishers’ conventional count of ‘references’, of which CCED contains 75000. 4 Some of these items may be potentially ambiguous in another way, since identical strings with literal meanings can be found: for example, in hot water and sit on the fence can be used literally to denote physical location as well as idiomatically or metaphorically to denote non-physical situation or mental state. However, corpus studies suggest that this kind of ambiguity is relatively infrequent, and the institutionalization of an idiomatic meaning is typically associated with non-use of possible literal meanings.

Computers and the Humanities 34: 103–108, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

103

Combining Supervised and Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation E. AGIRRE1, G. RIGAU2, L. PADRÓ2 and J. ATSERIAS2 1 LSI saila, Euskal Herriko Unibertsitatea, Donostia, Basque Country; 2 Departament de LSI,

Universitat Politècnica de Catalunya, Barcelona, Catalonia

Abstract. This work combines a set of available techniques – which could be further extended – to perform noun sense disambiguation. We use several unsupervised techniques (Rigau et al., 1997) that draw knowledge from a variety of sources. In addition, we also apply a supervised technique in order to show that supervised and unsupervised methods can be combined to obtain better results. This paper tries to prove that using an appropriate method to combine those heuristics we can disambiguate words in free running text with reasonable precision. Key words: combining knowledge sources, word sense disambiguation

1. Introduction The methods used by our sense disambiguating system are mainly unsupervised. Nevertheless, it may incorporate supervised knowledge when available. Although fully supervised systems have been proposed (Ng, 1997), it seems impractical to rely only on these techniques, given the high human labor cost they imply. The techniques presented in this paper were tried on the Hector corpus in the framework of the S ENS E VAL competition. Since most of our techniques disambiguate using WordNet, we had to map WordNet synsets into Hector senses. Although the techniques can be applied to most parts of speech, for the time being we focused on nouns. This paper is organized as follows: section 2 shows the methods we have applied. Section 3 deals with the lexical knowledge used and section 4 shows the results. Section 5 discusses previous work, and finally, section 6 present some conclusions. 2. Heuristics for Word Sense Disambiguation The methods described in this paper are to be applied in a combined way. Each one must be seen as a container of part of the knowledge needed to perform a correct

104

AGIRRE ET AL.

sense disambiguation. Each heuristic assigns a weight ranging in [0, 1] to each candidate sense. These votes are later joined in a final decision. Heuristic H1 (Multi-words) is applied when the word is part of a multi-word term. In this case, the Hector sense corresponding to the multi-word term is assigned. Only H1 and H8 yield Hector senses. Heuristic H2 (Entry Sense Ordering) assumes that senses are ordered in an entry by frequency of usage. That is, the most used and important senses are placed in the entry before less frequent or less important ones. This heuristic assigns the maximum score to the first candidate sense and linearly decreasing scores to the others. The sense ordering used is that provided by WordNet. Heuristic H3 (Topic Domain) selects the WordNet synset belonging to the WN semantic file most frequent among the semantic files for all words in the context, in the style of Liddy and Paik (1992). Heuristic H4 (Word Matching) is based on the hypothesis that related concepts are expressed using the same content words, computing the amount of content words shared by the context and the glosses (Lesk, 1986). Heuristic H5 (Simple Co-occurrence) uses co-occurrence data collected from a whole dictionary. Thus, given a context and a set of candidate synsets, this method selects the target synset which returns the maximum sum of pairwise co-occurrence weights between a word in the context and a word in the synset. The co-occurrence weight between two words is computed as Association Score (Resnik, 1992). Heuristic H6 (Co-occurrence Vectors) is based on the work by Wilks et al. (1993), who also use co-occurrence data collected from a whole dictionary. Given a context and a set of candidate synsets, this method selects the candidate which yields the highest similarity with the context. This similarity can be measured by the dot product, the cosine function or the Euclidean distance between two vectors. The vector for a context or a synset is computed by adding the co-occurrence information vectors of the words it contains. The co-occurrence information vector for a word is collected from the whole dictionary using Association Score (see section 3). Heuristic H7 (Conceptual Density) (Agirre and Rigau, 1996; Agirre, 1998) provides a relatedness measure among words and word senses, taking as reference a structured hierarchical net. Conceptual Density captures the closeness of a set of concepts in the hierarchy, using the relation between the weighted amount of word senses and the size of the minimum subtree covering all word senses. Given the target word and the nouns in the surrounding context, the algorithm chooses the sense of the target word which lies in the sub-hierarchy with highest Conceptual Density, i.e., the sub-hierarchy which contains a larger number of possible senses of context words in a proportionally smaller hierarchy. Heuristic H8 (Decision Lists). Given a training corpus where the target word has been tagged with the corresponding sense, frequencies are collected for: appearances of each word sense, bigrams of each word sense (form, lemma, and POS tag for left and right words), trigrams of each word sense, and window of

105

WORD SENSE DISAMBIGUATION

Table I. Words frequently co–occurring with wine word

AS

word

AS

word

AS

grapes bottles bread

10.5267 8.3157 8.2815

bottle Burgundy drink

8.1675 7.2882 7.2498

eucharist cider Bordeaux

7.1267 6.9273 6.6316

surrounding lemmas. Frequencies are filtered, converted to association scores and organized in decreasing order as decision lists. In the test part, the features found in the context are used to select the word sense, going through the decision list until a matching feature is found (Yarowsky, 1994). As the training corpus is tagged with Hector senses, it also outputs Hector senses. Combination. Finally, the ensemble of the heuristics is also taken into account. The way to combine all the heuristics in a single decision is simple. The weights assigned to the competing senses by each heuristic are normalized dividing them by the highest weight. The votes collected from each heuristic are added up for each competing sense. 3. Derived Lexical Knowledge Resources According to Wilks et al. (1993), two words co–occur in a dictionary if they appear in the same definition. In our case, a lexicon of 500,413 content word pairs of 41,955 different word forms was derived from Collins English Dictionary. Table I shows the words co-occurring with wine with the highest Association Scores. The lexicon produced in this way from the dictionary is used by heuristics H5 and H6 . 4. Results Our system tries to disambiguate all nouns except those tagged as proper nouns. The results submitted to the S ENS E VAL workshop are shown in Table II (July columns). At that stage of development, simple co-occurrence (H5 ) and co-occurrence vector (H6 ) were not yet integrated. Small bugs were found and a revised version was re-submitted in October. Finally, we included the simple co-occurrence and co-occurrence vector techniques (November columns). The system is still evolving (see section 6). Two combinations have been tried: an unsupervised system only using lexical knowledge, and a supervised system which includes also knowledge extracted from the training corpora. Table III shows the performance of each heuristic in isolation. Combining them all (Table II) has the best recall in both the supervised and the unsupervised system.

106

AGIRRE ET AL.

Table II. Results obtained at each stage of development Unsupervised (H1 to H7 )

recall precision coverage

Supervised (H1 to H8 )

July

October

November

July

October

November

38.8% 41.6% 93.0%

38.8% 41.8% 93.0%

40.4% 43.5% 93.0%

60.7% 62.0% 98.0%

63.9% 65.3% 98.0%

66.9% 68.3% 98.0%

Table III. Overall results for isolated heuristics

recall precision coverage

random

H1

H2

H3

H4

H5

H6

H7

H8

16.6% 16.6% 100%

5.7% 84.4% 6.8%

38.4% 45.4% 84.5%

30.1% 35.6% 84.5%

34.7% 41.5% 84.5%

27.6% 32.7% 84.5%

32.8% 38.8% 84.5%

29.5% 37.3% 79.0%

51.3% 71.6% 71.6%

Our systems perform well in both the supervised and unsupervised categories of the S ENS E VAL competition, especially considering that nearly all our techniques – except H1 and H8 – disambiguate to WordNet senses. In order to yield Hector senses, we used a mapping provided by the S ENS E VAL organization. The WordNet to Hector mapping adds a substantial handicap. Concerns were raised in the S ENS E VAL workshop regarding the quality (gaps in either direction, arguable mappings, etc.) of this mapping. Also, the used POS tagger was very simple.

5. Comparison with Previous Work Several approaches have been proposed for attaching the correct sense to a word in context. Some of them are only models for simple systems such as connectionist methods (Cottrell and Small, 1992) or Bayesian networks (Eizirik et al., 1993), while others have been fully tested in real size texts, like statistical methods (Yarowsky, 1992; Yarowsky, 1994; Miller et al., 1994), knowledge based methods (Sussna, 1993; Agirre and Rigau, 1996), or mixed methods (Richarson, 1994; Resnik, 1995). The performance of WSD is reaching a high stance, although usually only small sets of words with clear sense distinctions are selected for disambiguation. For instance, Yarowsky (1995) reports a success rate of 96% disambiguating twelve words with two clear sense distinctions each, and Wilks et al. (1993) report a success rate of 45% disambiguating the word bank (thirteen senses from Longman Dictionary of Contemporary English) using a technique similar to heuristic H6 .

WORD SENSE DISAMBIGUATION

107

This paper has presented a general technique for WSD which is a combination of statistical and knowledge based methods, and which has been applied to disambiguate all nouns in a free running text. 6. Conclusions and Further Work Our system disambiguates to WordNet 1.6 senses (the only exception being heuristics H1 and H8 ). In order to yield Hector senses, the results were automatically converted using a mapping provided by the S ENS E VAL organization. It is clear that precision is reduced if a sense mapping is used. We have shown that the ensemble of heuristics is a useful way to combine knowledge from several lexical knowledge methods, outperforming each technique in isolation (coverage and/or precision). Better results can be expected from adding new heuristics with different methodologies and different knowledge sources (e.g., from corpora). More sophisticated methods to weight the contribution of each heuristic should also improve the results. Another possible improvement – after Wilks and Stevenson (1998) – would be to use a supervised learning process to establish the best policy for combining the heuristics. In order to get a fair evaluation, we plan to test our system on a corpus tagged with WordNet senses, such as SemCor. We believe that an all-word task provides a more realistic setting for evaluation. If we want to get an idea of the performance that can be expected from a running system, we cannot depend on the availability of training data for all content words. References Agirre, E. and G. Rigau. “Word Sense Disambiguation Using Conceptual Density”. In Proceedings of COLING’96. Copenhagen, Denmark, 1996. Agirre, E. Formalization Of Concept-Relatedness Using Ontologies: Conceptual Density, Ph.D. thesis, LSI saila, University of the Basque Country, 1998. Cottrell, G. and S. Small. “A Connectionist Scheme for Modeling Word Sense Disambiguation”. Cognition and Brain Theory, 6(1) (1992), 89–120. Eizirik, L., V. Barbosa and S. Mendes. “A Bayesian-Network Approach to Lexical Disambiguation”. Cognitive Science, 17 (1993), 257–283. Lesk, M. “Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone”. In Proceedings of the SIGDOC’86 Conference, ACM, 1986. Liddy, E. and W. Paik. “Statistically-Guided Word Sense Disambiguation”. In AAAI Fall Symposium on Statistically Based NLP Techniques, 1992. Miller, G., M. Chodorow, S. Landes, C. Leacock and R. Thomas. “Using a Semantic Concordance for sense Identification”. In Proceedings of ARPA Workshop on Human Language Technology, 1994. Ng, H.T. “Getting Serious about Word Sense Disambiguation”. In Proceedings of the SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What and How? Washington DC, USA, 1997. Resnik, P. “Wordnet and Distributional Analysis: A Class-based Approach to Lexical Discovery”. In AAAI Spring Symposium on Statistically Based NLP Techniques, 1992.

108

AGIRRE ET AL.

Resnik, P. “Disambiguating Noun Groupings with Respect to WordNet Senses”. In Proceedings of the Third Workshop on Very Large Corpora. MIT, 1995. Richarson, R., A.F. Smeaton and J. Murphy. Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words. Working Paper CA-1294, School of Computer Applications, Dublin City University, 1994. Rigau, G., J. Atserias and E. Agirre. “Combining Unsupervised Lexical Knowledge Methods for WSD”. In Proceedings of joint ACL-EACL’97. Madrid, Spain, 1997. Sussna, M. “Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network”. In Proceedings of the Second International Conference on Information and Knowledge Management. Arlington, Virginia USA, 1993. Wilks, Y., D. Fass, C. Guo, J. McDonal, T. Plate and B. Slator. “Providing Machine Tractable Dictionary Tools”. In Semantics and the Lexicon. Ed. J. Pustejowsky, Kluwer Academic Publishers, 1993, pp. 341–401. Wilks, Y. and M. Stevenson. “Word Sense Disambiguation Using Optimized Combinations of Knowledge Sources”. In Proceedings of joint COLING-ACL’98. Montreal, Canada, 1998. Yarowsky, D. “Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora”. In Proceedings of COLING’92. Nantes, France, 1992, pp. 454–460. Yarowsky, D. “Decision Lists for Lexical Ambiguity Resolution”. In Proceedings of ACL’94. Las Cruces, New Mexico, 1994. Yarowsky, D. “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”. In Proceedings of ACL’95. Cambridge, Massachussets, 1995.

Computers and the Humanities 34: 109–114, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

109

Word Sense Disambiguation Using Automatically Acquired Verbal Preferences JOHN CARROLL and DIANA McCARTHY Cognitive & Computing Sciences, University of Sussex, 78 Surrenden Park, BN1, 6XA, Brighton, UK (E-mail: {johnca;dianam}@cogs.susx.ac.uk)

Abstract. The selectional preferences of verbal predicates are an important component of a computational lexicon. They have frequently been cited as being useful for WSD, alongside other sources of knowledge. We evaluate automatically acquired selectional preferences on the level playing field provided by SENSEVAL to examine to what extent they help in WSD. Key words: selectional preferences Abbreviations: WSD – word sense disambiguation; ATCM – Association Tree Cut Model; POS – part-of-speech; SCF – subcategorization frame

1. Introduction Selectional preferences have frequently been cited as being a useful source of information for WSD. It has however been noted that their use is limited (Resnik, 1997) and that additional sources of knowledge are required for full and accurate WSD . This paper outlines the use of automatically acquired preferences for WSD , and an evaluation of them at the SENSEVAL workshop. The preferences are automatically acquired from raw text using the system described in sections 2.1–2.3. The target data is disambiguated as described in section 2.4. 1.1.

SCOPE

The preferences are obtained for the argument slots of verbal predicates where those slots involve noun phrases, i.e. subject, direct object and prepositional phrases. Preferences were not obtained in this instance for indirect objects since these are less common. The system has not at this stage been adapted for other relationships. For this reason disambiguation was only attempted on nouns occurring as argument heads in these slot positions. Moreover, preferences are only obtained where there is sufficient training data for the verb (using a threshold of 10 instances). Disambiguation only takes place where the preferences are strong enough (above a threshold on the score representing preference strength) and where

110

CARROLL AND McCARTHY

Figure 1. System Overview

the preferences can discriminate between the senses. Proper nouns were neither used nor disambiguated. Some minor identification of multi-word expressions was performed since these items are easy to disambiguate and we would not want to use the preferences in these cases. 2. System Description The system for acquisition is depicted in figure 1. Raw text is tagged and lemmatised and fed into the shallow parser. The output from this is then fed into the SCF acquisition system which produces argument head data alongside the SCF entries. From this argument head tuples consisting of the slot, verb (and preposition for prepositional phrase slots) and noun are fed to the preference acquisition module. To obtain the selectional preferences, 10.8 million words of parsed text from the BNC were used as training data. Some rudimentary WSD is performed on the nouns before preference acquisition. The selectional preference acquisition system then produces preferences for each verb and slot. These preferences are disjoint sets of WordNet (Miller et al., 1993a) noun classes, covering all WordNet nouns with a preference score attached to each class. The parser is then used on the target data and disambiguation is performed on target instances in argument head position. All these components are described in more detail below. 2.1.

SHALLOW PARSER AND SCF ACQUISITION

The shallow parser takes text (re-)tagged by an HMM tagger (Elworthy, 1994), using the CLAWS-2 tagset(Garside et al., 1987), lemmatised with an enhanced version of the GATE system morphological analyser (Cunningham et al., 1995). The shallow parser and SCF acquisition are described in detail by Briscoe and Carroll 1997; briefly, the POS tag sequences are analysed by a definite clause grammar over POS and punctuation labels, the most plausible syntactic analysis (with respect

WSD USING AUTOMATICALLY ACQUIRED VERBAL PREFERENCES

111

to a training treebank derived from the SUSANNE corpus (Sampson, 1995)) being returned. Subject and (nominal and prepositional) complement heads of verbal predicates are then extracted from successful parses, and from parse failures sets of possible heads are extracted from any partial constituents found.

2.2.

WSD OF THE ARGUMENT HEAD DATA

of the input data seems to help preference acquisition itself (Ribas, 1995b; McCarthy, 1997). We use a cheap and simple method using frequency data from the SemCor project (Miller et al., 1993b). The first sense of a word is selected provided that (a) the sense has been seen more than three times, (b) the predominant sense is seen more than twice as often as the second sense and (c) the noun is not one of those identified as ‘difficult’ by the human taggers. WSD

2.3.

SELECTIONAL PREFERENCE ACQUISITION

The preferences are acquired using Abe and Li’s method (Abe and Li, 1996) for obtaining preferences as sets of disjoint classes across the WordNet noun hypernym hierarchy. These classes are each assigned ‘association scores’ which indicate the degree of preference between the verb and class given the specified slot. The ATCM is collectively the set of classes with association scores provided for a verb. The association scores are given by p(c|v) , where c is the class and v the verb. A small p(c) portion of an ATCM for the direct object slot of eat is depicted in figure 2. The verb forms are not disambiguated. The ambiguity of a verb form is reflected in the preferences given on the ATCM. The models are produced using the minimum description length Principle (Rissanen, 1978). This makes a compromise between a simple model and one which describes the data efficiently. To obtain the models the hypernym hierarchy is populated with frequency information from the data and the estimated probabilities are used for the calculations that compare the cost (in bits) of the model and the data when encoded in the model.

2.4.

WORD SENSE DISAMBIGUATION USING SELECTIONAL PREFERENCES

WSD using the ATCM s simply selects all senses for a noun that fall under the node in the cut with the highest association score with senses for this word. For example the sense of chicken under FOOD would be preferred over the senses under LIFE FORM, when occurring as the direct object of eat. The granularity of the WSD depends on how specific the cut is. Target instances are disambiguated to a WordNet sense level. Each WordNet sense was mapped to the Hector senses required for SENSEVAL, using the mapping provided by the organisers.

112

CARROLL AND McCARTHY

Figure 2. ATCM for eat Direct Object

3. Results The preferences were only applied to nouns. For the all-nouns task fine-grained precision is 40.8% and recall 12.5%. The low recall is to be expected since many of the test items occur outside the argument head positions that we use. Coarsegrained precision is 56.2% and recall 17.2%. Performance is better when we look at the items which do not need disambiguation for POS. For these, coarse grained precision is 69.4% and recall 20.2%. An important advantage of our approach is that our preferences do not require sense tagged data and so can perform the untrainable-nouns task. On the finegrained untrainable-nouns task our system obtains 69.1% precision and 20.5% recall. 3.1. 1.

SOURCES OF ERROR

POS errors – These affect the parser. POS errors also contribute to the errors on the all-nouns task, where many of the items require POS disambiguation. 30% of the errors for shake were due to POS errors. 2. Parser errors – Preference acquisition in the training phase is subject to parser errors in identifying SCFs, although some of these are filtered out as ‘noise’. Errors in parsing the target data are more serious, since they might result in heads being identified incorrectly. Lack of coverage is also a problem: only 59% of the sentences in the target data were parsed successfully. Empirically, the grammar covers around 70–80% of general corpus text (Carroll and Briscoe, 1996), but the current disambiguation component appears to be rather inefficient since 15% of sentences fail due to being timed out. Data from parse failures is of lower quality since sets of possible heads are returned for each predicate, rather than just a single head.

WSD USING AUTOMATICALLY ACQUIRED VERBAL PREFERENCES

113

3. multi-word expression identification – Many of the multi-word expressions were not detected due to easily correctable errors. This resulted in the preferences being applied where inappropriate. 4. errors arising from the mapping between WordNet and Hector. 5. thresholding – WordNet classes with a low prior probability are removed in the course of preference acquisition. Because of this, some senses are omitted from the outset. 6. preference errors – Other contextual factors should be taken into consideration as well as preferences. Our system does comparably (in terms of precision and recall) with other systems using verbal preferences alone. 4. Discussion The results from SENSEVAL indicate that selectional preferences are not a panacea for WSD. A fully fledged system needs other knowledge sources. We contend that selectional preferences can help in situations where there are no other salient cues and the preference of the predicate for the argument is sufficiently strong. One advantage of automatically acquired selectional preferences is that they do not require supervised training data. Although our system does use sense ranking from SemCor when acquiring the preferences, it can be used without this. Another advantage is that domain-specific preferences can be acquired without any manual intervention if further text of the same type as the target text is available. SENSEVAL has allowed different WSD strategies to be compared on a level playing field. What is now needed is further comparative work to see the relative strengths and weaknesses of different approaches and to identify when and how complementary knowledge sources can be combined. Acknowledgements This work was supported by CEC Telematics Applications Programme project LE1-2111 “SPARKLE: Shallow PARsing and Knowledge extraction for Language Engineering” and by a UK EPSRC Advanced Fellowship to the first author. References Abe, N. and H. Li. “Learning Word Association Norms Using Tree Cut Pair Models”. In: Proceedings of the 13th International Conference on Machine Learning ICML. 1996, pp. 3–11. Briscoe, T. and J. Carroll. “Automatic Extraction of Subcategorization from Corpora”. In: Fifth Applied Natural Language Processing Conference. 1997, pp. 356–363. Carroll, J. and E. Briscoe. “Apportioning development effort in a probabilistic LR parsing system through evaluation”. In: Proceedings of the 1st ACL SIGDAT Conference on Empirical Methods in Natural Language Processing. 1996, pp. 92–100. Cunningham, H., R. Gaizauskas and Y. Wilks. “A general architecture for text engineering (GATE) – a new approach to language R&D”. Technical Report CS-95-21, University of Sheffield, UK, Department of Computer Science. 1995.

114

CARROLL AND McCARTHY

Elworthy, D. “Does Baum-Welch re-estimation help taggers?”. In: 4th ACL Conference on Applied Natural Language Processing. 1994, pp. 53–58. Garside, R., G. Leech and G. Sampson. The computational analysis of English: A corpus-based approach. Longman, London. 1987. McCarthy, D. “Word Sense Disambiguation for Acquisition of Selectional Preferences”. In: Proceedings of the ACL/EACL 97 Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. 1997, pp. 52–61. Miller, G. A., C. Leacock, R. Tengi and R. T. Bunker. “A semantic concordance”. In: Proceedings of the ARPA Workshop on Human Language Technology. 1993a, pp. 303–308. Miller, G., R. Beckwith, C. Felbaum, D. Gross and K. Miller. “Introduction to WordNet: An On-Line Lexical Database”. ftp//clarity.princeton.edu/pub/WordNet/5papers.ps. 1993b. Resnik, P. “Selectional Preference and Sense Disambiguation”. In: Proceedings of Workshop Tagging Text with Lexical Semantics: Why What and How? 1997, pp. 52–57. Ribas, F. “On Acquiring Appropriate Selectional Restrictions from Corpora Using a Semantic Taxonomy”. Ph.D. thesis, University of Catalonia. 1995. Rissanen, J. “Modeling by Shortest Data Description”. Automatica 14 (1978), 465–471. Sampson, G. English for the computer. Oxford University Press. 1995.

Computers and the Humanities 34: 115–120, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

115

A Topical/Local Classifier for Word Sense Identification MARTIN CHODOROW1, CLAUDIA LEACOCK2 and GEORGE A. MILLER3 1 Department of Psychology, Hunter College of CUNY, 695 Park Avenue, New York, NY 10021, USA (E-mail: [email protected]); 2 Department of Cognitive and Instructional Science,

Educational Testing Service, Princeton, NJ 08541, USA (E-mail: [email protected]); 3 Cognitive Science Laboratory, Princeton University, 221 Nassau Street, Princeton, NJ 08542, USA

(E-mail: [email protected])

Abstract. TLC is a supervised training (S) system that uses a Bayesian statistical model and features of a word’s context to identify word sense. We describe the classifier’s operation and how it can be configured to use only topical context cues, only local cues, or a combination of both. Our results on Senseval’s final run are presented along with a comparison to the performance of the best S system and the average for S systems. We discuss ways to improve TLC by enriching its feature set and by substituting other decision procedures for the Bayesian model. Future development of supervised training classifiers will depend on the availability of tagged training data. TLC can assist in the hand-tagging effort by helping human taggers locate infrequent senses of polysemous words. Key words: disambiguation, Senseval, Bayesian classifier

1. Introduction Our goal in developing TLC (a Topical/Local Classifier) was to produce a generic classifier for word sense disambiguation that uses publicly available resources and a standard Bayesian statistical model. We designed it to be flexible enough to incorporate topical context, local context, or a combination of the two. Topical context is comprised of the substantive words within the sentence. Local context consists of all words within a narrow window around the target. The next section gives a brief description of TLC’s design and describes how it was used in Senseval. (A more detailed account of TLC can be found in Leacock, et al., 1998). Section 3 focuses on our treatment of multiword expressions and proper names for Senseval. In Section 4 we discuss our Senseval results, which are presented as ets-pu in Kilgarriff and Rosenzweig (in this volume). In Section 5 we suggest some ways to improve performance. In the final section, we describe an application of TLC in manual sense tagging.

116

CHODOROW ET AL.

2. Overview of the Classifier A word sense classifier can be thought of as comprising a feature set and a decision procedure. Operationally, it can be viewed as a sequence of processing stages. Here we describe TLC’s operation, the features it extracts and the decision procedure it employs. TLC’s operation consists of preprocessing, training, and testing. (For Senseval, an extra preprocessing step was used to locate targets that are multiword expressions. This will be described in Section 3.) During preprocessing, example sentences are tagged with part-of-speech (Brill,1994) and each inflected open-class word found in WordNet (Fellbaum, 1998) is replaced with its base form. These steps permit TLC to normalize across morphological variants while preserving inflectional information in the tags. Training consists simply of counting the frequencies of the various contextual features (the cues) in each sense. When given a test sentence containing the polysemous word, TLC uses a Bayesian approach to find the sensesi which is the most probable given the cues cj contained in a context window of ± k positions around the polysemous target. For each si , the probability is computed with Bayes’ rule: p(si | c−k , . . . , ck ) =

p(c−k , . . . , ck | si )p(si ) p(c−k , . . . ck )

Since the term p(c−k ,. . . ,ck ) | si ) is difficult to estimate because of the sparse data problem, we assume, as is often done, that the occurrence of each feature is conditionally independent of the others, so that the term can be replaced with: p(c−k , . . . , ck | si ) =

k Y

p(cj | si )

j =−k

We can estimate p (cj | si ) from the training data, but the sparse data problem affects these probabilities too, and so TLC uses the Good-Turing formula (Good, 1953; Chiang, et al., 1995), to smooth the values of p(cj | si ), and provide probabilities for cues that did not occur in the training. TLC actually uses the mean of the Good-Turing value and the training-derived value, an approach that has yielded consistently better performance than relying on the Good-Turing values alone. There are four types of contextual features that TLC considers: (1) topical cues consisting of open-class words (nouns, verbs, adjectives and adverbs) found in the Senseval context; (2) local open-class words found within a narrow window around the target; (3) local closed-class items (non-open-class words, e.g., prepositions and determiners); (4) local part of speech tags. The local windows do not extend beyond a sentence boundary. Procedures for estimating p(cj | si ) and p(cj ) differ somewhat for the various feature types. 1. The counts for open-class words (common nouns, verbs, adjectives, and adverbs) from which the topical word probabilities are calculated are not sensi-

TLC FOR WORD SENSE IDENTIFICATION

117

tive to position anywhere within a wide window covering the entire example (the “bag of words” method). By contrast, the local cue probabilities do take relative position into account. 2. For open-class words found in the three positions to the left of the target (i.e., j = –3, –2, –1),p(cj | si ) is the probability that word cj appears in any of these positions. This permits TLC to generalize over variations in the placement of premodifiers, for example. Similarly, we generalize over the three positions to the right of the target. The window size of ± 3 was chosen on empirical grounds (Leacock et al., 1998). 3. Local closed-class items include determiners, prepositions, pronouns, and punctuation. For this cue type, p(cj | si ) is the probability that item cj appears precisely at location j for sense si . Positions j = –2, –1, 1, 2 are used. The global probabilities, for example p(the−1 ), are based on counts of closed-class items found at these positions relative to the nouns in a large textual corpus. 4. Finally, part of speech tags in the positions j = –2, –1, 0, 1, 2 are used. The probabilities for these tags are computed for specific positions (e.g.,p(DT−1 | si ), p(DT−1 )) in the same way as in (3) above. When TLC is configured to use only topical information, feature type (1) is employed. When it is configured for local information, types (2), (3), and (4) are used. Finally, in combined mode, the set of cues contains all four types. We determined which of the three configurations was best for each Senseval item by dividing the training materials into two subsets, one was used for training TLC, the remainder for evaluating the performance of each configuration. We then used the best configuration of TLC in Senseval’s final run. For twenty-four of the items, this was the combined classifier, for ten it was the local configuration, and for two, the topical configuration.

3. Multiword dictionary expressions and Proper Names During the development of TLC (Leacock et al., 1998), collocations (called multiword expressions in Senseval) were not included in the training/testing corpus – for the simple reason that collocations are usually monosemous. For example, if “rubber band” had only one sense in WordNet, the term was not included in the training or testing corpora. We emulated this filtering procedure for Senseval as follows. When a multiword expression appeared as a head word in the Hector dictionary, we automatically generated a regular expression to match morphological and other variants, and searched the Senseval final-run corpus for the regular expression. For example, to find instances of “rubber band”, we searched for “/rubber band[s]?/” in the test corpus, and assigned any matches to the “rubber band” sense of “band”. TLC was not subsequently trained on that sense of band. As a result, if the regular expression match failed, test examples could not be assigned the correct sense. This procedure

118

CHODOROW ET AL.

Table I. Comparison of TLC to best and average S system performance on trainable words (fine-grained scoring). part of speech All Nouns Verbs Adjectives Multi-word Proper name

TLC precision recall .756 .806 .709 .744 .785 .811

(.755) (.806) (.709) (.743) (.704) (.360)

Best S System precision recall .771 .850 .709 .761 .907 .937

(.771) (.850) (.709) (.761) (.906) (.937)

Mean of S Systems precision recall .733 .789 .687 .724 .757 .758

(.657) (.787) (.686) (.723) (.682) (.480)

worked surprisingly well. About 25 regular expressions were generated, matching almost 7% of the test sentences. Of these, 84% were correctly identified. Other multiword expressions in the Hector dictionary are often listed as kinds or as idioms within a Hector word sense. For example, “jazz band” and “rock band” are kinds of one sense of band. Again, regular expressions were used to locate and assign a sense to these collocations. However, since many other kinds of bands, like “rhythm and blues band”, are subsumed under the same sense but are not explicitly specified as a kind, the classifier was also trained in the usual way on that sense. This meant that even if the regular expression match failed, the correct sense might still be identified based on TLC’s cues. In developing TLC we did not consider proper names, again because they are not polysemous. Proper name identification is a field unto itself, and our working assumption has been that a proper name filter would be applied to text prior to TLC’s operation. Since we do not have such a filter as part of TLC’s preprocessing, the proper names in Senseval were treated as separate senses, with training performed on each independently. 4. Results We used TLC to assign senses to the 36 trainable words only. Features were extracted from the supervised training materials, but the definitions and example sentences provided in the Hector dictionary were not used. However, as described in Section 3, we did filter the collocations listed as kinds or idioms during preprocessing. The results indicate that TLC’s precision increased with size of the training data (Pearson correlation coefficientr = 0.33, p < 0.05, two-tailed), but there was no significant effect of the number of senses (r = –0.15, p > 0.10). As expected for a Bayesian classifier, its performance was strongly affected by item entropy (r = –0.63, p< 0.01).

TLC FOR WORD SENSE IDENTIFICATION

119

Table I shows the classifier’s performance over all trainable words when scored by the fine-grained method. It also lists the results by nouns, verbs, adjectives, multi-word expressions, and proper nouns. The data are taken from the final Senseval run and are designated ets-pu in the main summary tables. For purposes of comparison, Table I also gives performance figures for the best supervised training (S) system, as well as the mean for all S systems.

5. System Improvements Most classifiers consist of two independent components: the statistical model and the set of features they manipulate. Ideally, these two should be prised apart and evaluated independently of one another. For example, TLC’s Bayesian model assumes conditional independence of the features, which is clearly a false assumption. Other models for word sense disambiguation that do not assume independence are emerging, such as maximum entropy and TiMBL (Veenstra et al., this volume). It is quite possible that replacing the Bayesian model with one of these would improve the classifier’s overall performance. It is also likely that the feature set TLC uses can be improved. For example, it currently uses the Penn Treebank part-of-speech tag sets. Recently, enriched tags that encode configurational information such as supertags (Joshi and Srinivas, 1994) are being developed, and might also improve the system’s performance.

6. A Sense-Tagging Application Miller, et al. (submitted) are currently preparing a hand-tagged corpus for several hundred common words of English, as a resource for future development of statistical classifiers. Preparation of these materials is time-consuming and laborintensive, in part because many words have secondary senses that are so infrequent that it is difficult to find examples, except by sifting through hundreds of cases of the primary sense. For instance, in every 100 occurrences of “bank”, 78 are likely to be examples of the “financial institution” sense, with the remaining 22 representing the other 8 senses. We wondered if TLC could perform a pre-screening function by flagging many examples of the primary sense, and in this way save the human taggers much time and effort. As an experiment, for each of eight words that have a single salient sense, we trained TLC on this sense and on the union of all the other senses of the word so that the classifier could score new examples in terms of “primary” sense and “other”. When we looked at examples that were classified as high probability primary, low probability other, there were very few misclassifications. This screening procedure should speed up the tagging process by allowing human taggers to concentrate their efforts on sentences in which a nonprimary sense is more likely to be used. We hope that in this way, by assisting in the

120

CHODOROW ET AL.

manual tagging of training corpora, TLC can contribute to the future development of all supervised training systems, including its own. References Brill, E. “Some advances in rule-based part of speech tagging” Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle: AAAI. 1994. Chiang T-H., Y-C. Lin and K-Y Su. “Robust learning, smoothing, and parameter tying on syntactic ambiguity resolution”, Computational Linguistics, Vol. no. 21–3, 1995, pp. 321–349. Fellbaum, C. (ed). WordNet: An Electronic Lexical Database, Cambridge: MIT Press. 1998. Good, I. F. “The population frequencies of species and the estimation of population parameters”, Biometrica, Vol. no. 40, 1953, pp. 237–264. Joshi, A.K., B. Srinivas. “Disambiguation of Super Parts of Speech (or Supertags): Almost parsing”, Proceedings of COLING 1994, 1994, pp. 154–160. Leacock, C., M. Chodorow and G. A. Miller. “Using corpus statistics and WordNet relations for sense identification”, Computational Linguistics, Vol. no. 24-1, 1998, pp. 147–165. Miller, G. A., R. Tengi and S. Landes (submitted for publication). “Matching the Tagging to the Task”.

Computers and the Humanities 34: 121–126, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

121

GINGER II: An Example-Driven Word Sense Disambiguator LUCA DINI1,∗ , VITTORIO DI TOMASO1 and FRÉDÉRIQUE SEGOND2 1 Centro per l’Elaborazione del Linguaggio e dell’ Informazione (E-mail: {dini,ditomaso}@celi.sns.it); 2 Xerox Research Centre Europe

(E-mail: [email protected])

1. Introduction Ginger II performs “all word” unsupervised Word Sense disambiguation for English, exploiting information from machine-readable dictionaries in the following way. To automatically generate a large, dictionary-specific semantically tagged corpus, we extract example phrases found in the text in the dictionary entries. We attach to each headword in this text the dictionary sense numbering in which the text was found. This provides the sense label for the head word in that context. GINGER II then builds a database of semantic disambiguation rules from this labelled text by extracting functional relations between the words in these corpus sentences. As in GINGER I (Dini et al., 1998) the acquired rules are two-level rules involving the word level and/or ambiguity class level. In contrast to the algorithm used in GINGER I which was a variant of Brill’s tagging algorithm (Brill, 1997), iteratively validating adjacency rules on a tagged corpus, GINGER II is now based on a completely nonstatistical approach. GINGER II directly extracts semantic disambiguation rules from dictionary example phrases using all functional relations found there. The dictionary, providing typical usages of each sense, needs no iterative validation. GINGER II provides the following improvements over GINGER I: − it relies on dictionary sense numbering to semantically tag dictionary examples − it uses syntactic parsing of dictionary examples to extract semantic disambiguation rules ∗ We are grateful to Gregory Grefenstette and Christopher Brewster for their comments on earlier

versions of this paper. Our thanks also go to Rob Gaizauskas, Wim Peters, Mark Steventson and Yorick Wilks for fruitful discussions about the methodology. Any remaining errors are our own.

122

LUCA DINI ET AL.

− it uses two sets of semantic information to produce semantic disambiguation rules: the dictionary numbering provided from HECTOR (Atkins, 1993) and the 45 top level categories of WordNet. We present below the building blocks of GINGER II as well as the components and the resources it uses. 2. The GINGER II Approach to Semantic Disambiguation within the SENSEVAL Competition GINGER II is an unsupervised rule based semantic tagger which works on all words. Semantic disambiguation rules are directly extracted from dictionary examples and their sense numberings. Because senses and examples have been defined by lexicographers, they provide a reliable linguistic source for constructing a data base of semantic disambiguation rules. GINGER II first builds, using dictionary examples, a data-base of rules which will then be applied to a new text and return as output a semantically tagged text. To learn the semantic disambiguation rules GINGER II uses the following components: − The HECTOR Oxford Dictionary of English (OUP), − the Xerox Incremental Finite State Parser for English (XIFSP), − WordNet 1.6 (English). GINGER II uses dictionary example phrases as a semantically tagged corpus. When an example z is listed under the sense number x of a dictionary entry for the word y, GINGER II creates a rule which stipulates that, in usages similar to z, the word y has the meaning x. Using XIFSP,1 we first parse all the OUP example phrases for the selected SENSEVAL words. XIFSP is a finite state shallow parser relying on part of speech information only to extract syntactic functions without producing complete parse trees in the traditional sense. GINGER II makes use of the syntactic relations: subject-verb, verb-object and modifier. Subject-object relations include cases such as passives, reflexives and relative constructions. Modifier relations include prepositional and adjectival phrases as well as relative clauses. GINGER II also uses XIFSP-extracted information about appositions. Altogether GINGER II uses 6 kinds of functional relations. Although XIFSP also extracts adverbial modification, GINGER II does not use it since our semantic disambiguation also uses, as shown below, the 45 toplevel WordNet categories where all adverbs are associated with the same unique semantic class. Once all OUP examples have been parsed, each word of each functional pair is associated with semantic information.

GINGER II

123

Two sets of semantic labels are used: the HECTOR sense numbers and the 45 WordNet top-level categories. HECTOR senses numbers are used to encode the example headword, while the WordNet tags are used to encode all remaining words appearing in the examples. We use the relatively small number of WordNet top level categories so as to obtain sufficiently general semantic disambiguation rules. If we used only HECTOR sense numbers on the assumption that they were extended to all items in a dictionary, this would result in far too many semantic rules, each with a very limited range of application. GINGER II deduces semantic rules2 from these functional-semantic word pairs. These rules, like those of Brill,3 are of two kinds. There are rules at the word level and rules at the ambiguity class level. The example below, summarizes the above steps for the example he shook the bag violently registered under the HECTOR sense number (sen uid = 504338) of the OUP entry for shake: First XIFSP extracts the syntactic functional relations: SUBJ(he,shake), DOBJ(shake,bag). These functional relation are then transformed into functional pairs. For instance, OBJ(shake, bag) becomes (shake H asobj , bag H asobj −1 ). These functional pairs are then augmented with semantic information: the target word, here shake, is associated with HECTOR sense numbers (504338,516519, 516517, 516518, . . . 516388) and the other word, here bag for the verb-object relation, is associated with its WordNet tags sense number (6, 23, 18, 5, 4): These pairs can be read as: (shake

H asObj H asObj −1 , bag ) 504338_516519_516517_516518_ . . . _516388 6_23_18_5_4

From this pair we extract the two following disambiguation rules: − bag WRIGHT bi504338_bi516519_bi516517_bi516518_. . . _bi516388 bi504338 − b6_b23_b18_b5_b4 WRIGHT bi504338_bi516519_bi516517_bi516518_. . . _bi516388 bi504338 Where b represents the object relation and bi its inverse. Rule (1) can be read as: the ambiguity class (504338, 516519, 516517, 516518, . . . 504388) disambiguates as 504338 when it has as object the word bag. Rule (2) can be read as: the ambiguity class (504338, 516519, 516518, . . . 504388) disambiguates as 504338 when it has as object the WordNet ambiguity class (6, 23, 18, 5, 4).

124

LUCA DINI ET AL.

Figure 1. GINGER II: general architecture.

All dictionary example phrases are translated into semantic disambiguation rules and form a rule data-base. GINGER II then applies these rules to any new input text and gives as output a semantically tagged text. The applier, designed at CELI, uses several heuristics in order to drive the application of the disambiguating rules. In particular it exploits the notion of tagset distance in order to determine the best matching rule. The tagset distance is a metric which calculates the distance between two semantic classes within WordNet. The metric for computing the distance can be set by the user and can vary across several applications. The applier first parses the new text and extract the functional dependencies. Then it extracts the potential matching rules. In case of conflict between rules, priority is given to word-level rules. If no word-level rules can apply then priority is given to the rule with the lowest or the highest (depending on the way user set the metrics) distance. The system is now complete and can run on all words of any text. The general architecture of GINGER II is summarized in Figure 1.

3. Evaluation and Future Perspectives For the overall SENSEVAL task GINGER II obtained a precision of 0.46 and a recall of 0.37 which is among the upper band of the unsupervised systems and among the average band of the supervised systems. But contrary to many systems in this range, GINGER II is a general system which works on all words and, regarding the SENSEVAL exercise, it did not take any advantage of knowing the word’s part of speech in advance. Besides, because it directly uses HECTOR senses it did not have the disadvantage of the “mapping senses” phase.

GINGER II

125

We expect these results would improve since a new English tagger is now integrated in XIFSP which performs better than the one we used. Future versions of GINGER will include more functional relations and richer dictionary information. We are also interested in testing possible improvement in system performance using, for instance, triples rather than pairs, for example, using subject-verb-object relations rather than subject-verb, verb-object relations. Encouraged by GINGER’s robustness we are now integrating such a WSD component into XeLDA (Xerox Linguistic Development Architecture) making use of additional dictionary information such as collocates and subcategorization. All this information gives birth to a rule database attached to a particular dictionary leading to a dictionary based semantic tagger.4 Other areas of investigation concern deciding which semantic tags would be best to use, and associating weights with the semantic rules of the database. The results of GINGER II indicates that even if dictionaries, seen as handtagged corpora, are reliable sources of information to extract semantic disambiguation rules from, they can be improved. We believe that one important way of creating better linguistic resources for many Natural Language processing tasks, is to enrich dictionaries with prototypical example phrases. Because it is unsupervised, the method used within GINGER II can be applied to any language for which on-line dictionaries exist but for which significantly large semantically pre-tagged copora are not available.

Notes 1 See Ait-Mokhtar and Chanod (1997). 2 The rule extractor has been implemented as a Java program which parses dictionary entries in

order to gather all the relevant information. 3 See Brill (1995, 1997). 4 See Segond et al. (1999).

References Ait-Mokhtar, S. and J-P. Chanod. “Subject and Object Dependency Extraction Using Finite-State Transducers”. In Proceedings of Workshop on Information Extraction and the Building of Lexical Semantic Resources for NLP Applications, ACL, Madrid, Spain, 1997. Atkins, S. “Tools for Corpus-aided Lexicography: The HECTOR Project”. In Acta Linguistica Hungarica 41, 1992–1993. Budapest, 1993 pp. 5–72. Brill, E. “Transformation-based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging”. In Computational Linguistics, 1995. Brill, E. “Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging”. In Natural Language Processing Using Very Large Corpora. Kluwer Academic Press, 1997. Dini, L., V. Di Tomaso and F. Segond. “Error Driven Word Sense Disambiguation”. In Proceedings of COLING/ACL, Montreal, Canada, 1998. Miller, G. “Wordnet: An On-line Lexical Database’. International Journal of Lexicography, 1990.

126

LUCA DINI ET AL.

Resnik, P. and D. Yarowsky. “A Perspective on Word Sense Disambiguation Methods and Their Evaluation”. In Proceedings of ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C., USA, 1997. Segond, F., E. Aimelet, V. Lux and C. Jean. “Dictionary-driven Semantic Look-up”. In Computer and the Humanities, this volume. Yarowsky, D. “Unsupervised Word Sense Disambiguation Method Rivalizing Supervised Methods”. In Proceedings of the ACL, 1995. Wilks, Y and M. Stevenson. “Word Sense Disambiguation Using Optimised Combinations of Knowledge Sources”. In Proceedings of COLING/ACL, Montreal, Canada, 1998.

Computers and the Humanities 34: 127–134, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

127

Word Sense Disambiguation by Information Filtering and Extraction JEREMY ELLMAN, IAN KLINCKE and JOHN TAIT School of Computing & Information Systems, University of Sunderland, UK (E-mail: [email protected])

Abstract. We describe a simple approach to word sense disambiguation using information filtering and extraction. The method fully exploits and extends the information available in the Hector dictionary. The algorithm proceeds by the application of several filters to prune the candidate set of word senses returning the most frequent if more than one remains. The experimental methodology and its implication are also discussed. Key words: word sense disambiguation, information filtering, SENSEVAL

1. Introduction Our interest in word sense disambiguation comes from experiences with “Hesperus”, a research system that clusters Internet web pages based on their similarity to sample texts (Ellman and Tait, 1997). Hesperus uses a development of Morris and Hirst’s (1991) idea of lexical chains–that coherent texts are characterised by sets of words with meanings related to the topics and that these topics may be detected by reference to an external thesaurus. Of course, many words are ambiguous and have meanings corresponding to different thesaural headwords. This represents a problem for lexical chaining, since the selection of an incorrect sense means that the word may be joined to an inappropriate chain. It will also be excluded from the correct chain, disrupting the apparent topic flow of the text, and degrading the accuracy of the procedure. This is counteracted using a word sense disambiguation pre-processor. The function of the pre-processor is as much to filter out spurious sense assignments as to wholly provide unique sense identifications. This increases the accuracy of word sense disambiguation which is one of the effects of the lexical chaining process (Okumura and Honda, 1994) by early elimination of inappropriate senses. The pre-processor follows the “sliding window” approach described in Sussna (1993), where ambiguous words are examined within the several words of their surrounding local context. This is compatible with the Senseval task (based as it is

128

ELLMAN ET AL.

on lexical samples), and was consequently re-implemented for Senseval, where it competed as SUSS. SUSS’s principal objective in Senseval was to evaluate different disambiguation techniques that could be used to improve the performance of a future version of Hesperus. This excludes both training and deep linguistic analysis. Training, as in machine learning approaches, implies the existence of training corpora. Such corpora tend only to exist in limited subject areas, or are restricted in scope. Machine Learning approaches were consequently excluded since Hesperus is intended to be applicable to any subject area. Indeed, we could argue that the associations found in thesauri contain the most common representations and subsume the associations found in normal text. Deep linguistic analysis is rarely robust, and often slow. This makes it incompatible with Hesperus, which is designed as a real-time system. A derived objective was to maximise the number of successful disambiguations – the essential competition requirement! SUSS extensively exploited the Hector machine readable dictionary entries used in Senseval. There were two reasons for this: Firstly, Hector dictionary entries are extremely rich, and allowed us to consider disambiguation techniques that would not have been possible using Roget’s Thesaurus alone (as used in Hesperus). Secondly, Hector sense definitions were much finer grained than those used in Roget. A system that used Roget would have been at a considerable disadvantage since it would not have been able to propose exact Hector senses in the competition. One noteworthy technique made possible by the Hector dictionary was the conversion and adaptation of dictionary fields to patterns, as used in Information Extraction (e.g. Onyshkevych, 1993; Riloff, 1994). Where possible, this allowed the unique selection of a candidate word sense, with minimal impact on the performance of the rest of the algorithm. 2. SUSS: The Sunderland University Senseval System SUSS is a multi-pass system that attempts to reduce the number of candidate word senses by repeated filtering. Following an initialisation phase, different filters are applied to select a preferred sense tag. The order of filter application is important. Word and sense-specific techniques are applied first; more general techniques are used if these fail. Specific techniques are not likely to affect any other than their prospective targets, whereas general methods introduce probable misinterpretation over the entire corpus. For example, a collocate such as “brass band” uniquely identifies that sense of “band”, with no impact on other word senses. Other techniques required careful assessment to ensure that their overall effect was positive. This was part of a structured development strategy.

WORD SENSE DISAMBIGUATION

2.1.

129

THE SUSS DEVELOPMENT STRATEGY

We used the training data considerably to develop SUSS, not to train the system but to ensure that promising techniques for some types of ambiguity did not adversely influence the overall performance of the system. The strategy was as follows: 1. A basic system was implemented that processed the training data. 2. A statistics module was implemented that displayed disambiguation effectiveness by word, word sense, and percentage precision. 3. As different disambiguation techniques were developed, effectiveness was measured on the whole corpus. 4. Techniques that improved performance (as measured by total percentage successful disambiguations) were further developed. Those that degraded performance were dropped. (Since the competition was time limited, it was not cost effective to pursue interesting but unsuccessful approaches.) 2.2.

SUSS INITIALISATION PHASE

SUSS used a preparation phase that included dictionary processing and other preparations that would otherwise be repeated for each lexical sample to be processed. The HECTOR dictionary was loaded into memory using a public domain program that parses SGML instances. This made the definition available as an array of homographs that is further divided into an array of finer sense distinctions. Each of these contained fields, such as the word sense definition, part of speech information, plus examples of usage. The usage examples were used in the “Example Comparison Filter” and the “Semantic Relations Filter” techniques (described below). They were reduced to narrow windows W words wide centred on the word to be disambiguated from which stopwords (Salton and McGill, 1983) have been eliminated. This facilitated comparison with identically structured text windows produced from the test data. The main SUSS algorithm is as follows. 2.3.

SUSS ALGORITHM PROCESSING PHASE

1. For each sample: 2. Filter possible entries as collocates. (DONE if there is only one candidate sense.) 3. Filter remaining senses for information extraction pattern. (DONE if there is only one candidate sense.) 4. Filter remaining senses for idiomatic phrases. (DONE if there is only one candidate sense.) 5. Eliminate stopwords from sample.

130

ELLMAN ET AL.

6. Produce window w words wide centred on word to be disambiguated 7. For each example in the Hector dictionary entry Match the sample window against the example window Select the sense that has the highest example matching score. 8. If no unique match found, return the most frequent sense of those remaining from the training corpus (or first remaining dictionary entry – note 1). We now go on to describe the specific techniques tested. 2.4.

COLLOCATION FILTER

Collocations are short, set expressions which have undergone a process of lexicalisation. For example, consider the collocation ‘brass band’. This expression, without context, is understood to refer to a collection of musicians, playing together on a range of brass instruments, rather than a band made of brass to be worn on the wrist. For these reasons it is possible for the Hector dictionary to define such expressions as distinct senses of the word. Given the set nature of collocations, therefore, it was considered that to look for these senses early in the disambiguation process would be a simple method of identifying or eliminating them from consideration. The collocation identification module, therefore, worked as a filter using simple string matching. If a word occurrence passing through the module corresponded to one of the collocational senses defined in the dictionary it would be tagged as having that sense. If none of these senses were applicable, however, all senses taking a collocational form were filtered out. 2.5.

INFORMATION EXTRACTION PATTERN FILTER

The Information Extraction filter refers exclusively to enhancements to the Hector dictionary entries specifically to support word sense disambiguation. The HECTOR dictionary is primarily intended for human readers. Many entries contain a clues field in a restricted language that indicates typical usage. Examples include phrases such as “ learn at mother’s knee, learn at father’s knee, and variants”, or “ usu on or after”. Such phrases have long been proposed as an important element of language understanding (Becker, 1975). These phrases were manually converted into string matching patterns and successfully used to identify individual senses. For example, “shake’ contains the following: shake in one’s shoes, shake in one’s boots v/= prep/in pron-poss prep-obj/(shoes,boots,seat) This can be used to convert the idiom field (using PERL patterns) as follows: shake in \w∗ (shoes|boots|seat) This may now be used to match against any of the idiomatic expressions “shake in her boots”, “your boots”, etc.

WORD SENSE DISAMBIGUATION

131

We call a related method “phrasal patterns”. A phrasal pattern is a non-idiomatic multiple word expression that strongly indicates use of a word in a particular sense. For example, “shaken up” seems to occur only in past passive forms. Adding appropriate phrasal patterns to a dictionary sense was found to increase disambiguation performance for that sense. The majority of phrasal patterns were manually derived from the Hector dictionary entries. Others were identified by observing usage patterns in the dictionary examples, or the training data. Collocation and other phrasal methods are important since they are tightly focused on one word, and on one sense that word may be used in. They do not affect other word senses, and can not influence the interpretation of other words. 2.6.

IDIOMATIC FILTER

Idiomatic forms identify some word senses. Unlike collocations, however, idiomatic expressions are not constant in their precise wording. This made it necessary to search for content words in a given order, rather than looking for a fixed string. An idiom was considered present in the text if a subset of the content words were found exceeding a certain (heuristically determined) threshold value. For example, the meaning of “ too many cooks” is clear, without giving the precise idiom. Dictionary entries that contained idiomatic forms were processed as follows. Firstly, two word idioms were checked for specifically. If the idiom was longer, stopwords were removed from the idiomatic form listed, and remaining content words compared in order with words occurring in the text. If 60% of the content words were found in the region of the target word, the idiomatic filter succeeded, and senses containing that idiom selected. Otherwise, senses containing that idiomatic form were excluded from further consideration. 2.7.

EXAMPLE COMPARISON FILTER

The Example Comparison Filter tries to match the examples given in the dictionary against the word to be disambiguated, looking at the local usage context. It assigns a score for each sense based on identical words occurring in the text and dictionary examples and their relative positions. We take a window of words surrounding the target word, with a specified width and specified position of the target, in the text and in a similar window from each dictionary example. For each example in each sense, all the words occurring in each window are compared and, where identical words are found, a score, S, is assigned, where X S= dS dE w∈W

and w is a word in window W, and dS and dE are functions of the distance of the word from the target word in the sample and example windows respectively, such that greater distances result in lower scores. The size of the window was determined

132

ELLMAN ET AL.

empirically. Window sizes of 24, 14, and 10 words were tried. Larger window sizes increased the probability of spurious associations, and a window size of ten words, (which is five words before and five words after the target word) was selected as optimal. When all the example scores have been calculated for each word sense, the sense with the highest example score is chosen as the correct sense of that occurrence. In cases where this does not produce a result, the most frequently occurring sense (or first dictionary sense – see note 1) that has not been previously eliminated is chosen. 2.8.

OTHER TECHNIQUES EVALUATED

One of the objectives of SUSS was to evaluate different disambiguation techniques. Below we describe two methods that were evaluated, but not used in the final system, since they lead to decreased overall performance. 2.9.

PART OF SPEECH FILTER

Wilks and Stevenson (1996) have claimed that much of sense tagging may be reduced to part-of-speech tagging. Consequently, we used the Brill (1992) Tagger on the subset of the training data set that required part-of-speech discrimination. This should have improved disambiguation performance by filtering out possible senses not appropriate to the assigned part of speech. However, due to tagging inaccuracy, this was just as likely to eliminate the correct word sense, too. Consequently, it did not make a positive contribution. Another routine that used the part-of-speech tags attempted to filter out the senses of words marked as noun modifiers by the dictionary grammar labels where the following word was not marked as a noun by the tagger. This routine also checked words that contained an ‘after’ specification in the grammar tag and eliminated these senses where the occurrence did not follow the word given. However, it gave no overall benefit to the results either. One possible cause of this is in occurrences where there are two modifiers joined by a conjunction so that the first is, legitimately, not followed immediately by a noun. 2.10.

SEMANTIC RELATIONS FILTER

The Semantic Relations Filter is an extension of the example comparison filter that uses overlapping categories and groups in Roget’s thesaurus, rather than identical word matching. This should allow us to recognise that “accident” is used in the same sense in “car accident” and “motor-bike accident”, since both are means of transport. Appropriate scores are allocated for each category in Roget that the test sentence window has in common with the dictionary example window. As in the example

WORD SENSE DISAMBIGUATION

133

comparison, the sense that contains the highest scoring example is selected as the best. Disappointingly, this technique finds many spurious relations where words in the local context are interpreted ambiguously. This lead to an overall performance degradation over the test set, and so the technique was not part of the final SUSS algorithm. 3. Discussion and Conclusion SUSS consisted of a number of individual disambiguation techniques that were applied to the data sequentially. Each of these techniques were designed to have one of two effects; either to attempt to assign a unique dictionary sense for the occurrence, or to eliminate one or more invalid senses from consideration. During development a range of techniques were tested to determine whether they were effective in increasing the disambiguation accuracy of the algorithm. Details of the different techniques applied, and their relative effectiveness are given in Ellman (forthcoming). The testing procedures utilised the training data, with the algorithm being applied both with and without the technique activated. The results of these applications were compared over different senses and different words, against the human tagged training data. The statistics produced were used to determine whether the technique improved the overall accuracy of the disambiguation and, hence, whether it was a useful technique. Some techniques, for example, produced a great improvement in accuracy on particular words or specific senses,2 yet the overall effect was a reduction in accuracy over all words. This result reflects the interaction between word specific and generic sense disambiguation methods. A generic disambiguation technique needs to have better accuracy than that which would be given by selecting a default sense. For example, in the training corpus, “wooden” means “made of wood” in 95% of the samples. Thus, a generic technique applied to “wooden” needs to exceed this level of accuracy or it will degrade overall performance. Regular Information Extraction patterns provided a particularly effective sense specific disambiguation. However, it was necessary to convert each pattern by hand. A clear next step would be the development of a module to automatically produce the patterns from the relevant dictionary fields. SUSS performed surprisingly well considering its lack of sophistication, with above average performance compared to other systems. It is particularly interesting to note that it was placed in the first three systems where no training data was supplied.

134

ELLMAN ET AL.

Notes 1 The calculation of sense occurrence statistics was designed to counter a perceived deficiency in

Hector, where the ordering of senses did not appear to match that of sense frequency in the corpus. This was considered to be a training technique, and SUSS was classified as a learning system. The SUSS-Dictionary system did not use this technique and was considered as an “all words” system. 2 A future variation of SUSS could use different settings for each word to be disambiguated. These could be determined automatically using machine learning algorithms.

References Brill, E. “A Simple Rule-based Part-of-speech Tagger”. Proceeding of the Third Conference on Applied Natural Language Processing, Trento, Italy, 1992. Becker, J. D. “The Phrasal Lexicon”. In Proceedings of the Conference on Theoretical Issues in Natural Language Processing, Cambridge, MA, 1975, pp. 70–77. Ellman, J. (forthcoming). “Using Roget’s Thesaurus to determine the similarity of Texts”. PhD Thesis University of Sunderland, UK. Ellman, J. and J. Tait. “Using Information Density to Navigate the Web”. UK IEE Colloquium on Intelligent World Wide Web Agents, March 1997. Okumura, M. and T. Honda. “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”. Proc. COLING 1994 vol. 2, pp. 755–761. Onyshkevych, B. “Template Design for Information Extraction”. Proceeding of the Fifth Message Understanding Conference (MUC-5), 1993. Morris, J. and G. Hirst. “Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text”. Computational Linguistics, 17(1) (1991), 21–48. Riloff, E. and W. Lehnert. “Information Extraction as a Basis for High-precision Text Classification”. ACM Transactions on Information Systems, 12(3) (July 1994), 296–333. Salton, G. and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill. Stairmand, M. “A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval”. PhD Thesis. Dept of Computational Linguistics UMIST UK, 1996. St-Onge, D. Detecting and Correcting Malapropisms with Lexical Chains. MSc Thesis, University of Toronto, 1995. Sussna, M. “Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network”. Proceedings of the Second International Conference on Information and Knowledge Base Management, 1993, pp. 67–74. Wilks, Y. and M. Stevenson The Grammar of Sense: Is Word-sense Tagging Much More Than Partof-speech Tagging? Technical Report CS-96-05, University of Sheffield, 1996.

Computers and the Humanities 34: 135–140, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

135

Large Scale WSD Using Learning Applied to SENSEVAL PAUL HAWKINS and DAVID NETTLETON University of Durham, Southampton, UK

Abstract. A word sense disambiguation system which is going to be used as part of a NLP system needs to be large scale, able to be optimised towards a specific task and above all accurate. This paper describes the knowledge sources used in a disambiguation system able to achieve all three of these criteria. It is a hybrid system combining sub-symbolic, stochastic and rule-based learning. The paper reports the results achieved in Senseval and analyses them to show the system’s strengths and weaknesses relative to other similar systems.

1. Introduction The motivation behind this work is to develop a core Word Sense Disambiguation (WSD) module which can be integrated into a NLP system. An NLP system imposes three requirements on any dedicated WSD module it may use: • To be large scale and disambiguate all words contained in all open class categories. • To be able to be optimised towards a specific task. • To be accurate. Senseval facilitated the evaluation of all three of these requirements. Senseval enabled the comparison of disambiguation accuracy with other state-of-the-art systems. It also provided the first opportunity to test if this system was lexicon independent which enables optimisations towards a specific task. The main features of this system are the way different knowledge sources are combined, how contextual information is learnt from a corpus and how the disambiguation algorithm eliminates senses. This paper concentrates on the knowledge sources used. A detailed examination of all components of the system can be found in (Hawkins, 1999).

136

HAWKINS AND NETTLETON

2. Knowledge Sources Three knowledge sources are used to aid disambiguation: frequency, clue words and contextual information. They are all combined together to produce a hybrid system which takes advantage of stochastic, rule-based and sub-symbolic learning methods. A hybrid system seems appropriate for the WSD task because words differ considerably in the number of different senses, the frequency distribution of those senses, the number of training examples available and the number of collocates which can help disambiguation. This makes the task very different for each word, and affects the amount each of the knowledge sources is able to help disambiguation for that particular word. By combining these knowledge sources the aim is to take the useful information each is able to offer, and not allow them to cause confusion in cases where they are unable to help. Each of the three knowledge sources is now described.

2.1.

FREQUENCY

The frequency information is calculated from the Hector training corpus which has been manually sense tagged. The frequency of each sense is calculated for each word form rather than the root form of each word. In some instances this morphological information greatly increases the frequency baseline.1 For example, the frequency distribution of senses is very different for word forms sack and sacks than it is for sacking. The results show that using frequency information in this way increases the frequency baseline for sack from 50% to 86.6%.

2.2.

CLUE WORDS

Clue words are collocates or other words which can appear anywhere in the sentence. The clue words are manually identified, which does pose a scaleability problem. However, given the size of the Senseval task it seemed appropriate to take advantage of human knowledge. On average less than one hour was dedicated by an unskilled lexicographer to identifying clues for each word. This is substantially less than the skilled human effort required to manually sense tag the training data. The success of this knowledge source on this scale may influence the decision to invest resources in clue words on a larger scale. In general, clues give very reliable information and therefore they can often be used even with words which have a very high frequency baseline. If an infrequent sense has a good clue then it provides strong enough evidence to out-weigh the frequency information. For the ambiguous word wooden, spoon provides an excellent clue for an infrequently used sense. This enabled the system to achieve 98% accuracy – 4% above the frequency baseline. The learning algorithm was unable to help for this word as it does not suggest senses with a high enough confidence to ever out-weigh the frequency information.

LARGE SCALE WSD USING LEARNING APPLIED TO SENSEVAL

2.3.

137

CONTEXTUAL INFORMATION

This section introduces the notion of a contextual score which represents a measure for the contextual information between two concepts. Whilst it contributes less to the overall accuracy than the frequency or clue words information, contextual information aims to correctly disambiguate the more difficult words. It uses a sub-symbolic learning mechanism and requires training data. As with most subsymbolic approaches it is difficult to obtain an explanation for why a particular sense is chosen. The contextual score uses the WordNet hierarchy to make generalisations so that the most is gained from each piece of training data. These scores differ from a semantic similarity score described in Sussna (1993), by representing the likelihood of two concepts appearing in the same sentence rather than a measure of how closely related two concepts are. As WordNet does not attempt to capture contextual similarity which is required for WSD (Karov and Edelman, 1996) this information is learnt. This greatly reduces the dependency on the WordNet hierarchy making the system more domain independent. For example, in WordNet doctor and hospital would be assigned a very low semantic similarity as one is a type of professional and the other is a type of building. However, the concepts do provide very useful contextual information which would be learnt during training. Contextual scores are learnt by increasing scores between the correct sense and the contextual words and decreasing scores between the incorrectly chosen sense and the contextual words. The mechanism by which this is performed is beyond the scope of this paper. The contextual scores between concepts are stored in a large matrix. Only the nodes and their hypernyms which have occurred more than 20 times in the SemCor training data are included in the matrix which comprises about 2000 nodes. Whilst it would be possible to include all WordNet nodes in the matrix, the amount of training data required to train such a matrix is currently not available. Also the purpose of the matrix is to learn scores between more general concepts in the higher parts of the hierarchy and to accept the WordNet structure in the lower parts. To find the contextual score between two nodes they are looked up to see if they are contained in the matrix; if they are not their hypernyms are moved up until a node is found which is in the matrix. The contextual scores between nodes in the matrix are learnt during training. Given a training sentence such as “I hit the board with my hammer” , where board is manually sense tagged to the Board(plank) sense, Hit and Hammer are contextual words, but only Hammer will be considered in this example. Figure 1 shows how scores are changed between nodes. Let us assume that the system incorrectly assigns the Circuit Board sense to board. Hammer is represented by Device in the contextual matrix, the correct sense of board is represented by Building Material and the incorrectly chosen sense is represented by Electrical Device. The training process increases the contextual score between Device and Building Material and

138

HAWKINS AND NETTLETON

Artifact

Instrumentality

Building Material Device

Dining table, board Control Board

Electrical Device

Hammer Circuit Board

Nail

Board, Plank

Increase contextual score Decrease contextual score Cut off between nodes in and out of matrix Figure 1. Diagram showing the changes in contextual scores if ‘hammer’ and the ‘board, plank’ sense of board appear in a training sentence.

decreases the score between Electrical Device and Device, Thus making hammer a better contextual clue for Board (plank) and a worse contextual clue for Circuit board. The diagram highlights the benefit of the contextual matrix operating above the word level. The training sentence also enables Nail to obtain a higher contextual score with Board(plank). The greatest benefit of the contextual score has proved to be for words which are difficult to disambiguate. Typically these words have a low frequency baseline and clue words are unable to improve accuracy. Contextual scores can be learnt for concepts with different POS. This vastly increases the amount of contextual information available for each ambiguous word and also enables all words of all POS to be disambiguated. This is important in order to meet the large-scale requirement imposed on the system. As contextual scores are learnt there is a reliance on training data. However, as the system is not dependant on the WordNet hierarchy, a system trained on SemCor should be able to be used on a different lexicon without re-learning. Using the Hector lexicon during Senseval was the first opportunity to test this feature. Analysis of the results in section 3 shows that the learning aspects of the system do exhibit lexicon independent features.

139

LARGE SCALE WSD USING LEARNING APPLIED TO SENSEVAL

Table I. The effect of each knowledge source on overall accuracy

(1) Root Form Frequency (2) Word Form Frequency (3) Clue words + 2 (4) Contextual scores + 2 (5) Full System 2 + 3 + 4 (6) Coarse Grained 2 + 3 + 4

Onion

Generous

Shake

All words

84.6 85 92.5 85 92.5 92.5

39.6 37 44.9 50.1 50.7 50.7

23.9 30.6 71.1 61.8 69.9 72.5

57.3 61.6 73.7 69.8 77.1 81.4

3. Results Table I shows the contribution frequency, clue words and contextual scores have made to the overall accuracy of the system. Apart from the final row all scores quoted are ‘fine-grained’ results. Precision and recall values are the same as this system attempted every sentence. Row (2) shows that the overall accuracy is increased by 4.3% by using word form rather than root form frequencies. Row (4) shows that this system performs quite well even without the use of manually identified clue words; such a system would have no scaleability problems. Out of the three words identified, generous benefits the most from the contextual scores. This is because it has a low frequency baseline and there are very few clues words which are able to help. Row (5) shows that the overall system achieves much higher accuracy than any sub-section of it. This shows that the clue words and contextual scores are useful for disambiguating different types of words and so can be successfully combined.

4. Conclusion and Comparison The real benefits of the Senseval evaluation are now briefly exploited by comparing different systems’ results. Figure 2 uses Kappa to analyse results of the four systems which achieved the highest overall precision, all of which used supervised learning. Kappa gives a measure of how well the system performed relative to the frequency baseline. This enables the relative difficulty of disambiguating different categories of words to be examined. The graph shows that all systems found that nouns were the easiest POS to disambiguate and adjectives proved slightly more difficult than verbs. Relative to other systems Durham did well for nouns and least well for verbs. Possible reasons for this are that the Durham system only uses semantic information in the context, and gives equal weight to all words in the sentence. Other systems also use syntactic clues and often concentrate on the words immediately surrounding

140

HAWKINS AND NETTLETON

Figure 2. Graph showing comparison between 4 learning systems in Senseval.

the ambiguous word which may be more beneficial for discriminating between verb senses. The Durham system performed very well on the words where no training data was given. This highlights its lexicon independence feature, as it was able to take advantage of training performed using SemCor and the WordNet lexicon. Note 1 The accuracy achieved by a system which always chooses the most frequent sense.

References Hawkins, P. “DURHAM: A Word Sense Disambiguation System”. Ph.D. thesis, Durham University, 1999. Karov, Y. and S. Edelman. “Similarity-based Word Sense Disambiguation”. Computational Linguistics, 24(1) (1996), 41–59. Sussna, M. “Word Sense Disambiguation for Free-Text Indexing Using a Massive Semantic Network”. In Proceedings of the 2nd International Conference on Information and Knowledge Management, pp. 67–74.

Computers and the Humanities 34: 141–146, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

141

Word Sense Disambiguation Using the Classification Information Model Experimental Results on The SENSEVAL Workshop HO LEE1, HAE-CHANG RIM1 and JUNGYUN SEO2 1 Korea University, Seoul, 136, Korea (E-mail: {leeho,rim}@nlp.korea.ac.kr); 2 Sogang University,

Seoul, 121 Korea (E-mail: [email protected])

Abstract. A Classification Information Model is a pattern classification model. The model decides the proper class of an input instance by integrating individual decisions, each of which is made with each feature in the pattern. Each individual decision is weighted according to the distributional property of the feature deriving the decision. An individual decision and its weight are represented as classification information which is extracted from the training instances. In the word sense disambiguation based on the model, the proper sense of an input instance is determined by the weighted sum of whole individual decisions derived from the features contained in the instance. Key words: Classification Information Model, classification information, word sense disambiguation

1. Introduction Word sense disambiguation can be treated as a kind of classification process. Classification is the task of classifying an input instance into a proper class among pre-defined classes, using features extracted from the instance. When the classification technique is applied to word sense disambiguation, an instance corresponds to a context containing a polysemous word. At the same time, a class corresponds to a sense of the word, and a feature to a clue for disambiguation. In this paper, we propose a novel classification model, the Classification Information Model (Lee et al., 1997), and describe the task of applying the model to the case of word sense disambiguation. 2. Classification Information Model Classification Information Model is a model of classifying the input instance by use of the binary features representing the instance (Lee et al., 1997). We assume that each feature is independent from any other features. In the model, the proper class of an input instance, X, is determined by equation 1. def

proper class of X = arg max Rel(cj , X) cj

(1)

142

LEE ET AL.

where cj is the j -th class and Rel(cj , X) is the relevance between the j -th class and X. Since it is assumed that there is no dependency between features, the relevance can be defined as in equation 2.1 def

Rel(cj , X) =

m X

xi wij

(2)

i=1

where m is the size of the feature set, xi is the value of the i-th feature in the input instance, and wij is the weight between the i-th feature and the j -th class. In equation 2, xi has binary value (1 if the feature occurs within context, 0 otherwise) and wij is defined by using classification information. Classification information of a feature (fi ) is composed of two components. One is the MPCi ,2 which corresponds to the most probable class of the instance determined by the feature. The other is the DSi ,3 which represents the discriminating ability of the feature. Assuming we consider only a feature fi , we can determine the proper class to be MPCi and assign DSi to the weight of the decision which is made with the feature fi . Accordingly, wij in equation 2 is defined as in equation 3 with classification information of features.  DSi if cj = MPCi def wij = (3) 0 otherwise In order to define classification information, the model uses the normalized conditional probability, pˆ j i , defined in equation 4, instead of the conditional probability of classes given features, p(cj |fi ).4 N(c) p(cj |fi ) N(c j)

def

pˆ j i = P n

k=1

N(c) p(ck |fi ) N(c k)

p(fi |cj ) = Pn k=1 p(fi |ck )

(4)

In equation 4, N(cj ) is the number of instances belonging to the class cj and N(c) is the average number of instances per class. With the normalized conditional probability, both components of classification information are defined as in equations 5 and 6. def

MPCi = arg max pˆ j i cj

= arg max p(fi |cj ) cj

(5)

143

WSD USING THE CLASSIFICATION INFORMATION MODEL

Table I. Example of features and their classification information Feature

MPC

DS

Feature

MPC

DS

(–1 very) (±5 very) (±5 been) (±5 have) (±5 about) (±B very and)

512274 512274 512274 512309 512309 512274

0.8173 0.8756 0.8651 1.017 1.619 2.585

(+1 and) (±5 and) (±5 we) (±5 raised) (–B been very)

512274 512274 512309 512309 512274

0.5202 0.0275 1.591 2.585 2.585

def

DSi = log2 n − H (pˆ i ) n X = log2 n + pˆ j i log2 pˆ j i

(6)

j =1

3. Word Sense Disambiguation Based on the Classification Information Model When the classification technique is applied to word sense disambiguation, input instances correspond to contexts containing polysemous words. At the same time, classes correspond to senses of the word, and features to clues for disambiguation. There are, however, various types of clues for sense disambiguation within context. Therefore, disambiguation models should be revised in order to utilize them. In addition to word bigram, a set of positional relationships, part-of-speech sequences, co-occurrences in a window, trigrams and verb-object pairs can be useful clues for word sense disambiguation (Yarowsky, 1996). Therefore, we adopt feature templates used in Yarowsky (1994) in order to represent all types of clues together. The templates of the condition field in our model are as follows: 1. word immediately to the right (+1 W) 2. word immediately to the left (–1 W) 3. word found in ±k word window (±k W) 4. Pair of words at offsets –2 and –1 (–B W W) 5. Pair of words at offsets –1 and +1 (±B W W) 6. Pair of words at offsets +1 and +2 (+B W W) The features extracted from the sentence 700005 among testing data set of generous and their classification information are shown in Table I.5 There are two advantages of separating the feature extractor from the disambiguation model. One is the language independent characteristic of the model. In order to apply this approach to other languages, only the substitution of feature templates,

144

LEE ET AL.

Table II. Experimental results on the SENSEVAL data set Sense degree

Systems

All words

Nouns

Verbs

Adjectives

Fine-grained

best baseline best system our system

0.691 0.781 0.701

0.746 0.845 0.773

0.676 0.720 0.646

0.688 0.751 0.673

Mixed-grained

best baseline best system our system

0.720 0.804 0.740

0.804 0.865 0.817

0.699 0.748 0.682

0.703 0.764 0.712

Coarse-grained

best baseline best system our system

0.741 0.818 0.752

0.852 0.885 0.835

0.717 0.761 0.692

0.705 0.766 0.715

not the modification of the model itself, is required. The other is flexibility for utilizing linguistic knowledge. If new useful linguistic knowledge is provided, the model can easily utilize it by extending feature templates. 4. Experimental Results Some experimental results on the data set of the SENSEVAL workshop are shown in Table II.6 Since our system uses a supervised learning method, the precision for only trainable words are contained in the table. Among the supervised learning systems, our system was ranked middle in performance, and can generally determine senses better than the best baseline method. However, our system was especially weak in determining the sense of verbs. One possible reason for this weakness is that the system exploited only words and parts-of-speech, though other higher level information, such as syntactic relations, is important for determining senses of verbs. Figure 1 shows the correlation between the size of training data and precision: as the size of the data set is decreased, so too is the level of performance. This tendency is fairly regular and is independent of the part-of-speech of target polysemous words. Therefore, additional techniques for relaxing the data sparseness problem are required for our system. 5. Summary Our model is a supervised learning model, based on classification information. It has several good characteristics. The model can exploit various types of clues because it adopted the feature templates. Moreover, the model is language independent since the feature extractor instead of the disambiguation model handles all

WSD USING THE CLASSIFICATION INFORMATION MODEL

145

Figure 1. Correlation between the size of training data and system performance.

of the language dependent aspects. The time complexity of the algorithm for learning and applying the model is low7 because the disambiguation process requires only a few string matching operations and lookups to the sets of classification information. However, it is essential for our model that we overcome the data sparseness problem. For Korean polysemous words, we have already tried to relax the data sparseness problem by exploiting automatically constructed word class information. The precision was somewhat improved, but it was not remarkable because it has some difficulty in clustering words with low frequency. For future work, we will combine statistical and rule-based word clustering methods and also adopt similarity-based approaches to our model.

Notes 1 Classification Information Model can be regarded as a kind of linear classifier because the right

side of equation 2 is completely matched with that of a linear classifier. wij of linear classifer is generally learned by the least-mean-square algorithm. However, the Classification Information Model directly assigns wij with equation 3. According to Lee (1999), Classification Information Model makes decisions much faster on learning and somewhat more precisely than a linear classifier based on the least-mean-square algorithm for the data set used in Leacock et al. (1998). 2 The MPC represents the Most Probable Class. 3 The DS represents the Discrimination Score. 4 According to Lee et al. (1997), the normalized conditional probability is useful for preventing the model from overemphasizing the imbalance of the size of training data set among classes. 5 The features that did not occur in the training data were removed from the table. 6 There was a mistake on the mapping from internal sense number to the official sense number in our system. The content of Table II was based on the result of revision on 16 October 1998. 7 The time complexity for the learning algorithm is O(mn), where m is the size of feature set and n is the number of senses. And, the time complexity for applying the algorithm is O(n + log2 m) (Lee, 1999).

146

LEE ET AL.

References Leacock, C., M Chodorow and G. A. Miller. “Using Corpus Statistics and WordNet Relations for Sense Identification”. Computational Linguistics, 24(1) (1998), 147–165. Lee, H., D.-H. Baek and H.-C. Rim. “Word Sense Disambiguation Based on The Information Theory”. In Proceedings of Research on Computational Linguistics Conference, 1997, pp. 49–58. Lee, H. A Classification Information Model for Word Sense Disambiguation. Ph.D. thesis. The Department of Computer Science and Engineering (in Korean), Korea University, 1999. Yarowsky, D. E. “Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French”. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994, pp. 88–95. Yarowsky, D. E. Three Machine Learning Algorithms for Lexical Ambiguity Resolution. Ph.D. in Computer and Information Science, University of Pennsylvania, 1996.

Computers and the Humanities 34: 147–152, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

147

Word Sense Disambiguation with a Similarity-Smoothed Case Library DEKANG LIN Department of Computer Science, University of Manitoba, Canada (E-mail: [email protected])

1. Introduction We present a case-based algorithm for word sense disambiguation (WSD). The case library consists of local contexts of sense-tagged examples in the training corpus. For each target word in the testing corpus, we compare its local context with all known cases and assign it the same sense tag as in the most similar case. Like other corpus-based WSD algorithms, data sparseness is a serious problem. In order to alleviate this problem, an automatically generated thesaurus is employed that allows a match between two local contexts to be established even when different words are involved. 2. Representation of Local Context In many WSD algorithms, the local context of a word is mainly made up of the words surrounding the word. In our approach, the local context of a word is a set of paths in the dependency tree of the sentence that contains the word. The nodes in the dependency tree of a sentence represent words in the sentence. The links represent the dependency relationships between the words. A dependency relationship is an asymmetric binary relationship between two words: the head (or governor) and the modifier (or dependent). The properties of the smallest phrase that contains both the head and the modifier are mostly determined by the head. For example, the dependency tree of the sentence (1a) is shown in (1b). (1) a. Ethnic conflicts are shaking the country b.

148

LIN

Table I. Meanings of dependency labels. Label

Meaning

Compl det jnab subj gen rel

the relationship between a word and its first complement the relationship between a noun and its determiner the relationship between a noun and its adjectival modifier the relationship between a subject and a predicate the relationship between a noun and its genitive determiner the relationship between a noun and its relative clause

The root node of the dependency trees is “shaking”. The arrows of the links point to the modifiers. The labels attached to the links are the types of dependency relationships. Explanations of the labels can be found in Table I. We define the local context of a word in a sentence to be a set of paths in the dependency tree of the sentence between the word and other words in the sentence. Each path is a feature of the word. The features are named by concatenating the link labels and part-of-speech tags of the nodes along the paths. The value of a feature is the root form of the word at the other end of the path. The set of featurevalue pairs forms the local context of the word. For example, the local context of “shaking” in (la) is the feature vector (2). (2) ((V shake), (V:subj:N conflict), (V:subj:N:jnab:A ethnic), (V:be:Be be), (V:compl:N country), (V:compl:N:det:Det the)) We used a broad-coverage parser, called Principar (Lin, 1993), to parse all the training examples and extract the local context of the sense-tagged words. The local contexts of target words in the testing corpus are similarly constructed. The intended meaning of a target word is determined by finding the sense-tagged example whose local context is most similar to the local context of the target word. 3. Similarity Measure To deal with the data sparseness problem, we used a thesaurus automatically extracted from a large corpus (125 million words) to bridge the gap between the training examples and the testing corpus (Lin, 1998). Consider the example (3a) from the Senseval testing corpus. The relevant part of its dependency tree is shown in (3b). The local context of “shaken” is the feature vector in (4).

WORD SENSE DISAMBIGUATION WITH A SIMILARITY-SMOOTHED CASE LIBRARY

149

(3) a. The guerrillas’ first urban offensive, which has lasted three weeks so far and shows no sign of ending, has shaken a city lulled by the official propaganda. b.

(4) ((V shake), (V:subj:N offensive), (V:compl:N city), . . . ) Compared with the example in (1), the subject and objects of “shake” in (3) are different words. However, by looking up the automatically generated thesaurus, which contains 11,870 noun entries, 3,644 verb entries and 5,660 adjective/adverb entries, our system found the following entries for “offensive” and “conflict”: offensive: attack 0.183; assault 0.168; raid 0.154; effort 0.153; campaign 0.148; crackdown 0.137; strike 0.129; bombing 0.127; move 0.124; invasion 0.123; initiative 0.121; . . . conflict 0. 072; . . . city: state 0.346; town 0.344; country 0.299; country 0.292; university 0.286; region 0.248; village 0.237; area 0.228; . . . The similarity between “city” and “country” is 0.292 and the similarity between “offensive” and “conflict” is 0.072. The similarities between these words enable the system to recognize the commonality between the local context (4) and (2). If all distinct words were considered as equally different, the sentence “she shook her head” would have as much commonality to (3a) as (1a), which is that the main verb is “shake”. Let v be a feature vector and f be a feature. We use l (f) to denote the length of the path that corresponds to f, F(v) to denote the set of features in v and f (v) to denote the value of feature f in v. For example, suppose v is the feature vector in (2) and f is the feature V:subj:N, then l (f) = 1, f (v) is “conflict” and F (v) = {V, V:subj:N, V:subj:N:Jnab:A, V:be:Be, V:compl:N, V:compl:N:det:Det}. The function simTo(v1 , v2 ) measures the similarity of v1 to v2 . It is defined as follows: P

f ∈ F (v1 ) ∩ F (v2 )3−1(f ) sim(f (v1 )), f (v2 ))(rlogP (f (v1 )) + logP (f (v2 ))) P P r f ∈ F (v1 )3−l(f ) logP (f (v1 )) + f ∈ F (v2 )3−l(f ) logP (f (v2 ))

where r ∈ [0, 1] is a discount factor to make simTo(V1 , v2 ) asymmetrical; sim(w, w0 ) is the similarity between two words w and w0 , retrieved from the automatically generated thesaurus; P (f (v)) is the prior probability of the value of feature f of the verb v. Suppose, v is the verb “shaking” in (1b), f is the feature V:subj:N. Then f (v) is [N conflict] and P (f (v)) is estimated by dividing the frequency of [N conflict] in a large corpus with the total number of words in the corpus.

150

LIN

The value 3−l(f ) is used in simTo(v1 , v2 ) to capture the fact that the longer the path, the smaller the influence that the word at the other end can exert on the target word. Examples in the training corpus often contain irrelevant details that have nothing to do with the meaning of the target word. The feature (V:be:Be be) in (2) is one such example. The decision process should focus more on how much of the unknown instance is covered by a known case. This is achieved by using the discount factor r (set to 0.1 in all our experiments) to make simTo (v1 , v2 ) asymmetrical. The value simTo(v1 , v2 ) is high when v1 possesses the most of the features of v2 . Extra features in v1 that are not shared by v2 are discounted by r. Given a target word and its local context v, our algorithm tags the target word with the sense tag of the example whose local context v0 maximizes the similarity simTo(v0, v). 4. Experimental Results We submitted two sets of results to the Senseval workshop. The first one used the entire training corpus to construct the case library. In the second one, the case library contains only the examples from the Hector lexicon. Our official Senseval results are as follows: Trained with the corpus: recall=.701, precision=.706 Trained with the lexicon: recall=.520, precision=.523 All evaluation results reported in this paper are obtained with the “Coarse Grain” scoring algorithm. Our official system had several serious bugs, which were later corrected. Table II shows our unofficial results after the bug fixes. The column caption “R” stands for recall, “P” stands for precision and “F” stands for the F-measure, which is ×R defined as F = 2×P . Table II includes the results of several variations of the P +R system that we experimented with: To gauge the effect of the amount of training data on WSD, we constructed a case library with the training corpus and another one with the examples from the Hector lexicon. To see the advantage of the thesaurus, we also ran the system without it. The thesaurus accounted for about 4–6% increase in both precision and recall. It is somewhat surprising that the benefits of the thesaurus is not greater with the smaller training set than with the larger one. To determine how the similarity of cases affects the reliability of the disambiguation decisions, we used a threshold θ to filter the system outputs. The system only assigns a sense tag to a word when the similarity of the most similar case is greater than θ. Table II shows that a low threshold seems to produce slight improvements. A high threshold causes the recall to drop drastically with only modest gain in precision.

151

WORD SENSE DISAMBIGUATION WITH A SIMILARITY-SMOOTHED CASE LIBRARY

Table II. Unofficial evaluation results. Using paths in dependency tree as features use Thesaurus

training data

no yes no yes

corpus corpus lexicon lexicon

R

θ =0 P

F

R

.698 .748 .587 .628

.692 .754 .596 .637

.695 .750 .591 .632

.687 .733 .578 .614

θ = 0.25 P F .702 .771 .598 .650

.694 .751 .588 .631

R .622 .684 .438 .541

θ = 0.5 P F .728 .781 .633 .663

.670 .729 .518 .596

Using surrounding words as features use Thesaurus

training data

R

θ =0 P

F

R

no yes no yes

corpus corpus lexicon lexicon

.623 .671 .462 .506

.628 .678 .466 .512

.625 .674 .464 .509

.589 .377 .370 .143

θ = 0.25 P F .641 .787 .458 .741

.613 .510 .409 .240

R .279 .121 .082 .029

θ = 0.5 P F .762 .873 .711 .810

.408 .213 .147 .056

To evaluate the contribution of parsing in WSD, we experimented with a version of the system which uses surrounding words and their part-of-speech tags as features. For example, the feature vector for sentence (1a) is: ((V shake) (prev3:A ethnic) (prev2:N conflict) (prev1:Be be) next1:Det the) (next2:Det city)) The use of the parser leads to about 7% increase in both recall and precision when the training corpus is used and about 12% in both recall and precision when only the Hector examples are used. 5. Related Work Many recent WSD algorithms are corpus-based (e.g., Bruce and Wiebe, 1994; Ng and Lee, 1996; and Yarowsky, 1994), as well as most systems described in this special issue. Leacock and Chodorow (1998) explored the idea of using WordNet to deal with the data sparseness problem. They observed that as the average number of training examples per word sense is increased from 10 to 200, the improvement in the accuracy (roughly equivalent to the precision measure in Senseval) gained by the use of WordNet decreases from 3.5% to less than 1%. In our experiments, however, the improvement in precision gained by the use of the automatically generated thesaurus increases from 5.2% to 6.9% (θ = 0.25) as the average number of examples per sense is increased from 3.67 (in Hector) to 30.32 (in the training corpus).

152

LIN

6. Conclusion We presented a case-based algorithm for word sense disambiguation. Our results with the Senseval data showed that the use of the automatically generated thesaurus significantly improves accuracy of WSD. We also showed that defining local contexts in terms of dependency relationships has substantial advantage over defining local contexts as surrounding words, especially when the size of the training set is very small. References Bruce, R. and J. Wiebe. ‘Word sense disambiguation using decomposable models’. Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. Las Cruces, New Mexico, 1994, pp. 139–145. Leacock, C. and M. Chodorow. ‘Combining Local Context and WordNet Similarity for Word Sense Identification’.WordNet: An Electronic Lexical Database. MIT Press, 1998. pp. 256–283. Lin, D. ‘Principle-based Parsing without Overgeneration’. Proceedings of ACL–93. Columbus, Ohio, 1993, pp. 112–120. Lin, D.: ‘Automatic Retrieval and Clustering of Similar Words’. Proceedings of COLING/ACL–98. Montreal, 1998, pp. 768–774. Ng, H. T. and H. B. Lee. ‘Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Examplar-Based Approach’. Proceedings of 34th Annual Meeting of the Association for Computational Linguistics. Santa Cruz, California, 1996, pp. 40–47. Yarowsky, D. ‘Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French’. Proceedings of 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, New Mexico, 1994, pp. 88–95.

Computers and the Humanities 34: 153–158, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

153

Senseval: The CL Research Experience KENNETH C. LITKOWSKI CL Research, 9208 Gue Road, Damascus, MD 20872, USA (E-mail: [email protected])

Abstract. The CL Research Senseval system was the highest performing system among the “Allwords” systems, with an overall fine-grained score of 61.6 percent for precision and 60.5 percent for recall on 98 percent of the 8,448 texts on the revised submission (up by almost 6 and 9 percent from the first). The results were achieved with an almost complete reliance on syntactic behavior, using (1) a robust and fast ATN-style parser producing parse trees with annotations on nodes, (2) DIMAP dictionary creation and maintenance software (after conversion of the Hector dictionary files) to hold dictionary entries, and (3) a strategy for analyzing the parse trees in concert with the dictionary data. Further considerable improvements are possible in the parser, exploitation of the Hector data (and representation of dictionary entries), and the analysis strategy, still with syntactic and collocational data. The Senseval data (the dictionary entries and the corpora) provide an excellent testbed for understanding the sources of failures and for evaluating changes in the CL Research system. Key words: word-sense disambiguation, Senseval, dictionary software, analysis of parsing output

1. Introduction and Overview The CL Research Senseval system was developed specifically to respond to the Senseval call, but made use of several existing components and design considerations. The resultant system, however, provides the nucleus for general natural language processing, with considerable opportunities for investigating and integrating additional components to assist word-sense disambiguation (WSD). We describe (1) the general architecture of the CL Research system (the parser, the dictionary components, and the analysis strategy); (2) the Senseval results and observations on the CL Research performance; and (3) opportunities and future directions. 2. The CL Research System The CL Research system consists of a parser, dictionary creation and maintenance software, and routines to analyze the parser output in light of dictionary entries. In the Senseval categorization, the CL Research system is an “All-words” system (nominally capable of “disambiguating all content words”). We did not actually attempt to disambiguate all content words, only assigning parts of speech to these other words during parsing. A small separate program was used to convert the Hector dictionary data into a form which could be uploaded and used by the

154

LITKOWSKI

dictionary software. As the analysis strategy evolved during development, some manual adjustments were made to the dictionary entries, but these could have been handled automatically by simple revisions to the original conversion program. Our system could in theory proceed to disambiguate any word for which Hector-style dictionary information is available.

2.1.

THE PARSER

The parser used in Senseval (provided by Proximity Technology) is a prototype for a grammar checker. The parser uses an augmented transition network grammar of 350 rules, each consisting of a start state, a condition to be satisfied (either a non-terminal or a lexical category), and an end state. Satisfying a condition may result in an annotation (such as number and case) being added to the growing parse tree. Nodes (and possibly further annotations, such as potential attachment points for prepositional phrases) are added to the parse tree when reaching some end states. The parser is accompanied by an extensible dictionary containing the parts of speech (and frequently other information) associated with each lexical entry. The dictionary information allows for the recognition of phrases (as single entities) and uses 36 different verb government patterns to create dynamic parsing goals and to recognize particles and idioms associated with the verbs. These government patterns follow those used in (Oxford Advanced Learner’s Dictionary, 1989).1 The parser output consists of bracketed parse trees, with leaf nodes describing the part of speech and lexical entry for each sentence word. Annotations, such as number and tense information, may be included at any node. The parser does not always produce a correct parse, but is very robust since the parse tree is constructed bottom-up from the leaf nodes, making it possible to examine the local context of a word even when the parse is incorrect. The parser produced viable output for almost all the texts in the evaluation corpora, 8443 out of 8448 items (99.94 percent).

2.2.

DICTIONARY COMPONENT

The CL Research Senseval system relies on the DIMAP dictionary creation and maintenance software as an adjunct to the parser dictionary. This involved using the existing DIMAP functionality to create dictionary entries from the Hector data (with multiple senses, ability to use phrasal and collocational information, and attribute-value features for capturing information from Hector) and using these entries for examining the parser output. Some features were added by hand using DIMAP, rather than revising the Hector conversion program, in the interests of time; the conversion program can be easily modified to automate the process. These features formed the primary information used in making the sense assignments in Senseval.2

SENSEVAL: THE CL RESEARCH EXPERIENCE

2.3.

155

ANALYSIS STRATEGY

The CL Research system is intended to be part of a larger discourse analysis processing system (Litkowski and Harris, 1997). The most significant part of this system for WSD is a lexical cohesion module intended to explore the observation that, even within short texts of 2 or 3 sentences, the words induce a reduced ontology (i.e., a circumscribed portion of a semantic network such as WordNet (Miller et al., 1990) or MindNet (Richardson, 1997)). The implementation in Senseval does not attain this objective, but does provide insights for further development of a lexical cohesion module. The CL Research system involves: (1) preprocessing the Senseval texts; (2) submitting the sentences to the parser; (3) examining the parse results to identify the appropriate DIMAP entry (relevant only where Hector data gave rise to distinct entries for derived forms and idioms); (4) examining each sense in the DIMAP entry to filter out non-viable senses and adding points to senses that seem preferred based on the surrounding context of a tagged item; and (5) sorting the still viable senses by score to select the answer to be returned. The DIMAP dictionary contained all Hector senses, phrases, and collocations; step 3 particularly focused on recognizing phrases and collocations and selecting the appropriate DIMAP entry (important, for example, in recognizing Hector senses for milk shake and onion dome). Step 4 is the largest component of the CL Research system and where the essence of the sense selection is made. In this step, we iterate over the senses of the DIMAP entry, keeping an array of viable senses (each with an accompanying score), examining the features for the sense. The features were first used to filter out inappropriate senses. The parse characteristics of the tagged word were examined and flags set based on the part of speech (such as number for nouns and verbs, whether a noun modified another noun, whether a verb had an object, and whether a verb or adjective was a past tense, past participle, or present participle); these characteristics were sometimes used to retrieve a different DIMAP entry (to get an idiom, for example). The flags were then used in conjunction with the Hector grammar codes to eliminate senses for such reasons as countability of nouns, number mismatch (e.g., when a verb required a plural subject), transitivity incompatibility (an intransitive sense when a verb object was present), tense incompatibility (e.g., if a verb sense could never be passive and the past tense flag was set or when a gerundial was required and not present), when there was no modified noun for a noun-modifier sense, and when an adjective sense was required to be in the superlative form. The system examined grammar codes indicating that a sense was to be used “with” or “after” a specific word or part of speech; if the condition was satisfied, 3 points were added to the sense’s score. Hector clues specifying collocates (e.g., experience for bitter) were used to add 5 points for a sense; clues specifying semantic classes have not yet been implemented. The kind feature of Hector definitions (e.g.,indie band, jazz band) was generalized into a quasi- regular-expression recognizer for context preceeding and

156

LITKOWSKI

Table I. Precision for major tasks Task

Overall Noun Verb Adjective Indeterminate

Number of texts

Grain Fine

Mixed

Coarse

Attempted

8448 2756 2501 1406 1785

61.6 71.1 53.5 61.7 58.4

66.0 75.2 57.8 65.2 64.0

68.3 78.6 59.6 69.1 64.2

98.13 97.86 98.44 98.15 98.10

following the tagged word (e.g., “on [prpos] =” to recognize any possessive pronoun for on one’s knees). Many of the phrasal or idiom entries were transformed manually3 into kind features in DIMAP senses, facilitating idiom recognition or serving as a backup when the parser did not pick up a phrase as an entity. This mechanism was also used for Hector clues that specified particular words or parts of speech. The kind features were used as strong indicators in matching a sense. When a kind equation was satisfied, any viable senses up to that point were dropped and only senses that satisfied akind equation were then allowed as viable. Overall, this mechanism only added a couple of percentage points; however, for some words with several kind equations, the effect was much more significant. After elimination of senses, the viable senses were sorted by score and the top score was the sense selected. In case of ties (such as when no points were added for any senses), the most frequent sense (as reflected in the Hector order) was chosen. 3. CL Research System Results Table I shows the CL Research system results for the major Senseval tasks. Since most tasks have a high percent attempted, the recall for each task is only slightly lower (around one percent). The CL Research system was the top performing “Allwords” system in both the initial and revised submissions for these major tasks. For the initial submission, precision was 6 percent lower and recall 9 percent lower; this was due to the fact that the percent attempted in the initial submission was 92.74 percent. Thus, most of the improvement between the initial and revised submissions resulted from simply being able to provide a guess for about 400 additional tasks. For the initial submission, the CL Research system was the best system on 19 of the 41 individual tasks, above average for 12 more, and worst for 2 tasks. Table II shows the CL Research system results for three tasks. For onion and generous, the results changed little from the initial to the revised submission. For onion, the results were at the top for the initial submission and second for the revised submission; for generous, the results were only one above the worst performing

157

SENSEVAL: THE CL RESEARCH EXPERIENCE

Table II. Precision for selected tasks Task

Onion-n Generous-a Shake-p

Number of texts

Grain Fine

Mixed

Coarse

Attempted

214 227 356

84.6 37.7 66.0

84.6 37.7 68.9

84.6 37.7 69.8

97.20 98.24 96.63

system. For shake, there was a seven percent increase at the fine-grained level, with the system as the second-best for the initial submission and the top system for the revised submission; a considerable portion of the improvement was the ability to make a guess for an additional 12 percent of the texts between the initial and revised submissions (primarily due to correcting a faulty mechanism for recognizing the phrase shake up). These examples illustrate characteristics of the CL Research system. For onion, which has a low entropy (0.86), the high precision is due to the fact that the highest frequency sense is ordered first in the DIMAP dictionary; there was no semantic discrimination in use and the system guessed the first sense. The same is true of generous, where, however, the entropy was much higher (2.30). Since, again, the CL Research system had little semantic information, the most frequent sense was guessed in the largest percentage of cases. Because of the higher entropy, the guesses were more often incorrect and the performance of the CL Research system very poor. For shake, there was a much higher entropy (3.70). This might have led to a lower performance, except that there was a considerable amount of additional information in the Hector definitions that permitted sense discrimination. Generally, the system was able to recognize the difference between noun and verb senses. Among the nouns, there were several “kinds” ( milk shake, handshake) that were readily recognized. Among the verbs, the CL Research system was able to recognize a large number of phrases, not only specific idioms (shake a leg, shake off), but also, through the extension of the “kind” mechanism, phrases that could include optional elements, both specific words and words of a specific part of speech s(hake one’s head, shake in one’s boots).

4. Discussion and Future Directions The CL Research system contains many opportunities for improvement. Many of the wrong guesses were due to incorrect parses; we can expect significant improvement in overall results from parser changes. Further, we did not fully exploit the information available in the Hector data; we can expect some improvements in

158

LITKOWSKI

this area. Finally, we can expect some improvements from semantic processing, working off a semantic network like WordNet or MindNet. Since the level of WSD was achieved with very little semantics and with likely improvements from further exploitation of the data, the CL Research system results are consistent with the suggestion in (Wilks and Stevenson, 1997) of achieving 86 percent correct tagging from sense frequency ordering, grammar codes, and collocational data. In addition, our data suggest the WSD can be accomplished within small windows (i.e., short surrounding context) of the tagged word. Finally, the Senseval system (the dictionary entries and the corpora) provides an excellent testbed for understanding the sources of failures and for evaluating changes in the CL Research system. Notes 1 Source C code (8,000 lines) for the parser, which compiles in several Unix and PC environments,

is available upon request from the author, along with 120 pages of documentation. 2 An experimental version of DIMAP, containing all the functionality used in Senseval, is available for immediate download at http://www.clres.com. 3 Most of these kind equations are amenable to automatic generation, but this was not developed for the current Senseval submission.

References Litkowski, K.C. and M.D. Harris. Category Development Using Complete Semantic Networks, Technical Report 97-01. Gaithersburg, MD: CL Research, 1997. Miller, G.A., R. Beckwith, C. Fellbaum, D. Gross and K.J. Miller. “Introduction to WordNet: An On-Line Lexical Database”. International Journal of Lexicography, 3(4) (1990), 235–244. Oxford Advanced Learner’s Dictionary, 4th edn. Oxford, England: Oxford University Press, 1989. Richardson, S.D. Determining Similarity and Inferring Relations in a Lexical Knowledge Base [Diss]. New York, NY: The City University of New York, 1997. Wilks, Y. and M. Stevenson. “Sense Tagging: Semantic Tagging with a Lexicon”. In: Tagging Text with Lexical Semantics: Why, What, and How? SIGLEX Workshop. Washington, D.C.: Association for Computational Linguistics, April 4–5 1997.

Computers and the Humanities 34: 159–164, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

159

Selecting Decomposable Models for Word-Sense Disambiguation: The Grling-Sdm System∗ TOM O’HARA1, JANYCE WIEBE1 and REBECCA BRUCE2 1 Department of Computer Science and Computing Research Laboratory, New Mexico State

University, Las Cruces, NM 88003-0001, USA (E-mail: {tomohara,wiebe}@cs.nmsu.edu); 2 Department of Computer Science, University of North Carolina at Asheville, Asheville, NC

28804-3299, USA (E-mail: [email protected]) Abstract. This paper describes the grling-sdm system, which is a supervised probabilistic classifier that participated in the 1998 SENSEVAL competition for word-sense disambiguation. This system uses model search to select decomposable probability models describing the dependencies among the feature variables. These types of models have been found to be advantageous in terms of efficiency and representational power. Performance on the SENSEVAL evaluation data is discussed.

1. Introduction A probabilistic classifier assigns the most probable sense to a word, based on a probabilistic model of the dependencies among the word senses and a set of input features. There are several approaches to determining which models to use. In natural language processing, fixed models are often assumed, but improvements can be achieved by selecting the model based on characteristics of the data (Bruce and Wiebe, 1999). The grling-sdm1 system was developed to test the use of probabilistic model selection for word-sense disambiguation in the SENSEVAL competition (Kilgarriff and Rosenzweig, this volume). Shallow linguistic features are used in the classification model: the parts of speech of the words in the immediate context and collocations2 that are indicative of particular senses. Manually-annotated training data is used to determine the relationships among the features, making this a supervised learning approach. However, no additional knowledge is incorporated into the system. In particular, the HECTOR definitions and examples are not utilized. Note that this model selection approach can be applied to any discrete classification problem. Although the features we use are geared towards word-sense disambiguation, similar ones can be used for other problems in natural language processing, such as event categorization (Wiebe et al. 1998). This paper assumes basic knowledge of the issues in empirical natural language processing (e.g., the sparse data problem). Jurafsky and Martin (1999) provide a good introduction.

160

O’HARA ET AL.

2. The Grling-Sdm System The focus in our research is probabilistic classification, in particular, on automatically selecting a model that captures the most important dependencies among multi-valued variables. One might expect dependencies among, for example, variables representing the part-of-speech tags of adjacent words, where each variable might have the possible values noun, verb, adjective, etc. In practice, simplified models that ignore such dependencies are commonly assumed. An example is the Naive Bayes model, in which all feature variables are conditionally independent of each other given the classification variable. This model often performs well for natural language processing problems such as word-sense disambiguation (Mooney, 1996). However, Bruce and Wiebe (1999) show that empirically determining the most appropriate model yields improvements over the use of Naive Bayes. The grling-sdm system therefore uses a model search procedure to select the decomposable model describing the relationships among the feature variables (Bruce and Wiebe, 1999). Decomposable models are a subset of graphical probability models for which closed-form expressions (i.e., algebraic formulations) exist for the joint distribution. As is true for all graphical models, the dependency relationships in decomposable models can be depicted graphically. Standard feature sets are used in grling-sdm, including the parts of speech of the words in the immediate context, the morphology of the target word, and collocations indicative of each sense (see Table I). The collocation variable colli for each sense Si is binary, corresponding to the absence or presence of any word in a set specifically chosen forSi .3 There are also four adjacency-based collocational features (WORD ± i in Table I), which were found to be beneficial in other work (Pedersen and Bruce, 1998; Ng and Lee, 1996). These are used only in the revised system, improving the results discussed here somewhat. A probabilistic model defines the distribution of feature variables for each word sense; this distribution is used to select the most probable sense for each occurrence of the ambiguous word. Several different models for this distribution are considered during a greedy search through the space of all of the decomposable models for the given variables. A complete search would be impractical, so at each step during the search a locally optimal model is generated without reconsidering earlier decisions (i.e., no backtracking is performed). During forward search, the procedure starts with a simple model, such as the model for complete independence or Naive Bayes, and successively adds dependency constraints until reaching the model for complete dependence or until the termination condition is reached (Bruce and Wiebe, 1999). An alternative technique, called backward search, proceeds in the opposite direction, but it is not used here. For example, Figure 1 depicts the forward model search for onion-n. This illustration omits the part-of-speech feature variables which were discarded during the

SELECTING DECOMPOSABLE MODELS FOR WORD-SENSE DISAMBIGUATION

161

Table I. Features used in grling-sdm. Feature

Description

pos–2 pos–1 pos pos+1 pos+2 colli

part-of-speech of second word to the left part-of-speech of word to the left part-of-speech of word itself (morphology) part-of-speech of word to the right part-of-speech of second word to the right occurrence of a collocation for sense i

word–2 word–1 word+1 word+2

stem of second word to the left stem of word to the left stem of word to the right stem of second word to the right

Figure 1. Forward model search for onion-n

search.4 The nodes for the collocational feature variables are labeled by the sense mnemonic: ‘veg’ for sense 528347 and ‘plant’ for sense 528344. In addition, the node ‘other’ covers collocations for miscellaneous usages (e.g., proper nouns). In each step, a new dependency is added to the model. This usually results in one new edge in the graph. However, in step (d), two edges are added as part of a threeway dependency involving the classification variable (onion) and the two main collocation feature variables (veg and plant). Instead of selecting a single model, the models are averaged using the Naive Mix (Pedersen and Bruce, 1997), a form of smoothing. The system averages three sets of models: the Naive Bayes model; the final model generated by forward search from the Naive Bayes model; and the first k models generated by forward search from the model of independence.

3. Analysis of Performance Results The overall results for the supervised systems participating in SENSEVAL indicate that our system is roughly performing at an average level.

162

O’HARA ET AL.

Figure 2. Forward search models selected for onion-n and generous-a.

This section discusses how the system performs on the three tasks highlighted in the SENSEVAL discussions: onion-n, generous-a, and shake-p. More details can be found in (O’Hara et al., 1998). Figure 2 shows the final model selected during forward model search foronionn. The nodes labeled ‘ID mnemonic’ (e.g., ‘528344 plant’) correspond to the COLLi features discussed earlier, with the lexicographer sense mnemonic included for readability. These are binary feature variables indicating the presence or absence of words found to be indicative of sense ID. Note that there are only collocational feature variables for two of the five possible senses, since three cases don’t occur in the training data. For the evaluation data, the system always selects the vegetable sense of “onion” (528347). This problem is due to insufficient training data, resulting in poor parameter estimates. For instance, there are 15 test sentences containing the sense related to “spring onion” (528348) but no instances of this sense in the training data. Figure 2 also shows the final model selected during the forward search performed for generous-a. Note the dependencies between the collocation feature variables for senses 512274 (unstint), 512277 (kind), and 512310 (copious). The system has trouble distinguishing these cases. Bruce and Wiebe (1999) describe statistical tests for diagnosing such classification errors. The measure of form diagnostic assesses the feature variable dependencies of a given model, which determine the parameters to be estimated from the training data. The measure is evaluated by testing and training on the same data set (Bruce and Wiebe, 1999). Since all the test cases have already been encountered during training, there can be no errors due to insufficient parameter estimates (i.e., no sparse data problems). For the model shown above, this diagnostic only achieves an accuracy of 48.9% suggesting that important dependencies are not specified in the model. Themeasure of feature set is a special case of the measure of form diagnostic using the model of complete dependence. Since all dependencies are considered, errors can only be due to inadequate features. This diagnostic yields an accuracy of 95.2%, indicating that most of the word senses are being distinguished sufficiently, although there

SELECTING DECOMPOSABLE MODELS FOR WORD-SENSE DISAMBIGUATION

163

is some error. Thus, the problem with generous-a appears to result primarily from selection of overly simplified model forms.5 We use a fixed Naive Bayes model forshake-p and other cases with more than 25 senses. Running this many features is not unfeasible for our model selection approach; however, the current implementation of our classifier has not been optimized to handle a large number of variables. See (O’Hara et al., 1998) for an analysis of this case. 4. Conclusion This paper describes the grling-sdm system for supervised word-sense disambiguation, which utilizes a model search procedure. Overall, the system performs at the average level in the SENSEVAL competition. Future work will investigate (1) better ways of handling words with numerous senses, possibly using hierarchical model search (Koller and Sahami, 1997), and (2) ways to incorporate richer knowledge sources, such as the HECTOR definitions and examples. Notes ∗ This research was supported in part by the Office of Naval Research under grant number

N00014-95-1-0776. We gratefully acknowledge the contributions to this work by Ted Pedersen. 1 GraphLing is the name of a project researching graphical models for linguistic applications. SDM

refers to supervised decomposable model search. 2 The term “collocation” is used here in a broad sense, referring to a word that, when appearing in

the same sentence, is indicative of a particular sense. 3 A word W is chosen for S if (P(S | W ) – P(S )) / P(S ) ≥ 0.2, that is, if the relative percent gain i i i i

in the conditional probability over the prior probability is 20% or higher. This is a variation on the per-class, binary organization discussed in (Wiebe et al., 1998). 4 After model search, any feature variables that are not connected to the classification variable are discarded. 5 For onion-n, the measure of form diagnostic achieves an accuracy of 79.9% for the model above, and the measure of feature set diagnostic achieves an accuracy of 96.7%.

References Bruce, R. and J. Wiebe. “Decomposable modeling in natural language processing”. Computational Linguistics 25(2) (1999), 195–207. Jurafsky, D. and J. H. Martin. Speech and Language Processing. Upper Saddle River, NJ: PrenticeHall. 1999. Koller, D. and M. Sahami. “Hierarchically classifying documents using very few words”. Proc. 14th International Conference on Machine Learning (ICML-97). Nashville, Tennessee, 1997, pp. 170–178. Mooney, R. “Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning”. Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP-96). Philadelphia, Pennsylvania, 1996, pp. 82–91.

164

O’HARA ET AL.

Ng, H. T. and H. B. Lee. “Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach”. Proc. of the 31st Annual Meeting of the Association for Computational Linguistics (ACL-96). Santa Cruz, California, 1996, pp. 40–47. O’Hara, T., J. Wiebe and R. Bruce. “Selecting decomposable models for word-sense disambiguation: the grling-sdm system”. Notes of SENSEVAL Workshop. Sussex, England, September 1998. Pedersen, T. and R. Bruce. “A new supervised learning algorithm for word sense disambiguation”. Proc. of the 14th National Conference on Artificial Intelligence (AAAI-97). Providence, Rhode Island, 1997, pp. 604–609. Pedersen, T. and R. Bruce. “Knowledge-lean word-sense disambiguation”. Proc. of the 15th National Conference on Artificial Intelligence (AAAI-98). Madison, Wisconsin, 1998, pp. 800–805. Wiebe, J., K. McKeever and R. Bruce. “Mapping collocational properties into machine learning features”. Proc. 6th Workshop on Very Large Corpora (WVLC-98). Association for Computational Linguistics SIGDAT, Montreal, Quebec, Canada, 1998.

Computers and the Humanities 34: 165–170, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

165

Simple Word Sense Discrimination Towards Reduced Complexity KEITH SUDERMAN Department of Computer Science, University of Manitoba, Winnipeg, Canada R3T 2N2 (E-mail: [email protected])

Abstract. Wisdom is a system for performing word sense disambiguation (WSD) using a limited number of linguistic features and a simple supervised learning algorithm. The most likely sense tag for a word is determined by calculating co-occurrence statistics for words appearing within a small window. This paper gives a brief description of the components in the Wisdom system and the algorithm used to predict the correct sense tag. Some results for Wisdom from the Senseval competition are presented, and directions for future work are also explored. Key words: Senseval, statistical WSD, word sense disambiguation

1. Introduction For any non-trivial problem in computer science, reducing complexity is an important goal. As the problems become more difficult the complexity of solutions tends to increase. Word Sense Disambiguation (WSD) is a non-trivial task, and as the sophistication of the systems that perform WSD increases, the complexity of these systems also increases. Unfortunately, this increase in complexity is frequently exponential rather than linear or (ideally) logarithmic. This paper describes Wisdom, a WSD system developed for a graduate level course in Natural Language Understanding (NLU) and then expanded to take part in the Senseval competition.1 The initial Wisdom system was an attempt to study the predictive power of co-occurrence statistics without considering other linguistic features. To select a sense tag, the initial system calculated co-occurrence statistics for words within a four-word window. Larger windows were tested; however, the best results were achieved across all words when a small word window is used. This agrees with past observations by Kaplan (1955), Choueka and Lusignan (1995), and others, that humans require only a two-word window to distinguish the correct sense of a word. For the Senseval exercise, Wisdom was augmented to construct a dependency tree for the context sentence and consult a thesaurus to overcome sparse training data. Wisdom performs very well considering the limited amount of knowledge employed, achieving an overall fine-grained precision of 69.0% with 60.8% recall

166

SUDERMAN

on 7,444 words attempted. Only the English language tasks were tested, but the system can be trained with a tagged corpus in any language. 2. Statistical Word Sense Disambiguation Wisdom can disambiguate any word ω for which a previously tagged corpus S is available. The task of assigning sense tags to the occurrences of ω in an untagged corpus T is divided into two phases, a training phase and a classification phase. During the training phase relevant words are extracted from the sentence S and a count of the number of times they occur with each possible sense of the word ω is maintained. After the sentences in S have been examined and relevant words counted, the sentences in T are presented and each occurrence of ω is sense-tagged. Identification of relevant words is discussed in detail in the next section. 2.1.

RELEVANT WORDS AND PHRASES

Initially, relevant words are considered to be those words immediately adjacent to ω in the context sentence. Empirical testing suggests that only the two words immediately preceding ω and the two words immediately following ω should be considered, including function words and other common stop words. For example, for the adjective generous in the sentence: “They eat reasonably generous meals and they snack in between.” eat, reasonably, meals, and and are considered to be relevant words. In addition to maintaining occurrence counts for single relevant words, frequencies for combinations of adjacent words are also computed to enable recognition of commonly occurring phrases. If the word ω appears in the phrase “ u v ω x y” then frequency statistics are also maintained for the strings uv, vx, xy, and uvxy. These are referred to as relevant phrases. 2.2.

TRAINING

During the training phase the sentences in S are parsed, the position of the word ω is determined, relevant words and phrases are identified, and the number of times each relevant word or phrase co-occurs with ω is counted. After all relevant words have been recorded, the occurrence counts are converted to conditional probabilities P (i|r), that is: ri pi = Pn

j =1 rj

where ri is the number of times the relevant word r has appeared with sense i, and n is the number of possible sense tag assignments to ω. This yields the probability that ω is an occurrence of sense i given the relevant word r.

SIMPLE WORD SENSE DISCRIMINATION

167

Figure 1. Dependency tree produced by Minipar.

After parsing the sentences in the training set, the Hector dictionary is searched for special cases of the word ω. A special case is a word, compound word, or morphological form of a word that has only one possible sense assignment. For example, waist band, steel band, and t-shirt all appear in the dictionary with unique sense tags, while wooden spoon, and wristband have two possible sense tags and are not, therefore, considered as special cases. Sense tags are assigned to special cases by performing a dictionary lookup and assigning the indicated sense. It should be noted that morphological forms of the word ω are treated separately as distinct words rather than as different forms of the same word. This is an artifact of the original system that used a simple tokenizer, rather than fully parsing the sentence. After the training phase and before classification, entropy values are calculated for co-occurring words, and all those with entropy above a predetermined threshold are considered poor sense indicators for ω and subsequently ignored. Entropy is calculated for word r as: entropy =

n X

−vi × Log2 (vi )

i=1

where vi is the conditional probability P (sensei |r), and n is the number of possible sense assignments to ω. The threshold used to determine whether a relevant word is ambiguous depends on ω, as well as other factors such as the size and source of the corpus. The system that participated in Senseval simply used the same entropy threshold for all words. 2.3.

ADDITIONAL SOURCES OF KNOWLEDGE

If the size of the training set is small, the number of reliable indicators may be insufficient to identify infrequently occurring senses. In such cases, Wisdom uses two additional knowledge sources: First, sentences are parsed with Minipar (Lin, 1993; 1998), a broad coverage parser for English. Minipar generates a dependency tree for each word in the sentence that specifies the head of the phrase in which it occurs. For example, for the above sentence Minipar generates the dependency tree shown in Figure 1.

168

SUDERMAN

The dependency tree is used to identify the phrase containing the word ω. Relevant words are restricted to adjacent words in the same phrase in the target sentence. For the above example, the relevant words are reasonably, meals, eat, and they. Since parsing with Minipar is a recent addition to the system, this is the only information provided by Minipar that is currently used by Wisdom, although there are clearly possibilities for enhancing the system with additional information from the parse. While the use of dependency trees improves the quality of the relevant words, it does not overcome the problem of a small training set. Therefore, during classification, if none of the relevant words has been previously encountered Wisdom consults an electronic thesaurus (Lin, 1998) to find words similar to the relevant words. Each of these is assigned a similarity value by the thesaurus and words above a predetermined threshold are retained. 2.4.

CLASSIFICATION

After training, sentences from the test set are presented to the system one at a time for classification, and the relevant words are extracted. The conditional probabilities for relevant words that have been encountered in the training set are summed, and ω is tagged with the sense that has the highest sum of probabilities. If there is more than one possible sense assignment, one is chosen at random. If the system is unable to determine a possible sense assignment, it will attempt to guess the correct sense tag. The sense to be used as a guess is determined during training. A set of 100 trial runs is performed for each possible sense tag. In each set of runs a different sense is used as the default guess: the first sense is used in the first set, the second sense is used in the second set, etc. During each trial run a portion of the training set is drawn at random and presented to the system for training. The remainder of the training set is classified and the score is recorded. The sense that yields the best average score is used as the default guess when classifying the hold-out data. Interestingly, the most frequently occurring sense is rarely the best sense to select when there are no other cues, since if the training set is sufficiently large there is typically some evidence (in the form of previously encountered relevant words) for the most frequently occurring senses. Therefore, when no relevant words are found, we may assume that this is an instance of a less frequently occurring sense of ω. Use of this information in Wisdom is currently under exploration. 3. Results The results presented here are those from the September competition. No results were submitted for the second evaluation in October. There are still several obvious problems with the system, which are currently under investigation. For example, Wisdom attempted to assign sense tags to five more verbs than the human

169

SIMPLE WORD SENSE DISCRIMINATION

Table I. Overall score for All-trainable words

Fine grain Mixed grain Coarse grain

Precision

Recall

Attempted

Position

69.0 71.8 73.8

60.8 63.3 65.0

7044 7444 7444

5 6 7

Table II. Fine (Coarse)-grained scores by part of speech

Nouns Verbs Adjectives

Precision

Recall

Attempted

Position

73.4 (79.6) 64.3 (68.3) 72.1 (76.4)

56.4 (61.2) 64.2 (68.2) 65.9 (69.8)

2914 2904 1284

6 (7) 6 (7) 5 (4)

annotators, which indicates either an incorrect part of speech tagging by the parser or a problem in Wisdom itself. Table I shows the overall system performance for all trainable words, Table II shows system performance by part of speech. In relation to other systems, Wisdom performed better than expected, typically finishing in the top five to ten systems for all tasks, and performing slightly better on adjectives than nouns or verbs. While Wisdom’s coarse-grained scores tended to be higher than its fine-grained scores, Wisdom’s coarse-grained scores did not increase as much as other systems and typically fell behind when compared to the other systems on course-grained sense distinction. However, for all trainable adjectives, Wisdom achieved the fifth highest fine-grained score and the fourth highest coarse-grained score. 4. Future Work Wisdom represents a first attempt to develop a system for WSD. The original system was developed for a graduate level AI course and was not intended to be extended; however, performance of the system in the Senseval exercise, especially given the simplicity of the system’s design, suggests it may be worthwhile to continue to improve the system. In particular, because Wisdom is a relatively simple system, it should be possible to develop Wisdom in such a way as to enable a systematic study of the contribution of different types of information to the disambiguation task. Currently, most systems employ various kinds of contextual and external information (see Ide and Véronis (1998) for a comprehensive survey). Typically, the contribution of each type of information, especially for disambiguating words in different parts of speech etc., is difficult or impossible to determine, and no systematic study has, to

170

SUDERMAN

my knowledge, yet been conducted. However, given the complexity of WSD, such a study could shed light on some of the subtleties involved. To accomplish this, baseline performance levels need to be firmly established for the system in its current state before other sources of knowledge are added. The results from the Senseval competition need to be studied in detail to determine what, if any, relation exists between the words Wisdom can correctly tag and those it cannot. In addition, parameters need to be tailored specifically to the target word rather than using one set of global parameters across all words. Finally, the relation between the choice of parameters and word classes will also be investigated. Once solid baselines have been established for the system, other sources of linguistic knowledge can be added. In particular, the parser provides much more information than is used. Note 1 Wisdom appears as manitoba.ks in the Senseval results.

References Choueka, Y. and S. Lusignan. “Disambiguation by Short Contexts”. Computers and the Humanities, 19 (1985), 147–157. Ide, N. and J. Véronis. “Word Sense Ambiguation: The State of the Art”. Computational Linguistics, 24(1) (1998), 1–40. Kaplan, A. “An Experimental Study of Ambiguity and Context”. Mechanical Translation, 2(2) (1955), 39–46. Lin, D. “Principle Based Parsing without Overgeneration”. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus Ohio, 1993, pp. 112–120. Lin, D. “Automatic Retrieval and Clustering of Similar Words”. In COLING-ACL98, Montreal, Canada, 1998.

Computers and the Humanities 34: 171–177, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

171

Memory-Based Word Sense Disambiguation JORN VEENSTRA, ANTAL VAN DEN BOSCH, SABINE BUCHHOLZ, WALTER DAELEMANS and JAKUB ZAVREL ILK, Tilburg University, PO Box 90153, 5000 LE Tilburg, The Netherlands (E-mail: {veenstra,antalb,buchholz,walter,zavrel}@kub.nl)

Abstract. We describe a memory-based classification architecture for word sense disambiguation and its application to the SENSEVAL evaluation task. For each ambiguous word, a semantic word expert is automatically trained using a memory-based approach. In each expert, selecting the correct sense of a word in a new context is achieved by finding the closest match to stored examples of this task. Advantages of the approach include (i) fast development time for word experts, (ii) easy and elegant automatic integration of information sources, (iii) use of all available data for training the experts, and (iv) relatively high accuracy with minimal linguistic engineering.

1. Introduction In this paper we describe a memory-based approach to training word experts for word sense disambiguation (WSD) as defined in the SENSEVAL task: the association of a word in context with its contextually appropriate sense tag. In our current system, training of the semantic word experts is based on POS-tagged corpus examples and selected information from dictionary entries. The general approach is completely automatic; it only relies on the availability of a relatively small number of annotated examples for each sense of each word to be disambiguated, and not on human linguistic or lexicographic intuitions. It is therefore easily adaptable and portable. Memory-Based Learning (MBL) is a classification-based, supervised learning approach. In this framework, a WSD problem has to be formulated as a classification task: given a set of feature values describing the context in which the word appears and any other relevant information as input, a classifierhas to select the appropriate output class from a finite number of a priori given classes. In our approach, we construct a distinct classifier for each word to be disambiguated. We interpret this classifier as a word-expert (Berleant, 1995). Alternative supervised learning algorithms could be used to construct such word experts. The distinguishing property of memory-based learning as a classification-based supervised learning method is that it does not abstract from the training data the way that alternative learning methods (e.g. decision tree learning, rule induction, or neural networks) do.

172

VEENSTRA ET AL.

In the remainder of this paper, we describe the different memory-based learning algorithms used, discuss the setup of our memory-based classification architecture for WSD, and report the generalization accuracy on the SENSEVAL data both for cross-validation on the training data and for the final run on the evaluation data.

2. Memory-Based Learning MBL keeps all training data in memory and only abstracts at classification time by extrapolating a class from the most similar item(s) in memory (i.e. it is a lazy learning method instead of the more common eager learning approaches). In recent work (Daelemans et al., 1999) we have shown that for typical natural language processing tasks, this lazy learning approach is at an advantage because it “remembers” exceptional, low-frequency cases which are nevertheless useful to extrapolate from. Eager learning methods “forget” information, because of their pruning and frequency-based abstraction methods. Moreover, the automatic feature weighting in the similarity metric of a memory-based learner makes the approach wellsuited for domains with large numbers of features from heterogeneous sources, as it embodies a smoothing-by-similarity method when data is sparse (Zavrel and Daelemans, 1997). For our experiments we have used TiMBL1 , an MBL software package developed in our group (Daelemans et al., 1998). TiMBL includes the following variants of MBL: IB 1:

The distance between a test item and each memory item is defined as the number of features for which they have a different value (overlap metric). IB 1- IG : In most cases, not all features are equally relevant for solving the task; this variant uses information gain (an information-theoretic notion measuring the reduction of uncertainty about the class to be predicted when knowing the value of a feature) to weight the cost of a feature value mismatch during comparison. IB 1- MVDM: For typical symbolic (nominal) features, values are not ordered. In the previous variants, mismatches between values are all interpreted as equally important, regardless of how similar (in terms of classification behaviour) the values are. We adopted the modified value difference metricto assign a different distance between each pair of values of the same feature. MVDM - IG : MVDM with IG weighting. IGT REE: In this variant, an oblivious decision tree is created with features as tests, and ordered according to information gain of features, as a heuristic approximation of the computationally more expensive pure MBL variants. For more references and information about these algorithms we refer the reader to (Daelemans et al., 1998; Daelemans et al., 1999).

MEMORY-BASED WORD SENSE DISAMBIGUATION

173

3. System Architecture and Experiments For the WSD task, we train classifiers for each word to be sense-tagged.2 To settle on an optimal memory-based learning algorithm variant (i.e. IB 1, IB 1- IG, IB 1MVDM, or IGTREE) and parameter settings (e.g. k, the number of similar items taken into account when extrapolating from memory), as well as different possible feature construction settings (see below), ten-fold cross-validation is used: the training data is split into ten equal parts, and each part in turn is used as a test set, with the remaining nine parts as training set. All sensible parameter settings, algorithm variants, and feature construction settings are tested, and those settings giving the best results in the cross-validation are used to construct the final classifier, this time based on all available training data. This classifier is then tested on the SENSEVAL test cases for that word. Feature Extraction The architecture described is suited for WSD in general, and this can include various types of distinctions ranging from rough senses that correspond to a particular POS tag, to very fine distinctions for which semantic inferences need to be drawn from the surrounding text. The 36 words and their senses in the SENSEVAL task embody many such different types of disambiguations. Since we do not know beforehand what features will be useful for each particular word and its senses, and because our classifier can automatically assess feature relevance, we have chosen to include a number of different information sources in the representation for each case. All information is taken from the dictionary entries in the HECTOR dictionary and from the corpus files, both of which have been labeled with Part of Speech tags using MBT, our Memory-Based Tagger (Daelemans et al., 1996). We did not use any further information such as external lexicons or thesauri. The sentences in the corpus files contain sense-tagged examples of the word in context. For example: 800002 An image of earnest Greenery is almost tangible. Eighteen years ago she lost one of her six children in an accident < / > on Stratford Road, a tragedy which has become a pawn in the pitiless point-scoring of small-town vindictiveness.

The dictionary contains a number of fields for each sense, some of which (i.e. the ‘ex’ (example) and ‘idi’ (idiom) fields) are similar to the corpus examples. These underwent the same treatment as the corpus examples: these cases were used to extract both context features (directly neighbouring words and POS-tags, as described in section 3), and keyword features (informative words from a wide neighbourhood; see section 3). The only other field from the dictionary that we used is the ‘def’ field, which gives a definition for a sense. During the crossvalidation, the examples which originated from the dictionary were always kept in the training portion of the data to have a better estimate of the generalization error. Note that for both dictionary and corpus examples, we took the sense-tag that it was labeled with as a literal atom,3 and did not take into account the hierarchical

174

VEENSTRA ET AL.

sense/sub-sense structure of the category labels. All cases that were labeled as errors or omissions (i.e. the 999997 and 999998 tags) were discarded. Disjunctions were split into (two) separate cases. Context Features We used the word form and the Part-of-Speech (POS) tag of the word of interest and the surrounding positions as features. After some initial experiments, the size of the window was set to two words to the left and to the right. This gives the following representation for the example given above: 800002,in,IN,an,DT,accident,NN,on,IN,Stratford,53275

Keyword Features Often the direct context cannot distinguish between two senses. In such cases it is useful to look at a larger context (e.g. the whole text snippet that comes with the example) to guess the meaning from its content words. As there is a large number of possible content words, and each sentence contains a different number of them, it is not practical to represent all of them in the fixedlength feature-value vector that is required by the learning algorithm. We therefore used only a limited set of “informative” words, extracted from i) sentences in the corpus file and ii) the ‘ex’ and ‘idi’ sentences in the dictionary file; we will call these words the keywords. The method is essentially the same as in the work of Ng and Lee (1996), and extracts a number of keywords per sense. These keywords are then used as binary features, which take the value 1 if the word is present in the example, and the value 0 if it is not. A word is a keyword for a sense if it obeys the following three properties: (i) the word occurs in more than M1 percent of the cases with the sense; a high value of M1 thus restricts the keywords to those that are very specific for a particular sense, (ii) the word occurs at least M2 times in the corpus; a high value of M2 thus eliminates low-frequency keywords, (iii) only the M3 most frequently occurring keywords for a sense are extracted, restricting somewhat the number of keywords that are extracted for very frequent senses. Definition FeaturesIn addition to the keywords that passed the above selection, we use all open class words (nouns, adjectives, adverbs and verbs) in the ‘def’ field of the dictionary entry as features. Comparable to the keyword feature the definition word feature has the value ‘1’ if it occurs in the test sentence, else it has the value ‘0’. The ‘def’ field is only used for this purpose, and is not converted to a training case. After the addition of both types of keywords, a complete case for our example will look as follows: 800002,in,IN,an,DT,accident,NN,on,IN,Stratford,NNP,0,0,. . . . . . ,0,0,0,1,0,0,0,0,0,0,0,532675

Post-processing The ‘dict’ files contain information about multi-word expressions, compounds or collocations of a word related to a specific sense, e.g. the collocation

175

MEMORY-BASED WORD SENSE DISAMBIGUATION

Table I. The best scoring metrics and parameter settings found after 10-fold cross-validation on the training set (see text). The scores are the baseline, the default and optimal settings on the training set (average of 10-fold cross-validation), and the fine-grained, medium and coarse scores on the evaluation set respectively. The scores on the evaluation set were computed by the SENSEVAL coordinators. The average scores are computed over the percentages in this table word

metric

k

M1-M2-M3 baseline train.def train.opt eval.f eval.m eval.c

accident amaze band behaviour

MVDM IB1-IG IGTREE MVDM-IG

3 1 – 9

0.3-3-3 1.0-500-0 0.5-7-4 0.3-5-5

67.0 57.9 73.0 95.9

81.4 99.7 85.4 94.9

90.2 100 88.8 96.7

92.9 97.1 88.6 96.4

95.4 97.1 88.6 96.4

98.1 97.1 88.6 96.4

bet-n bet-v bitter bother

MVDM-IG IB1-IG MVDM-IG MVDM-IG

1 3 5 3

0.0-5-100 0.7-3-3 0.5-5-100 0.2-5-100

25.5 37.3 30.6 45.6

56.7 64.3 57.6 72.8

71.1 88.6 59.1 83.6

65.7 76.9 65.8 85.2

72.6 77.8 66.4 87.1

75.5 81.2 66.4 87.1

brilliant bury calculate consume

MVDM-IG MVDM-IG IB1-IG IGTREE

1 3 7 –

0.6-2-100 0.5-5-100 0.7-3-3 0.7-5-5

47.3 32.4 72.0 37.5

57.5 35.9 79.2 32.9

58.8 46.2 83.2 58.8

54.6 50.2 90.4 37.3

62.0 51.0 90.8 43.8

62.0 51.7 90.8 49.7

derive excess float-a float-n float-v

MVDM MVDM-IG IGTREE MVDM-IG IGTREE

5 5 – 1 –

0.0-2-100 0.5-1-1 0.3-3-3 0.8-5-5 0.4-2-100

42.9 29.1 61.9 41.3 21.0

63.9 82.6 57.0 50.8 34.2

67.3 89.3 73.5 70.2 44.0

65.0 84.4 57.4 64.0 35.4

66.1 86.3 57.4 65.3 40.6

66.8 88.2 57.4 68.0 44.1

generous giant-a giant-n invade

MVDM IGTREE MVDM-IG MB1-IG

15 – 5 3

0.6-5-100 1.0-500-0 0.2-5-100 0.1-10-1

32.5 93.1 49.4 37.5

44.8 92.8 77.2 48.0

49.3 94.1 82.6 62.7

51.5 97.9 78.8 52.7

51.5 99.5 85.6 59.2

51.5 100 97.5 62.3

knee modest onion promise-n promise-v

MVDM-IG MVDM-IG IB1 MVDM-IG IB1-IG

5 9 1 5 3

0.0-5-100 0.0-5-100 0.8-5-5 0.2-5-100 0.5-5-10

42.8 58.8 92.3 59.2 67.4

70.3 61.1 90.0 63.6 85.6

81.4 67.1 96.7 75.3 89.8

79.3 70.7 80.4 77.0 86.2

81.8 72.8 80.4 83.2 87.1

84.1 75.2 80.4 91.2 87.9

sack-n sack-v sanction scrap-n scrap-v

MVDM-IG IB1 MVDM-IG IB1 IGTREE

1 9 1 1 –

0.3-3-3 1.0-500-0 0.5-3-3 0.4-5-100 0.7-3-3

44.3 98.9 55.2 37.0 90.0

75.0 97.8 74.9 58.3 88.3

90.8 98.9 87.4 68.3 91.7

84.1 97.8 86.3 68.6 85.5

84.1 97.8 86.3 83.3 97.8

84.1 97.8 86.3 86.5 97.8

seize shake shirt slight wooden

IGTREE MVDM-IG IGTREE IB1-IG IGTREE

– 7 – 1 –

0.5-5-100 0.2-5-100 0.7-5-5 0.3-3-3 0.5-1-1

27.0 24.7 56.9 66.8 95.3

57.1 71.5 83.7 92.7 97.3

68.0 73.3 91.2 93.0 98.4

59.1 68.0 84.4 93.1 94.4

59.1 68.5 91.8 93.3 94.9

63.7 69.4 96.7 93.6 94.9

54.1

70.5

78.6

75.1

77.9

79.7

average

176

VEENSTRA ET AL.

‘golden handshake’ strongly predicts sense ‘516773’. Using this information in a post-processing step gave a slight improvement in performance. Results In this section we present the results we obtained with the optimal choice of metrics and feature construction parameters found with 10-fold cross validation on the training data, and the results on the evaluation data, as measured by the SENSEVAL coordination team. For comparison we also provide the baseline results (on the training data), obtained by always choosing the most frequent sense. Table I shows the results per word. The algorithm and metric applied are indicated in the metric column; the value of k in the third column; the values of M1, M2 and M3 in the next column; the accuracy with the optimal settings can be found in the ’train.opt’ column; and the accuracy obtained with the default setting (M1 = 0.8, M2 = 5, M3 = 5; the default suggested by Ng and Lee; 1996) and algorithm (IB 1- MVDM, k=1, no weighting) is given in the column ’train.def’. The three rightmost columns give the scores on the evaluation data, measured by the fine-grained, medium, and coarse standard respectively. For an overview of the scoring policy and a comparison to other systems participating in SENSEVAL we refer to Kilgarriff and Rosenzweig (this volume). 4. Conclusion A memory-based architecture for word sense disambiguation does not require any hand-crafted linguistic knowledge, but only annotated training examples. Since for the present SENSEVAL task dictionary information was available, we made use of this as well, and it was easily accommodated in the learning algorithm. We believe that MBL is well-suited to domains such as WSD, where large numbers of features and sparseness of data interact to make life difficult for many other (e.g. probabilistic) machine-learning methods, and where nonetheless even very infrequent or exceptional information may prove to be essential for good performance. However, since this work presents one of the first (but cf. Ng and Lee (1996) and Wilks and Stevenson (1998)) excursions of MBL techniques into WSD territory, this claim needs further exploration. Although the work presented here is similar to many other supervised learning approaches, and in particular to the Exemplar-based method used by Ng and Lee (1996) (which is essentially IB 1- MVDM with k=1), the original aspect of the work presented in this paper lies in the fact that we have used a cross-validation step per word to determine the optimal parameter-setting, yielding an estimated performance improvement of 14.4% over the default setting.

MEMORY-BASED WORD SENSE DISAMBIGUATION

177

Acknowledgements This research was done in the context of the “Induction of Linguistic Knowledge” (ILK) research programme, which is funded by the Netherlands Foundation for Scientific Research (NWO). Notes 1 TiMBL is available from: http://ilk.kub.nl/. 2 In some cases, the SENSEVAL task requires sense-tagging a word/POS-tag combination; we will

refer to both situations as word sense-tagging. 3 Although we did strip the letter suffixes (such as−x), except for the −p suffix.

References Berleant, D. “Engineering word-experts for word disambiguation”. Natural Language Engineering. 1995, pp. 339–362. Daelemans, W., A. Van den Bosch and J. Zavrel. “Forgetting exceptions is harmful in language learning”. Machine Learning, Special issue on Natural Language Learning. 1999. Daelemans, W., J. Zavrel, K. Van der Sloot and A. Van den Bosch. “TiMBL: Tilburg Memory Based Learner, version 1.0, Reference Guide”. ILK Technical Report 98-03, available from: http://ilk.kub.nl/. 1998. Daelemans, W., J. Zavrel, P. Berck, and S. Gillis. “ MBT : A Memory-Based Part of Speech Tagger Generator”. In: E. Ejerhed and I. Dagan (eds.) Proc. of Fourth Workshop on Very Large Corpora. 1996, pp. 14–27. Ng, H. T. and H. B. Lee. “Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach”. In: Proc. of 34th meeting of the Assiociation for Computational Linguistics. 1996. Wilks, Y. and M. Stevenson. “Word Sense Disambiguation using Optimised Combinations of Knowledge Sources”. In: Proceedings of COLING-ACL’98. Montreal, Quebec, Canada, 1998, pp. 1398–1402. Zavrel, J. and W. Daelemans. “Memory-Based Learning: Using Similarity for Smoothing”. In: Proc. of 35th annual meeting of the ACL. Madrid. 1997.

Computers and the Humanities 34: 179–186, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

179

Hierarchical Decision Lists for Word Sense Disambiguation DAVID YAROWSKY Dept. of Computer Science and Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD 21218, USA (E-mail: [email protected])

Abstract. This paper describes a supervised algorithm for word sense disambiguation based on hierarchies of decision lists. This algorithm supports a useful degree of conditional branching while minimizing the training data fragmentation typical of decision trees. Classifications are based on a rich set of collocational, morphological and syntactic contextual features, extracted automatically from training data and weighted sensitive to the nature of the feature and feature class. The algorithm is evaluated comprehensively in the SENSEVAL framework, achieving the top performance of all participating supervised systems on the 36 test words where training data is available. Key words: word sense disambiguation, decision lists, supervised machine learning, lexical ambiguity resolution, SENSEVAL

1. Introduction Decision lists have been shown to be effective at a wide variety of lexical ambiguity resolution tasks including word sense disambiguation (Yarowsky, 1994, 1995; Mooney, 1996; Wilks and Stevenson, 1998), text-to-speech synthesis (Yarowsky, 1997), multilingual accent/diacritic restoration (Yarowsky, 1997), multilingual accent/diacritic restoration (Yarowsky, 1994), named entity classification (Collins and Singer, 1999) and spelling correction (Golding, 1995). One advantage offered by interpolated decision lists (Yarowsky, 1994, 1997) is that they avoid the training data fragmentation problems observed with decision trees or traditional non-interpolated decision lists (Rivest, 1987). They also tend to be effective at modelling a large number of highly non-independent features that can be problematic to model fully in Bayesian topologies for sense disambiguation (Gale, Church and Yarowsky, 1992; Bruce and Wiebe, 1994). This paper presents a new leaning topology for sense disambiguation based on hierarchical decision lists, adding a useful degree of conditional branching to the decision list framework. The paper also includes a comprehensive evaluation of this algorithm’s performance on extensive previously unseen test data in the SENSEVAL framework (Kilgarriff, 1998, Kilgarriff and Palmer, this volume),

180

YAROWSKY

showing its very successful application to the complex and fine-grained HECTOR sense inventory.

2. System Description The basic decision-list algorithms used in this system are described in Yarowsky (1994, 1997), with key details outlined below. Note that part-of speech (POS) tagging is treated as a disjoint task from sense tagging, and a trigram POS tagger has been applied to the data first. The POS tagger has not been optimized for the specific idiosyncrasies of the SENSEVAL words and such optimization would likely be helpful.

2.1.

FEATURE SPACE

The contextual clues driving the decision list algorithm are a cross-product of rich sets of token types and positions relative to the keyword. The example decision lists in Table I illustrate a partial set of such features. Positional options include relative offsets from the keyword (+1, −1, −2), the keyword itself (+0), co-occurrence within a variable κ-word window (± κ), and larger n-gram patterns (+1+2, −1+1). Another crucial positional class are the wide range of syntactic relations extracted from the data using an island-centered finite state parser. The valid patterns differ depending on keyword part of speech, and for nouns they are V/OBJ – the verb of which the keyword is an object (e.g. showed very abundant promise), SUBJ/V – the verb of which the keyword is the subject, and MODNOUN – the optional headnoun modified by the noun. Each of these patterns help capture and generalize sets of very predictive longer-distance word associations. Five major token types are measured in each of the diversity of syntactic/ collocational positions, including W = literal word, L = lemma (win/V=win, wins, won, winning), P = part-of-speech, C = word class (e.g. countryname) and Q = question, such as is the word in the given position capitalized? Together this rich cross-product of word-type and syntactic position offers considerable refinement over the bag-of-words model.

2.2.

FEATURE WEIGHTING AND BASIC DECISION SIST GENERATION P (f |s )

i j is For each word-position feature fi , a smoothed log-likelihood ratio P (fi |¬−s i) computed for each sense sj , with smoothing based on an empirically estimated function of feature type and relative frequency. Candidate features are ordered by this smoothed ratio (putting the best evidence first), and the remaining probabilities are computed via the interpolation of the global and history-conditional probabilities.1

HIERARCHICAL DECISION LISTS FOR WORD SENSE DISAMBIGUATION

2.3.

181

HIERARCHICAL DECISION LISTS

One limitation of traditional flat decision lists is that they do not support conditional branching. Yet it is often the case that given some major splitting criterion (such as whether a keyword is identified as a noun or verb) we would wish to divide the control flow of the decision procedure into relatively independent paths specialized for the modelling needs of each side of the splitting partition. Decision trees, which entail complete path independence after every node split, pay for this power with wasteful training data fragmentation. Yet a simple forest of uniflow decision lists fails to capture the common hierarchical structure to many decision problems. This proposed hybrid supports several major useful decision flow partitions, but largely retains the uniflow non-data-fragmenting benefits of interpolated decision lists. The key to the success of this approach is defining a class of such natural major partitioning questions for the application, and pursuing exhaustive cross-validated search on whether any candidate partition improves modelling of the training data.2 For the application of sense disambiguation, some natural major decision-flow partitioning criteria are: − Split on the part of speech of the keyword. As previously noted, the sense inventory and natural decision lists for the noun and verb senses of words is widely divergent, and thus a top-level split in control flow based on keyword part-of-speech is very natural. The top-level decision list in Table I illustrates this split into subsequent LN (noun) and LV (verb) decision lists for the word promise. − Split on keyword inflection . Similarly, within a major part-of-speech, different inflectional forms (e.g. promise and promises, or scrap and scraps) often exhibit different sense inventory distributions and different optimal subsequent modeling. In the mid-level list in Table I, promises (NOUN) separately yields a mostly pure sense distribution that effectively excludes senses 5 and 6. In contrast, the singular inflection promise (NOUN) retains this ambiguity, requiring the subsequent decision list L4 to distinguish senses 4, 5 and 6. While this partition could technically have been done with finer grained parts of speech at the top-level split, the interaction with other mid-level questions (see below) makes this two-tiered part-of-speech partition process worthwhile. − Split on major idiomatic collocations. Many idiomatic collocations like keep/break/give/make a promise or shake up/down/out/off benefit from a subsequent specialized decision list to resolve the possible sense differences for this specific collocation (e.g. L1 or L2), and when corresponding to a single sense number (e.g. keep a promise → 4.3) can directly yield a sensetag output (as a specialized decision list would have no residual ambiguity to resolve). Such candidate collocations are extracted from the basic defining inventory mne-uid.map3 (e.g. promise 538409 keep n promise / / 4.3) and/or

182

YAROWSKY

Table I. Partial decision list hierarchy for the SENSEVAL word promise. Top-level Decision List for promise Loc

Pattern Typ Token

Next List

1

3

+0 +0

P P

LN(⇓) LV

0 440

0 115

NOUN → VERB →

Empirical Sense Distribution 4 4.1 4.2 4.3 4.4 297 0

53 0

5 0

37 0

11 0

5

6

22 0

93 0

⇓ Mid-level Decision List for promise.LN (noun) Loc

Pattern Typ Token

Next List

4

V/obj V/obj V/obj V/obj +0 +0

L L L L W W

→ 4.3 → 4.4 → L1 → L2 → L3 → L4(⇓)

0 0 2 0 115 180

keep/V break/V make/V give/V promises promise

Empirical Sense Distribution 4.1 4.2 4.3 4.4 5 0 0 44 0 5 2

0 0 0 5 0 0

31 0 0 1 0 1

0 11 0 0 0 0

0 0 0 1 0 21

6 0 0 2 2 1 88

⇓ (Abbreviated) Terminal Decision List for promise. L4 (promise-noun- singular) Loc

Pattern Typ Token

Output Sense

LogL

4

+1 −1 −1 V/obj +1 −1 +1 −1 −1 +1 +1 +1 −1 +1 ±κ V/obj ±κ ±κ subj/V V/obj V/obj V/obj −1 −1 −1

W W L L W L L W W W W W W W L L L L L L L L L L L

→4 →6 →6 →6 →6 →4 →4 →6 →6 →6 →6 →6 →4 →4 →4 →4 →4 →4 →4 →4 →4 →4 →5 →5 →5

9.51 8.16 7.38 7.27 6.16 5.74 5.70 5.57 5.57 5.57 5.57 5.57 5.16 5.16 4.74 4.74 4.64 4.29 4.18 4.16 4.16 4.16 4.09 4.09 4.09

41 0 0 0 0 6 3 0 0 0 0 0 2 2 15 3 14 11 2 2 2 2 0 0 0

to of early/J show/V at firm/J do/V such much when on as your during free/J trust/V support/N election/N contain/V win/V repeat/V honour/J rhetorical/J increase/V future/J

Empirical Sense Distribution 4.1 4.2 4.3 4.4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 12 7 13 3 0 0 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0

partition (5.1/5.2/5.3/6.1/6.2) was generally pursued on the SENSEVAL data. Recent results indicate that an even more effective compromise in this case is to utilize a deeply hierarchical approach where probabilities are interpolated across sibling subtrees.

HIERARCHICAL DECISION LISTS FOR WORD SENSE DISAMBIGUATION

183

from collocations that are found to be tightly correlated with specialized sense numbers in the training data. The decision to split out any such collocation is based on an empirical test of the global efficacy of doing so.4 − Split on syntactic features. In many cases it is also useful to allow mid-level splits on syntactic questions such as whether a keyword noun premodifies another noun (e.g. the standard syntactic feature MODNOUN ! = NULL). Such a split is not useful to promise, but is widely applicable to the HECTOR inventory given its tendency to make an NMOD subsense distinction. − Partition subsenses hierarchically. When a sense inventory has a deep sense/subsense structure, it may be useful to have third-level decision lists focus on major sense partitions (e.g. 4/5/6) and when appropriate yield pointers to a finer-grained subsense-resolving decision list (e.g. L5 = 5.1/5.2/5.3). This multi-level subsense resolution is most effective when the subsenses are tightly related to each other and quite different from the other major senses. For performance reasons, however, a flat direct subsense partition (5.1/5.2/5.3/6.1/6.2) was generally pursued on the SENSEVAL data. Recent results indicate that an even more effective compromise in this case is to utilize a deeply hierarchical approach where probabilities are interpolated across sibling subtrees.

3. Evaluation and Conclusion Table II details the performance of the JHU hierarchical decision list system in the 1998 SENSEVAL evaluation. To help put the performance figures in perspective, the average precision for all supervised systems is given, as is the precision for the best performing system of any type. All data to the left of the vertical line are based on the July 98 bakeoff. Here the JHU system achieved the highest average overall precision on the 36 “trainable” words (for which tagged training data was available). Due to the haste under which the primary evaluation was conducted, and the inability to manually check the output, for three words (bet/Noun, floating /Adj and seize/Verb) the JHU system had errors in mapping from its internal sense number representations (a contiguous sequence 0,1,2,3, . . . ) to the standard output sense IDs (538411, 537573, 537626, etc.). This resulted in significantly lower scores for these three words. Thus for the 2nd round October 98 evaluations, these simple mapping errors were corrected and nothing else was changed. Corrected performance figures are given to the right of the vertical line. The additional evaluation area consisted of the 5 words for which no annotated training data was available. As a demonstration of robustness, the JHU supervised tagger was applied to these words as well, trained only on their dictionary definitions. Precision for these words was measured at deaf = 94.3, disability = 90.0, hurdle = 69.0, rabbit= 76.5 and steering = 95.0, with an overall average precision

184

YAROWSKY

Table II. Performance of the JHU system on the 36 trainable words.

Word

Avg. Syst POS Prec.

All Trainable All Trainable All Trainable All Trainable All Trainable

a n p v all

72.7 81.7 73.7 66.4 73.40

accident amaze band behaviour bet bet bitter bother brilliant bury calculate consume derive excess float float floating generous giant giant invade knee modest onion promise promise sack sack sanction scrap scrap seize shake shirt slight wooden

n v p n n v p v a v v v v n n v a a a n v n a n n v n v p n v v p n a a

92.3 94.6 87.5 87.5 60.8 55.5 63.8 75.3 56.1 47.8 87.9 52.1 59.5 83.5 65.1 47.1 57.2 53.7 84.1 83.6 56.5 81.5 68.2 86.7 84.4 69.8 76.9 83.1 76.9 64.2 78.7 64.2 68.6 90.8 92.0 95.8

Initial Best JHU Syst Prec. Prec.

JHU JHU Rank % of of 21 Best

Final New JHU JHU Prec Rank 77.3 87.0 78.1 74.3 78.9

1 2 1 1 1

78.8

1

63.6

+

73.5

1

77.8 84.7 78.1 73.4 78.4

77.8 87.0 78.1 73.4 78.4

1 3 1 1 1

100.0 97.4 100.0 100.0 100.00

95.6 100.0 90.6 96.1 52.2 69.8 64.9 80.2 59.5 46.2 92.2 53.0 66.4 87.8 82.2 54.0 0.0 59.5 99.1 85.8 54.6 84.6 71.8 92.1 88.6 90.9 87.8 97.8 86.5 75.1 94.9 65.3 70.9 92.6 96.3 97.4

95.7 100.0 90.6 69.4 75.7 78.6 73.4 86.5 61.4 57.3 92.2 58.5 67.1 90.0 82.2 61.4 80.9 61.2 99.5 91.0 63.4 87.1 72.9 92.5 88.6 91.3 87.8 97.8 86.5 79.5 95.1 68.4 76.5 97.8 96.3 98.0

2 1 1 + − 3 + 2 3 + 1 + 2 2 1 2 − 2 3 + − 2 + 2 1 2 1 1 1 2 2 2 3 3 1 3

99.9 100.0 100.0 99.7 69.7 88.8 88.4 92.7 96.9 80.6 100.0 90.6 99.0 97.6 100.0 87.9 0.0 97.2 99.6 94.3 86.1 97.1 98.5 99.6 100.0 99.6 100.0 100.0 100.0 94.5 99.8 95.5 92.7 94.7 100.0 99.4

Rank/best for all systems + = above median rank Average precision for supervised systems − = below median rank

HIERARCHICAL DECISION LISTS FOR WORD SENSE DISAMBIGUATION

185

of 81.7%, the 2nd-highest untrainable-word score among all participants, including those systems specialized for unsupervised and dictionary-based training. Finally, the comparative advantage of hierarchical decision lists relative to flat lists was investigated. Using the most fine-grained inventory scoring and 5fold cross validation on the training corpus for these additional studies, average accuracy on the 36 test words dropped by 7.3% when the full 3-level lists were replaced by a single 2-level list splitting only on the part of speech of the keyword. A further 1% drop in average accuracy was observed on the ‘p’ words (bitter, sanction, etc.) when their top-level POS split was merged as well.5 Taken together these results indicate that optionally splitting dataflow on keyword inflections, major syntactic features, idiomatic collocations and subsenses and treating these in separate data partitions can improve performance while retaining the general dataflow benefits of decision lists. One natural next step in this research is to evaluate the minimally supervised bootstrapping algorithm from Yarowsky (1995) on this data. Results on the word rabbit show a 24% increase in performance using bootstrapping on unannotated rabbit data over the supervised baseline. The major impediment to this work is the lack of discourse IDs in the data (or at least a matrix indicating those test sentences co-occurring in the same discourse). This information is crucial to the co-training of the one-sense-per-collocation and one-sense-per-discourse tendencies that enables the bootstrapping algorithm to gain new beachheads and robustly correct acquired errors or over-generalizations. Thus acquisition of some type of discourse or document IDs for the HECTOR sentences would potentially be a very rewarding investment.

Notes 1 The history-conditional probabilities are based on the residual data for which one earlier pattern

in the decision list matches. While clearly more relevant, they are often much more poorly estimated because the size of the residual training data shrinks at each line of the decision list. A reasonable compromise is to interpolate between two conditional probabilities for any given feature fi at line i of the list, βi P (sj |fi ) + (1 − βi ) P (sj |¬f1 ∧ .. ∧ ¬fi−1 ), where βi = 0 corresponds to the original Rivest (1987) decision list formulation. 2 Training time for a single linear decision list is typically under 2 seconds total elapsed clock time on a SPARC Ultra-2. Because there is often a natural hierarchical sequence of split question types, and because many combinations are unnecessary to consider (e.g. nmod and noun inflectional cases under the top-level LV=verb split), the total space of tested split combinations is typically (much) less than 1000, and hence very computationally tractable. 3 http://www.itri.bton.ac.uk/events/senseval/mne-uid.map 4 Note that small numbers of the make/give/break a promise senses 4.1, 4.2 and 4.3 are not caught by the specialized patterns in the mid-level decision list. There are several reasons for this. A majority of these few misses are due to parsing errors that failed to recognize the correct headword given unusually convoluted syntax. In some cases, there may be genuine ambiguity, as in sentence 800848 “that the promises given to him be kept”, which is recognized as 4.2 = give a promise but was human labelled as 4.3 = keep a promise. 5 On explanation for this smaller drop is that the feature spaces for different parts of speech are

186

YAROWSKY

somewhat orthogonal, making it relatively less costly to accommodate their separate decision threads in the same list.

References Bruce, R. and J. Wiebe. “Word-sense disambiguation using decomposable models.” Proceedings of ACL ’94, MD: Las Cruces, 1994, pp. 139–146, Collins, M. and Y. Singer. “Unsupervised models for named entity classification.” Proc. of the 1999 Joint SIGDAT Conference, MD: College Park, 1999, pp. 100–110, Gale, W., K. Church, and D. Yarowsky. “A method for disambiguating word senses in a large corpus.” Computers and the Humanities, 26 (1992), 415–439. Golding, A. “A Bayesian hybrid method for context-sensitive spelling correction.” Proceedings of the 3rd Workshop on Very Large Corpora, 1995, pp. 39–53. Kilgarriff, A. “SENSEVAL: An exercise in evaluating word sense disambiguation programs.” Proceedings of LREC, Granada, 1998, pp. 581–588. Mooney, R. “Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning.” Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadelphia. 1996, pp. 82–91, Rivest, R. “Learning decision lists.” Machine Learning, 2 (1987), 229–246. Wilks, Y. and M. Stevenson. “World sense disambiguation using optimised combinations of knowledge sources.” Proceedings of COLING/ACL-98. 1998. Yarowsky, D. “Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French.” Proceedings of ACL ’94, 1994, pp. 88–95. Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods.” Proceedings of ACL ’95, 1995, pp. 189–196. Yarowsky, D. “Homograph disambiguation in speech synthesis.” In J. van Santen, R. Sproat, J. Olive and J. Hirschberg (eds.), Progess in Speech Synthesis, Springer-Verlag, 1997, pp. 159–175.

Computers and the Humanities 34: 187–192, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

187

Using Semantic Classification Trees for WSD C. de LOUPY1,2, M. EL-BÈZE1 and P.-F. MARTEAU2 1 Laboratoire d’Informatique d’Avignon (LIA), BP 1228, F-84911 Avignon Cedex 9 France (E-mail: {claude.loupy,marc.elbeze}@lia.univ-avignon.fr); 2 Bertin Technologies, Z.I des Gatines –

B.P. 3, F-78373 Plaisir cedex (E-mail: {deloupy,marteau}@bertin.fr)

Abstract. This paper describes the evaluation of a WSD method within SENSEVAL . This method is based on Semantic Classification Trees (SCTs) and short context dependencies between nouns and verbs. The training procedure creates a binary tree for each word to be disambiguated. SCTs are easy to implement and yield some promising results. The integration of linguistic knowledge could lead to substantial improvement. Key words: semantic classification trees, SENSEVAL , word sense disambiguation, WSD evaluation

1. Introduction While developing a set of Information Retrieval components (de Loupy et al., 1998a), the Laboratoire Informatique d’Avignon (LIA) and Bertin Technologies are investigating semantic disambiguation. In a Document Retrieval framework, identifying the senses associated with the words of a query is expected to lead to some noise reduction for short queries (Krovetz and Croft, 1992). As a second benefit, this knowledge should also result in an increase in recall through query expansion relying on synonymy and other semantic links. In de Loupy et al. (1998d), we experimented with this type of enrichment using WordNet (Miller et al., 1993). Performance was improved when words having a single sense (two if they are not frequent words) were associated with their synonyms. In de Loupy et al. (1998b), we evaluated a first approach based on WordNet, the SemCor (Miller et al., 1993b) and Bisem Hidden Markov Models. These models are not so well adapted to this task for 2 reasons: (i) the context window is too small (2 words), (ii) a very large amount of training corpus (so far unavailable) is required. Semantic Classification Trees (SCT) (Kuhn and de Mori, 1995) offer an alternative to model right and left contexts jointly. Smaller learning resources are required. Short context dependencies are taken into account. We could have used a pure knowledge-based WSD system. But extending such a system to a large scale requires writing rules for each word. The SCT approach can be seen as an attractive trade-off because it allows building an automatic WSD system without excluding the possibility of introducing knowledge.

188

DE LOUPY ET AL.

2. Preparation of the Data The SCT method, which requires a training corpus, is well-suited to bring out relevant dependencies between the word to be disambiguated and the words (or types of words) surrounding it. As a first step, we have only attempted to tag nouns and verbs (adjectives have not been tested). More precisely, the evaluation of the proposed approach has been performed on 25 different words (see section 4 for the list). In order to train the models, we have used the examples given by DIC1 (24 examples per word on average) and TRAIN (315 examples per word on average). “Yarowsky [. . . ] s uggests that local ambiguities need only a window of size k = 3 or 4, while semantic or topic-based ambiguities require a larger window of 20–50 words” (Ide and Véronis, 1998). Therefore, we have limited the context window size to 3 lemmas before the ambiguous one (call it 3) and 3 lemmas after. If two possible semantic tags are given for 3 in the learning sample, the information is duplicated to produce one example for each tag. The examples found in DIC and those extracted from TRAIN have been processed exactly in the same way and have the same weight for training. In order to achieve better WSD, it is important (Dowding et al., 1994; Segond et al., 1993b) to take the grammatical tags of the words into account. For such a task, we have used our ECSTA tagger2 (Spriet and El-Bèze, 1997). Yarowsky (1993) highlights various behaviors based on syntactic categories: directly adjacent adjectives or nouns best disambiguate nouns. Our assumption is quite different; we would like to check to what extent verbs and nouns could disambiguate nouns and verbs.3 The words belonging to the three following grammatical classes are therefore not kept for the disambiguation process: determiners, adverbs and adjectives. The other words are replaced by their lemmas and unknown words are left unchanged. Some words are so strongly related that, in almost all the cases, it is possible to replace one of them by another without any consequence for the sense of the sentence. For instance, it is not necessary to keep precise information on months. Hence, January, February, etc. are replaced by MONTH. In the same way, pseudolemma DAY stands for Monday, etc., CD for a number, PRP for a pronoun, NNPL for a location (Paris, etc.), NNPG for a group (a company, etc.), NNP for the name of a person and UNK for an Out of Vocabulary Word if its initial letter is an uppercase. These substitutions are intended to decrease the variability of the context in which a given word sense can be found. For example, in the definition of sack, sense 504767 (“the pillaging of a city”) is given with the example: the horrors that accompany the sack of cities?. This sentence is used to produce the following context example of: / horror / that / accompany / sack (504767) / of / city / ? /. 3. Semantic Classification Trees A very short description of the SCT method is provided hereafter. For more information, one can refer to Kuhn and de Mori (1995). An SCT is a specialized classification tree that learns semantic rules from training data. Each node T of the

USING SEMANTIC CLASSIFICATION TREES FOR WSD

189

Figure 1. An extract of the SCT for the noun sack.

binary tree contains a question that admits a “ Yes/No” answer corresponding to the two branches attached to the node. The preprocessing procedure described in the previous section produces a set of learning samples. A set of questions4 corresponding to each sample is then constructed from the words found in the context of the word to be disambiguated. A quantity called Gini impurity (Breiman et al., 1984) is computed in order to choose the most appropriate question for a given T node. Let S be the set of semantic tags associated P with Pthe word to be disambiguated. The Gini impurity is given by i(T ) = j ∈S k∈S,k6 =j p(j |T ) × p(k|T ) where p(j |T ) is the probability of sense j given node T. For each node, the chosen question is the one which leads to a maximal decrease in impurity from the current node to its children, i.e., the one maximizing the change in impurity: 1i = I (T ) − py × i(Y es) − pn × I (No) where py and pn are the proportions of items respectively sent to Yes and No by the question. For instance, let us consider the SCT represented in Figure 1 which has been created for the noun sack.5 Twelve senses are possible for sack-n. Symbols ‘’ mark the boundaries of a pattern. ‘+’ indicates a gap of at least one word. For example, < + sack + potato . > models all the sack contexts for which sack is not the first word of the context, one or several words follow, then potato and a period occur. The sense assigned to sack for this context is σ5 ”, that is 504756 (“a plain strong bag”), or σ10 , that is 505121 (“sack of potatoes”, “something big, inert, clumsy, etc.”). The example given in section 2 gives the rule < + sack of city + > and corresponds to sense σ8 , that is 504767 (“the pillaging of a city”). A linguist would not have used the same questions as the ones found automatically by the system. However, the score obtained for sack-n is good: 90.2% of correct

190

DE LOUPY ET AL.

Figure 2. Score of the SCTs for the 25 words (the number of tests per word is given in parentheses).

assignment (the score of a systematic assignation of the most frequent tag being 50.4%). 4. Evaluation of SCTs in SENSEVAL Within SENSEVAL, the SCT method has been used for semantic tag assignment. The results obtained for the 25 words with fine-grained semantic tagging (high precision) are reported6 in Figure 2. One could argue the most important thing for training is the number of examples for each sense-word pair. Indeed, the best scores are obtained for behaviour-n and amaze-v for which there is a large number of samples (335 and 139 per sense, respectively). This is not the only explanation: scrap-v (13 samples per sense) has better results than promise-v (200 samples per sense) and derive-v (47 samples per sense). Since scrap-v has only 3 semantic tags, the task is obviously easier than for float-v (16 tags, 15 samples per sense). Lastly, the task for amaze-v is the easiest since there is only one sense! Like other systems tested in SENSEVAL, performance is, on average, better for nouns than for verbs. It is difficult to compare the experiments carried out with the SCT method and with the HMM model described in de Loupy et al. (1998b) since training and test corpora are different. Moreover, the task described in de Loupy et al. (1998b) requires assignment of semantic tags to each word of the Brown corpus.

USING SEMANTIC CLASSIFICATION TREES FOR WSD

191

5. Conclusion The approach described in this article has yielded some interesting results. Had we used more sophisticated questions when building the SCTs, results could have been better. Moreover, since little data is given for each semantic tag, we have used low thresholds in order to build wider trees.7 Therefore, some rules are too specific and do not reach the generalization objective. Other methods have been tested, leading to the conclusion that SCTs perform better than alternative approaches presented in de Loupy et al. (1998c) (0.51 precision for the other two methods on nouns). Further experiments are necessary in order to assess this result with more reliability. This method is a numerical one and requires no expertise. Nevertheless, linguistic knowledge could be integrated into the whole process, particularly when drawing up the list of questions. For example, the following word is often a good way to determine the sense of a verb (ex: look around, look for, look after, . . . ). Moreover, the LIA is developing a French semantic lexicon within the framework of the EuroWordNet project (Ide et al., 1998) and intends, with the support of its industrial partner Bertin Technologies, to use it in a cross-language Document Retrieval frame. Future research will be focused on this topic. Acknowledgements We are indebted to Frédéric Béchet, Renato De Mori and Roland Kuhn for their help on the implementation of the SCT method. Notes 1 DIC and TRAIN are used here as in SENSEVAL to abbreviate dictionary and training corpus. 2 ECSTA was evaluated for French in Spriet and El-Bèze (1997), but we do not have a real estimate

of its performance for English. 3 Within the SENSEVAL evaluation, we found that using nouns and verbs to disambiguate nouns

improved the effectiveness from 6 to 34% compared to the use of adjectives and nouns, except for 3 nouns for which scores are similar (11.5% improvement on average). For the verbs it is not so clear since the average improvement is less than 2%. 4 Questions are formulated as regular expressions. An example is given in the following paragraph. 5 The noun sack is a better illustration of the SCT method than onion. The SCT for onion can be found in de Loupy et al. (1998c). 6 The SCT s always make a decision. Therefore, precision and recall are the same. 7 The use of high thresholds would lead to building very poor trees and even, with a very high threshold, reduce to a single node (the root) so that the most frequent tag would be systematically assigned.

References Breiman, L., J. Friedman, R. Olshen and C. Stone. Classification and Regression Trees. Wadsworth, 1984.

192

DE LOUPY ET AL.

Dowding, J., R. Moore, F. Andry and D. Moran. Interleaving Syntax and Semantics in an Efficient Bottom-up Parser. ACL-94, Las Cruces, New Mexico, 1994, pp. 110–116. Ide, N., D. Greenstein and P. Vossen (eds). Special Issue on EuroWordNet, Computers and the Humanities, 32(2–3) (1998). Ide, N. and J. Véronis. “Introduction to the Special Issue on WSD: The State of the Art; Special Issue on Word Disambiguation”. Computational Linguistics, 24(1) (March 1998), 1–40. Krovetz, R. and W.B. Croft. “Lexical Ambiguity and Information Retrieval”. ACM Transaction on Information Systems, 10(1). Kuhn, R. and R. De Mori. “The Application of Semantic Classification Trees to Natural Language Understanding”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5) (May 1995), 449–460. de Loupy, C., P.-F. Marteau and M. El-Bèze. Navigating in Unstructured Textual Knowledge Bases. La Lettre de l’IA, No. 134-135-136, May 1998, pp. 82–85. de Loupy, C., M. El-Bèze and P.-F. Marteau. Word Sense Disambiguation using HMM Tagger. LREC, Granada, Spain; May 28–30 1998, pp. 1255–1258. de Loupy, C., M. El-Bèze and P.F. Marteau. WSD Based on Three Short Context Methods. SENSEVAL Workshop, Herstmonceux Castle, England, 2–4 September, 1998, http://www.itri.brighton.ac.uk/ events/senseval/. de Loupy, C., P. Bellot, M. El-Bèze and P.F. Marteau. Query Expansion and Classification of Retrieved Documents. TREC-7, Gaithersburg, Maryland, USA, 9–11 November 1998. Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross and K. Miller. Introduction to WordNet: An OnLine Lexical Database. http://www.cosgi.princeton.edu/∼wn, August 1993. Miller, G., C. Leacock, T. Randee and R. Bunker. “A Semantic Concordance”. In Proceedings of the 3rd DARPA Workshop on Human Language Technology. Plainsboro, New Jersey, 1993, pp. 303– 308. Segond, F., A. Schiller, G. Grefenstette and J.-P. Chanod. An Experiment in Semantic Tagging Using Hidden Markov Model Tagging. ACL/EACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications; Madrid, July 1997. Spriet, T. and M. El-Bèze. “Introduction of Rules into a Stochastic Approach for Language Modelling”. In Computational Models for Speech Pattern Processing, NATO ASI Series F, editor K.M. Ponting, 1997. Yarowsky, D. One Sense per Collection ARP. A Human Technology Workshop, Princeton, NJ, 1993.

Computers and the Humanities 34: 193–197, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

193

Dictionary-Driven Semantic Look-up FRÉDÉRIQUE SEGOND1, ELISABETH AIMELET1, VERONIKA LUX1 and CORINNE JEAN2 1 Xerox Research Centre Europe, Meylan, France; 2 Université de Provence and Xerox Research

Centre Europe

1. Introduction The French Semantic Dictionary Look-up (SDL) uses dictionary information about subcategorization and collocates to perform Word Sense Disambiguation (WSD). The SDL is fully integrated in a multilingual comprehension system which uses the Oxford Hachette French-English bilingual dictionary (OUP-H). Although the SDL works on all words both for French and English, Romanseval results are relevant for French verbs only because subcategorisation and collocate information is richer for this part of speech in the OUP-H. The SDL uses dictionaries as semantically tagged corpora of different languages, making the methodology reusable for any language with existing on-line dictionaries. This paper first describes the system architecture as well as its components and resources. Second, it presents the work we did within Romanseval, namely sense mapping and results analysis. 2. Semantic Dictionary Look-Up: Goal, Architecture and Components The SDL selects the most appropriate translation of a word appearing in a given context. It reorders dictionary entries making use of dictionary information. It is built on top of Locolex,1 an intelligent dictionary look-up device which achieves some word sense disambiguation using the word’s context: part-of speech and multiword expression (MWEs)2 recognition. However, Locolex choices remain syntactic. Using the OUP-H information about subcategorization and collocates the SDL goes one step further towards semantic disambiguation. To reorder dictionary entries the SDL uses the following components: − the Xerox Linguistic Development Architecture (XeLDA), − the Oxford University Press-Hachette bilingual French-English, EnglishFrench dictionary (OUP-H), − the French Xerox Incremental Finite State Parser (XIFSP).

194

SEGOND ET AL.

XeLDa is a linguistic development framework designed to provide developers and researchers with a common architecture for the integration of linguistic services. The OUP-H dictionary look-up and the French XIFSP are both integrated into XeLDA. The OUP-H (French-English),3 an SGML-tagged dictionary, is designed to be used for production, translation, or comprehension, by native speakers of either English or French. The SDL uses OUP subcategorization and collocate tags. Collocate tags encode the kind of subject and/or object a predicate expects. Most of the time, they are given as a list of words, sometimes as a concept. To extract functional information from input text in order to match it against OUP-H information, we use the French XIFSP. XIFSP adds syntactic information at sentence level in an incremental way, depending on the contextual information available at a given stage. Of particular interest to us is the fact that shallow parsing allows fast automatic recognition and extraction of subject and object dependency relations from large corpora, using a cascade of finite-state transducers. The extraction of syntactic relations does not use subcategorisation information and relies on part of speech information only. For instance, suppose the task is to disambiguate the verb présenter in the sentence: Des difficultés se présentent lorsque l’entreprise d’assurance n’exerce ses activités qu’en régime de libre prestation de services et s’en tient à la couverture de risques industriels. The SDL first calls the XIFSP which parses the sentence and extracts syntactic relations, among which: SUBJREFLEX (difficulté,présenter). This relation encodes that difficultéis the subject of the reflexive usage of the verb présenter. This information is then matched against collocates information in the OUP-H for the verb présenter. Because matches are found (reflexive usage and collocate), the SDL reorders the dictionary entry and first proposes the translation “to arise, to present itself”. If no dictionary information matches the context of the input sentence, it returns, by default, the first sense of the OUP-H.4 In case of information conflict between subcategorisation and collocates, priority is given to collocates.5 3. Sense Mapping Sense mapping is an additional source of discrepancy with the gold standard which has an influence on the evaluation of WSD systems. Mapping, in our case, consists of assigning a Larousse sense tag not to an example but to a sense that is usually illustrated by a number of examples in the OUP-H. We map two different sets of senses which usually do not have the same number of elements. On average, the OUP-H distinguished more senses than Le Larousse for verbs (15.5 for OUP-H, 12.66 for Larousse) and less for nouns and adjectives (for nouns: 5.6 in OUPH, 7.6 in Larousse; for adjectives: 4.8 in OUP-H, 6.3 in Larousse).6 Clearly, the

DICTIONARY-DRIVEN SEMANTIC LOOK-UP

195

fewer senses in the initial lexical resource used by the WSD system, the easier the mapping. These differences show up between any two dictionaries, but in this case they are especially important because of two additional factors: first, the Petit Larousse is monolingual while OUP-H is bilingual. Second, the Petit Larousse is a traditional dictionary with a clear encyclopedic bias while the OUP-H is corpus and frequency based. Being monolingual and intended for French native speakers, the Petit Larousse provides a sophisticated hierarchy of senses. Being bilingual and intended for non-native speakers, the OUP-H provides a flat set of senses. For the same reason, Larousse gives priority to semantics and provides only indicative syntactic information, while OUP-H explicitly mentions all the most common syntactic constructions and distinguishes one sense for each of them. Because of the mapping phase, the output of the SDL can be a disjunction of tags (one sense of the OUP-H maps to several senses of the Petit Larousse) or a question mark (one sense of the OUP-H does not map to any sense of Le Larousse, or, the human mapper did not know). Another challenging issue for sense mapping concerns MWEs. While Larousse often includes MWEs in a given word sense, OUP-H systematically lists them at the end of an entry with no link to any of the other senses. OUP-H distinguishes one sense for each MWE. Following the OUP-H philosophy we did not attach any of the Larousse senses to the OUP-H MWEs. When the SDL identifies a (OUP-H) MWE, its output is a translation and not a sense tag of the Larousse. As a consequence, all MWEs that were correctly identified by SDL (about 18% of the verb occurences) were computed as wrong answers in the evaluation. Paradoxically, one of the SDL’s strength turns out to be a drawback within the ROMANSEVAL exercise.

4. Evaluation and Conclusion For complete results and for a comparative analysis of these results with other systems, see Segond (this volume). One of the strengths of the ROMANSEVAL exercise has been to make us understand in greater details the different factors that influence the evaluation of WSD systems. They include, for instance, the granularity of dictionaries used by the system (definition dictionaries, bilingual dictionaries, ontologies), how MWEs are handled as well as what is the goal of a given WSD system. Because what we are interested in is to see how much semantic disambiguation the SDL actually achieves according to our own dictionary (OUP-H) within our own application (comprehension aid), we computed another evaluation for the 20 verbs.7 In this evaluation, we obtain 70% precision and 33% recall. Precision is the number of verbs correctly tagged divided by number of verbs tagged. By tagged verbs we mean verbs for which dictionary information has been used by the SDL to select

196

SEGOND ET AL.

a meaning. Recall is the number of verbs correctly tagged divided by the total number of verbs. It gives an indication of how many times information needed is encoded in the dictionary. A study of the results shows that the system tagged 715 verbs out of 1502 verbs occurences. Among these 715 tagged verbs 400 were tagged using MWEs’ information and 315 using subcategorization and/or collocates information. Among the 400 tagged as MWEs, 279 were properly recognized. Wrong MWEs were recognized because of a too generous encoding of the possible variations of MWEs.8 Among the 315 senses selected using subcategorisation and collocates information, 225 were correctly selected. Incorrect ones are mainly due to the two following factors: − subject/object extraction error by the shallow parser, − false prepositional phrase attachment.9 We see that MWEs recognition achieves about 18% of the verb semantic disambiguation while subcategorization and collocates achieve about 14%. In this evaluation we did not take into account cases where we found the right tag using the first OUP-H sense by default. Two reasons guided this decision: first, we wanted to see how well the SDL performed when it actually performed a choice; second, as long as the first sense of the OUP-H usually does not map with the first sense of the Larousse, this information is difficult to interpret. The encouraging results obtained for verbs can be improved by using more of the functional relations provided by the XIFSP and richer dictionary information. For instance, we could use relations such as subject of the relative clauses, indirect object. We are now working on combining the SDL with the semantic example-driven tagger developed with CELI.10 The resulting semantic disambiguation module, a dictionary-based semantic tagger, will use a rule database encoding all together information about subcategorization, collocates and examples. Indeed, looking back at the overall evaluation exercise, we believe that the future of WSD lies not only in combining WSD methods, but also in creating WSD systems attached to a particular lexical resource which has been designed with a given goal. For instance, a WSD system attached to a general bilingual dictionary will perform better than a general ontology containing few senses distinctions in helping in understanding English texts from general newspapers.

Notes 1 See Bauer et al. (1995). 2 Multiword expressions range from compounds (salle de bain bathroom) and fixed phrases (a priori)

to idiomatic expressions (to sweep something under the rug). 3 See Oxford (1994).

DICTIONARY-DRIVEN SEMANTIC LOOK-UP

197

4 Note that because of the encyclopedic vs corpus frequency based difference between OUP-H and

Larousse, the first sense of Larousse often does not match the first sense of OUP-H. 5 A full description of the SDL can be found in (Segond et al., 1998). 6 Individual cases differ considerably from the average. Two particular verbs such as comprendre

and parvenir which respectively have 11 and 3 senses in the OUP-H, both have 4 senses in the Larousse. 7 No collocate information is attached to nouns in the OUP-H and for the adjectives chosen, very little collocate information was provided. When information is not present in the dictionary there is no way for us to perform any disambiguation. 8 Using local grammar rules, Locolex encodes morpho-syntactic variations of MWEs in the OUP-H. In some cases this encoding has been too generous leading to the over-recognition of such expressions. 9 For instance in the sentence “une aide destinée á couvrir les dettes des éleveurs” (help which is designed to cover debts of breeders), the shallow parser analyzes “des éleveurs” as a VMODOBJ of “couvrir” instead of as a complement of the NP “les dettes”. This is because in equivalent syntactic construction such as “couvrir les gens d’or”, “d’or” is VMODOBJ of “couvrir”. 10 See Dini et al (this volume).

References Ait-Mokhtar, S. and J-P. Chanod. “Subject and Object Dependency Extraction Using Finite-State Transducers”. In Proceedings of Workshop on Information Extraction and the Building of Lexical Semantic Resources for NLP Applications, ACL, Madrid, Spain (1997). Bauer, D., F. Segond and A. Zaenen. “LOCOLEX : The Translation Rolls Off Your Tongue”. In Proceedings of ACH-ALLC, Santa-Barbara, USA (1995). Breidt, L., G. Valetto and F. Segond. “Multiword Lexemes and Their Automatic Recognition in Texts”. In Proceedings of COMPLEX, Budapest, Hungaria (1996a). Breidt, L., G. Valetto and F. Segond. “Formal Description of Multi-word Lexemes with the Finite State formalism: IDAREX”. In Proceedings of COLING, Copenhagen, Danmark (1996b). Larousse. Le petit Larousse illustré – dictionnaire encyclopédique. Edited P. Maubourguet, Larousse, Paris, 1995. Oxford-Hachette. The Oxford Hachette French Dictionary. Edited M-H Corréard and V. Grundy, Oxford University Press-Hachette, 1994. Segond, F., E. Aimelet and L. Griot. “ ‘All You Can Use!’ Or How to Perform Word Sense Disambiguation with Available Resources”. In Second Workshop on Lexical Semantic System, Pisa, Italy, 1998. Wilks, Y. and M. Stevenson. “Word Sense Disambiguation Using Optimised Combinations of Knowledge Sources”. In Proceedings of COLING/ACL, Montreal, Canada, 1998.

Computers and the Humanities 34: 199–204, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

199

ROMANSEVAL: Results for Italian by SENSE STEFANO FEDERICI, SIMONETTA MONTEMAGNI and VITO PIRRELLI Istituto di Linguistica Computazionale – CNR, Via della Faggiola 32, Pisa, Italy (E-mail: {stefano,simo,vito}@ilc.pi.cnr.it)

Abstract. The paper describes SENSE, a word sense disambiguation system that makes use of different types of cues to infer the most likely sense of a word given its context. Architecture and functioning of the system are briefly illustrated. Results are given for the ROMANSEVAL Italian test corpus of verbs. Key words: analogy-based NLP, semantic similarity, word sense disambiguation

1. Word Sense Disambiguation by SENSE SENSE (Self-Expanding linguistic knowledge-base for Sense Elicitation) is a specialised version of a general purpose language-learning system (Federici and Pirrelli, 1994; Federici et al., 1996) tailored for sense disambiguation (Federici et al., 1997, 1999). SENSE belongs to the family of example-based Word Sense Disambiguation (WSD) systems as it assigns, to an ambiguous word token Wk in a target context Cj , the sense with which Wk is tagged in another context similar or identical to Cj . Hereafter, a target word token Wk and its context Cj will jointly be referred to for convenience as the “input pattern”. Knowledge of the way the senses of Wk appear in context comes from a repertoire of examples of use of word senses, or “Example Base” ( EB). The EB of SENSE contains three basic types of such contexts: (i) subcategorisation patterns (e.g. an infinitival construction governed by a given sense of Wk ), (ii) functionally annotated word co-occurrence patterns (e.g. the typical objects of Wk if the latter is a verb), (iii) fixed phraseological expressions. The similarity of a target context Cj to the contexts in EB is measured differently depending on i, ii or iii. Contexts of type i and iii are dealt with through simple pattern-matching: Cj is either identical to another context in EB where Wk occurs (in which case the sense of Wk in that context is selected), or no answer is given. On the other hand, when Cj is part of a functionally annotated word cooccurrence pattern (type ii above) then similarity does not necessarily require full identity. This means that when SENSE fails to find anEB context identical to Cj , it tries to match Cj against a semantically similar context. Semantic similarity is

200

FEDERICI ET AL.

assessed through a “proportionally-based” similarity measure briefly illustrated in section 1.1. SENSE outputs either one sense of Wk in Cj (if only one sense is supported by EB), or a ranked list of possible alternative senses. The ranking procedure is sketched in section 1.2. 1.1.

PROPORTIONALLY- BASED SEMANTIC SIMILARITY

The key formal notion used by SENSE to compute similarity between non identical (functionally annotated) contexts is “proportional analogy”. To illustrate, suppose that SENSE has to guess the sense of the Italian verb accendere in the pair accendere-pipa/O ‘light–pipe’ (where ‘pipe’ is tagged as a direct object) and that the input pattern in question is not already present in EB. Then the system goes into EB looking for functionally annotated patterns entering a proportion such as the following: t1 fumare1 sigaretta1 /O ‘smoke–cigarette/O’

:

t2 fumare1 pipa1 /O ‘smoke–pipe/O’

=

t3 accendere1 sigaretta1 /O ‘light–cigarette/O’

:

t4 accendere? pipa1 /O ‘light–pipe/O’

The proportion involves three EB verb-object pairs where the verb is sense-tagged (t1 , t2 and t3 ), plus the input pattern accendere-pipa/O (t4 or “target term”). The proportion is solved by assigning accendere in t4 the sense accendere1 , by analogical transfer from t3 (or “transfer term”). Intuitively, the proportion suggests that the sense of accendere in the input pattern is likely to be the same as the one in the pattern accendere1 -sigaretta1 , since pipa1 and sigaretta1 are found to be in complementary distribution relative to the same sense fumare1 of the verb fumare ‘smoke’. t1 , or “pivot term”, plays the role of linking the target with the transfer term. We can say that analogical proportions are able to transfer word senses across sense-preserving contexts. Note further that here the similarity between contexts depends on Wk (e.g., accendere in the case at hand): ‘pipe’ and ‘cigarette’ are semantic neighbours only relative to some verbs (e.g. ‘smoke’ or ‘light’, as opposed to, e.g., ‘roll’ or ‘fill’).1 Observe that, in the analogical proportion above, nouns stand in the same syntactic relation to verbs. In other cases, however, clusters of nouns which function, say, as the object of a given verb sense also function as typical subjects of other related verb senses. This is captured through proportions of verb-noun pairs involving syntactically-asymmetric constructions, as exemplified below: t1 rappresentare1 quadro1 /S ‘show painting/S’

:

t2 rappresentare1 foto1 /S ‘show photo/S’

=

t3 attaccare1 quadro1 /O ‘hang_up painting/O’

:

t4 attaccare? foto1 /O ‘hang_up photo/O’

ROMANSEVAL: RESULTS FOR ITALIAN BY SENSE

201

In the proportion, foto1 ‘photo’ and quadro1 ‘painting’ are semantically similar due to their both being subjects of the same sense of the verb rappresentare ‘represent’ (rappresentare1 ). This similarity is supposed to proportionally carry over to the case of the same two nouns being used as typical objects of attaccare ‘hang’. The inference is made that the sense of attaccare in the target term is attaccare1 , by analogy to the transfer term attaccare1 -quadro1 /O. When proportions are found which support more than one sense interpretation of Wk , alternative interpretations are weighted according to their analogy-based support. The weight reflects: (i) number of proportions supporting a given sense interpretation and (ii) semantic entropy of the words in the pivot terms of the supporting proportions (calculated according to the Melamed (1997) definition, i.e. as log2 (freq(Wk )) where “freq” counts the number of different functionally annotated EB patterns containing Wk ).2 1.2.

MULTI - CUE

WSD

AND RANKING OF RESULTS

We deal here with the way SENSE weighs multiple sense assignments depending on what type of EB context supports them. Input patterns are projected onto EB by looking for matching phraseological contexts first (if any), and then for functionally annotated word co-occurrence patterns. Syntactic frames are looked for only as a last resort. Existence of an EB lexical pattern (type ii or iii in section 1) identical to the input pattern is always given full credit, and the corresponding Wk sense is selected. For lack of identical lexical evidence, similar contexts are searched for through analogical proportions. If more than one sense is proportionally supported, the one with the heaviest analogical weight (section 1.1) is selected. Subcategorisation patterns are resorted to only when lexical evidence is inconclusive. 2. Experimental Setting In the experiment reported here, SENSE is asked to assign senses to verb occurrences in the ROMANSEVAL test corpus on the basis of a bi-partitioned EB. 2.1.

THE TEST CORPUS

The ROMANSEVAL test corpus contains 857 input patterns of 20 different polysemous verbs. The verbs show different degrees of polysemy: the number of senses ranges from the 16 senses of passare ‘pass’ to the 2 senses of prevedere ‘foresee’; on average, each verb has 5 different senses. Input patterns are fed into SENSE after a parsing stage (see Federici et al., 1998a,b) which outputs them as syntactically annotated patterns. These patterns are compatible with any of the three types of context in EB (section 1).

202 2.2.

FEDERICI ET AL.

THE EXAMPLE BASE

In this experiment, SENSE uses a bi-partitioned EB. The first partition is a generic resource containing 17,359 functionally annotated verb-noun patterns (6,201 with subject, and 11,148 with object), with no indication of sense for either member of the pair. We will hereafter refer to this partition as the “unsupervised tank”. These patterns were automatically extracted (Montemagni, 1995) from both definitions and example sentences of the verb entries of a bilingual Italian-English dictionary (Collins, 1985) and a monolingual Italian dictionary (Garzanti, 1984). They represent the typical usage of 3,858 different verbs, each exemplified through a comparatively sparse number of patterns (on average 4.5 per verb). Although these patterns were originally sense-tagged on the verb, we could not use these tags, since (a) they referred to sense distinctions coming from different dictionaries, and (b) they could not easily be mapped onto ROMANSEVAL sense distinctions. The second partition is specific to each test word Wk : it contains a number of patterns attesting the different senses of Wk as defined by ROMANSEVAL. The patterns include: (i) patterns originally belonging to the unsupervised tank and manually sense-tagged; (ii) patterns extracted from the lexicon adopted in ROMANSEVAL as a reference resource. This partition contains a comparatively small number of patterns (an average of 31.6 per Wk ) exemplifying an average of 6 contexts of use of each of Wk senses. Typical word co-occurrence patterns form 87% of the partition, subcategorisation patterns 10% and phraseological expressions about 3%. Note that only Wk is sense-tagged in these patterns which thus act as “sense seeds” of Wk (Yarowsky, 1995). 2.3.

ANALOGICAL PROPORTIONS WITH A BI - PARTITIONED

EB

In this section we briefly illustrate the way the bi-partite EB described above is used to establish analogical proportions. Given an input pattern, SENSE tries to establish analogical proportions by looking for the transfer term in the partition of sense seeds, while t1 and t2 are looked for in the unsupervised tank. Proportions of this sort are intuitively less constrained than those illustrated in section 1.1, since nouns in the proportion are no longer proved to be in complementary distribution relative to the same verb sense, but simply relative to the same verb. Relaxing this constraint was necessary since, as pointed out above, our EB combines sense distinctions coming from different dictionaries. This evaluation protocol amounts to testing analogy-based WSD in a fully unsupervised way. 3. Results Results of the experiment are encouraging. Recall, calculated as the number of correct answers relative to the total number of input patterns, is 67% and precision 85%. Correct answers include: (a) one-sense answers (over 95% of the total); (b) more-than-one-sense answers, when the correct sense is given the topmost weight

ROMANSEVAL: RESULTS FOR ITALIAN BY SENSE

203

together with a subset of the attested senses of Wk in EB. SENSE fails on 11% of the input patterns. Input patterns for which SENSE yields no answer amount to 22% of the total. Almost half of them (i.e. 86 out of 192) contain context words missing in EB for which no proportion could possibly be established. It is interesting to consider the individual contribution of each context type (see section 1) to the disambiguation task: 72% of SENSE correct answers are based on lexico-semantic patterns (either fixed phraseological expressions or typical word co-occurrence patterns representative of the selectional preferences of a specific verb sense); 28% are based on subcategorisation information. Analogical proportions contribute 52% of correct sense assignments.3 Note finally that, in the test sample, more-than-one-sense answers are always due to subcategorisation patterns. 4. Concluding Remarks In this paper, we illustrated an analogy-based system for WSD capable of dealing with different types of linguistic evidence (syntactic and lexico-semantic), and report the results obtained on the ROMANSEVAL test bed. One of the most innovative features of the system is that similarity between contexts is computed through analogical proportions which, in the reported experiment, are minimally constrained, i.e. they are based on a handful of sense-tagged contexts (or sense seeds) reliably extended through a set of untagged data (forming an unsupervised tank). This amounts to testing analogy-based WSD in a fully unsupervised mode, and it has an obvious bearing on the scalability and exportability of the proposed method. For a given Wk , one can “plug”, into EB, different sense subdivisions (e.g. exhibiting varying degrees of granularity), and disambiguate Wk in context accordingly. Moreover, the unsupervised tank can either be extended through new lexical patterns extracted from unrestricted texts, or specialised through addition of domain-specific contexts. Notes 1 For a comparison between this operational notion of similarity and alternative proposals used in

other analogy-based systems for WSD, the reader is referred to Federici et al. (1999). 2 A detailed discussion of the weighting procedure can be found in Federici et al. (1997, 1999). 3 This figure is obtained by forcing SENSE to disambiguate all input patterns proportionally, i.e.

pretending that no input pattern was already present in the partition of sense seeds.

References Collins, G. M. English-Italian Italian-English Dictionary. London Firenze, Collins Giunti Marzocco, 1985. Federici, S. and V. Pirrelli. “Linguistic Analogy as a Computable Process”. In Proceedings of NeMLaP. Manchester, UK, 1994, pp. 8–14. Federici, S., S. Montemagni and V. Pirrelli. “Analogy and Relevance: Paradigmatic Networks as Filtering Devices”. In Proceedings of NeMLaP. Ankara, Turkey, 1996, pp. 13–24.

204

FEDERICI ET AL.

Federici, S., S. Montemagni and V. Pirrelli. “Inferring semantic similarity from Distributional Evidence: an Analogy-based Approach to Word Sense Disambiguation”. In Proceedings of the ACL/EACL Workshop “Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications” . Madrid, Spain, 12 July 1997. Federici, S., S. Montemagni and V. Pirrelli. “Chunking Italian: Linguistic and Task-oriented Evaluation”. In Proceedings of the LRE-98 Workshop on Evaluation of Parsing Systems. Granada, Spain, 26 May 1998, 1998a. Federici, S., S. Montemagni, V. Pirrelli and N. Calzolari. “Analogy-based Extraction of Lexical Knowledge from Corpora: the SPARKLE Experience”. In Proceedings of LRE-98, Granada, Spain, 28–30 May 1998, 1998b. Federici, S., S. Montemagni and V. Pirrelli. “SENSE: an Analogy-based Word Sense Disambiguation System”. In Natural Language Engineering, 1999. Garzanti. Il Nuovo Dizionario Italiano Garzanti. Garzanti, Milano, 1984. Melamed, D. “Measuring Semantic Entropy”. In Proceedings of SIGLEX Workshop on Tagging Text with Lexical Semantics: why, what and how?, ANLP’97, Washington, USA, 4–5 April 1997. Montemagni, S. Subject and Object in Italian Sentence Processing. PhD Dissertation, UMIST, Manchester, UK, 1995. Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods”. In Proceedings of ACL ’95, Cambridge, MA, June 1995, pp. 189–196.

Computers and the Humanities 34: 205–215, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

205

Do Word Meanings Exist? PATRICK HANKS Oxford English Dictionaries

1. Introduction My contribution to this discussion is to attempt to spread a little radical doubt. Since I have spent over 30 years of my life writing and editing monolingual dictionary definitions, it may seem rather odd that I should be asking, do word meanings exist? The question is genuine, though: prompted by some puzzling facts about the data that is now available in the form of machine-readable corpora. I am not the only lexicographer to be asking this question after studying corpus evidence. Sue Atkins, for example, has said “I don’t believe in word meanings” (personal communication). It is a question of fundamental importance to the enterprise of sense disambiguation. If senses don’t exist, then there is not much point in trying to ‘disambiguate’ them – or indeed do anything else with them. The very term disambiguate presupposes what Fillmore (1975) characterized as “checklist theories of meaning.” Here I shall reaffirm the argument, on the basis of recent work in corpus analysis, that checklist theories in their current form are at best superficial and at worst misleading. If word meanings do exist, they do not exist as a checklist. The numbered lists of definitions found in dictionaries have helped to create a false picture of what really happens when language is used. Vagueness and redundancy – features which are not readily compatible with a checklist theory – are important design features of natural language, which must be taken into account when doing serious natural language processing. Words are so familiar to us, such an everyday feature of our existence, such an integral and prominent component of our psychological makeup, that it’s hard to see what mysterious, complex, vague-yet-precise entities meanings are. 2. Common Sense The claim that word meaning is mysterious may seem counterintuitive. To take a time-worn example, it seems obvious that the noun bank has at least two senses: ‘slope of land alongside a river’ and ‘financial institution’. But this line of argument is a honeytrap. In the first place, these are not, in fact, two senses of a single word;

206

HANKS

they are two different words that happen to be spelled the same. They have different etymologies, different uses, and the only thing that they have in common is their spelling. Obviously, computational procedures for distinguishing homographs are both desirable and possible. But in practice they don’t get us very far along the road to text understanding. Linguists used to engage in the practice of inventing sentences such as “I went to the bank” and then claiming that it is ambiguous because it invokes both meanings of bank equally plausibly. It is now well known that in actual usage ambiguities of this sort hardly ever arise. Contextual clues disambiguate, and can be computed to make choice possible, using procedures such as that described in Church and Hanks (1989). On the one hand we find expressions such as: people without bank accounts; his bank balance; bank charges; gives written notice to the bank; in the event of a bank ceasing to conduct business; high levels of bank deposits; the bank’s solvency; a bank’s internal audit department; a bank loan; a bank manager; commercial banks; High-Street banks; European and Japanese banks; a granny who tried to rob a bank and on the other hand: the grassy river bank; the northern bank of the Glen water; olive groves and sponge gardens on either bank; generations of farmers built flood banks to create arable land; many people were stranded as the river burst its banks; she slipped down the bank to the water’s edge; the high banks towered on either side of us, covered in wild flowers. The two words bank are not confusable in ordinary usage. So far, so good. In a random sample of 1000 occurrences of the noun bank in the British National Corpus (BNC), I found none where the ‘riverside’ sense and the ‘financial institution’ sense were both equally plausible. However, this merely masks the real problem, which is that in many uses NEITHER of the meanings of bank just mentioned is fully activated. The obvious solution to this problem, you might think, would be to add more senses to the dictionary. And this indeed is often done. But it is not satisfactory, for a variety of reasons. For one, these doubtful cases (some examples are given below) do invoke one or other of the main senses to some extent, but only partially. Listing them as separate senses fails to capture the overlap and delicate interplay among them. It fails to capture the imprecision which is characteristic of words in use. And it fails to capture the dynamism of language in use. The problem is vagueness, not ambiguity. For the vast majority of words in use, including the two words spelled bank, one meaning shades into another, and indeed the word may be used in a perfectly natural but vague or even contradictory way. In any random corpus-based selection of citations, a number of delicate questions will arise that are quite difficult to resolve or indeed are unresolvable. For example: How are we to regard expressions such as ‘data bank’, ‘blood bank’, ‘seed bank’, and ‘sperm bank’? Are they to be treated as part of the ‘financial institution’

DO WORD MEANINGS EXIST?

207

sense? Even though no finance is involved, the notion of storing something for safekeeping is central. Or are we to list these all as separate sense (or as separate lexical entries), depending on what is stored? Or are we to add a ‘catch-all’ definition of the kind so beloved of lexicographers: “any of various other institutions for storing and safeguarding any of various other things”? (But is that insufficiently constrained? What precisely is the scope of “any of various”? Is it just a lexicographer’s copout? Is a speaker entitled to invent any old expression – say, ‘a sausage bank’, or ‘a restaurant bank’, or ‘an ephemera bank’ – and expect to be understood? The answer may well be ‘Yes’, but either way, we need to know why.) Another question: is a bank (financial institution) always an abstract entity? Then what about 1? 1. [He] assaulted them in a bank doorway. Evidently the reference in 1 is to a building which houses a financial institution, not to the institution itself. Do we want to say that the institution and the building which houses it are separate senses? Or do we go along with Pustejovsky (1995: 91), who would say that they are all part of the same “lexical conceptual paradigm (lcp)”, even though the superordinates (INSTITUTION and BUILDING) are different? The lcp provides a means of characterizing a lexical item as a meta-entry. This turns out to be very useful for capturing the systematic ambiguities which are so pervasive in language. . . . Nouns such as newspaper appear in many semantically distinct contexts, able to function sometimes as an organization, a physical object, or the information contained in the articles within the newspaper. a. The newspapers attacked the President for raising taxes. b. Mary spilled coffee on the newspaper. c. John got angry at the newspaper. So it is with bank1 . Sometimes it is an institution; sometimes it is the building which houses the institution; sometimes it is the people within the institution who make the decisions and transact its business. Our other bank word illustrates similar properties. Does the ‘riverside’ sense always entail sloping land? Then what about 2? 2. A canoe nudged a bank of reeds. 3. Ockham’s Razor Is a bank always beside water? Does it have one slope or two? Is it always dry land? How shall we account for 3 and 4? 3. Philip ran down the bracken bank to the gate. 4. The eastern part of the spit is a long simple shingle bank. Should 3 and 4 be treated as separate senses? Or should we apply Ockham’s razor, seeking to avoid a needless multiplicity of entities? How delicate do we want our sense distinctions to be? Are ‘river bank’, ‘sand bank’, and ‘grassy bank’ three different senses? Can a sand bank be equated with a shingle bank?

208

HANKS

Then what about ‘a bank of lights and speakers’? Is it yet another separate sense, or just a further extension of the lcp? If we regard it as an extension of the lcp, we run into the problem that it has a different superordinate – FURNITURE, rather than LAND. Does this matter? There is no single correct answer to such questions. The answer is determined rather by the user’s intended application, or is a matter of taste. Theoretical semanticists may be more troubled than language users by a desire for clear semantic hierarchies. For such reasons, lexicographers are sometimes classified into ‘lumpers’ and ‘splitters’: those who prefer – or rather, who are constrained by marketing considerations – to lump uses together in a single sense, and those who isolate fine distinctions. We can of course multiply entities ad nauseam, and this is indeed the natural instinct of the lexicographer. As new citations are amassed, new definitions are added to the dictionary to account for those citations which do not fit the existing definitions. This creates a combinational explosion of problems for computational analysis, while still leaving many actual uses unaccounted for. Less commonly asked is the question, “Should we perhaps adjust the wording of an existing definition, to give a more generalized meaning?” But even if we ask this question, it is often not obvious how it is to be answered within the normal structure of a set of dictionary definitions. Is there then no hope? Is natural language terminally intractable? Probably not. Human beings seem to manage all right. Language is certainly vague and variable, but it is vague and variable in principled ways, which are at present imperfectly understood. Let us take comfort, procedurally, from Anna Wierzbicka (1985): An adequate definition of a vague concept must aim not at precision but at vagueness: it must aim at precisely that level of vagueness which characterizes the concept itself. This takes us back to Wittgenstein’s account of the meaning of game. This has been influential, and versions of it are applied quite widely, with semantic components identified as possible rather than necessary contributors to the meaning of texts. Wittgenstein, it will be remembered, wrote (Philosophical Investigations 66, 1953): Consider for example the proceedings that we call ‘games’. I mean board games, card games, ball games, Olympic games, and so on. What is common to them all? Don’t say, “There must be something common, or they would not be called ‘games’ ” – but look and see whether there is anything common to all. For if you look at them you will not see something common to all, but similarities, relationships, and a whole series of them at that. To repeat: don’t think, but look! Look for example at board games, with their multifarious relationships. Now pass to card games; here you find many correspondences with the first group, but many common features drop out, and others appear. When we pass next to ball games, much that is common is retained, but much is lost. Are they all ‘amusing’? Compare chess with noughts and crosses. Or

DO WORD MEANINGS EXIST?

209

is there always winning and losing, or competition between players? Think of patience. In ball games there is winning and losing; but when a child throws his ball at the wall and catches it again, this feature has disappeared. Look at the parts played by skill and luck; and at the difference between skill in chess and skill in tennis. Think now of games like ring-a-ring-a-roses; here is the element of amusement, but how many other characteristic features have disappeared! And we can go through the many, many other groups of games in the same way; can see how similarities crop up and disappear. And the result of this examination is: we see a complicated network of similarities overlapping and criss-crossing: sometimes overall similarities, sometimes similarities of detail. It seems, then, that there are no necessary conditions for being a bank, any more than there are for being a game. Taking this Wittgensteinian approach, a lexicon for machine use would start by identifying the semantic components of bank as separate, combinable, exploitable entities. This turns out to reduce the number of separate dictionary senses dramatically. The meaning of bank1 might then be expressed as: • IS AN INSTITUTION • IS A LARGE BUILDING • FOR STORAGE • FOR SAFEKEEPING • OF FINANCE/MONEY • CARRIES OUT TRANSACTIONS • CONSISTS OF A STAFF OF PEOPLE And bank2 as: • IS LAND • IS SLOPING • IS LONG • IS ELEVATED • SITUATED BESIDE WATER On any occasion when the word ‘bank’ is used by a speaker or writer, he or she invokes at least one of these components, usually a combination of them, but no one of them is a necessary condition for something being a ‘bank’ in either or any of its senses. Are any of the components of bank2 necessary? “IS LAND”? But think of a bank of snow. “IS SLOPING”? But think of a reed bed forming a river bank. “IS LONG”? But think of the bank around a pond or small lake. “IS ELEVATED”? But think of the banks of rivers in East Anglia, where the difference between the water level and the land may be almost imperceptible. “SITUATED BESIDE WATER”? But think of a grassy bank beside a road or above a hill farm.

210

HANKS

4. Peaceful Coexistence These components, then, are probabilistic and prototypical. The word “typically” should be understood before each of them. They do not have to be mutually compatible. The notion of something being at one and the same time an “(ABSTRACT) INSTITUTION and (PHYSICAL) LARGE BUILDING”, for example, may be incoherent, but that only means that there two components are not activated simultaneously. They can still coexist peacefully as part of the word’s meaning potential. By taking different combinations of components and showing how they combine, we can account economically and satisfactorily for the meaning in a remarkably large number of natural, ordinary uses. This probabilistic componential approach also allows for vagueness. 5. Adam sat on the bank among the bulrushes. Is the component “IS SLOPING” present or absent in 5? The question is irrelevant: the component is potentially present, but not active. But it is possible to imagine continuations in which it suddenly becomes very active and highly relevant, for example if Adam slips down the bank and into the water. If our analytic pump is primed with a set of probabilistic components of this kind, other procedures can be invoked. For example, semantic inheritances can be drawn from superordinates (“IS A BUILDING” implies “HAS A DOORWAY” (cf.1); “IS AN INSTITUTION” implies “IS COGNITIVE”(cf.6)). 6. The bank defended the terms of the agreement. What’s the downside? Well, it’s not always clear which components are activated by which contexts. Against this: if it’s not clear to a human being, then it can’t be clear to a computer. Whereas if it’s clear to a human being, then it is probably worth trying to state the criteria explicitly and compute over them. A new kind of phraseological dictionary is called for, showing how different aspects of word meaning are activated in different contexts, and what those contexts are, taking account of vagueness and variability in a precise way. See Hanks (1994) for suggestions about the form that such a phraseological dictionary might take. A corpus-analytic procedure for counting how many times each feature is activated in a collection of texts has considerable predictive power. After examining even quite a modest number of corpus lines, we naturally begin to form hypotheses about the relative importance of the various semantic components to the normal uses of the word, and how they normally combine. In this way, a default interpretation can be calculated for each word, along with a range of possible variations. 5. Events and Traces What, then, is a word meaning? In the everyday use of language, meanings are events, not entities. Do meanings also exist outside the transactional contexts in which they are used? It is a convenient shorthand to talk about “the meanings of words in a dictionary”, but

DO WORD MEANINGS EXIST?

211

strictly speaking these are not meanings at all. Rather, they are ‘meaning potentials’ – potential contributions to the meanings of texts and conversations in which the words are used, and activated by the speaker who uses them. We cannot study word meanings directly through a corpus any more satisfactorily than we can study them through a dictionary. Both are tools, which may have a lot to contribute, but they get us only so far. Corpora consist of texts, which consist of traces of linguistic behaviour. What a corpus gives us is the opportunity to study traces and patterns of linguistic behaviour. There is no direct route from the corpus to the meaning. Corpus linguists sometimes speak as if interpretations spring fully fledged, untouched by human hand, from the corpus. They don’t. The corpus contains traces of meaning events; the dictionary contains lists of meaning potentials. Mapping the one onto the other is a complex task, for which adequate tools and procedures remain to be devised. The fact that the analytic task is complex, however, does not necessarily imply that the results need to be complex. We may well find that the components of meaning themselves are very simple, and that the complexity lies in establishing just how the different components combine.

6. More Complex Potentials: Verbs Let us now turn to verbs. Verbs and nouns perform quite different clause roles. There is no reason to assume that the same kind of template is appropriate to both. The difference can be likened to that between male and female components of structures in mechanical engineering. On the one hand, the verbs assign semantic roles to the noun phrases in their environment. On the other hand, nouns (those eager suitors of verbs) have meaning potentials, activated when they fit (more or less well) into the verb frames. Together, they make human interaction possible. One of their functions, though not the only one, is to form propositions. Propositions, not words, have entailments. But words can be used as convenient storage locations for conventional phraseology and for the entailments or implications that are associated with those bits of phraseology. (Implications are like entailments, but weaker, and they can be probabilistic. An implicature is an act in which a speaker makes or relies on an implication.) Consider the different implications of these three fragments: 7. the two men who first climbed Mt Everest. 8. He climbed a sycamore tree to get a better view. 9. He climbed a gate into a field. 7 implies that the two men got to the top of Everest. 8 implies, less strongly, that the climber stopped part-way up the sycamore tree. 9 implies that he not only got to the top of the gate, but climbed down the other side. We would be hard put to it to answer the question, “Which particular word contributes this particular implicature?” Text meanings arise from combinations, not from any one word

212

HANKS

individually. Moreover, these are default interpretations, not necessary conditions. So although 70 may sound slightly strange, it is not an out-and-out contradiction. 70 *They climbed Mount Everest but did not get to the top. Meaning potentials are not only fuzzy, they are also hierarchically arranged, in a series of defaults. Each default interpretation is associated with a hierarchy of phraseological norms. Thus, the default interpretation of climb is composed of two components: CLAMBER and UP (see Fillmore 1982) – but in 10, 11 and 12 the syntax favours one component over the other. Use of climb with an adverbial of direction activates the CLAMBER component, but not the UP component. 10. I climbed into the back seat. 11. Officers climbed in through an open window. 12. A teacher came after me but I climbed through a hedge and sat tight for an hour or so. This leads to a rather interesting twist: 13 takes a semantic component, UP, out of the meaning potential of climb and activates it explicitly. This is not mere redundancy: the word ‘up’ is overtly stated precisely because the UP component of climb is not normally activated in this syntactic context. 13. After breakfast we climbed up through a steep canyon.

7. Semantic Indeterminacy and Remote Clues Let us now look at some examples where the meaning cannot be determined from the phraseology of the immediate context. These must be distinguished from errors and other unclassifiables. The examples are taken from a corpus-based study of check. Check is a word of considerable syntactic complexity. Disregarding (for current purposes) an adjectival homograph denoting a kind of pattern (a check shirt), and turning off many other noises, we can zero in on the transitive verb check. This has two main sense components: INSPECT and CAUSE TO PAUSE/SLOW DOWN. Surely, as a transitive verb, check cannot mean both ‘inspect’ and ‘cause to pause or slow down’ at the same time? 14 and 15 are obviously quite different meanings. 14. It is not possible to check the accuracy of the figures. 15. The DPK said that Kurdish guerrillas had checked the advance of government troops north of Sulaimaniya. But then we come to sentences such as 16–18. 16. Then the boat began to slow down. She saw that the man who owned it was hanging on to the side and checking it each time it swung. Was the man inspecting it or was he stopping it? What is ‘it’? The boat or something else? The difficulty is only resolved by looking back through the story leading up to this sentence – looking back in fact, to the first mention of ‘boat’ (160 ).

DO WORD MEANINGS EXIST?

213

160 “Work it out for yourself,” she said, and then turned and ran. She heard him call after her and got into one of the swing boats with a pale, freckled little boy. . . Not it is clear that the boat in this story has nothing to do with vessels on water; it is a swinging ride at a fairground. The man, it turns out, is trying to cause it to slow down (‘checking’ it) because of a frightened child. This is a case where the relevant contextual clues are not in the immediate context. If we pay proper attention to textual cohesion, we are less likely to perceive ambiguity where there is none. 17. The Parliamentary Assembly and the Economic and Social Committee were primarily or wholly advisory in nature, with very little checking power. In 17, the meaning is perfectly clear: the bodies mentioned had very little power to INSPECT and CAUSE TO PAUSE. Perhaps an expert on European bureaucracy might be able to say whether one component or the other of check was more activated, but the ordinary reader cannot be expected to make this choice, and the wider context is no help. The two senses of check, apparently in competition, here coexist in a single use, as indeed they do in the cliché checks and balances. By relying too heavily on examples such as 14 and 15, dictionaries have set up a false dichotomy. 18. Corporals checked kitbags and wooden crates and boxes. . . What were the corporals doing? It sounds as if they were inspecting something. But as we read on, the picture changes. 180 . Sergeants rapped out indecipherable commands, corporals checked kitbags and wooden crates and boxes into the luggage vans. The word into activates a preference for a different component of the meaning potential of check, identifiable loosely as CONSIGN, and associated with the cognitive prototype outlined in 19. 19. PERSON check BAGGAGE into TRANSPORT No doubt INSPECT is present too, but the full sentence activates an image of corporals with checklists. Which is more or less where we came in.

8. Where Computational Analysis Runs Out Finally, consider the following citation: 20. He soon returned to the Western Desert, where, between May and September, he was involved in desperate rearguard actions – the battle of Gazala, followed by Alamein in July, when Auchinleck checked Rommel, who was then within striking distance of Alexandria. Without encyclopedic world knowledge, the fragment . . . Alamein in July, when Auchinleck checked Rommel is profoundly ambiguous. I tried it on some English teenagers, and they were baffled. How do we know that Auchinleck was not checking Rommel for fleas or for contraband goods? Common sense may tell us

214

HANKS

that this is unlikely, but what textual clues are there to support common sense? Where does the assignment of meaning come from? • From internal text evidence, in particular the collocates? Relevant are the rather distant collocates battle, rearguard actions, and perhaps striking distance. These hardly seem close enough to be conclusive, and it is easy enough to construct a counterexample in the context of the same collocates (e.g. *before the battle, Auchinleck checked the deployment of his infantry). • From the domain? If this citation were from a military history textbook, that might be a helpful clue. Unfortunately, the extract actually comes from an obituary in the Daily Telegraph, which the BNC very sensibly does not attempt to subclassify. But anyway, domain is only a weak clue. Lesk [1986] observed that the sort of texts which talk about pine cones rarely also talk about icecream cones, but in this case domain classification is unlikely to produce the desired result, since military texts do talk about both checking equipment and checking the enemy’s advance. • From real-world knowledge? Auchinleck and Rommel were generals on opposing sides; the name of a general may be used metonymically for the army that he commands, and real-world knowledge tells us that armies check each other in the sense of halting an advance. This is probably close to psychological reality, but if it is all we have to go on, the difficulties of computing real-world knowledge satisfactorily start to seem insuperable. • By assigning Auchinleck and Rommel to the lexical set [GENERAL]? This is similarly promising, but it relies on the existence of a metonymic exploitation rule of the following form: [GENERALi] checked [GENERALj] = [GENERALi]’s army checked (= halted the advance of) [GENERALj]’s army. We are left with the uncomfortable conclusion that what seems perfectly obvious to a human being is deeply ambiguous to the more literal-minded computer, and that there is no easy way of resolving the ambiguity. 9. Conclusion Do word meanings exist? The answer proposed in this discussion is “Yes, but . . . ” Yes, word meanings do exist, but traditional descriptions are misleading. Outside the context of a meaning event, in which there is participation of utterer and audience, words have meaning potentials, rather than just meaning. The meaning potential of each word is made up of a number of components, which may be activated cognitively by other words in the context in which it is used. These cognitive components are linked in a network which provides the whole semantic base of the language, with enormous dynamic potential for saying new things and relating the unknown to the known. The target of ‘disambiguation’ presupposes competition among different components or sets of components. And sometimes this is true. But we also find

DO WORD MEANINGS EXIST?

215

that the different components coexist in a single use, and that different uses activate a kaleidoscope of different combinations of components. So rather than asking questions about disambiguation and sense discrimination (“Which sense does this word have in this text?”), a better sort of question would be “What is the unique contribution of this word to the meaning of this text?” A word’s unique contribution is some combination of the components that make up its meaning potential, activated by contextual triggers. Components that are not triggered do not even enter the lists in the hypothetical disambiguation tournament. They do not even get started, because the context has already set a semantic frame into which only certain components will fit. A major future task for computational lexicography will be to identify meaning components, the ways in which they combine, relations with the meaning components of semantically related words, and the phraseological circumstances in which they are activated. The difficulty of identifying meaning components, plotting their hierarchies and relationships, and identifying the conditions under which they are activated should not blind us to the possibility that they may at heart be quite simple structures: much simpler, in fact, than anything found in a standard dictionary. But different. References Church, K.W. and P. Hanks. “Word Association Norms, Mutual Information, and Lexicography”, in Computational Linguistics 16:1, 1990. Fillmore, C.J. “An alternative to checklist theories of meaning” in Papers from the First Annual Meeting of the Berkely Linguistics Society, 1975, pp. 123–132. Fillmore, C.J. “Towards a Descriptive Framework for Spatial Deixis” in Speech, Place, and Action, R.J. Jarvella and W. Klein Eds. New York: John Wiley and Sons, 1982. Hanks, P. “Linguistic Norms and Pragmatic Exploitations, Or Why Lexicographers need Prototype Theory, and Vice Versa” in Papers in Computational Lexicography: Complex ’94, Eds. F. Kiefer, G. Kiss, and J. Pajzs, Budapest: Research Institute for Linguistics, 1994. Pustejovsky, J. The Generative Lexicon. Cambridge MS: MIT Press, 1995. Wierzbicka, A. Lexicography and Conceptual Analysis. Ann Arbor MI: Karoma, 1985. Wierzbicka, A. English Speech Act Verbs: A Semantic Dictionary. Sydney: Academic Press, 1987. Wittgenstein, L. Philosophical Investigations. Oxford: Basil Blackwell, 1953.

Computers and the Humanities 34: 217–222, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

217

Consistent Criteria for Sense Distinctions MARTHA PALMER Department of Computer Science, IRCS, University of Pennsylvania, Phil, PA 19104, USA (E-mail: [email protected])

Abstract. This paper specifically addresses the question of polysemy with respect to verbs, and whether or not the sense distinctions that are made in on-line lexical resources such as WordNet are appropriate for computational lexicons. The use of sets of related syntactic frames and verb classes are examined as a means of simplifying the task of defining different senses, and the importance of concrete criteria such as different predicate argument structures, semantic class constraints and lexical co-occurrences is emphasized.

1. Introduction The difficulty of achieving adequate hand-crafted semantic representations has limited the field of natural language processing to applications that can be contained within well-defined subdomains. The only escape from this limitation will be through the use of automated or semi-automated methods of lexical acquisition. However, the field has yet to develop a clear consensus on guidelines for a computational lexicon that could provide a springboard for such methods, in spite of all of the effort on different lexicon development approaches (Mel’cuk, 1988; Pustejovsky, 1991; Nirenburg et al., 1992; Copestake and Sanfilippo, 1993; Lowe et al., 1997; Dorr, 1997). One of the most controversial areas has to do with polysemy. What constitutes a clear separation into senses for any one verb or noun, and how can these senses be computationally characterized and distinguished? The answer to this question is the key to breaking the bottleneck of semantic representation that is currently the single greatest limitation on the general application of natural language processing techniques. In this paper we specifically address the question of polysemy with respect to verbs, and whether or not the sense distinctions that are made in on-line dictionary resources such as WordNet (Miller, 1990; Miller and Fellbaum, 1991), are appropriate for computational lexicons. We examine the use of sets of related syntactic frames and verb classes as a means of simplifying the task of defining different senses, and we focus on the mismatches between these types of distinctions and some of the distinctions that occur in WordNet.

218

PALMER

2. Challenges in Building Large-Scale Lexicons Computational lexicons are an integral part of any natural language processing system, and perform many essential tasks. Machine Translation (MT), and Information Retrieval (IR), both rely to a large degree on isolating the relevant senses of words in a particular phrase, and there is wide-spread interest in whether or not word sense disambiguation (WSD), can be performed as a separate self-contained task that would assist these applications.1 Information retrieval mismatches such as the retrieval of an article on plea bargaining, (speedier trials and lighter sentences), given speed of light as a query are caused by inadequate word sense disambiguation. These are clearly not the same senses of light, (or even the same parts of speech), but a system would have to distinguish between WordNet light1, involving visible light, and WordNet light2, having to do with quantity or degree in order to rule out this retrieval. However, it is possible that the lexically based statistical techniques currently employed in the best IR systems are already accomplishing a major portion of the WSD task, and a separate WSD stage would have little to add (Voorhees, 1999). Clear sense distinctions have a more obvious payoff in MT. For instance, in Korean, there are two different translations for the English verb lose, depending on whether it is an object that has been misplaced or a competition that has been lost: lose1, lose the report – ilepeli-ess-ta, and lose2, lose the battle – ci-ess-ta (Palmer et al., 1998). Whether or not WSD is a useful separate stage of processing for MT or part of an integrated approach, selecting the appropriate entry in a bilingual lexicon is critical to the success of the translation. The lose sense distinctions can be made by placing semantic class constraints on the object positions, i.e., +competition, and +solid object respectively. The first constraint corresponds directly to a WordNet hypernym, but the second one does not. The closest correlate in WordNet would be +abstract activity, which is the common hypernym for both hostile military engagement and game, and which may discriminate sufficiently. Computational lexicons can most readily make sense distinctions based on concrete criteria such as: − different predicate argument structures − different semantic class constraints on verb arguments − different lexical co-occurrences, such as prepositions This seems straightforward enough, and traditional dictionaries usually have separate entries for transitive (two argument) and intransitive (one argument) verbs, as well as for verb particle constructions (with specific prepositions, as in break off). However, semantic class constraints are never made explicit in dictionaries, and lexicographers often refer to even more inaccessible implicit criteria. For instance, out of the ten senses that WordNet 1.6, gives for lose, we find one, WN2, that corresponds to our lose1 from above, lose the battle sense, but two, WN1 and WN5, that correspond to our lose2, misplace an item.

SENSE DISTINCTIONS

219

− lose1 – WN2. lose – (fail to win; “We lost the battle but we won the war”) − lose2 – WN1. (fail to keep or to maintain; cease to have, either physically or in an abstract sense; fail to keep in mind or in sight; “She lost her purse when she left it unattended on her seat”; “She lost her husband a year ago”) − lose2 – WN5. (miss from one’s possessions; lose sight of; “I’ve lost my glasses again!”) When we try to establish concrete criteria for distinguishing between WN1 (lost her purse) and WN5 (lost my glasses), we realize that these two WordNet senses are not distinguished because of anything to do with semantic class constraints on the verb arguments (an +animate Agent and a +solid object possessed by the Agent in both cases), but rather are distinguished by possible future events – namely the likelihood of the object being found. It is not reasonable to expect a computational lexicon to characterize all possible worlds in which an event can take place, and then distinguish between all possible outcomes. A more practical sense division for a computational lexicon would be [lose1 (losing competitions), lose2 (misplacing objects), lose3 (being bereft of loved ones)].2 We are not denying that a computational lexicon should include particular changes in the state of the world that are entailed by specific actions, quite the contrary (Palmer, 1990). However, the characterizations of these changes should be generally applicable, and cannot be so dependent on a single world context that they change with every new situation. Other areas of difference between computational lexicons and more traditional lexical resources have to do with the flexibility of the representation. Computational lexicons are particularly well suited to capturing hierarchical relationships and regular sense extensions based on verb class membership. For instance, the following two senses are in and among the 63 sense distinctions WordNet listed for break. − break – WN2. break, separate, split up, fall apart, come apart – (become separated into pieces or fragments; “The figurine broke”; “The freshly baked loaf fell apart”) − break – WN5. (destroy the integrity of; usually by force; cause to separate into pieces or fragments; “He broke the glass plate”; “She broke the match”) They are shown as being related senses in WordNet 1.6, but the relationship is not made explicit. It is a simple task for a computational lexicon to specify the type of relationship, i.e., the transitive frame in WN5 the causative form of WN2, and has explicit inclusion of an Agent as an additional argument. In the XTAG English lexicon (Joshi et al., 1975; Vijay-Shanker, 1987), this is currently handled by associating both the intransitive/ergative and transitive tree families3 with the same syntactic database entry for break. In the transitive form the NP1 (Patient) becomes the Object and an NP0 (Agent) is added as the Subject. The +causative

220

PALMER

Figure 1. An ergative verb and its causative sense extension.

semantic feature can be added as well, as illustrated in Figure 1.4 We are currently adding syntactic frames to the two related entries in WordNet 1.6, which, as well as making the definitions more consistent, helps to explicitly capture the sense extension. This resource, called VerbNet, will be available soon (Dang et al., 1998). In addition to regular extensions in meaning that derive from systematic changes in subcategorization frames, there are also regular extensions occasioned by the adjunction of optional prepositions, adverbials and prepositional phrases. For example, the basic meaning of push, He pushed the next boy, can be extended to explicitly indicate accompanied motion by the adjunction of a path prepositional phrase, as in He pushed the boxes across the room (Palmer et al., 1997; Dang et al., 1998), which corresponds to WN1 below. The possibility of motion of the object can be explicitly denied through the use of the conative, as in He pushed at the box, which is captured by WN5. Finally, the basic sense can also be extended to indicate a change of state of the object by the adjunction of apart, as in He pushed the boxes apart. There is no WordNet sense that corresponds to this, nor should there be. What is important is for the lexicon to provide the capability of recognizing and generating these usages where appropriate. If they are general enough to apply to entire classes of verbs, then they can be captured through regular adjunctions rather than being listed explicitly (for more details, see Bleam et al., 1998). − WN1. push, force – (move with force, “He pushed the table into a corner”; “She pushed her chin out”) − WN5. push – (press against forcefully without being able to move)

3. Conclusion It has been suggested that WordNet sense distinctions are too fine-grained and coarser senses are needed to drive the word sense disambiguation task. For instance, in defining cut, WordNet distinguishes between WN1, separating into pieces of a concrete object, WN29, cutting grain, WN30, cutting trees, and WN33,

SENSE DISTINCTIONS

221

cutting hair. For many purposes, the three more specialized senses, WN29, WN30 and WN33, which all involve separation into pieces of concrete objects could be collapsed into the more coarse-grained WN1. However, when searching for articles on recent changes in hair styles, the more fine-grained WN33 would still be useful. Computational lexicons actually lend themselves readily to moving back and forth between elements of an hierarchical representation based on concrete criteria, and this type of structuring should become more prevalent. The point is that they operate most effectively in the realm of concrete criteria for sense distinctions, such as changes in argument structure, changes in sets of syntactic frames and/or changes in semantic class constraints, and lexical co-occurrences. Distinctions that are based on world knowledge, no matter how diverse, are much more problematic. We must bear this in mind in order to design a word sense disambiguation task that will also encourage rational, incremental development of computational lexicons. Acknowledgements We thank Aravind Joshi, the members of the XTAG group, Christiane Fellbaum and our reviewers. This work has been supported in part by NSF grants SBR 8920230 and IIS-9800658 and Darpa grant #N66001-94C-6043. Notes 1 For a discussion of WSD and IR, see Krovetz and Croft (1992) and Sanderson (1994). 2 Obviously, semantic class constraints on the object would fail to distinguish between losing one’s

husband in the supermarket versus losing one’s spouse to cancer, and additional information such as adjuncts would have to be considered as well. 3 A tree family contains all of the syntactic realizations associated with a particular subcategorization frame, such as subject and object extraction and passive (XTAG-Group, 1995; Xia et al., 1999). 4 All of Levin’s break and bend verbs are given the same type of entry, as well as many other verbs (Levin, 1993; Dang et al., 1998).

References Bleam, T., M. Palmer and V. Shanker. “Motion Verbs and Semantic Features in Tag”. In Proceedings of the TAG+98 Workshop. Philadelphia, PA, 1998. Copestake, A. and A. Sanfilippo. ‘Multilingual Lexical Representation”. In Proceedings of the AAAI Spring Symposium: Building Lexicons for Machine Translation. Stanford, California, 1993. Dang, H.T., K. Kipper, M. Palmer and J. Rosenzweig. “Investigating Regular Sense Extensions Based on Intersective Levin Classes”. In Proceedings of Coling-ACL98. Montreal, CA, 1998. Dorr, B.J. “Large-Scale Dictionary Construction for Foreign Language Tutoring and Interlingual Machine Translation”. Machine Translation, 12 (1997), 1–55. Joshi, A.K., L. Levy and M. Takahashi. “Tree Adjunct Grammars”. Journal of Computer and System Sciences (1975). Krovetz, R. and W. Croft. “Lexical Ambiguity and Information Retrieval”. ACM Transactions on Information Systems, 10(2) (1992), 115–141. Levin, B. English Verb Classes and Alternations: A Preliminary Investigation. Chicago, IL: The University of Chicago Press, 1993.

222

PALMER

Lowe, J., C. Baker and C. Fillmore. “A Frame-Semantic Approach to Semantic Annotation”. In Proceedings 1997 Siglex Workshop/ANLP97. Washington, D.C., 1997. Mel’cuk, I.A. “Semantic Description of Lexical Units in an Explanatory Combinatorial Dictionary: Basic Principles and Heuristic Criteria”. International Journal of Lexicography, 1(3) (1988), 165–188. Miller, G.A. “Wordnet: An On-Line Lexical Database”. International Journal of Lexicography, 3 (1990), 235–312. Miller, G.A. and C. Fellbaum (1991). “Semantic Networks of English”. Lexical and Conceptual Semantics, Cognition Special Issue. 1991, pp. 197–229. Nirenburg, S., J. Carbonell, M. Tomita and K. Goodman Machine Translation: A Knowledge-Based Approach. San Mateo, California, USA: Morgan Kaufmann, 1992. Palmer, M. “Customizing Verb Definitions for Specific Semantic Domains”. Machine Translation, 5 (1990). Palmer, M., C. Han, F. Xia, D. Egedi and J. Rosenzweig. “Constraining Lexical Selection Across Languages Using Tags”. In Tree Adjoining Grammars. Ed. A. Abeille and O. Rambow. Palo Alto, CA: CSLI, 1998. Palmer, M., J. Rosenzweig, H. Dang and F. Xia. “Capturing Syntactic/Semantic Generalizations in a Lexicalized Grammar”. Presentation in Working Session of Semantic Tagging Workshop, ANLP-97. 1997. Pustejovsky, J. “The Generative Lexicon”. Computational Linguistics, 17(4) (1991). Sanderson, M. “Word Sense Disambiguation and Information Retrieval”. In Proceedings of the 17th ACM SIGIR Conference. 1994, pp. 142–151. Vijay-Shanker, K. (1987). A Study of Tree Adjoining Grammars. PhD thesis, Department of Computer and Information Science, University of Pennsylvania. Voorhees, E.M. “Natural Language Processing and Information Retrieval”. In Proceedings of Second Summer School on Information Extraction, Lecture Notes in Artificial Intelligence. SpringerVerlag, 1999. Xia, F., M. Palmer and K. Vijay-Shanker. “Towards Semi-Automating Grammar Development”. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS-99). Beijing, China, 1999. XTAG-Group, A Lexicalized Tree Adjoining Grammar for English, Technical Report IRCS 95-03. University of Pennsylvania, 1995.

Computers and the Humanities 34: 223–234, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

223

Cross-Lingual Sense Determination: Can It Work? NANCY IDE Department of Computer Science, Vassar College, 124 Raymond Avenue, Poughkeepsie, NY 12604-0520, USA (E-mail: [email protected])

Abstract. This article reports the results of a preliminary analysis of translation equivalents in four languages from different language families, extracted from an on-line parallel corpus of George Orwell’s Nineteen Eighty-Four. The goal of the study is to determine the degree to which translation equivalents for different meanings of a polysemous word in English are lexicalized differently across a variety of languages, and to determine whether this information can be used to structure or create a set of sense distinctions useful in natural language processing applications. A coherence index is computed that measures the tendency for different senses of the same English word to be lexicalized differently, and from this data a clustering algorithm is used to create sense hierarchies. Key words: parallel corpora, sense disambiguation, translation

1. Introduction It is well known that the most nagging issue for word sense disambiguation (WSD) is the definition of just what a word sense is. At its base, the problem is a philosophical and linguistic one that is far from being resolved. However, work in automated language processing has led to efforts to find practical means to distinguish word senses, at least to the degree that they are useful for natural language processing tasks such as summarization, document retrieval, and machine translation. Several criteria have been suggested and exploited to automatically determine the sense of a word in context (see Ide and Véronis, 1998), including syntactic behavior, semantic and pragmatic knowledge, and especially in more recent empirical studies, word co-occurrence within syntactic relations (e.g., Hearst, 1991; Yarowsky, 1993), words co-occurring in global context (e.g., Gale et al., 1993; Yarowsky, 1992; Schütze, 1992, 1993), etc. No clear criteria have emerged, however, and the problem continues to loom large for WSD work. The notion that cross-lingual comparison can be useful for sense disambiguation has served as a basis for some recent work on WSD. For example, Brown et al. (1991) and Gale et al. (1992a, 1993) used the parallel, aligned Hansard Corpus of Canadian Parliamentary debates for WSD, and Dagan et al. (1991) and Dagan and Itai (1994) used monolingual corpora of Hebrew and German and a bilingual dictionary. These studies rely on the assumption that the mapping between words and word senses varies significantly among languages. For example, the wordduty

224

IDE

in English translates into French as devoir in its obligation sense, and impôt in its tax sense. By determining the translation equivalent of duty in a parallel French text, the correct sense of the English word is identified. These studies exploit this information in order to gather co-occurrence data for the different senses, which is then used to disambiguate new texts. In related work, Dyvik (1998) used patterns of translational relations in an English-Norwegian parallel corpus (ENPC, Oslo University) to define semantic properties such as synonymy, ambiguity, vagueness, and semantic fields and suggested a derivation of semantic representations for signs (e.g., lexemes), capturing semantic relationships such as hyponymy etc., from such translational relations. Recently, Resnik and Yarowsky (1997) suggested that for the purposes of WSD, the different senses of a word could be determined by considering only sense distinctions that are lexicalized cross-linguistically. In particular, they proposed that some set of target languages be identified, and that the sense distinctions to be considered for language processing applications and evaluation be restricted to those that are realized lexically in some minimum subset of those languages. This idea would seem to provide an answer, at least in part, to the problem of determining different senses of a word: intuitively, one assumes that if another language lexicalizes a word in two or more ways, there must be a conceptual motivation. If we look at enough languages, we would be likely to find the significant lexical differences that delimit different senses of a word. However, this suggestion raises several questions. For instance, it is well known that many ambiguities are preserved across languages (for example, the French intérêt and the English interest), especially languages that are relatively closely related. Assuming this problem can be overcome, should differences found in closely related languages be given lesser (or greater) weight than those found in more distantly related languages? More generally, which languages should be considered for this exercise? All languages? Closely related languages? Languages from different language families? A mixture of the two? How many languages, and of which types, would be “enough” to provide adequate information for this purpose? There is also the question of the criteria that would be used to establish that a sense distinction is “lexicalized cross-linguistically”. How consistent must the distinction be? Does it mean that two concepts are expressed by mutually noninterchangeable lexical items in some significant number of other languages, or need it only be the case that the option of a different lexicalization exists in a certain percentage of cases? Another consideration is where the cross-lingual information to answer these questions would come from. Using bilingual dictionaries would be extremely tedious and error-prone, given the substantial divergence among dictionaries in terms of the kinds and degree of sense distinctions they make. Resnik and Yarowsky (1997) suggest EuroWordNet (Vossen, 1998) as a possible source of information,

CROSS-LINGUAL SENSE DETERMINATION

225

but, given that EuroWordNet is primarily a lexicon and not a corpus, it is subject to many of the same objections as for bi-lingual dictionaries. An alternative would be to gather the information from parallel, aligned corpora. Unlike bilingual and multi-lingual dictionaries, translation equivalents in parallel texts are determined by experienced translators, who evaluate each instance of a word’s use in context rather than as a part of the meta-linguistic activity of classifying senses for inclusion in a dictionary. However, at present very few parallel aligned corpora exist. The vast majority of these are bi-texts, involving only two languages, one of which is very often English. Ideally, a serious evaluation of Resnik and Yarowsky’s proposal would include parallel texts in languages from several different language families, and, to maximally ensure that the word in question is used in the exact same sense across languages, it would be preferable that the same text were used over all languages in the study. The only currently available parallel corpora for more than two languages are Orwell’s Nineteen Eighty-Four (Erjavec and Ide, 1998), Plato’s Republic (Erjavec et al., 1998), the MULTEXT Journal of the Commission corpus (Ide and Véronis, 1994), and the Bible (Resnik et al., in press). It is likely that these corpora do not provide enough appropriate data to reliably determine sense distinctions, Also, it is not clear how the lexicalization of sense distinctions across languages is affected by genre, domain, style, etc. This paper attempts to provide some preliminary answers to the questions outlined above, in order to eventually determine the degree to which the use of parallel data is viable to determine sense distinctions, and if so, the ways in which this information might be used. Given the lack of large parallel texts across multiple languages, the study is necessarily limited; however, close examination of a small sample of parallel data can, as a first step, provide the basis and direction for more extensive studies.

2. Methodology I have conducted a small study using parallel, aligned versions of George Orwell’s Nineteen Eighty-Four (Erjavec and Ide, 1998) in five languages: English, Slovene, Estonian, Romanian, and Czech.1 The study therefore involves languages from four language families (Germanic, Slavic, Finno-Ugrec, and Romance), two languages from the same family (Czech and Slovene), as well as one non-Indo-European language (Estonian). Nineteen Eighty-Four is a text of about 100,000 words, translated directly from the original English into each of the other languages. The parallel versions of the text are sentence-aligned to the English and tagged for part of speech. Although Nineteen Eighty-Four is a work of fiction, Orwell’s prose is not highly stylized and, as such, it provides a reasonable sample of modern, ordinary language that is not tied to a given topic or sub-domain (such as newspapers, technical reports, etc.). Furthermore, the translations of the text seem to be relatively faithful to the

226

IDE

original: for instance, over 95% of the sentence alignments in the full parallel corpus of seven languages are one-to-one (Priest-Dorman et al., 1997). Four ambiguous English words were considered in this study: hard, line, country and head. Line and hard were chosen because they have served in various WSD studies to date (e.g., Leacock et al., 1993) and a corpus of occurrences of these words from the Wall Street Journal corpus was generously made available for comparison.2 Serve, another word frequently used in these studies, did not appear frequently enough in the Orwell text to be considered, nor did any other suitable ambiguous verb.3 Country and head were chosen as substitutes because they appeared frequently enough for consideration. All sentences containing an occurrence or occurrences (including morphological variants) of each of the three words were extracted from the English text, together with the parallel sentences in which they occur in the texts of the four comparison languages (Czech, Estonian, Romanian, Slovene). The English occurrences were first separated according to part of speech, retaining the noun senses of line, country, and head, and the adjective and adverb senses of hard. As Wilks and Stevenson (1998) have pointed out, part-of-speech tagging accomplishes a good portion of the work of semantic disambiguation; therefore only occurrences with the same part of speech have been considered.4 The selected English occurrences were then grouped using the sense distinctions in WordNet, (version 1.6) (Miller et al., 1990; Fellbaum, 1998). The sense categorization was performed by the author and two student assistants; results from the three were compared and a final, mutually agreeable grouping was established. The occurrence data for each sense of each of the four words is given in Table I.5 For each of the four comparison languages, the corpus of sense-grouped parallel sentences for English and that language was sent to a linguist and native speaker of the comparison language. The linguists were asked to provide the lexical item in each parallel sentence that corresponds to the ambiguous English word; if inflected, they were asked to provide both the inflected form and the root form. In addition, the linguists were asked to indicate the type of translation, according to the distinctions given in Table II. Additional information about possible synonyms, etc., was also asked for. For over 85% of the English word occurrences (corresponding to types 1 and 2 in Table II), a specific lexical item or items could be identified as the translation equivalent for the corresponding English word. Translations of type 5, involving phrases whose meaning encompassed a larger phrase in the English, were considered to be translation equivalents on a case-by-case basis. For example, the Czech translation of “grow[n] hard” is translated in a single verb (closer in meaning to the English “harden”) and as such was judged not to be an equivalent for “hard”, whereas the translation of “stretch of country” in all four comparison languages by a single lexical word was considered to be equivalent, since the translation does not combine two (necessarily) separable concepts.6 Each translation equivalent was represented by its lemma (or the lemma of the root form in the case of

227

CROSS-LINGUAL SENSE DETERMINATION

Table I. Corpus statistics for parallel data from Orwell’s Nineteen Eighty-Four Word

Sense description (WordNet)

hard

difficult metaphorically hard not yielding to pressure; vs. “soft” very strong or vigorous, arduous with force or vigor (adv.) earnestly, intently (adv.)

line

WordNet sense # 1.1 1.2 1.3 1.4 2.1 2.3

# of OCC

Total OCC

4 2 3 1 2 1

13

direction, course 1.10 acting in conformity 1.16 a linear string of words 1.5 contour, outline 1.4 formation of people/things beside one another 1.1 wrinkle, furrow, crease 1.12 logical argument 1.8 something long, thin, flexible 1.18 fortified position 1.7 spatial location 1.11 formation of people/things behind one another 1.3

3 1 8 3 1 3 1 4 1 2 1

28

country

a politically organized body of people area outside cities and towns

1.2 1.5

16 3

19

head

part of the body intellect ruler, chief front, front part

1.1 1.3 1.4 1.7

50 12 2 1

65

TOTAL NUMBER OF OCCURRENCES OF ALL WORDS TOTAL NUMBER OF SAMPLES (TOTAL OCC × 4 LANGUAGES)

125 500

derivatives), for comparison purposes, and associated with the WordNet sense to which it corresponds.7 In order to determine the degree to which the assigned sense distinctions correspond to translation equivalents, a coherence index (CI) was computed that measures the degree to which each pair of senses is translated using the same word as well as the consistency with which a given sense is translated with the same word.8 Note that the CIs do not determine whether or not a sense distinction can be

228

IDE

Table II. Translation types and their frequencies Type

Meaning

1

A single lexical item is used to translate the English equivalent 395 (possibly a different part of speech) 5 The English word is translated by a phrase of two or more words or a compound, which has the same meaning as the single English word The English word is not lexicalized in the translation 29 A pronoun is substituted for the English word in the translation 3 An English phrase containing the ambiguous word is translated by 28 a single word in the comparison language which has a broader or more specific meaning, or by a phrase in which the specific concept corresponding to the English word is not explicitly lexicalized

2

3 4 5

# OCC

% OCC 86% 1%

6% 0.6% 6%

Table III. Number of words used to translate the test words WORD

# Senses

hard country line head

6 3 11 4

RO

ES

SL

CS

8 2 9 9

7 4 14 6

5 3 12 9

6 4 11 4

lexicalized in the target language, but only the degree to which they are lexicalized differently in the translated text. However, it can be assumed that the CIs provide a measure of the tendency to lexicalize different WordNet senses differently, which can in turn be seen as an indication of the degree to which the distinction is valid. For each ambiguous word, the CI is computed for each pair of senses, as follows:

CI (sq sr ) =

Pn

i=1 s

(i)

msq msr n

where: • n is the number of comparison languages under consideration; • msq and msr are the number of occurrences of sense sq and sense sr in the English corpus, respectively, including occurrences which have no identifiable translation;

229

CROSS-LINGUAL SENSE DETERMINATION

Table IV. CIs for hard and head Hard WordNet Sense No 2.1 2.3 1.4 1.3 1.1 1.2

2.1 0.50 0.13 0.00 0.04 0.19 0.00

2.3 1.00 0.25 0.50 0.00 0.00

1.4

1.00 0.17 0.00 0.25

Head 1.3

1.1

0.56 0.00 0.63 0.21 0.00

1.2 1.1 1.3 1.4 1.7

1.1 0.69 0.53 0.12 0.40

1.3

1.4

0.45 0.07 0.50 0.00 0.00

1.7

1.00

0.50

• s (i) is the number of times that senses q and r are translated by the same lexical item in language i, i.e., X

x=y

x∈t rans(q), y∈t rans(r)

The CI is a value between 0 and 1, computed by examining clusters of occurrences translated by the same word in the other languages. If sense i and sense j are consistently translated with the same word in each comparison language, then CI(si , sj ) = 1; if they are translated with a different word in every occurrence, CI(si , sj ) = 0. In general, the CI for pairs of different senses provides an index of their relatedness, i.e., the greater the value of CI(si , sj ), the more frequently occurrences of sense i and sense j are translated with the same lexical item. When i = j, we obtain a measure of the coherence of a given sense. The CIs were computed over four sets of comparison languages, in order to determine the effects of language-relatedness on the results: • Estonian (Finno-Ugric), Romanian (Romance), and Czech and Slovene (Slavic); • Estonian, Romanian, and Slovene (three different language families); • Czech and Slovene (same language family); • Romanian, Czech, and Slovene (Indo-European) for comparison with Estonian (non-Indo-European). CIs were also computed for each language individually. To better visualize the relationship between senses, a hierarchical clustering algorithm was applied to the CI data to generate trees reflecting sense proximity. 9 Finally, in order to determine the degree to which the linguistic relation between languages may affect coherence, a correlation was run among CIs for all pairs of the four target languages.

230

IDE

Figure 1. Cluster tree and distance measures for the six senses of hard.

Figure 2. Cluster tree and distance measures for the four senses of head.

3. Results Although the data sample is small, it gives some insight into ways in which a larger sample might contribute to sense discrimination. The CI data for hard and head are given in Table IV. CIs measuring the affinity of a sense with itself – that is, the tendency for all occurrences of that sense to be translated with the same word – show that all of the six senses of hard show greater internal consistency than affinity with other senses, with senses 1.1 (“difficult” – CI = 0.56) and 1.3 (“not soft” – CI = 0.63) registering the highest internal consistency.10 The same holds true for three of the four senses of head, while the CI for senses 1.3 (“intellect”) and 1.1 (“part of the body”) is higher than the CI for 1.3/1.3. Figure 1 shows the sense clusters for hard generated from the CI data.11 The senses fall into two main clusters, with the two most internally consistent senses (1.1 and 1.3) at the deepest level of each of the respective groups. The two adverbial forms12 are placed in separate groups, reflecting their semantic proximity to the different adjectival meanings of hard. The clusters for head (Figure 2) similarly show two distinct groupings, each anchored in the two senses with the highest internal consistency and the lowest mutual CI (“part of the body” (1.1) and “ruler, chief” (1.4)). The hierarchies apparent in the cluster graphs make intuitive sense. Structured like dictionary entries, the clusters for hard and head might appear as in Fig-

CROSS-LINGUAL SENSE DETERMINATION

231

Figure 3. Clusters for hard and head structured as dictionary entries.

ure 3. This is not dissimilar to actual dictionary entries for hard and head; for example, the entries for hard in four differently constructed dictionaries (Collins English (CED), Longman’s (LDOCE), Oxford Advanced Learner’s (OALD), and COBUILD) all list the “difficult” and “not soft” senses first and second, which, since most dictionaries list the most common or frequently used senses first, reflects the gross division apparent in the clusters. Beyond this, it is difficult to assess the correspondence between the senses in the dictionary entries and the clusters. The remaining WordNet senses are scattered at various places within the entries or, in some cases, split across various senses. The hierarchical relations apparent in the clusters are not reflected in the dictionary entries, since the senses are for the most part presented in flat, linear lists. However, it is interesting to note that the first five senses of hard in the COBUILD dictionary, which was constructed on the basis of corpus examples and presents senses in order of frequency, correspond to five of the six WordNet senses in this study; WordNet’s “metaphorically hard” is spread over multiple senses in the COBUILD, as it is in the other dictionaries. The results for different language groupings show that the tendency to lexicalize senses differently is not affected by language distance (Table V). The mean CI for Estonian, the only non-Indo-European language in the study, is lower than that for any other group, indicating that WordNet sense distinctions are slightly less likely to be clearly distinguished in Estonian. However, the difference (z = –1.43) is not statistically significant. Correlations of CIs for each language pair (Table VI) also show no relationship between the degree to which sense distinctions are lexicalized differently and language distance. This is contrary to results obtained by Resnik and Yarowsky (submitted), who found that non-Indo-European languages tended to lexicalize English sense distinctions, especially at finer-grained levels, more than Indo-European languages. However, their translation data was generated by native speakers presented with isolated sentences in English who were asked to provide the translation for a given word in the sentence. It is not clear how this data compares to translations generated by trained translators working with full context.

4. Summary The small sample in this study suggests that cross-lingual lexicalization can be used to define and structure sense distinctions. The cluster graphs above provide infor-

232

IDE

Table V. Average CI values for language groupings Language group

Average CI

ALL RO/ES/SL SL/CS RO/SL/CS ES

0.27 0.28 0.28 0.27 0.26

Table VI. Correlation among CIs for the four target languages Language Pair Correlation ES/CS RO/SL RO/CS SL/CS RO/ES ES/SL

0.74 0.80 0.72 0.71 0.73 0.80

mation about relations among WordNet senses that could be used, for example, to determine the granularity of sense differences, which in turn could be used in tasks such as machine translation, information retrieval, etc. For example, it is likely that as sense distinctions become finer, the degree of error is less severe. Resnik and Yarowsky (1997) suggest that confusing finer-grained sense distinctions should be penalized less severely than confusing grosser distinctions when evaluating the performance of sense disambiguation systems. The clusters also provide insight into the lexicalization of sense distinctions related by various semantic relations (metonymy, meronymy, etc.) across languages; for instance, the “part of the body” and “intellect” senses of head are lexicalized with the same item a significant portion of the time across all languages, information that could be used in machine translation. In addition, cluster data such as that presented here could be used in lexicography, to determine a more detailed hierarchy of relations among senses in dictionary entries. It is less clear how cross-lingual information could be used to determine sense distinctions independent of a pre-defined set, such as the WordNet senses used here. More work needs to be done on this topic utilizing substantially larger parallel corpora that include a variety of language types. We are currently experimenting with clustering occurrences rather than senses (similar to Schütze, 1992), as

CROSS-LINGUAL SENSE DETERMINATION

233

well as using WordNet synsets and “back translations” (i.e., additional translations in the original language of the translations in the target language) to create semantic groupings, which could provide additional information for determining sense distinctions. Acknowledgements The author would like to gratefully acknowledge the contribution of those who provided the translation information: Tomaz Erjavec (Slovene), Vladimir Petkevic (Czech), Dan Tufis (Romanian), and Kadri Muischnek (Estonian); as well as Dana Fleur and Daniel Kline, who helped to transcribe and evaluate the data. Special thanks to Dan Melamed and Hinrich Schütze for their helpful comments on earlier drafts of the paper. Notes 1 The Orwell parallel corpus also includes versions of Nineteen-Eighty Four in Hungarian, Bulgarian, Latvian, Lithuanian, Serbian, and Russian. 2 Claudia Leacock provided samples of hard and line from the Wall Street Journal corpus. 3 The verb sense of line does not occur in the English Orwell. 4 Both the adjective and adverb senses of hard were retained because the distinction is not consistent across the translations used in the study. 5 The sense inventories and parallel corpus extracts used in this analysis are available at http://www.cs.vassar.edu/∼ide/wsd/. 6 That all four languages use a single lexical item to express this concept itself provides some basis to regard “stretch of country” as a collocation expressing a single concept. 7 The number of translation equivalents for each word in the analysis is given in Table III. 8 Note that the CI is similar in concept to semantic entropy (Melamed, 1997). However, Melamed computes entropy for word types, rather than word senses. 9 Developed by Andreas Stolcke. 10 Senses 2.3 and 1.4 have CIs of 1 because each of these senses exists in a single occurrence in the corpus, and have therefore been discarded from consideration of CIs for individual senses. We are currently investigating the use of the Kappa statistic (Carletta, 1996) to normalize these sparse data. 11 For the purposes of the cluster analysis, CIs of 1.00 resulting from a single occurrrence were normalized to 0.5. 12 Because root forms were used in the analysis, no distinction in translation equivalents was made for part of speech.

References Carletta, J. “Assessing Agreement on Classification Tasks: The Kappa Statistic”. Computational Linguistics, 22(2) (1996), 249–254. Dagan, I. and A. Itai. “Word Sense Disambiguation Using a Second Language Monolingual Corpus”. Computational Linguistics, 20(4) (1994), 563–596. Dagan, I., A. Itai and U. Schwall. “Two Languages Are More Informative Than One”. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, 18–21 June 1991, Berkeley, California, 1991, pp. 130–137.

234

IDE

Dyvik, H.. “Translations as Semantic Mirrors”. Proceedings of Workshop W13: Multilinguality in the Lexicon II, The 13th Biennial European Conference on Artificial Intelligence (ECAI 98), Brighton, UK, 1998, pp. 24–44. Erjavec, T. and N. Ide. “The MULTEXT-EAST Corpus”. Proceedings of the First International Conference on Language Resources and Evaluation, 27–30 May 1998, Granada, 1998, pp. 971– 974. Erjavec, T., A. Lawson and L. Romary. “East Meets West: Producing Multilingual Resources in a European Context”. Proceedings of the First International Conference on Language Resources and Evaluation, 27–30 May 1998, Granada, 1998, pp. 981–986. Fellbaum, C. (ed.). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. Gale, W. A., K. W. Church and D. Yarowsky. “A Method for Disambiguating Word Senses in a Large Corpus”. Computers and the Humanities, 26, 415–439. Hearst, M. A. “Noun Homograph Disambiguation Using Local Context in Large Corpora”. Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1991, pp. 1–19. Ide, N. and J. Véronis. “Word Sense Disambiguation: The State of the Art”. Computational Linguistics, 24(1) (1998), 1–40. Leacock, C., G. Towell and E. Voorhees. “Corpus-based Statistical Sense Resolution”. Proceedings of the ARPA Human Language Technology Worskshop, Morgan Kaufman: San Francisco, 1993. Melamed, I. D. “Measuring Semantic Entropy”. ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4–5, 1997, Washington, D.C., 1997, 41–46. Miller, G. A., R. T. F. Beckwith, D. Christiane, D. Gross and K. J. Miller. “WordNet: An On-line Lexical Database”. International Journal of Lexicography, 3(4) (1990), 235–244. Priest-Dorman, G., T. Erjavec, N. Ide and V. Petkevic. Corpus Markup. COP Project 106 MULTEXTEast Deliverable D2.3 F. Available at http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html, 1997. Resnik, P., M. Broman Olsen and M. Diab (in press). “Creating a Parallel Corpus from the Book of 2000 Tongues”. Computers and the Humanities. Resnik, P. and D. Yarowsky (submitted). “Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation”. Submitted to Natural Language Engineering. Resnik, P. and D. Yarowsky. “A Perspective on Word Sense Disambiguation Methods and Their Evaluation”. ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, and How? April 4–5, 1997, Washington, D.C., 1997, pp. 79–86. Schütze, H. “Dimensions of Meaning”. Proceedings of Supercomputing’92. Los Alamitos, California: IEEE Computer Society Press, 1992, pp. 787–796. Schütze, H. “Word Space”. In Advances in Neural Information Processing Systems 5. Eds. S.J. Hanson, J.D. Cowan and C.L. Giles, San Mateo, California: Morgan Kauffman, 1993, pp. 5, 895–902. Vossen, P. (ed.). “EuroWordNet: A Multilingual Database with Lexical Semantic Networks”. Computers and the Humanities, 32 (1998), 2–3. Wilks, Y. and M. Stevenson. “Word Sense Disambiguation Using Optimized Combinations of Knowledge Sources”. Proceedings of COLING/ACL-98, Montreal, August, 1998. Yarowsky, D.. “Word Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora”. Proceedings of the 14th International Conference on Computational Linguistics, COLING’92, 23–28 August, Nantes, France, 1992, pp. 454–460. Yarowsky, D.. “One Sense per Collocation”. Proceedings of the ARPA Human Language Technology Workshop, New Jersey: Princeton, 1993, pp. 266–271.

Computers and the Humanities 34: 235–243, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

235

Is Word Sense Disambiguation Just One More NLP Task? YORICK WILKS Department of Computer Science, University of Sheffield, Sheffield, UK (E-mail: [email protected])

Abstract. The paper examines the task of Word Sense Disambiguation (WSD) critically and compares it with Part of Speech (POS) tagging, arguing that the ability of a writer to create new senses distinguishes the tasks and makes it more problematic to test WSD by the mark-up-and-model paradigm, because new senses cannot be marked up against dictionaries. This serves to set WSD apart and puts limits on its effectiveness as an independent NLP task. Moreover, it is argued that current WSD methods based on very small word samples are also potentially misleading because they may or may not scale up. Since all-word WSD methods are now available and are producing figures comparable to the smaller scale tasks, it is argued that we should concentrate on the former and find ways of bootstrapping test materials for such tests in the future. Key words: Word Sense Disambiguation, lexical tuning, part of speech tagging, lexical rules, vagueness

I want to make clear right away that I am not writing as a sceptic about wordsense disambiguation (WSD) let alone as a recent convert. On the contrary, my PhD thesis was on the topic thirty years ago (Wilks, 1968) and was what we would now call a classic AI toy system approach, one that used techniques later called Preference Semantics, but applied to real newspaper texts. But it did attach single sense representations to words drawn from a polysemous lexicon of 800 words. If Boguraev was right, in his informal survey twelve years ago, that the average NLP lexicon was under fifty words, then that work was ahead of its time and I do therefore have a longer commitment to, and perspective on, the topic than most, for whatever that may be worth! 1. Part-of-speech and Word-Sense Tagging Contrasted I want to raise some general questions in this paper about WSD as a task, aside from all the busy work in SENSEVAL: questions that should make us wary about what we are doing here, but definitely not stop doing it. I can start by reminding us all of the ways in which WSD is not like part-of-speech (POS) tagging, even though they are plainly connected in information terms, as Stevenson and I pointed out in (Wilks and Stevenson, 1998a), and were widely misunderstood for

236

WILKS

doing so. From these differences, of POS and WSD, I will conclude that WSD is not just one more partial task to be hacked off the body of NLP and solved. What follows acknowledges that Resnik and Yarowsky made a similar comparison in 1997 (Resnik and Yarowsky, 1997) though this list is a little different from theirs: 1. There is broad agreement about POS tags in that, even among those who advocate differing sets, there is little or no dispute that the sets can be put into one-many correspondence. That is not generally accepted for alternative sets of senses for the same words from different lexicons. 2. There is little dispute that humans can POS tag to a high degree of consistency, but again this is not universally agreed for WS tagging. I shall return to this issue below, but its importance cannot be exaggerated: if humans do not have this skill then we are wasting our time trying to automate it. I assume that fact is clear to everyone: whatever maybe the case in robotics or fast arithmetic, in the NLP parts of AI there is no point modelling or training for skills that humans do not have! 3. I do not know the genesis of the phrase “lexical tuning,” but the phenomenon has been remarked, and worked on, for thirty years and everyone seems agreed that it happens, in the sense that human generators create, and human analysers understand, words in quite new senses, ungenerated before or, at least, not contained in a point-of-reference lexicon, whether that be thought of as in the head or in the computer. Only this view is consistent with the evident expansion of sense lists in dictionaries with time; these new additions cannot be simply established usages not noticed before. If this is the case it seems to mark an absolute difference between WSD and POS tagging (where extension does not occur in the same way), and that should radically alter our view of what we are doing in SENSEVAL, because we cannot apply the standard empirical modelling method to that kind of novelty. The now standard empirical paradigm of [mark-up, model/train, and test] assumes prior markup, as in point (2) above. But we cannot, by definition, mark up for new senses, that is, those not in the list we were initially given because the text analysed creates them, or that were left out of the source from which the mark up list came. If this phenomenon is real, and I assume it is, it sets a limit to phenomenon (2), the human ability to pre-tag with senses, and therefore sets an upper bound on the percentage results we can expect from WSD, a fact that marks WSD off quite clearly from POS tagging. The contrast here is in fact quite subtle as can be seen from the interesting intermediate case of semantic tagging: attaching semantic, rather than POS, tags to words automatically, a task which can then be used to do more of the WSD task (as in Dini et al., 1998) than POS tagging can, since the ANIMAL or BIRD versus MACHINE tags can then separate the main senses of “crane”. In this case, as with POS, one need not assume novelty in the tag set, but must allow for novel assignments from it to corpus words e.g. when a word like “dog” or “pig” was first

IS WORD SENSE DISAMBIGUATION JUST ONE MORE NLP TASK?

237

used in a human sense. It is just this sense of novelty that POS tagging does have, of course, since a POS tag like VERB can be applied to what was once only a noun, e.g. “ticket”. This kind of novelty, in POS and semantic tagging, can be pre-marked up with a fixed tag inventory, on the basis of lexical rules and corpora; hence both these techniques differ from genuine sense novelty which cannot be premarked. As I wrote earlier, the thrust of these remarks is not intended sceptically, either about WSD in particular, or about the empirical linguistic agenda of the last ten years more generally. I assume the latter has done a great deal of good to NLP/CL: it has freed us from toy systems and fatuous example-mongering, and shown that more could be done with superficial knowledge-free methods than the whole AI knowledge-based-NLP tradition ever conceded: the tradition in which every example, every sentence, had in principle to be subjected to the deepest methods. Minsky and McCarthy always argued for that, but it seemed to some even then an implausible route for any least-effort-driven theory of language evolution to have taken. The caveman would have stood paralysed in the path of the dinosaur as he downloaded deeper analysis modules, trying to disprove he was only having a nightmare. However, with that said, it may be time for some corrective: time to ask not only how we can continue to slice off more fragments of partial NLP as tasks to model and evaluate, but also how to reintegrate them for real tasks that humans undoubtedly can evaluate reliably, like MT and IE, and which are therefore unlike any of the partial tasks we have grown used to (like syntactic parsing) but about which normal language users have no views at all, for they are expert-created tasks, of dubious significance outside a wider framework. It is easy to forget this because it is easier to keep busy, always moving on. But there are few places left to go after WSD; empirical pragmatics has surely started but may turn out to be the final leg of the journey. Given the successes of empirical NLP at such a wide range of tasks, it is not too soon to ask what it is all for, and to remember that, just because machine translation (MT) researchers complained long ago that WSD was one of their main problems, it does not follow that high level percentage success at WSD will advance MT. It may do so, and it is worth a try, but we should remember that Martin Kay warned years ago that no set of individual solutions to computational semantics, syntax, morphology etc. would necessarily advance MT. However, unless we put more thought into reintegrating the new techniques developed in the last decade we shall never find out.

2. WS Tagging as a Human Task It seems obvious to me that, aside from the problems of tuning and other phenomena that go under names like vagueness, humans can, after training, sense-tag texts at reasonably high levels and reasonable inter-annotator consistency. They can do this with alternative sets of senses for words for the same text, although it

238

WILKS

may be a task where some degree of training and prior literacy are essential, since some senses in a list are not widely known to the public. The last question should not be shocking: teams of lexicographers in major publishing houses constitute such literate, trained, teams and they can normally achieve agreement sufficient for a large printed dictionary to be published (agreement about sense sets, that is, a closely related skill to sense-tagging). Those averse to claims about training and expertise here should remember that most native speakers cannot POS tag either, though there seems substantial and uncontentious consistency among the trained. There is strong evidence for this position on tagging ability, which includes Green (1989, see also Jorgensen, 1990) and indeed the high figures obtained for small word sets by the techniques pioneered by Yarowsky (1995). Many of those figures rest on forms of annotation (e.g. assignment of words to thesaurus head sets in Roget), and the general plausibility of the methodology serves to confirm the reality of human annotation (as a consistent task) as a side effect. The counterarguments to this have come explicitly from the writings of Kilgarriff (1993), and sometimes implicitly from the work of those who argue from the primacy of lexical rules or of notions like vagueness in relationship to WSD. In Kilgarriff’s case I have argued elsewhere (Wilks, 1997) that the figures he produced on human annotation are actually consistent with very high levels of human ability to sense-tag and are not counter-arguments at all, even though he seems to remain sceptical about the task in his papers. He showed only that for most words there are some contexts for which humans cannot assign a sense, which is of course not an argument against the human skill being generally successful. Kilgarriff is also, of course, the organiser of this SENSEVAL workshop. There need be no contradiction here, but a fascinating question about motive lingers in the air. Has he set all this up so that WSD can destroy itself when rigourously tested? One does not have to be a student of double-blind tests, and the role of intention in experimental design, to take these questions seriously, particularly as he has designed the methodology and the use of the data himself. The motive question here is not mere ad hominem argument but a serious question needing an answer, and I have no doubt he will supply it in this volume. These are not idle questions, in my view, but go to the heart of what the SENSEVAL workshop is FOR: is it to show how to do better at WSD, or is to say something about wordsense itself (which might involve saying that you cannot do WSD by computer at all, or cannot do it well enough to be of interest?). In all this discussion, we should remember that, if we take the improvement of (assessable) real tasks as paramount, tasks like MT, Information Retrieval and Information Extraction (IE), then it may not in the end matter whether humans are ever shown psycholinguistically to need POS tagging or WSD for their own language performance – there is much evidence that they do not. But that issue is wholly separate from what concerns us here; it may still be useful to advance MT/IE via partial tasks like WSD, if they can be shown performable, assess-

IS WORD SENSE DISAMBIGUATION JUST ONE MORE NLP TASK?

239

able, and modelable by computers, no matter how humans ultimately turn out to work. 3. Criticisms of WSD in Terms of Vagueness and Lexical Rules Critiques of the broadly positive position above (i.e. that WSD can be done by people and machines and we should keep at it) sometimes seem to come as well from those who argue (a) for the inadequacy of lexical sense sets over productive lexical rules as well as those who argue (b) for the inherently vague quality of the difference between the senses of a given word. I believe both these approaches are muddled if their proponents conclude that WSD is therefore fatally flawed as a task. Lexical rules go back at least to Givon’s (1967) sense-extension rules but they are in no way incompatible with a sense-set approach. Such sense sets are normally structured in dictionaries (often by part of speech and by general and specific senses) and the rules are, in some sense, no more than a compression device for predicting that structuring. But the set produced by any lexical rules is still a set, just as a dictionary list of senses is a set, albeit structured. It is mere confusion to think one is a set and one not: Nirenburg and Raskin (1997) have pointed out that those who argue against lists of senses (in favour of rules, e.g. Pustejovsky, 1995) still produce and use such lists, for what else could they do? I cannot myself get much clarity on this from advocates of the lexical rule approach: whatever its faults or virtues, what has it to do with WSD? If their case is that rules can predict or generate new senses then their position is no different (with regard to WSD) from that of anyone else who thinks new senses important, however modelled or described. The rule/compression issue itself has nothing essential to do with WSD: it is simply one variant of the novelty/tuning/new-sense/metonymy problem, however that is described. The vagueness issue is again an old observation, one that, if taken seriously, must surely result in a statistical or fuzzy-logic approach to sense discrimination, since only probabilistic (or at least quantitative) methods can capture real vagueness. That, surely, is the point of the Sorites paradox: there can be no plausible or rational qualitatively-based criterion (which would include any quantitative system with clear limits: e.g. tall = over 6 feet) for demarcating “tall”, “green” or any inherently vague concept. If, however, sense sets/lists/inventories are to continue to play a role, then vagueness can mean no more than highlighting what all systems of WSD must have, namely some parameter, or threshold, for the assignment of usage to one of a list of senses versus another, or for setting up a new sense in the list. Talk of vagueness adds nothing to help that process for those who want to assign, on some quantitative basis, to one sense rather than another; the only heuristic solution is one of tuning to see what works and fits our intuitions.

240

WILKS

Vagueness would be a serious concept only if the whole sense list for a word (in rule form or not) was abandoned in favour of statistically-based clusters of usages or contexts. There have been just such approaches to WSD in recent years (e.g. Bruce and Wiebe, 1994; Pedersen and Bruce, 1997; Schuetze and Pederson, 1995) and the essence of the idea goes back to Sparck Jones (1964/1986) but such an approach would find it impossible to take part in any competition like SENSEVAL because it would inevitably deal in nameless entities which cannot be marked up for. Vagueness and Lexical Rule-based approaches also have the consequence that all lexicographic practice is, in some sense, misguided: on such theories dictionaries are fraudulent documents that could not help users, whom they systematically mislead by listing senses. Fortunately, the market decides this issue, and it is a plainly false claim. Vagueness in WSD is either false (the last position) or trivial, and known and utilised within all methodologies. This issue owes something to the systematic ignorance of its own history, so often noted in AI. A discussion email preceding this workshop referred to the purported benefits of underspecification in lexical entries, and how recent formalisms had made that possible. How could anyone write such a thing in ignorance of the 1970s and 80s work on incremental semantic interpretation of Hirst, Mellish and Small (Hirst, 1987; Mellish, 1983; Small et al., 1988) among others?

4. Symbolic-Statistical Hybrids for WSD? None of this is a surprise to those with AI memories more than a few weeks long: in our field people read little outside their own notational clique, and constantly “rediscover” old work with a new notation. This leads me to my final point which has to do, as I noted above, with the need for a fresh look at technique integration for real tasks. We all pay lip service to this while we spend years on fragmentary activity, arguing that that is the method of science. Well, yes and no, and anyway this is not science: what we are doing is engineering and the fragmentation method does not generally work there, since engineering is essentially integrative, not analytical. We often write or read of “hybrid” systems in NLP, which is certainly an integrative notion, but we again have little clear idea of what it means. If statistical or knowledge-free methods are to solve some or most cases of any linguistic phenomenon, like WSD, how do we then locate that subclass of the phenomena that require other, deeper, techniques like AI and knowledgebased reasoning? Conversely, how can we know which cases the deeper techniques cannot or need not deal with? If there is an upper bound to empirical method – and I have argued that that will be lower for WSD than for some other NLP tasks – then how can we pull in other techniques smoothly and seamlessly for the “hard” examples? The experience of POS tagging, to return to where we started, suggests that ruledriven taggers can do as well as purely ML-based taggers, which, if true, suggests

IS WORD SENSE DISAMBIGUATION JUST ONE MORE NLP TASK?

241

that symbolic methods, in a broad sense, might still be the right approach for the whole task. Are we yet sure this is not the case for WSD; I simply raise the question? Ten years ago, it was taken for granted in most of the AI/NLP community that knowledge-based methods were essential for serious NLP. Some of the successes of the empirical program (and especially the MUC and TIPSTER programs) have caused many to reevaluate that assumption. But where are we now, if a real ceiling is already in sight? Information Retrieval languished for years, and maybe still does, as a technique with a use but an obvious ceiling, and no way of breaking through it; there was really nowhere for its researchers to go. But that is not quite true for us, because the claims of AI/NLP to offer high quality at NLP tasks have never been really tested. They have certainly not failed, just got left behind in the rush towards what could be most easily tested.

5. General versus Small-Scale WSD Which brings me to my final point: general versus small scale WSD. Our NLP group at Sheffield is one of the few that has insisted on continuing with general WSD: the tagging and test of all content words in a text, a group that includes CUP, XRCE-Grenoble and CRL-NMSU. We currently claim about 95% correct sense assignment (Wilks and Stevenson, 1998b) and do not expect to be able to improve on that for the reasons set out above; we believe the rest is AI or lexical tuning! The general argument for continuing with the all-word paradigm, rather than the highly successful small-scale paradigm of Yarowsky et al. is that that is the real task, and there is no firm evidence that the small scale will scale up because much of sense-disambiguation is mutual between the words of the text, which, I believe, cannot be used by the small-scale approach. Logically, if you claim to do all the words you ought, in principle, to be able to enter a contest like SENSEVAL that does only some of the words with an unmodified system. This is true, but you will also expect to do worse as you have not have had as much training data for the chosen word set. Moreover, you will have to do far more preparation to enter if you insist, as we would, on bringing the engines and data into play for all the training and test set words; the effort is that much greater and it makes such an entry self-penalising in terms of both effort and likely outcome, which is why we decided not to enter in the first round, regretfully, but just to mope and wail on the sidelines. The methodology chosen for SENSEVAL was a natural reaction to the lack of training and test data for the WSD task, as we all know, and that is where I would personally like to see effort put in the future, so that everyone can enter all the words. I assume that would be universally agreed to if the data were there. It is a pity, surely, to base the whole structure of a competition on the paucity of the data. What we would like to suggest positively is that we cooperate to produce more data, and use existing all-word systems, like Grenoble, CUP, our own and others willing to join, possibly in combination, so as to create large-scale tagged data

242

WILKS

quasi-automatically, in rather the same fashion that produced the Penn Tree Bank with the aid of parsers, not just people. We have some concrete suggestions as to how this can be done, and done consistently, using not only multiple WSD systems but also by cross comparing the lexical resources available, e.g. WordNet (or EuroWordNet) and a major monolingual dictionary. We developed our own test/training set with the WordNet-LDOCE sense translation table (SENSUS, Knight and Luk, 1994) from ISI. Some sort of organised effort along those lines, before the next SENSEVAL, would enable us all to play on a field not only level, but much larger.

References Bruce, R. and J. Wiebe. “Word-sense Disambiguation Using Decomposable Models”. In Proceedings of the 32nd. Meeting of the Assn. for Computational Linguistics, ACL-94, 1994. Dini, L., V. di Tommaso and F. Segond. “Error-driven Word Sense Disambiguation”. In Proceedings of COLING-ACL98, Montreal, 1998. Givon, T. Transformations of Ellipsis, Sense Development and Rules of Lexical Derivation. SP-2896, Systems Development Corp., Santa Monica, CA, 1967. Green, G. Pragmatics and Natural Language Understanding. Erlbaum: Hillsdale, NJ, 1989. Hirst, G. Semantic Interpretation and the Resolution of Ambiguity. Cambridge: CUP, 1987. Jorgensen, J. “The Psychological Reality of Word Senses”. Journal of Psycholinguistic Research, 19 (1990). Kilgarriff, A. “Dictionary Word-sense Distinctions: An Enquiry into Their Nature”. Computers and the Humanities, 26 (1993). Knight, K. and S. Luk. “Building a Large Knowledge Base for Machine Tanslation”. In Proceedings of the American Association for Artificial Intelligence Conference AAAI-94. Seattle, WA 1994, pp. 185–109. Mellish, C. “Incremental Semantic Interpretation in a Modular Parsing System”. In Automatic Natural Language Parsing. Eds. Sparck-Jones and Wilks, Ellis Horwood/Wiley, Chichester/New York, 1983. Nirenburg, S. and V. Raskin. Ten Choices for Lexical Semantics. Research Memorandum, Computing Research Laboratory, Las Cruces, NM, 1997. Pedersen, T. and R. Bruce. “Distinguishing Word Senses in Untagged Text”. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing. Providence, RI, 1997, pp. 197–207. Pustejovsky, J. The Generative Lexicon. Cambridge, MA: MIT Press, 1995. Resnik, P. and D. Yarowsky. “A Perspective on Word Sense Disambiguation Techniques and their Evaluation”. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, Why and How?. Washington, DC., 1997, pp. 79–86. Schuetze, H. “Dimensions of Meaning”. In Proceedings of Supercomputing ’92. Minneapolis, MN, 1992, pp. 787–796. Schuetze, H. and J. Pederson. “Information Retrieval based on Word Sense”. In Proc. Fourth Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, NV, 1995. Small, S., G. Cottrell and M. Tanenhaus (Eds.). Lexical Ambiguity Resolution. San Mateo, CA: Morgan Kaufmann, 1988. Sparck Jones, K. Synonymy and Semantic Classification. Edinburgh: Edinburgh UP (1964/1986). Wilks, Y. Argument and Proof. PhD thesis, Cambridge University, 1968. Wilks, Y. “Senses and Texts”. Computers and the Humanities. 1997.

IS WORD SENSE DISAMBIGUATION JUST ONE MORE NLP TASK?

243

Wilks, Y. and M. Stevenson. “The Grammar of Sense: Using Part-of-speech Tags as a First Step in Semantic Disambiguation”. Journal of Natural Language Engineering 4(1) (1998a), 1–9. Wilks, Y. and M. Stevenson. “Optimising Combinations of Knowledge Sources for Word Sense Disambiguation”. In Proceedings of the 36th Meeting of the Association for Computational Linguistics (COLING-ACL-98). Montreal, Canada, 1998b. Yarowsky, D. “Unsupervised Word-Sense Disambiguation Rivaling Supervised Methods”. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95). Cambridge, MA, 1995, pp. 189–196.