Evaluation of Lexical Acquisition Algorithms - CiteSeerX

1 downloads 0 Views 341KB Size Report
Stefan Bordag, Hans Friedrich Witschel, Thomas Wittig. Universität ..... Quasthoff. Mit einer lexikographisch-historischen Einführung und einer Bibliographie.
Evaluation of Lexical Acquisition Algorithms Stefan Bordag, Hans Friedrich Witschel, Thomas Wittig Universität Leipzig, Institut für Informatik, Augustusplatz 10-11, 04109 Leipzig {sbordag, witschel, wittig}@informatik.uni-leipzig.de

Typ des Beitrags/Type of the paper

Vortrag/Lecture

Evaluation of Lexical Acquisition Algorithms Stefan Bordag, Hans Friedrich Witschel, Thomas Wittig Since there is a continuously growing amount of lexical acquisition algorithms for all kinds of lexical knowledge, be it morphologic, syntactic, semantic, domain specific or other information, it is necessary to be able to compare their quality. Furthermore at the same time it is necessary to provide an evaluation method that is easy to grasp and as uniform as possible so that it can be applied to all kinds of algorithms and provide comparable and most importantly reproducible results. We propose an extension of the gold standard evaluation method by generalizing most lexical knowledge to binary relations and then give an evaluation framework based on precision and recall. To show the effectiveness of this approach we compare the well-known mutual information measure with the dice coefficient. Aufgrund der wachsenden Menge von Algorithmen zur Extraktion von lexikalischem Wissen, sei es morphologisches, syntaktisches, semantisches, domänenspezifisches oder anderes, ist es notwendig, deren Qualität vergleichen zu können. Gleichzeitig wird eine Evaluierungsmöglichkeit benötigt, welche einfach zu verstehen und so generell wie möglich ist, damit sie auf verschiedenartige Algorithmen angewandt werden kann und vergleichbare und vor allem reproduzierbare Evaluierungsergebnisse liefert. Wir schlagen eine Erweiterung der Gold Standard Evaluierungsmethode vor, indem wir das lexikalische Wissen als binäre Relationen abstrahieren und dann auf den Begriffen Precision und Recall basierend ein Evaluierungsframework konzeptionieren. Die Effektivität dieses Ansatzes legen wir durch einen Vergleich der Evaluierungsergebnisse des wohlbekannten Mutual Information Maßes und des Dice Koeffizienten dar.

Introduction Lexical acquisition encompasses a large topic, which is concerned with extracting a maximum amount of lexical knowledge with minimum amount of manual work. The methods can range from completely unsupervised clustering techniques, bootstrapping methods, data mining and rule learning up to supervised high-precision annotating. The goals on the other hand comprise the extraction of morphologic and syntactic knowledge about lexical items such as lemmas, wordforms and phrases (Resnik 93), annotation of semantic relations between

lexical items (Hearst 92, Baroni & Bisi 04) and collocational knowledge (Church et al. 89) as well as translation correspondences. Different evaluations like psycholinguistic tests (Miller 85), vocabulary tests (Rapp 02), a gold standard or artificial items, both covered by (Grefenstette 94) have been employed with strongly varying quality. But even when using a gold standard and testing against a lexical knowledge source like WordNet, different methodologies have been used resulting in a non-comparability of their results. We propose a generalized evaluation framework which allows to test against an arbitrary knowledge source in a way comparable to (Curran 03) and (Budanitsky & Hirst 01). We model an abstract view on the output of an algorithm as well as an abstract view on the knowledge source. This allows to test almost any possible lexical acquisition algorithm against any imaginable lexical knowledge source as far as the desired output of the algorithm is present in the knowledge source and the knowledge in the source can be represented using binary relations. The two main goals of this framework are: • The results should be represented in easily understandable precision and recall terms. This alleviates the need for indirect evaluations which give only indications whether some algorithms performed well or not like for example in the framework of (Weeds & McCarthy 04). • Reproducibility of the results. If the corpus used for reproduction of reported evaluation results is available as well as the knowledge source, then it should be easy to reproduce the evaluation. In fact, using a different corpus of the same domain should yield similar results. We provide a freely available prototype of an implementation of this framework, which can already test against well-known knowledge sources like WordNet and Roget's Thesaurus. For German it is possible to use GermaNet and Dornseiff (Dornseiff 04). Related work Existing approaches for evaluating lexical acquisition algorithms can be divided into direct approaches (psycholinguistic evidence, standard vocabulary tests, gold standard) and indirect approaches (application-based, artificial synonyms). • Psycholinguistic evidence: Evaluations that use psycholinguistic evidence compare the results of lexical acquisition algorithms to ranked lists which consist of the responses of free word-association experiments (Grefen-









stette 94, Finkelstein et al. 02). Because of the expense of psycholinguistic experiments, such evaluations are done usually on small samples, which makes the results less representative. Standard vocabulary tests: Standard vocabulary tests like parts of Test of English as a Foreign Language (TOEFL) or English as a Second Language (ESL) use multiple choice questions to find synonyms to given words. These pairs of questions together with their possible answers can be used for evaluating lexical acquisition algorithms (Landauer & Dumais 97, Turney 01). Again the small samples and also the small set of alternatives given for multiple choice are limitations of this method. Artificial synonyms: Artificial synonyms are perfect synonyms produced by randomly choosing a part of all occurrences of a word and substituting the word by a pseudo-word. Then lexical acquisition algorithms can be evaluated by checking the ranking of the pseudo-word in the ranked list of the original word and vice versa (Grefenstette 94). This method requires only little human effort and works well for the synonymy relation. Other relations cannot be treated this way. Application-based: Application-based evaluations compare the improvement that an application receives when it is using the results of a lexical acquisition algorithm (Dagan et al. 97, Lee 99, Budanitsky & Hirst 01). Because of the indirect way of evaluating it is hard to decide what is responsible for poor improvements – the lexical acquisition algorithm or the application, which should be improved. The results can differ from application to application. Gold standard: Gold standards are resources like thesauri or (machine readable) dictionaries. Using gold standards means to compare the results of a lexical acquisition algorithm with the knowledge contained in such a knowledge base. Then it is possible to measure the overlap in form of precision and recall. o Global approach: The global approach consists in extracting a number of most similar pairs of words by using a lexical acquisition algorithm and measuring the overlap with a knowledge base (Grefenstette 94). However, testing only the globally most similar pairs of words does not permit statements about the performance of the lexical acquisition algorithm for a larger variety of words. Again, there is the problem of representativity of small sample sets. o Local approach: To proceed locally means to choose a balanced sample set of words (in terms of frequency of occurrence, abstractness/concreteness, specificity/generality and so on) and then measure the overlap of the results of a lexical acquisition algorithm and the

knowledge base for this sample set. Usually evaluations, which make use of gold standards, only check the synonymy relation (Curran 03). But this method works well for most other semantic relations, too. The Evaluation Framework The approach taken in our evaluation framework is to view lexical acquisition (LA) algorithms in terms of Information Retrieval (IR). In IR applications, for a given search query which usually consists of one or more words, a number of documents ranked according to their relevance to the query (as deemed by the search engine) is returned. Precision measures how many of the retrieved ones were really relevant and recall measures how many relevant ones have been retrieved of a known number of relevant documents. Equivalently it is possible to treat an input word (or several) as a search query and the LA algorithm to be evaluated as a search engine, which retrieves other words, which might be considered relevant. The knowledge whether a retrieved word is relevant or not can be taken from an electronically available lexical knowledge source such as WordNet, Roget's Thesaurus for English or GermaNet and Dornseiff for German. Cases where this analogy to Information Retrieval seems to be troublesome will be discussed at the end of this section. For now we define: Definition: A lexical acquisition algorithm is a process which receives one or more words as an input and which returns a ranked list of words or features which are supposedly relevant to the input words. In order to use various knowledge sources uniformly as evaluation instances, ideally a logical view is needed which allows us to treat them equivalently. Representing all the knowledge of such a possible source as relations between words is a possible view and will be explained in detail. Definition: A lexical knowledge source contains relations between pairs of words. It follows directly from this definition that it is possible to take a word A and view all other words, which stand in any relation to A as relevant words. According to the notation in (Lin 98) this would mean the set (A,*,*) where (A,R,B) means that A stands in relation R to B. Another possibility is to take a word A and a specific relation R and to view all words which stand in this rela-

tion to A as relevant which according to Lin’s notation would mean (A,R,*). Therefore it is possible to measure precision and recall for all relations at a time as well as for each relation in particular. Precision and Recall Usually a lexical acquisition algorithm is applied to a number of lexical items (or simply to all words of a corpus), deviations from this rule will be discussed below. As defined above, for each word the result is a ranked list of supposedly relevant words (or other items such as POS tags, morphologic information etc.) standing in one of the desired relations to the input word. In order to measure the precision of such an algorithm, all words should be measured uniformly – which means that the mean of the precision values for each word has to be taken. Therefore we introduce a variable x (for x=5, 10, 50,…), which denotes the size of the retrieved sets. When comparing to the chosen knowledge source it would now be possible to tag every extracted word of the x words (x=5 in the example) as either correct (c) or incorrect (w). By looking up in the knowledge base which other words are related to the input word it is possible to obtain the set of relevant words and thus compute recall. In the example given below, word1 was assumed to have 50 other words standing in some relation to this word. The mean of all words gives a global recall measure. Note that for global recall extracting 3 out of 10 possible words correctly is much more valuable than extracting 3 out of 50 possible. Because of this an algorithm can have a higher recall value and at the same time a smaller precision value as compared to some other algorithm. 1 c c w

word1 word2 word3 overall

2 c w w

3 w w c

4 w c w

5 c c w

Precision Recall 3/5*100 3/50*100 3/5*100 3/10*100 1/5*100 1/150*100 (700/5)/3=46,6% (5500/150)/3=12,2%

Table 1: Example table depicting how precision and recall can be measured.

When for example the goal is to measure the performance of some abstract mathematical similarity measure it would be interesting to know which kind of relations it ‘favors’. Therefore, a distinct precision value for each relation can be measured in the same way. Note that since between a word pair several different relations can hold, the sum of the precision percentages must not equal the over-

all precision score. For the overall precision, an extracted word can only be correct once, whereas when summing precisions over different relations the same word can be correct twice or more. However, it might be necessary to introduce the notion of ‘hard’ and ‘soft’ precision/recall. Usually the corpus used contains an order of magnitude more different lexical items than even the largest manually crafted knowledge source will contain. Therefore in many cases the input word will not be present in the used knowledge source. This implies that all extracted words will be considered as wrong, according to the knowledge source because the first word of the pair (A,R,B) will never be able to match anything. Calculating ‘soft’ precision and recall means ignoring all these cases. For some purposes it might still be interesting how many words had enough data according to the algorithm to be included in the evaluation, therefore we did not automatically exclude the ‘hard’ precision and recall measuring from our framework. Relations For the described method of measuring precision and recall, it is important to be able to represent lexical knowledge using binary relations. To begin with, we have chosen WordNet as the most commonly known knowledge source for giving examples on how to produce binary relation views on sets and links between sets. WordNet (Miller 85) is organized around so-called synsets (synonym sets). Then there are relations either between whole synsets or between single lexical items within these synsets. Without any loss of information this can be transformed into a relational scheme: • A synset breaks up into pairs of words, which stand in the synonymy relation. • A relation between two synsets breaks up into a complete mapping of all words of one synset and all words of the other synset standing in the given relation. • A relation between two lexical items can be taken directly. Thus the word 'neglect' appears in four noun synsets together with 'negligence, neglectfulness, disregard, carelessness, nonperformance' (for simplicity the variety of meanings will be omitted here). This can be reformulated into the relation pairs (using Lin’s notation): (neglect, synonym, negligence), (neglect, synonym, neglectfulness) and so on. In one of the synsets of neglect, there is a lexical relation between the noun 'neglect' and the verb 'neglect' of the type 'derivationally related form'. This translates directly into the pair (noun:neglect, deriv,

verb:neglect). Furthermore for the same synset a hypernym relation is defined to the synset of the word 'carelessness'. Therefore all words of the initial synset have the hypernym 'carelessness': (neglect, hypernym, carelessness), (negligence, hypernym, carelessness) and (neglectfulness, hypernym, carelessness). One linguistic relation, namely cohyponymy, is not explicitly annotated in WordNet. Nevertheless it is possible to derive it using hypernyms. If two words have the same hypernym, then there is the cohyponymy relation between them. With this a ‘search depth’ variable has been introduced which lets the user specify whether these words must be direct hyponyms to the same hypernym or can be farther away. Usually, however, distances greater than one step result in highly unrelated words to be counted as cohyponyms, therefore in our example evaluation below we have used only direct hypernyms to find cohyponyms. Problematic cases As mentioned above, our approach is strictly relational in that it assumes that all data from knowledge sources can be represented using binary relations between words. This seems a strong restriction at first, but under closer inspection, a lot of representations of knowledge can be viewed as or transformed into relational ones. This does not only apply to semantic information but can be extended to other forms of lexical knowledge: many aspects of morphologic, syntactic or translational information can be represented using binary relations. Let us now consider some examples that support this claim: • Inflection: From the morphologic perspective, lemma information constitutes an important lexical feature of words. All knowledge about lemmatization can be encoded using the binary relation lemma, e.g. (cars, lemma, car) or (went, lemma, go). This may be somewhat redundant for some languages – e.g. for English which has quite simple rules for lemmatization – but the relational representation can always be inferred from these rules. • Part-of-speech: An important lexical feature of a word is its syntactic category (part-of-speech). This too is a relational problem, which can be described using a binary relation pos, that associates words with their part-of-speech: (car, pos, noun) or (young, pos, adjective). • Translation: encoding the translation correspondences of a word can trivially be done by relational means, namely using a relation translation for each of the different meanings of a word: (Bank, translation, bank) and (Bank, translation, bench).

• Subject areas: The assignment of subject areas to words can be encoded using a relation sa, e.g. (bicycle, sa, vehicles). Note that a subject area need not be a hypernym of the word it is assigned to: (meningitis, sa, medicine). • Word Sense Induction: Besides the obvious (paradigmatic) relations, there are other forms of semantic information associated to words. Many thesauri contain information about words related to a specific meaning of a given term (e.g. {table1 → sit, chair,...} and {table2 → column, row, ...}), which can be used for word sense disambiguation. These associations, they can be captured via a binary relation sense (e.g. (table1, sense, sit), (table1, sense, chair)}. • Representing collocational knowledge is somewhat more challenging: for collocations X consisting of more than two words, we either need relations with three or more arguments (which could not be handled by our evaluation tool) or we define a new binary relation x. All pairs of words that appear in X will then be assumed to be related according to x. We, could, for instance represent X = ''kick the bucket'' as {(kick, x, the), (kick, x, bucket), (the, x, bucket)}. This is admittedly quite intricate, so another way of dealing with collocational data should be used. Despite some problems, these representations cover a large portion of all possible lexical information that may be associated with words. This means that the overall quality of (new) lexical entries (comprising morphologic, syntactic, translational and semantic features) can be measured altogether using a relational approach, which may be useful when extending or bootstrapping a lexicon. Example analysis using the framework One of the most investigated issues in lexical acquisition is the extraction of significant co-occurrences in order to gain information about the general context of a word. The results of such measurements are then used to either extract collocations (Church et al. 89), word similarity (Terra & Clarke 03) or any other semantic relations like hypernymy (Hearst 92). The basic approach is to measure, which other words co-occur significantly together within a given window or within sentences. This results in four variables: • n : The size of the corpus.

• • •

nA : The estimated probability of word A to occur in a sentence. n n p ( B ) = B : The estimated probability of word B to occur in a sentence. n p ( A) =

p ( A, B ) =

nAB : The estimated probability of A and B to co-occur in a n

sentence. Baseline The simplest way to obtain information about which words co-occur often with each other is, of course, to simply count these co-occurrences. This measure will be taken as the baseline measure:

sigbaseline ( A, B ) = nAB However, this measure does not take the frequency of the involved words into account. Furthermore it neither takes the independence assumption (i.e. the assumption that occurrences of A and B are statistically independent: p(A,B) = p(A)·p(B)) into account nor the corpus size. Mutual Information The most successful measure in terms of papers published is no doubt mutual information (Church et al. 89). The idea is to compare the probability of observing word A and word B together (joint probability) with the probabilities of observing A and B independently.

sig MI ( A, B ) = log

p ( A, B ) n ⋅ nAB = log p ( A) ⋅ p ( B ) nA ⋅ nB

If these probabilities are near equal (which means total independence) then the division would be near 1 and the logarithm near 0. If the probability of joint appearance is larger than the probabilities of A and B occurring independently, that is p ( A, B ) ≥ p ( A) ⋅ p ( B ) then the value will be much larger then 1 and sig MI ( A, B ) >> 1 . On the contrary, if the probability of joint co-occurrence is smaller then the probability of A and B occurring independently, that is p ( A, B ) ≤ p ( A) ⋅ p ( B ) , then due to taking the logarithm sig MI ( A, B )