The SEMEVAL English Lexical Substitution Task ... - Diana McCarthy

The SEMEVAL English Lexical Substitution Task: Results

Diana McCarthy and Roberto Navigli

Systems

1 Results and Baselines

KU

In this document we show precision (P) and recall (R) and mode precision (mode P) and mode recall (mode R) as described in our scoring documentation1 . In tables 1 to 4 we have ordered systems according to recall on the best task, and in tables 5 to 8 according to recall on oot. In tables 3, 4, 7 and 8 we show further analysis of results using the subset of items which were i) NOT identified as multiwords (NMWT) ii) scored only on non multiword substitutes from both annotators and systems (i.e. no spaces) (NMWS) iii) items where the sentences were selected randomly (RAND) and iv) items where the sentences were selected manually (MAN). We retain the same ordering of systems for this further analysis. Although there are further differences in the systems which would warrant reranking on an individual analysis, since we combined the subanalyses in one table we keep the order as for 1 and 5 respectively for ease of comparison. We produced baselines using WordNet 2.1 (Miller et al., 1993) and a number of distributional similarity measures. For the WordNet best baseline we found the best ranked synonym using the criteria 1 to 4 below in order. For WordNet oot we found up to 10 synonyms using criteria 1 to 4 in order until 10 were found: 1. Synonyms from the first synset of the target word, and ranked with frequency data obtained from the BNC (Leech, 1992).

UNT MELB HIT USYD IRST 1 IRST 2 TOR

P 12.90 12.77 12.68 11.35 11.23 8.06 6.95 2.98

R 12.90 12.77 12.68 11.35 10.88 8.06 6.94 2.98

Mode P 20.65 20.73 20.41 18.86 18.22 13.09 20.33 4.72

Mode R 20.65 20.73 20.41 18.86 17.64 13.09 20.33 4.72

Table 1: best results 2. synonyms from the hypernyms (verbs and nouns) or closely related classes (adjectives) of that first synset, ranked with the frequency data. 3. Synonyms from all synsets of the target word, and ranked using the BNC frequency data. 4. synonyms from the hypernyms (verbs and nouns) or closely related classes (adjectives) of all synsets of the target, ranked with the BNC frequency data. We also produced best and oot baselines using the distributional similarity measures l1, jaccard, cosine, lin (Lin, 1998) and αSD (Lee, 1999) 2 . We took the word with the largest similarity (or smallest distance for αSD and l1) for best and the top 10 for oot. For mw detection and identification we used WordNet to detect if a multiword in WordNet which includes the target word occurs within a window of 2 words before and 2 words after the target word.

1 Available at http://nlp.cs.swarthmore.edu/semeval/tasks/task10/ task10documentation.pdf

2

We used 0.99 as the parameter for α for this measure.

NMWT

Systems KU UNT MELB HIT USYD IRST 1 IRST 2 TOR

P 13.39 13.46 13.35 11.97 11.68 8.44 7.25 3.22

NMWS

R 13.39 13.46 13.35 11.97 11.34 8.44 7.24 3.22

P 14.33 13.79 14.19 12.55 12.48 8.98 7.67 3.32

R 13.98 13.79 13.82 12.38 12.10 8.92 7.66 3.32

RAND

P 12.67 12.85 12.50 11.81 11.47 8.65 6.71 3.10

MAN

R 12.67 12.85 12.50 11.81 11.01 8.64 6.68 3.10

P 13.16 12.69 12.89 10.81 10.95 7.38 7.23 2.84

R 13.16 12.69 12.89 10.81 10.73 7.38 7.23 2.84

Table 3: Further analysis for best

NMWT

Systems KU UNT MELB HIT USYD IRST 1 IRST 2 TOR

Mode P 21.20 21.63 21.29 19.81 18.46 13.38 20.76 5.04

NMWS

Mode R 21.20 21.63 21.29 19.81 17.90 13.38 20.76 5.04

Mode P 21.88 21.59 21.74 19.93 19.25 13.85 21.50 4.90

RAND

Mode R 21.42 21.59 21.33 19.65 18.63 13.74 21.50 4.89

Mode P 20.34 20.18 19.72 20.03 19.14 13.76 22.17 5.20

MAN

Mode R 20.34 20.18 19.72 20.03 18.35 13.76 22.17 5.20

Mode P 21.01 21.35 21.18 17.53 17.20 12.33 18.23 4.17

Table 4: Further analysis for best: finding the mode

NMWT

Systems IRST 2 UNT KU IRST 1 USYD SWAG 2 HIT SWAG 1 TOR

P 72.04 51.13 48.43 43.11 37.26 39.95 35.60 37.49 11.77

NMWS

R 71.90 51.13 48.43 43.08 36.17 36.51 35.60 34.64 11.77

P 76.19 54.01 49.72 45.13 40.13 40.97 36.63 38.36 12.22

R 76.06 54.01 49.72 45.11 38.89 37.75 36.63 35.67 12.22

RAND

P 66.94 51.71 47.80 42.14 35.67 39.74 33.95 36.94 9.98

R 66.72 51.71 47.80 42.09 34.26 36.26 33.95 34.52 9.98

Table 7: Further analysis for oot

MAN

P 71.46 46.26 44.23 40.17 36.52 35.56 33.81 33.83 12.61

R 71.46 46.26 44.23 40.17 35.78 32.79 33.81 30.85 12.61

Mode R 21.01 21.35 21.18 17.53 16.84 12.33 18.23 4.17

NMWT

Systems IRST 2 UNT KU IRST 1 USYD SWAG 2 HIT SWAG 1 TOR

Mode P 60.38 68.03 63.42 56.82 44.71 52.28 48.48 49.11 15.03

NMWS

Mode R 60.38 68.03 63.42 56.82 43.35 47.78 48.48 45.35 15.03

Mode P 61.97 70.15 63.74 58.26 46.25 52.25 49.33 49.41 15.26

RAND

Mode R 61.97 70.15 63.74 58.26 44.77 47.98 49.33 45.70 15.26

Mode P 58.26 68.04 62.84 55.50 42.90 53.61 47.25 48.94 13.00

MAN

Mode R 58.26 68.04 62.84 55.50 41.13 48.78 47.25 45.72 13.00

Mode P 58.85 64.24 59.55 55.03 44.50 46.34 46.53 45.63 16.49

Mode R 58.85 64.24 59.55 55.03 43.58 42.88 46.53 41.67 16.49

Table 8: Further analysis for oot: finding the mode

Systems WordNet lin l1 lee jaccard cos

P 9.95 8.84 8.11 6.99 6.84 5.07

R 9.95 8.53 7.82 6.74 6.60 4.89

Mode P 15.28 14.69 13.35 11.34 11.17 7.64

Mode R 15.28 14.23 12.93 10.98 10.81 7.40

detection identification

system HIT P R 45.34 56.15 41.61 51.54 Table 9:

MW

WordNet BL P R 43.64 36.92 40.00 33.85

results

References Table 2: best baseline results Systems IRST 2 UNT KU IRST 1 USYD SWAG 2 HIT SWAG 1 TOR

P 69.03 49.19 46.15 41.23 36.07 37.80 33.88 35.53 11.19

R 68.90 49.19 46.15 41.20 34.96 34.66 33.88 32.83 11.19

Mode P 58.54 66.26 61.30 55.28 43.66 50.18 46.91 47.41 14.63

Mode R 58.54 66.26 61.30 55.28 42.28 46.02 46.91 43.82 14.63

Table 5: oot results Systems WordNet lin l1 lee jaccard cos

P 29.70 27.70 24.09 20.09 18.23 14.07

R 29.35 26.72 23.23 19.38 17.58 13.58

Mode P 40.57 40.47 36.10 29.81 26.87 20.82

Table 6: oot baseline

Mode R 40.57 39.19 34.96 28.86 26.02 20.16

Lillian Lee. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25–32. Geoffrey Leech. 1992. 100 million words of English: the British National Corpus. Language Research, 28(1):1–13. Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI. George Miller, Richard Beckwith, Christine Fellbaum, David Gross, and Katherine Miller, 1993. Introduction to WordNet: an On-Line Lexical Database. ftp://clarity.princeton.edu/pub/WordNet/5papers.ps.