Word-pair extraction for lexicography

Word-pair extraction for lexicography

Chris Brew & David McKelvie? Language Technology Group Human Communication Research Centre 2 Buccleuch Place EDINBURGH EH8 9LW SCOTLAND

Abstract. We describe an application of sentence alignment techniques and approxi-

mate string matching to the problem of extracting lexicographically interesting wordword pairs from multilingual corpora. Since our interest is in support systems for lexicographers rather than in fully automatic construction of lexicons, we would like to provide access to parameters allowing a tunable trade-o between precision and recall. We evaluate two techniques for doing this. Since sentence alignment tends to associate semantically similar words, approximate string matching draws attention to orthographic similarities, they can be used to serve dierent lexicographic purposes, as can the combination of the two techniques, which amounts, inter alia, to a tool for uncovering faux amis. We conclude by sketching a simple and exible means for allowing lexicographers to provide information which has the potential to improve system performance.

1 Introduction One of the central challenges of computational lexicography is to design computational tools which allow lexicographers to do what they have always done | only better. This has received far less attention than the almost unrelated task of taking machine-readable dictionaries and pressing them into service for computational linguistics, which is well described by Wilks, Slator and Guthrie [19]. While the use of corpora is a common thread, the priorities of lexicographers are, unsurprisingly, dierent from those of computationalists (see [18] for an account of how frequency information is used, and [3] for a description of the use of mutual information and other statistics to uncover sense information). We present techniques which are applicable to multilingual corpora, describing our evaluation of them as information sources for those involved in the construction of bilingual dictionaries, 2 . At minimum, a bilingual dictionary needs to indicate which words and phrases of one language are possible translations of words and phrases in the other language. A helpful bilingual dictionary will also draw the readers attention to information which goes beyond this, such as facts about which words and phrases are commonly mistranslated by reason of orthographic similarity (i.e. faux amis). It is for the lexicographer to determine whether information about faux amis is important enough to include in a dictionary, but several dictionaries do, including the Cambridge International Dictionary of English [14], which we will abbreviate as CIDE. We are grateful to the UK EPSRC's Integrated Lexical Database project for funding this work, to the European Commission and the MULTEXT project for providing access to tools and corpora, and to the UK ESRC and the Universities of Edinburgh,Glasgow and Durham for providing infrastructure and support 2 We are grateful to Diane Nicholls of Cambridge Language Services for annotating some of the data.

?

As computationalists, we want to aid the lexicographer by ltering out pairs of words and phrases which are extremely unlikely to be worthy of comment, presenting only those pairs for which human intuition is needed. In information retrieval (IR) terminology this is an attempt to maximise the precision of the retrieval process. Of course, since the primary purpose is to augment the capability of the lexicographer, it is important to present a some pairs which would not otherwise have been considered. In IR terms this requires us to maximize the recall of the retrieval process. As always, there is a trade-o between precision and recall. Increased recall can always be obtained by accepting a larger percentage of false positives, and the attendant responsibility of checking many screenfuls of largely spurious output.

2 Plan of the paper Our experiments use the English and French components of a multilingual corpus of parliamentary questions taken from the MLCC Multilingual Corpus (not yet published, but to be made available publicly in the near future). In this paper we restrict our attention to verbs, largely because these receive nely articulated classi cation in the Cambridge International Dictionary of English [14] , as well as elsewhere[12]. A crucial secondary reason is a remark made by Briscoe and Carroll: over half the analysis failures for unseen corpus material were caused by incorrect subcategorisation for predicate valency[4]. One consequence of this, pointed out by Briscoe and Carroll, is that NLP systems would bene t greatly from any techniques which can extract useful argument structure information from corpora. We think that the same is true for lexicography, and are currently carrying out experiments to test the coverage of the verb-frame annotation scheme provided in CIDE.

2.1 A corpus of word-pairs We begin with an un-annotated corpus of English and French. The French part contains 1,221,579 tokens (including punctuation) and the English part 1,046,420 tokens (again including punctuation). Our rst step was to pass the French and English texts through an HMM-based part-ofspeech tagger [2] using lexicons and transition tables prepared under the MULTEXT project. Dierent tag sets were used for French and English. For each tag-set we identi ed the verbal categories and lemmatised the verbs to their base forms. This process identi ed 98,713 verb tokens in the French part of the corpus and 96,202 verb tokens in the English part. This corresponds to 1818 verb types for French and 2065 verb types for English. Having identi ed verbs on both sides of the presumed translation equivalence relation, we use sentence alignment [8]. We used Afzal Ballim's implementation of Gale and Church's charalign algorithm,yielding a sentence-level alignment of the texts However, the performance of the GaleChurch algorithm is highly dependent on cross language regularities in the use of punctuation and spacing [6, pp67-74]. For some documents, notably those for which formatting conventions are dierent in the two languages, or where there are substantial blocks of extraneous text in one of the language pairs, the unmodi ed Gale-Church algorithm performs abysmally. Davis suggests a scheme for weighting dierent sources of evidence to improve alignment. Among the factors which they consider is the matching of Arabic numbers and English phrases, commenting that these increasingly appear in texts in non-European languages. Since the Gale-Church

algorithm works on the lengths of the sentences, one can modify the algorithm by ltering the sentences before aligning. E.g. by replacing words with their POS tags, or by removing all words except for some subset of the words. Our heuristic was to remove all words except punctuation and proper names and just use these to align. This seems to be best for the corpus of parliamentary questions which we used. This is an interesting corollary of Davis's result, indicating that when consistent formatting information is present, it can pay to throw away everything else. Having established sentence alignments, we generate a set of word-word pairs by simply enumerating all possible pairings (fr; eng) which do not cross the boundaries of an alignment unit. If we assume that the alignment process is infallible, and that verbs in French are always translated by verbs in English, then we can assume that the relation generated by this procedure is a superset of the true translation relation about which we want to inform ourselves. We make these assumptions for their heuristic value only. since there is ample evidence that both are incorrect. If we were to make the rash claim that this method were appropriate for machine translation, we would be open to the charge that we are simply using a grotesquely oversimpli ed variant of the IBM approach[5]. For our purposes the criticism has less force, because:

{ We are interested in uncovering relations of potential inter-translatability between word types, whereas machine translation requires that we hypothesise translation relations between word tokens, which is a much more demanding task.

{ We can aord to tolerate considerable error in the assessment of token alignments as long as the type alignments are useful. { Because of the presence of a human supervisor, there is no pressing need for even the type alignments to be uniformly correct.

Note that the relation of potential inter-translatability between word types is similar in

avour to the relationship between the components of lexical collocations in monolingual work. Our reduction of the corpus to a collection of token-token pairs makes it possible for us to press the technology of collocation analysis into the service of our current goals Potential intertranslatability is a symmetric relation, since the criterion for admission is that sucient sentences can be found in which the English word seems like that translation of the French word. The corpus of word-pairs uncovered by sentence alignment contains 161,430 distinct types, corresponding to 430,132 tokens (a mean of 2.668 occurrences per token, with maximum 995, minimum 1, standard deviation 7.61). English verbs align with a mean of 78.62 French verbs dierent French verbs (maximum 888, minimum 1, standard deviation 118.00). French verbs align with a mean of 90.14 English verbs (maximum 1255, minimum 1, standard deviation 145.235). Along with sentence alignment we chose a second means of uncovering relations between English and French words, namely approximate string matching [13]. The relationship we are really interested in is that which holds between words which look as if they might be translations. In lieu of that we fall back on various heuristic measures of orthographic similarity. Both techniques pick out (and order) a subset of the Cartesian product of all the words in the corpus, so each word pair falls into one of the categories shown in table 1: The rest of the paper is concerned with the task of assigning word-pairs to the appropriate cells.

2.2 Evaluation Since there are 18182065 = 3; 754; 170 pairs it was not feasible to manuallyevaluate every pair. The evaluation was carried out by one of the authors, relying on a good reading knowledge of

Cognate + Cognate Aligned + Cognates Translations Aligned - False Friends Unrelated

Table 1. Overall design of the paper French, the Collins Pocket English-French dictionary[10] and an appreciation of the likely topics of discussion in the European parliament. For even this relatively small portion of the space of verbs it was too time consuming to consult the corpus for every pair, although this would have been ideal given unlimited time and patience. We evaluated all pairs which gave a likelihood ratio score of greater than 7, and some of the other pairs thrown up by the various methods of approximate string matching. This made a total of 16,051 distinct verb-verb pairs, of which we judged 14,067 to be impossible as translation pairs, and 1983 to be possible. The evaluation eort took the best part of six working days. In a pilot study, working with a a smaller corpus, we checked for the possibility that there were large numbers of potentially correct translations which we had not considered, and found nothing new in a sample of 1000 random translations. It is certain that some correct translations have been missed, and we don't have data on the reproducibility of the judgements. Not all verbs will have an alignment with another verb, since some will align with semantically null support verbs. We evaluated on the basis that the alignment of \inform" with \faire" is incorrect, although the alignment of \faire part" with \inform" is. Equally, many verbs have more than one correct alignment. From French to English \interdire" can be \ban","bar","forbid","outlaw" or \prohibit", while from English to French \show" can be \attester","demontrer","manifester", \montrer","prouver", \reveler" or \temoigner".

3 Search for inter-translatability in alignment data In this section we report results of an eort to nd inter-translatable words in the corpus of word-pairs. In section 4 we cover the use of approximate string-matching to uncover eects of orthographic similarity. In either case, we conceptualize the task as using some statistic (or, in some cases, quasi-statistic) as an ordering heuristic for a set of pairs too large to examine in its entirety. We rely on the statistics for illumination rather than support. The most eective ordering heuristic for this data turned out to be the binomial likelihood ratio statistic that has been used for collocation analysis by Dunning [7]. The rst strategy which we investigated was that of simply taking the best match for each verb in each direction. From French to English this yields 864 correct (48.2%) 920 incorrect (51.3%) and 7 unevaluated (0.3%). From English to French the corresponding gures are 953 correct(47.25%) , 1063 incorrect (52.7%) and 1 unevaluated (0.005%). Recall is dicult to evaluate de nitively, because of the number of pairs that would need to be scanned. It certainly cannot be better than 864 1983 = 0:43 for French to English or 954 1982 = 0:48 for French to English.

3.1 Improving Precision It is for lexicographers to decide whether these levels of precision and recall are appropriate for their preferred working style ; if the hypotheses thrown up in the correct 50% are interesting

enough, the 50% of noise may well be tolerable. Nevertheless, we assume that this level of noise is not adequate, and investigate ways of reducing The rst tactic which we investigate is that of imposing a threshold on the likelihood gure equivalents 3 . min(?2log) N Correct Precision 612 33 33 100% 144 256 243 95.3 53 502 453 90.2 24 816 659 80.8 13 1363 831 61.0 0 1791 864 48.2

Table 2. The eect of thresholding (French to English)

min(?2log) N Correct Precision 613 32 32 100% 91 354 337 95.2 43 590 532 90.2 23 878 705 80.3 13 1454 898 61.76 0 2017 953 47.24

Table 3. The eect of thresholding (English to French) Tables 2 and 3 show the result of thresholding the data at various points in the scale de ned by the likelihood ratio statistic. Over half the correct translation pairs are available by cutting of at a level which gives 90% precision. Another simple approach is to consider as equivalent only these word pairs which satisfy the condition that each is the others' most preferred partner. This is the Best Match condition used by Gaussier, Lange a and Meunier [9]. There are 826 pairs which meet this condition in the current data, of which 631 (76.4%) are correct and 195 (23.6%) incorrect. Thresholding this data at a likelihood ratio of of ?2log > 24 produces 571 answers of which 519 are correct (90.9% precision). In order to attain 90% precision on the complete French-to-English data we had to threshold at ?2log > 53 yielding only 453 correct answers. However, for English to French we had to threshold at ?2log > 43, yielding 532 correct answers. If we require 95% precision we can threshold the symmetrical part of the data at ?2log > 33 yielding 460 correct answers out of 483. To achieve the 95% level on the French-to-English data we had to threshold at ?2log > 144, yielding only 243 correct answers. For English to French the 95% threshold is ?2log > 91 yielding 337 correct answers. 3

In the work on cognate-hood, McEnery and Oakes [13] proceed analogously, giving data on the way in which estimated probability of cognate-hood varies with threshold levels of a summary gure of merit

3.2 Increasing recall at the expense of precision We tried increasing recall by relaxing the condition that the pairs considered are the best matches for some word. Allowing second-best matches in the French-to-English data produces 1183 matches from 3517 attempts (33% precision) compared with 864 from 1791 (48% precision) using only the best matches. For French to English we get 1258/4001 (31% precision) rather than 953/2017 (47% precision). Adding Best (or rather K-Best) Match to this gets us 1008/1873 (53.8% precision). The threshold for 90% precision on this data is ?2log > 64 yielding 482 correct answers from 532 guesses. This is inferior to the corresponding result using best matches only. We conclude that relaxation of the best match criterion is useful mainly to those lexicographers who are prepared to tolerate a relatively low level of precision (in other words, those who do not nd the scanning of lists of word-pairs too onerous). For those, it may be of interest to know that the 1364 word pairs can be had by allowing the best 3 matches from French to English, nding 177 more than by using the best two matches, at the cost of scanning 5145 rather than 3517 pairs. This is a drop in overall precision from 33% to 24%. For English to French a further 148 pairs can be found by scanning 5845 rather than 4001 pairs, a drop in precision from 31% to 24%. It is a matter of taste whether one nds this prospect attractive.

3.3 Conclusions on the use of alignment information The use of the log-likelihood statistic in order to derive word-word relations from sentence alignments seems to work satisfactorily in diagnosing potential inter-translatability. There is the usual trade-o between precision and recall, with the best results obtained from a combination of thresholding and the use of Gaussier's symmetry criterion. About 30% of the known translation pairs can be found with a precision of close to 90%. These techniques have so far been developed on a single corpus, and for one particular language pair. We intend to remedy this by applying the techniques to other language pairs within the nine{language parallel corpus. It should be straightforward to extend this work to parts-of-speech other than verbs. We plan to carry out the same analysis for nouns and adjectives. We expect to use the heuristic that interesting verb-noun equivalences probably occur for verbs and nouns which do not have high-scoring verb-verb or noun-noun equivalences. This may make it possible to uncover an interesting range of phrasal and support verbs without the need to enumerate pairs and combinations of words 4.

4 Approximate string matching for cognate extraction The next task is to identify cognates and false friends. Our strategy is to use the translation pair information to classify orthographically similar words. We therefore need a way of detecting orthographic similarity. We began from the work of McEnery and Oakes [13], who use a variant of Dice's coecient originally due to Adamson and Boreham [1]. We report results from six variants of this method. 4

But see the discussion of false-friends later for evidence that it is hard to detect the absence of translation equivalence

1. dice The original Adamson and Boreham method. This counts the number of shared character bigrams. The formula is: 2 jbigrams(x) \ bigrams(y)j jbigrams(x)j + jbigrams(y)j where bigrams is the function which reduces a word to a multi-set of character bigrams. Order of the bigrams is insigni cant. 2. xdice A variant of dice which allows \extended bigrams". By this we mean ordered letter pairs which are formed by deleting the middle letter from an three letter substring of the word. The formula is: 2 jxbigrams(x) \ xbigrams(y)j jxbigrams(x)j + jxbigrams(y)j where xbigrams denotes the operation of forming a set of bigrams (as before) and adding the extended bigrams. 3. wdice As dice, except that bigrams are weighted in inverse proportion to their frequency. The weights were assigned by the formula: Ntokens + Ntypes weight(bigrami ) = freq(bigram i) + 1 where Ntokens stands for the number of bigram tokens seen in the corpus of English and French, and Ntypes stands for the number of distinct bigram types. This is a simple variant of a standard frequency weighting scheme from information retrieval. We changed only the numerator of the dice formula, so the coecient is no longer certain to fall between 1 and 0. Applying the same transformation to the denominator would have xed the problem. 4. wxdice The algorithm which is to xdice as wdice is to dice. 5. xxdice Another transformation of the numerator, this time using string positions as well as extended bigrams. When a bigram is found to be shared, its contribution to the numerator is 2 1 + (pos(x) ? pos(y))2 where pos returns the string position at which the bigram was found rather than simply 2. This penalizes matches between bigrams from dierent parts of the word. 6. Longest common subsequence. The formula is5: 2 jlcs(x; y)j jxj + jyj

4.1 Results Using Gaussier's Best Match Filter Using Gaussier's best match ter as before, lcs 553 translation pairs out 767 (72.1% precision). dice coecient found 581 from 860 (67.6% precision). wdice found 582 from 862 guesses (67.5%) (not much help) xdice found 602 from 867 (69.4%) wxdice found 602 from 866 (again a small bene t). xxdice nds 587 from 902 (65% precision). The move from bigrams (dice) to extended bigrams (xdice) is useful, but the term{weighting strategies have not paid o much for the symmetrical part of the data. 5

We used Hunt and Szymanski's algorithm [11], relying on the description given by Graham Stephen [17]. An earlier version of this paper described a more elaborate algorithm which of our own devising which worked on tries. This is potentially time-ecient, but involves large data structures and proves impractical for our application

4.2 Results using thresholding Tables 4,5 and 6 indicate how the precision and recall of the various string similarity metrics can be manipulated by use of thresholding. In every case we restrict our attention to the top 20000 pairs thrown up by the ordering. The data suggests that dice works better than wdice at the high end of the scales, but note that wdice recovers 38 more correct pairs if one is prepared to tolerate 50% precision. The results for dice are not directly comparable with those of McEnery and Oakes [13] because of our use of part-of-speech tagging to pick out only verbs. xdice is marginally inferior to wxdice throughout. The introduction of gappy bigrams seems to be advantageous, since xdice outperforms dice and wxdice outperforms wdice. However, table 6 shows that xxdice outperforms all the other methods on this data, yielding to wdice only in yield at very low thresholds. It is unlikely that this notional superiority of wdice would ever matter, while the high precision of xxdice in more accessible regions of the space is clearly important. min(dice) N Correct Precision min(wdice) N Correct Precision 0.90 114 110 96.5% 5.35 87 82 94.3% 0.85 219 193 88.12% 5.00 159 136 85.5% 0.80 299 255 85.3% 4.65 284 227 79.9% 0.70 860 450 52.3% 3.8 932 488 52.3% 0.43 20000 763 3.81% 2.03 20000 784 3.92%

Table 4. The eect of thresholding on dice and wdice

min(xdice) N Correct Precision min(wxdice) N Correct Precision 0.90 73 70 95.9% 5.3 127 119 93.7% 0.85 169 153 90.5% 5.07 192 173 90.1% 0.80 300 263 87.7% 4.7 333 272 81.7% 0.65 1021 496 48.8% 3.7 1088 558 51.3% 0.4 20000 772 3.86% 0.4 20000 799 4.0%

Table 5. The eect of thresholding on xdice and wxdice

5 Combining Sentence Alignment and Approximate String Matching We now return to the original purpose of the paper: nding translations and false friends for lexicographers. Since we are interested in nding pairs which look like translations, we x on the string comparison method which has the best success in nding translations, namely xxdice. We collected all the pairs which have either a likelihood ratio greater than 13 or an xxdice coecient greater than 0.5 (thresholding close to 50% precision in each case). There are 1278 of these. 339 are over threshold on both criteria, 778 are orthographically similar but below

min(xxdice) N Correct Precision min(lcs) N Correct Precision 0.813 83 82 98.7% 0.924 74 72 97.3% 0.778 145 143 98.6% 0.90 74 72 97.3% 0.715 332 306 92% 0.86 278 232 83.45% 0.50 1127 576 51.1% 0.79 1100 526 47.8% 0.22 20000 749 3.74% 0.615 20000 787 3.94%

Table 6. The eect of thresholding on xxdice and lcs threshold on alignment, 151 are aligned but not orthographically similar. There are 880 distinct English verbs and 908 distinct French ones. A reliable method for nding translation pairs The pairs for which both sentence alignment and

xxdice are above threshold highly likely to be translations. Of the 339 pairs falling into this category, 338 are translations.

A method for nding false friends We now turn to a means of nding false friends. In principle, all that is required is to look for pairs which have good orthographic similarity, but which are not matched by the likelihood ratio technique. Unfortunately, there are two possible reasons why pairs should not appear in the aligned corpus 1. They are not in fact translations. 2. They are potential translations, but the translation was not used in the corpus at hand. This is a sparse data eect. The rst results which we show are for French-to-English. There are 45 cases where the best orthographic match is better than the orthographic match for the pair having the highest alignment score. There are useful suggestions in the 45, notably that \contenir" might mean \contain" rather than \contend", that \confondre" might mean \confuse" rather than \conform", that \commander" might mean \commission" rather than \commandeer" and that \alleger" means \alleviate" rather than \allege". All of this needs careful monitoring by the lexicographer, because along with the successful corrections come pairs of words both of which are translations (for example that \allouer" can be translated by \allot" or by \allocate", and that \stocker" is either \stock" or \store"). More dubiously, we are told that \reduire" is not \require" but \reduce" that \saturer" is not \mature" but \saturate" and that \saper" is not \paper" but \sap". None of this confusions would ever have arisen in a human reader: they are unwelcome artefacts of the approximate string matching. xxdice is not noticeably better or worse than any of the others in throwing up this sort of nonsense. Finally, we informed that \inverser" means \notice" rather than \reverse", which is counter-productive. Going from English to French, there are 48 suggestions. Good ones include the suggestion that \comporter" is a better translation of \comprise" than \comprimer" (which actually means \compress") and that \allege" is to be translated by \alleguer" rather than \alleger". We are also told that \re ^eter" (which means \re ect as in a mirror") is a less good translation of \re ect" than \re ^echir" (to re ect on some matter) , which, while untrue in general, is undoubtedly a faithful re ection of what one expects to nd in a corpus of parliamentary language.

Viewed solely as techniques for detecting false friends, the methods in this paper are at an early stage of development. It looks as if the main limiting factor is the unselectivity of the methods for detecting orthographic similarity, which throw up very many \confusions" that would never trouble a human reader. If this is xed the next limiting factor will be the sparsity of the data available in bilingual corpora, closely followed by the limited linguistic range of the corpora currently available.

6 Conclusions The research in this paper has all been aimed towards the generation of suitably ordered lists of word pairs which a lexicographer can use as input. The main features of the eort have been: 1. The use of likelihood ratio techniques for picking out word-pairs whose alignment in a corpus is signi cant. 2. The use of k-best matching to get back some of the recall lost by using Gaussier's Best Match criterion to increase precision. of precision. As far as we know the combination of k-best and the symmetry criterion is new. 3. An evaluation of several potentially useful variants of Dice's coecient. 4. The xxdice coecient, combined with the use of likelihood ratio, forms an extremely high precision tool for the detection of translation pairs in our corpus. 5. A preliminary technique, needing extensive human supervision, for pulling out false friend information from corpus data. An emphasis on the needs of the lexicographer is the driving force behind the work reported, so we are hoping to carry the same principle into extensions of the work. We are aware of various arbitrary choices which we have made , including the use of likelihood ratio as an ordering and thresholding criterion. While we are provisionally satis ed with this criterion, we would prefer to avoid hard-wiring any criterion. An obvious option is to tune an initial presentation order by allowing the lexicographer to mark word pairs which are deemed of particular interest, adjusting the presentation order in such a way as to minimise the number of distractors which are presented before the items of interest, and maximize the probability that interesting pairs are found in the early part of the ordering. Such techniques are common in document retrieval applications [15]. Their primary advantage is an ability to provide a powerful tunable mechanism from whose complexities the user is almost entirely sheltered, since they need do nothing more than mark interesting pairs. Adding a \More pairs" button would provide analogous feedback relevant to the precision/recall trade-o discussed above. It might even be useful to generate new variations of the ordering criteria on the y by means of some suitable adaptive process. With or without the ful lment of this last, highly speculative possibility would be to provide a system capable of attuning itself to the lexicographer's idea of lexicographic interest.

References 1. G.W. Adamson and J Boreham The use of an association measure based on character structure to identify semantically related pairs of words and document titles Information Storage and Retrieval (10) pp. 253{60, 1974 2. Susan Armstrong, Graham Russell, Dominique Petitpierre, Gilbert Robert An Open Architecture for Multilingual Text Processing From Texts to Tags: issues in Multilingual Language Analysis: Proceedings of the ACL SIGDAT Workshop, Dublin, Ireland, March 27, 1995

3. Simon Baugh ILD Corpus and Lexicon Search System User Manual Unpublished Manuscript Cambridge Language Services, 1996 4. Edward Briscoe and John Carroll Toward Automatic Extraction of Argument Structure From Corpora Rank Xerox Research Centre Technical Report MLTT-006, Meylan, 1994 5. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra and Robert L. Mercer The Mathematics of Statistical Machine Translation: Parameter Estimation Computational Linguistics 19(2), pp. 263-312, 1995 6. Mark Davis, Ted Dunning, Bill Ogden Aligning noisy corpora Proceedings of the Seventh Conference of the European Chapter of the Association for Computational Linguistics, pp 67{74, March 27{31 1995, University College Dublin, Bel eld, Dublin, Ireland 7. Ted Dunning Accurate Methods for the Statistics of Surprise and Coincidence Computational Linguistics 19 (1) pp. 75{102, 1992 8. William A. Gale and Kenneth W.Church A program for Aligning Sentences in Bilingual Corpora Computational Linguistics 19 (1) pp. 61{74, 1993 9. E. Gaussier, J-M Lange and F. Meunier Towards Bilingual Terminology Proceedings of the ALLC/ACH Conference, pp. 121-24, OUP, Oxford, 1992 10. Pierre Henri-Cousin Collins Pocket French Dictionary Collins, London and Glasgow, 1987 11. J.W. Hunt and T.G. Szymanski A fast algorithm for computing longest common subsequences Communications of the ACM. (20) 5 pp. 350-3, May 1977 12. Beth Levin English Verb Classes and Alternations: A Preliminary Investigation University of Chicago Press, Chicago 1993 13. Tony McEnery and Michael Oakes Sentence and word alignment in the CRATER Project In Thomas & Short \Using Corpora for Language Research, pp. 211{231 14. Paul Proctor (ed) Cambridge International Dictionary of English Cambridge University Press, 1995 15. G. Salton and C. Buckley Improving Retrieval Performance by Relevance Feedback J. American Society for Information Science, 41, 4 (1990) pp. 288{297. 16. Jenny Thomas and Mick Short (eds) Using Corpora for Language Research: Studies in honour of Georey Leech Longman, London and New York, 1996 17. Graham A Stephen String Search Technical Report TR-92-gas-01, School of Electronic Engineering Science, University College North Wales, Bangor, UK, 1992 18. Della Summers Computer Lexicography: the importance of representativeness in relation to frequency In Thomas & Short, pp. 260-266 19. Yorick A. Wilks , Brian M. Slator and Louise M. Guthrie Electric Words: Dictionaries, Computers and Meanings MIT Press, Cambridge 1996