Can Bilingual Word Alignment improve Monolingual Phrasal Term ...

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? Jörg Tiedemann Department of Linguistics Uppsala University Sweden [email protected] Abstract: This paper focuses on the improvement of statistically extracted phrase lists by applying word alignment approaches to bitext1. Such phrase lists serve several tasks such as the compilation of terminology or translation databases. Our investigations are based on the assumption that word alignment favors well−formed phrase structures rather than irregular text segments. If this being the case, word alignment will filter out irregular structures from automatically generated phrase lists. As a result, an improved phrase list, in terms of precision, may be compiled. Furthermore, word alignment approaches can be used to identify additional multi−word units, e.g. multi−word cognates. Our investigations are focused on a Swedish/English text corpus that has been aligned with the Uppsala Word Aligner (UWA). Finally, we describe and apply three approaches to evaluate the automatically generated phrase lists: an evaluation by comparing results with existing reference data (prior reference), an evaluation against given syntactic patterns (prior reference patterns), and a manual evaluation of sample data (posterior reference). The evaluations of the extraction of phrasal terms in English substantiate the assumption: Precision has been significantly improved with insignificant loss in recall.

1.

Introduction

Domain specific terminology is one of the most important language resources in document production. Consistency of terms and their usage has to be assured especially in public documents such as political and technical texts. Authors of such documents need to be supported by efficient tools and comprehensive information about terms and their appropriate usage. Furthermore, an increasing number of documents have to be translated into several languages. However, translation is often carried out by external partners, which is time− and cost− consuming. In addition, external translation partners are usually not familiar with the terminology used for specific purposes. Therefore, the need of term databases becomes even more important for document translation than for production. In recent years much effort has been devoted to compile terminology databases automatically by analyzing document collections. Valuable monolingual and multilingual information is included in huge amounts in existing written documents. Such information can be used for building term databases that support document production, and human and (semi−) automatic translation. However, the task of identifying and extracting terminology information is not trivial. Usually, single−word approaches produce reliable results with little effort. Difficulties arise with phrasal terms in both monolingual and multi−lingual

2

Jörg Tiedemann

approaches. The problem can be defined as the problem of identifying appropriate multi−word units (MWUs) that are part of the terminology. However, the definition of phrasal terms is not straightforward and may vary for different applications. For translation in general, a wider set of phrasal structures has to be included whereas terminology databases usually focus on specific constructions such as proper names and non−compositional compounds. Another demanding task is the evaluation of automatically extracted phrases. Phrasal term extraction can be seen as a retrieval problem in which the most relevant phrasal terms in a corpus are to be found. The performance of retrieval systems is usually described in terms of recall and precision, which can be estimated by evaluating representative samples of the result (posterior reference) or by comparing results with a "gold standard’’ (prior reference) that has been created in advanced. Both techniques have their drawbacks as discussed in e.g. [Ahrenberg et al. 2000]: "Gold standards’’ have to be created in advance and they have to fit the retrieval problem. Once created, gold standards can be re−used for further automatic evaluations whereas posterior evaluations have to be re−done from scratch for each evaluation. However, creating reference data for evaluating phrase extraction is very time−consuming and requires detailed guidelines. Yet another approach to evaluation is to compare results with other phrase extraction tools by e.g. calculating cross−entropies [Pantel and Lin 2001]. In this study, we applied three approaches for evaluating our results: We used manually extracted lists of phrasal terms for prior reference evaluation; we evaluated our results against previously defined syntactic patterns for typical phrasal terms; and, finally, we performed manual evaluations of random samples. A detailed description of our approaches and results is presented in section 5.

2.

Automatic Recognition of Phrasal Terms

There are generally two types of approaches to automatic multiword term recognition: linguistic approaches and statistical approaches. A summary of term extraction methods can be found in [Kageura and Umino 1996] and, focused on multi−word recognition, in [EAGLES 1999]. Linguistic approaches concentrate on the recognition of pre−defined syntactic patterns in the text. These patterns are usually defined as to sequences of part−of− speech tags that represent phrasal structures, typically noun phrases [Daille et al. 1994, Justeson and Katz 1995, Arppe 1995, Bennett et al. 1999]. Statistical approaches apply frequency−based measures for ranking word N− grams in order to identify phrasal structures [Smadja 1993, Dagan and Pereira 1994, Pantel and Lin 2001]. Several statistical metrics have been proposed, e.g. mutual information [Church and Hanks 1989], the Dice coefficient [Smadja et al. 1996], the loglike coefficient [Dunning 1993], and entropy [Merkel and Andersson 2000]. The main advantage of statistical methods compared to the linguistic approach is its independence of the language under consideration. However, purely statistical method typically over−generate candidate collocations. Linguistic approaches, however, require language−specific tools for tagging the corpus with necessary information and a careful definition of syntactic phrase patterns. Combined techniques in hybrid systems may be used to reduce over−generations. In [Merkel and Andersson 2000, Tiedemann 1999b], the authors propose

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 3 language−specific filters in terms of classified phrase boundary words to improve precision. There are several approaches that combine the extraction of bilingual terminology with statistical word alignment. In [van der Eijk 1993], the author focuses on noun phrases and proposes a frequency−based measure for linking them to their correspondences in the target language. A system for semi−automatic extraction of bilingual terminology has been presented in [Dagan and Church 1994]. In [Smadja et al. 1996], another tool for extracting bilingual lexicons is presented, which focuses on finding translations of previously extracted collocations. In the approaches above, phrasal terms have to be compiled in advance and co−occurrence measures are used to find correspondences in bitexts. The use of parallel text for the identification of multi−word terms in both languages as a result of bilingual word alignment has been investigated in [Melamed 1997]; the author uses statistical translation models for discovering non−compositional compounds. In this article, we focus on the use of knowledge−lite word alignment for monolingual lexicography. Unlike Melamed’s approach, we start with statistically extracted word collocations for both languages, respectively. Then, we apply bilingual word alignment in order to reduce over−generations from both phrase lists such that the quality of the monolingual lists of phrasal terms is improved.

3.

The basic assumption

Terminology extraction is commonly based on monolingual text. Phrasal terms are particularly hard to detect. Phrase boundaries have to be defined in order to identify terms in the context. This is a difficult task. Consider, e.g., the sentence ’A sufficient number of dwellings to meet housing needs are to be constructed.’2: There are several possibilities to divide the sentence into phrasal units. ’A sufficient number’, ’number of dwellings’, ’dwellings to meet housing needs’, ’housing needs’ are some examples of phrases that might be of interest. They all overlap − the phrase boundaries are hard to define. However, with the assumption that phrasal terms are translated rather consistently into other languages we assume that bilingual word alignment can be used to significantly improve the precision of automatic term extraction from text for both languages, respectively. Furthermore, we assume that the loss in recall will be rather small if the word aligner covers most of the translation relations in the text. Statistical approaches highly over−generate phrase candidates by considering all possible word combinations. Among them, there are well−formed phrases of general language that are not relevant for terminology databases. Furthermore, many over−lapping, incomplete and malformed constructions are to be found in the list of phrase candidates. Incomplete and over−lapping phrases are particularly hard to detect because they may seem to be well−formed out of context. Word alignment however, works well for recurrent multiword terms with consistent translations, whereas phrase structures of general language are less likely to be aligned according to our data. It is also assumed that equivalents to malformed phrase constructions are less likely to be linked. Monolingual lists of phrasal terms can be extracted from the word alignment results. It is assumed that the quality of those lists has been improved in terms of an increased precision with

4

Jörg Tiedemann

a reasonable loss of recall compared to the list of phrasal terms that have been extracted statistically from monolingual text. Furthermore, we argue that word alignment can produce additional phrasal terms due to certain alignment techniques. This assumption will be investigated in the following study of a Swedish/English bitext.

4.

Handling phrases with the Uppsala Word Aligner − UWA

A prerequisite for this investigation is a word alignment system that handles multiword units. The Uppsala Word Aligner (UWA) [Tiedemann 1998, Tiedemann 1999b] is such a system that aligns textual units from bitext below the sentence level. It was developed within the co−operative project on parallel corpora, PLUG [Sågvall Hein 1999]. UWA combines knowledge−lite approaches to word alignment, mainly using frequency−based co−occurrence measures and string similarity measures. Furthermore, UWA applies simple stemming functions and heuristic parameters such as position weights. UWA extracts token links iteratively from sentence−aligned bitext and produces a bilingual lexicon from the alignments. In this study, we apply our approach to the Swedish/English portion of the Scania corpus, a multi−lingual collection of technical text provided by the Swedish truck and bus manufacturer Scania. This corpus has been used for the definition of a controlled vocabulary for Scania and a corresponding language checker for document production [Sågvall Hein and Almqvist 2000]. Within the PLUG project, we experienced the complexity of word alignment approaches and their evaluation when phrasal structures are involved [Ahrenberg et al. 2000]. UWA handles the bilingual alignment of phrasal structures in the following way: 1. Generate all possible phrases for both languages from the bitext. 2. Annotate phrases in the text using the generated phrases from the previous step. Consider phrases as single tokens for co−occurrence measures. 3. Align multi−word units if they fulfill the alignment constraints. 4.1

Phrase generation

The automatic generation of phrases in UWA is based on word association scores. Similarly to [Smadja 1993], an iterative process is applied in which the size of word N−grams is increased step by step. We apply mutual information scores in order to determine the significance of the current unit, i.e. the relation between the current (N−1)−gram X and its direct successor Y: MI = log 2

prob( X N −1 , Y ) prob( X N −1 ) prob(Y )

Currently, only contiguous bigrams and trigrams are considered in UWA for efficiency reasons.

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 5 Furthermore, we use classified stop word lists, as proposed in [Merkel and Andersson 2000] in order to exclude certain function words from certain positions in phrase candidates and to define break points for phrasal structures. Thus, certain words can mark the end or the beginning of phrases, other words are not allowed at the beginning, at the end, or within multi−word units. Beside co−occurrence measures we use the over−all frequency of the phrase within the complete text. Co−occurrence measures based on frequency do not work for low frequency units. Therefore, they should be excluded. Thanks to the combination of association scores, classified stop word lists, and frequency thresholds this technique produces valuable phrase collections within reasonable processing time. However, the process produces many inclusions, incomplete units and overlapping units. Consider the following phrase candidates that have been extracted from the Scania material: ABS warning ABS warning lamp axle raised front axle Here, ’ABS warning’ is included in the second phrase. It might be a correct phrase on its own if it occurs as an independent phrase in the corpus; otherwise it is incomplete. The phrase candidate ’axle raised’ seems to be malformed whereas ’front axle’ might be a correct term. Both candidates include the word ’axle’ and might overlap in some context such as ’the front axle raised’. Both, the amount of incomplete and overlapping phrases should be minimized; phrase lists should include only those phrasal terms, which occur as independent units in the text. 4.2

Annotating phrases

The next step in the UWA alignment processes is an annotation step. Here, UWA uses phrase lists that were generated in the previous step and tries to identify and annotate them in the text. The UWA annotation module applies a simple left−to− right segmentation and annotates the longest unit that can be found in the text and continues the annotation process with the subsequent word of the last annotation. With this heuristic, overlapping phrases and inclusions are excluded from the list of candidate phrases. Consider the following examples that have been annotated using the phrases from the example above. Without front axle raised ... and the ABS warning lamp lights. Now, new lists of monolingual phrase lists can be extracted from the annotated material for both languages. In these lists, inclusions and overlapping phrases are excluded. According to the example above, the phrase candidate ’ABS warning’ will be removed if it does not occur as an independent instance in some other context. Furthermore, ’axle raised’ is excluded as well because it overlaps with the annotated phrase ’front axle’. In other words, the longest left−most phrase is favored according to our strategy. Still, phrases of general language and incorrect phrases remain in the extracted lists of phrase candidates after the annotation step. In the next step we apply bilingual word alignment in order to get rid of them.

6

4.3

Jörg Tiedemann

Aligning phrases

All processing so far has concerned monolingual texts. Now, we will take advantage of the parallel character of the corpus and perform word alignment according to the basic assumption. Beside co−occurrence measures, UWA applies additional extraction strategies such as string similarity measures for the extraction of cognates [Tiedemann 1999a, Tiedemann 1999b]. This technique can be applied to multi−word units and phrasal cognates can be identified. Consider the following Swedish/English examples: varningslampor warning lamps se exempel see example retardens oljefilter retarder oil filter First, UWA collects alignment candidates, rated by their association score (including string similarity scores). Secondly, all possible N−grams are generated for both languages within the current bitext segment and pairs of such word combinations from corresponding sentences are combined within a certain link window. The maximum length of N−grams can be adjusted by according parameters. Currently, the default length is set to three. Then, the system searches possible alignments in the list of translation candidates. All pairs that could be identified are stored in the link list sorted by their score. Finally, the system aligns all pairs starting with the pair with the highest score. In this way, the most confident pairs are linked first. All linked words are removed immediately from the text segment such that no word can be aligned twice. The results of the UWA alignment are a list of token links and an extracted bilingual lexicon (type links). Phrasal terms are to be found both in the source and the target language. Aligned multi−word units are phrases with a certain consistency in translations. Furthermore, a fair number of additional phrases are identified by means of the string similarity criterion.

5.

Evaluating phrase extraction results

Phrasal terms have been produced for two languages at three different levels as the result of the previously described process: • • •

automatically generated phrases phrases that have been annotated phrases that have been aligned

Each step discards a number of phrase candidates. Additional phrase candidates such as cross−lingual multi−word cognates are retrieved in the alignment phase.

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 7 Evaluating precision and recall of such lists is by far not trivial. Evaluation depends on the purpose of the application, i.e. on the type of result that is aimed at. This investigation focuses on the extraction of phrasal terms for building terminology databases. In order to evaluate the proposed strategy, we aligned the complete Swedish/English part (about 2.7 million words) of the Scania corpus. Table 1 summarizes the number of extracted phrases at each of the three levels: automatically generated phrases, annotated phrases, and bilingually aligned phrases. Furthermore, the numbers of aligned phrases that have been annotated in the previous step, as well as the number of additional phrases from the alignment step are presented in the table. Table 1: Extracted phrases from the Scania corpus. generated phrases annotated phrases aligned phrases annotated and aligned added in alignment

Swedish English 16,496 19,529 12,188 15,079 4,138 9,843 3,148 6,865 990 2,978

Precision and recall are hard to estimate as regards phrases. Recall is to be measured by comparing retrieved phrases with the total number of phrases in the corpus. However, the total number of phrases is impossible to calculate automatically even if detailed syntactic rules are defined. This is primarily due to syntactic ambiguity. The analysis into phrases is very hard even for manual annotators who are supported with detailed guidelines. Manual annotations will vary for each investigation. Another drawback is the very time−consuming effort that is necessary for such tasks. 5.1

Evaluation using a Gold Standard

A possibility for estimating recall is to compare the outcome with a set of reference data, a gold standard, which has been defined in advance (prior reference). The gold standard should include a representative subset of relevant data from the data collection under consideration. Appropriate gold standards are difficult to produce and require tools and guidelines. In a previous project, multi−word units have been collected for text from the same domain (Scania manuals). The result of the project is MULTERM, a multilingual term database [Brännström and Dahlqvist 1994]. It includes multi− word terms, which are typical for these texts. We assume that this collection of manually collected phrases is representative to a certain extent for the corpus we were looking at. It includes 2,153 English phrases. Now, each extracted phrase list has been compared with this "gold standard" by taking the intersection of both sets. We assume that the sizes of these intersections can be used to estimate recall values for the phrase extraction process. Certainly, we cannot expect to find all phrases from the gold standard in the extracted material because both sets are based on different texts. However, the recall values with respect to this gold standard describe differences between the extracted phrase lists quite well. Table 2

8

Jörg Tiedemann

summarizes "recall" estimations with respect to the gold standard. Table 2: Identification of MULTERM terms. English MULTERM generated annotated aligned lost in annotation lost in alignment added in alignment

phrases portion 2,153 100.00% 1,158 53.79% 1,129 52.44% 1,135 52.72% 29 1.35% 171 7.94% 177 8.22%

Table 2 lists the number of phrasal terms from the gold standard that have been found among extracted phrase candidates at each processing step. Note that only exact matches are counted. There are, e.g., about 140 additional phrases in each step that are inclusions. Furthermore, the table shows the number of "gold standard phrases" that have been excluded in each step and how many of such phrasal terms have been found in the alignment step. The values in table 2 clearly show a constant amount of relevant terms with respect to the gold standard that are included in the extracted phrase lists. The loss of relevant terms is insignificant in the annotation phase and the loss in the alignment phase is more than equalized by the gain in terms. This indicates an improvement of precision, considering the total amount of phrases that have been retrieved as shown in table 1 and assuming that the values in table 2 describe an estimation of recall in a relative manner. 5.2

Evaluation using Syntactic Phrase Patterns

An indication as described above is not sufficient for evaluating precision. The problem in measuring precision is to find an appropriate definition of validity for possible phrase constructions. Validity depends on the language model that is being used to describe the syntactic structure of a language. Typically, phrases are defined by sequences of morphosyntactic tags. Extracted phrase candidates can be evaluated by matching them against such patterns. Justeson and Katz [1995] investigated technical terms from different English domains in order to find typical term patterns. They show that about 96% of all terms are noun phrases and about 4% adjectives. Furthermore, they look at phrasal terms and found that about 97% of the multi−word noun phrases included only nouns and adjectives, and about 3% additionally contained a preposition. They defined the following pattern for extracting English noun phrases3: ((Adj|N)+|(Adj|N)*(Prep)?(Adj|N)*)(N) According to the investigations in [Justeson and Kratz 1995], adjectives are the next most frequent term element. David Quinn [1997] specified three main types of English multi−word patterns for adjectives: (Adv)(Adj) (N)(Adj) (N)(N)(Adj)

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 9

Linguistic analysis of extracted phrases as described above assumes tagging. The extraction approaches discussed in this paper are based on knowledge−lite methods, i.e. we assume that no linguistic tools are available for the extraction. However, for this evaluation we used the publicly available part−of−speech tagger from IMS Stuttgart [Schmid 1994], which is based on decision trees. We applied the tagger to each list of extracted phrases. We found reasonable results of the tagging even though the available context is minimal (one or two words). Using the tagged phrase lists, we extracted all candidates that match the multi− word term patterns as described above. The results of this extraction are summarized in table 3. Table 3: Matching term patterns. NP1 refers to a simple noun−phrase pattern: 0 or more adjectives/nouns followed by a noun. NP2 refers to the noun−phrase pattern defined by Justeson/Kratz. NP2,ADJ refers to the NP2 pattern plus Quinn’s three adjective patterns. English (1) generated (2) annotated (3) aligned excluded in (2) excluded in (3) (2) and (3) added in (3)

NP1 36.79% 39.23% 54.74% 28.41% 28.43% 52.15% 61.05%

NP2 NP2, ADJ 49.09% 50.14% 50.03% 51.06% 65.78% 66.47% 45.64% 46.99% 38.69% 39.96% 63.48% 64.21% 71.59% 72.16%

The portion of candidates that match the syntactic term patterns increases at each extraction level. Different sets of term patterns result in similar improvements. Phrase candidates that have been excluded in annotation and alignment do not match these general patterns in most of the cases, which indicates that mainly malformed phrase candidates have been excluded. On the other hand, most of the aligned phrases match the given patterns, both for annotated phrases and for newly discovered phrases. The values in table 3 give some indications of the quality of the extracted terms at different stages. However, they should not be confused with precision measures for the extracted phrasal terms. A matching collocation does not necessarily represent a correct and complete phrase. Consider, e.g., that most of the phrasal terms, which have been excluded in annotation, are inclusions and therefore they are possibly not complete even though they might match one of the term patterns. Furthermore, the general term patterns do not cover all phrase types that we are interested in. For example, verb phrases are completely excluded from this consideration. Finally, we have to consider failures of the tagger that may occur when very limited context is provided and text from a very specific domain is processed. 5.3

Manual Evaluation

Another possibility to estimate precision is to check a sample of the outcome manually. Evaluation will be based on intuition and may be supported by

10

Jörg Tiedemann

appropriate guidelines with explanatory examples. Guidelines for phrase validity evaluation have not been available for our investigations. The following evaluations are based on random samples of 300 English collocations, which have been examined after each of the three steps. Concordances were used in cases of uncertainties. Table 4 summarizes the obtained values for each manual evaluation. Table 4: Manual evaluation of extracted samples (300 units). English generated annotated aligned excluded in annotation excluded in alignment added in alignment

accepted 220 248 271 invalid 155 98 28

portion 73.33% 82.67% 90.33% portion 51.67% 32.67% 9.33%

Table 4 shows significant improvements in precision after annotation and alignment. Furthermore, a large portion of phrases, which had been excluded in annotation and alignment, was not accepted by the human whereas most of the newly discovered phrases from the alignment step could be accepted. Still, the loss of valid phrases seems to be quite high. Many of these phrases are syntactically well−formed out of context. However, a large portion of them are inclusions and incomplete in the actual context. Furthermore, we accepted well−formed phrases of general language in the manual evaluations even though terminology extraction is usually not aimed at such phrases. All the evaluations above were concerned the English phrases. The approach as described above gains from the alignment to Swedish correspondences because of a different system of compounding in Swedish as compared to English. Concatenated compounds in Swedish often correspond to non−compositional equivalents in English. Many links between compositional compounds and their non−compositional equivalents can be established by word alignment. However, we expect a gain in precision even for Swedish even though we assume that the loss in recall will be larger than for the English part.

6.

Conclusions

The basic assumption made in section 3 was substantiated as regards the extraction of English phrases from the Swedish/English test corpus. The quality of the phrase lists could be improved significantly by applying automatic word alignment approaches. Even though the number of extracted phrases was reduced drastically, the loss in recall seems to be rather small. The precision of the phrasal terms that could be extracted was improved significantly. Word alignment can contribute to filtering out invalid phrase constructions as shown in this paper. Our results were evaluated in three ways: We applied automatic evaluation by comparing extracted phrases to previously defined reference data, we evaluated our results against typical syntactic term patterns, and, finally, we evaluated samples of extracted phrase candidates manually. The results of our evaluations as regards the extraction of English phrases confirm our assumption: The precision of

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 11 the extracted phrasal terms was improved significantly with a reasonable loss in recall. However, it has to be noted that the success of our approach depends on the language pair under consideration and on the performance of the word aligner as regards phrases. Nevertheless, this investigation clearly demonstrates the possibility to improve knowledge−lite phrase extraction by consulting parallel sources. Footnotes 1

This article is based on [Tiedemann 2000] from the Workshop on Terminology Resources and Computation, held in conjunction with LREC 2000, Athens. 2

This sentence is taken from the Declarations from the Swedish Government corpus, 1988. 3

The pattern is specified as a regular expression including the quantifiers ’*’ (0 or more), ’+’ (1 or more), and ’?’ (0 or 1). Elements separated by ’|’ define alternatives. Bibliography Lars Ahrenberg, Magnus Merkel, Anna Sågvall Hein, and Jörg Tiedemann. 2000. Evaluation of Word Alignment Systems. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC− 2000, Athens, Greece. European Language Resources Association. Antti Arppe. 1995. Term extraction from unresctricted text. In Proceedings of NODALIDA, Helsinki, Finland. NODALI. Nuala A. Bennett, Qin He, Kevin Powell, and Bruce R. Schatz. 1999. Extracting noun phrases for all of medline. In Proceedings of the American Medical Informatics Association AMIA, pages 671−675, Washington, DC, November. AMIA. Berith Brännström and Bengt Dahlqvist. 1994. Multerm − Projektrapport. Technical report, Department of Linguistics, Uppsala University, Sweden. Kenneth W. Church and Patrick Hanks. 1989. Word Association Norms, Mutual Information, and Lexicography. In Proceedings of the 27th ACL, pages 76− 83. Ido Dagan and Kenneth W. Church. 1994. Termight: Identifying and Translating Technical Terminology. In Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart/Germany. Association for Computational Linguistics. Ido Dagan and Fernando Pereira. 1994. Similarity−Based Estimation of Word Cooccurence Probabilities. In Proceedings of the 32rd Annual Meeting of the ACL, New Mexico State University. Association for Computational Linguistics. Béatrice Daille, Eric Gaussier, and Jean−Marc Lange. 1994. Towards Automatic Extraction of Monolingual and Bilingual Terminology. In Proceedings of

12

Jörg Tiedemann

COLING 94, pages 515−521. Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):61−74. The Lexicon Interest Group EAGLES. 1999. Preliminary Recommendations on Lexical Semantic Encoding. Technical Report. EAGLES. Pim van der Eijk. 1993. Automating the Acquisition of Bilingual Terminology. In Proceedings of the 6th Conference of the European Chapter of the ACL, Utrecht/The Netherlands. Association for Computational Linguistics. John S. Justeson and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1:9−27. Kyo Kageura and Bin Umino. 1996. Methods of Automatic Term Recognition. A Review. Terminology, 3(2):259−289. I. Dan Melamed. 1997. Automatic Discovery of Non−Compositional Compounds in Parallel Data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence. Magnus Merkel and Mikael Andersson. 2000. Knowledge−lite extraction of multi− word units with language filters and entropy thresholds. In Proceedings from RIAO, Paris/France. Patrick Pantel and Dekang Lin. 2001. A statistical corpus−based term extractor. In Proceedings of the Canadian Conference on Artificial Intelligence. David Quinn. 1997. Terminology for Machine Translation: a Study. Machine Translation Review, No.6, Oktober 1997:9−21 Anna Sågvall Hein and Ingrid Almqvist. 2000. A Language Checker of Controlled Language and its Integration in a Documentation and Translation Workflow. In Proceedings of the 22nd Conference on Translating and the Computer 22, Athens, Greece, November. Association for Information Management. Anna Sågvall Hein. 1999. The PLUG Project: Parallel corpora in Linköping, Uppsala, and Göteborg: Aims and achievements. In Lars Borin, editor, Proceedings of the Symposium on Parallel Corpora, to appear, Department of Linguistics, Uppsala University, Sweden. Helmut Schmid. 1994. Probabilistic part−of−speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, September. Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1). Frank Smadja. 1993. Retrieving Collocations from Text: XTRACT. Computational Linguistics. Jörg Tiedemann. 1998. Extraction of Translation Equivalents from Parallel Corpora. In Proceedings of the 11th Nordic Conference on Computational Linguistics NODALI98, Center for Sprogteknologi and Department of General and Applied Linguistics, University of Copenhagen. Jörg Tiedemann. 1999a. Automatic Construction of Weighted String Similarity Measures. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), Department of Linguistics, Uppsala University, Sweden. Jörg Tiedemann. 1999b. Word Alignment − Step by Step. In Proceedings of the 12th Nordic Conference on Computational Linguistics NODALIDA99,

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 13 University of Trondheim, Norway. Jörg Tiedemann. 2000. Extracting Phrasal Terms using Bitexts. In Proceedings of the Workshop on Terminology Resources and Computation, in connection with LREC−2000, pages 57−63, Athens, Greece. European Language Resources Association.

14

Jörg Tiedemann

Appendix A. Example alignments of phrasal terms 11−motor ABS informationslampa ABS−reglerventil ABS−varningslampa ABS−varningslampan Användarinstruktion Arbetsbeskrivning Arbetsbeskrivning Arbetstryck Arbetstryck avgasbroms avgasledning avgasrör kpl avgasläckor avgassystem avgasutsläppen Batterifrånskiljare Batterifrånskiljare Batterifrånskiljare Bosch P4 ABS Bromsljuskontakt Bromsljuskontakt blinkande punkt Låt motorn gå Magnetskiva Magnetspolar Magnetstativ Magnetventil Manuell nivåreglering Matningsspänningen Matningsspänningen Motorbromsprogram Sätt dit växellådan Sätt dit växelstången Sätt ihop kontaktstycket Sätt tillbaka fläkten Sätt upp avdragare Sätt upp navet Sätt upp navet Ta fram felkoder

11−series engine ABS information lamp ABS control valve ABS warning lamp the ABS warning lamp User instructions Job description Work Description Operating pressure Working pressure exhaust brake exhaust pipe exhaust pipe complete exhaust leakage exhaust systems exhaust emissions Battery disconnector Battery isolator Battery master switch Bosch P4 ABS Brake lamp switch Brake lights switch a flashing dot Run the engine Magnetic disc Magnetic coils Magnetic foot Solenoid valve Manual level control The power supply The supply voltage Engine brake program Fit the gearbox Attach the gearshift Assemble the connector Refit the fan Fit puller Mount the hub Secure the hub Accessing the fault

Can Bilingual Word Alignment improve Monolingual Phrasal Term Extraction? 15 Ta fram felkoder Ta försiktigt Ta hjälp Ta isär elkontakten Ta loss kardanaxeln Tryck lätt

Reading fault codes Carefully remove Get help Dismantle the connector Detach the propeller Lightly press