Identifying Contextual Information for Multi-Word ... - Semantic Scholar

8 downloads 0 Views 156KB Size Report
Mar 24, 1999 - Methods for multi-word term extraction have traditionally involved ... tion from UMLS 11], a domain-speci c thesaurus, and linguistic information.
Identifying Contextual Information for Multi-Word Term Extraction Diana Maynard and Sophia Ananiadou Dept. of Computing & Mathematics Manchester Metropolitan University Manchester, M1 5GD, U.K.

fD.Maynard, [email protected]

March 24, 1999 Abstract

Methods for multi-word term extraction have traditionally involved statistical techniques. More recently, hybrid techniques have been evolving which incorporate some linguistic knowledge. This information is generally very shallow, and researchers have tended to ignore any real understanding of either terms or the context in which they appear. We adopt an approach which uses a variety of knowledge sources - syntactic, semantic and statistical - and attempt to both enlighten and make use of the theoretical foundations of terminology in a practical application. We seek particularly to identify those parts of the context which are most relevant to the terms. We incorporate contextual weights, based on a new similarity measure, into a method for term recognition, and thereby improve the ranking of terms and enable disambiguation to be achieved.

1 Introduction Techniques for term extraction are becoming increasingly important, due to the existence of large volumes of textual data, coupled with the constant dynamism of special languages. Applications such as machine translation, automatic indexing, information extraction and information retrieval all bene t from automatic term extraction. In view of the fact that terms can be highly ambiguous, they also bene t from techniques for disambiguation. Whilst the use of contextual information is clearly bene cial for multi-word term extraction, it is dicult to identify which parts of the context are most useful and what type of information is relevant. The context is generally considered simply as a "bag of words", and its structure and relationship to the term in question are largely ignored. In order to identify relevant parts of the context, we apply several measures, considering the terminological status of each context word, its frequency, and semantic and syntactic information. This information 1

can also be used to cluster the contexts according to their properties, so that we can identify similar contexts for use as term indicators or for other applications of practical terminology. Methods for multi-word term extraction generally involve statistical and/or linguistic techniques, but the linguistic knowledge used is mainly in the form of simple syntactic information. Our approach makes use of semantic information from UMLS [11], a domain-speci c thesaurus, and linguistic information gleaned from a medical corpus of eye pathology records. We improve on a measure called NC-Value developed by Frantzi [6], which produces a ranking of candidate terms. Instead of considering every occurrence of a termform as identical, we di erentiate between instances of terms according to the concepts they represent, such that each meaning of an ambiguous term may be accorded a di erent ranking. By making use of contextual information in an appropriate way, and adding deeper forms of linguistic information, we can make better distinctions about terms, which has both theoretical and practical implications.

2 Background Whilst many term recognition approaches are largely statistical, some comprehension of the nature and behaviour of terms is nevertheless important for any practical application, as well as for theoretical advances in the eld of terminology. Our belief is that a deeper understanding of the conceptual aspect of terms is necessary [3], and to this extent we propose to make more use of linguistic, and in particular, semantic information. Methods of automatic term recognition involving purely linguistic information have been fairly limited. They fall into two main categories: those which use extrinsic information and those which use intrinsic information relative to the term. The types of information used tend to be syntactic for the former, e.g. LEXTER [2], and syntactic or morphological for the latter, e.g. [1]. More recently, approaches to ATR have veered towards using both statistical and linguistic information [4],[7],[6], but most are heavily biased towards the statistical part. Shallow linguistic information is then incorporated in the form of a syntactic lter which only permits certain combinations of syntactic categories to be considered as candidate terms. Little use has been made of semantic information for the purpose of term recognition, largely because, in contrast to morphological and syntactic information, it is hard both to identify and to manipulate. On the other hand, semantic information has been used fairly extensively for other purposes such as knowledge acquisition and word sense disambiguation. Because they do not deal speci cally with terminology, these methods tend to incorporate general language dictionaries or thesauri. One common method is to compare the content of the dictionary de nition of a word with the words in the surrounding context. Lesk [8] used an online dictionary to calculate the overlap between dictionary de nitions of word senses; Smeaton [13] calculated word-word similarity between related words in a thesaurus, while Yarowsky 2

[15] captured the meanings of word senses from semantic categories in Roget's Thesaurus. The application of these methods to word sense disambiguation has been largely successful. There are, however, two main reasons why they are not entirely suitable for our work. Firstly, they are used for word sense rather than term sense disambiguation. Secondly, they are designed to deal with general rather than domain-speci c language. Technical terms are not covered suciently in any general language dictionary or thesaurus, and we propose that they really require a more specialised source of information. To this extent, we use information from a domain-speci c thesaurus to uncover semantic relations concerning candidate terms in the text. We propose a new similarity measure, based on methods more commonly found in machine translation [14] [17], to calculate semantic distance between terms and their context, as will be detailed in Section?.

3 Extracting Contextual Information It is well-known among terminologists that linguistic contexts provide a rich and valuable source of information. The problem lies in identifying those parts of the context which are actually relevant. Although a domain-speci c corpus is used, it is clear that not all parts of the context are equally useful. Traditionally, a KWIC index is used to nd a window of words surrounding candidate terms, but these contexts then have to be manually investigated - a time-consuming and laborious process. Other approaches simply regard the context as a set of unordered, unstructured objects - a \bag of words". We attempt to make some distinction between those parts of the context which are most useful and those which are not. Our approach is built on the NC-Value method for automatic term recognition [5][6], which uses statistical and linguistic information from the context to identify terms. The linguistic information is in the form of a syntactic lter which restricts context words to nouns, verbs and adjectives. The context words are extracted and assigned a weight based on how frequently they appear with terms. Whilst restricting the context words to particular syntactic categories goes some way towards this aim, it is largely insucient, since even after ltering has taken place, there are many distinctions that can be drawn. In particular, we consider the meaning of the context words and their relationship with the candidate term. We consider 2 main ideas: 1. A context word which is a term is more likely to be signi cant as a term indicator than one which is a non-term. 2. Of these context terms, we propose that those which have similar semantic properties to the candidate term are better indicators. 3

These are applied by assigning weights to candidate terms based on the properties of the context words associated with them. The weights are combined to form a context factor for each candidate term, described formally as follows:

CF (a) =

X f (b) weight(b) + X f (d) sim(d; a) 

a

a

bCa



(1)

dTa

where a is the candidate term, Ca is the set of context words of a, b is a word from Ca , fa (b) is the frequency of b as a context word of a, weight(b) is the weight of b as a context word, Ta is the set of context terms of a, d is a word from Ta , fa (d) is the frequency of d as a context term of a, sim(d; a) is the similarity weight of d occurring with a.

3.1 Context Term Weight

Since we do not know in advance which context words are terms, the calculation of the context term (CT) weight can only be undertaken once we have a preliminary list of candidate terms. For this we use the top of the list of terms extracted by the NC-Value approach [6], since this should contain the "best" terms (or, at least, those candidate terms which behave in the most termlike fashion). A context term weight is assigned to each candidate term based on how many times it appears with a context term, as described formally below:

CT (a) =

X f (d) a

(2)

dTa

where a is the candidate term, Ta is the set of context terms of a, d is a word from Ta , fa (d) is the frequency of d as a context term of a.

3.2 Similarity Weight

Unlike other context similarity weights, e.g. [16] [12], our weight is based not on statistical but on semantic properties of the context terms [9]. Similarity is calculated using a thesaurus hierarchy, similar to the approach of Sumita and Iida [14]. For this we use the UMLS Metathesaurus to assign semantic categories to each word in question. The similarity between a context term and a candidate term is then measured by calculating the distance between their semantic categories in the hierarchy provided by the UMLS Semantic Network, 4

[TA] ENTITY

[TA1] PHYSICAL OBJECT

[TA11] ORGANISM

[TA111] PLANT

[TA112] FUNGUS

[TA2] CONCEPTUAL ENTITY

[TA12] ANATOMICAL STRUCTURE

[TA121] EMBRYONIC STRUCTURE

[TA122] ANATOMICAL ABNORMALITY

[TA1111] ALGA

Figure 1: Fragment of the Semantic Network a section of which is depicted in Figure 1. The semantic distance is de ned in terms of a positional weight (the vertical position of the Most Speci c Common Abstraction of the two nodes), and a commonality weight (the number of shared common ancestors of the two nodes). Similarity between two nodes is calculated by dividing the commonality weight by the positional weight to produce a gure between 0 and 1, 1 being the case where the two nodes are identical, and 0 being the case where there is no common ancestor (which would only occur if there were no unique root node in the hierarchy). w1 :::wn ) sim(w1 :::wn ) = com( (3) pos(w1 :::wn ) where com(w1 :::wn ) is the commonality weight of words 1...n pos(w1 :::wn ) is the positional weight of words 1...n As an example, the similarity between nodes TA111 and TA112 in the above network would be 0.8, calculated as follows:com(TA111, TA112) = 4 * 2 = 8 pos(TA111,TA112) = 5 + 5 = 10 sim(TA111,TA112) = 8/10 = 0.8 It becomes clear that the further down in the hierarchy the two items are, the greater their similarity, which is intuitive since they are more speci c: for example there would be greater similarity between apple and banana than between fruit and vegetable. It is also intuitive that the greater the horizontal and/or vertical distance between words in the network, the less similar they are. 5

Term

SNC-Value NC-Value Similarity Weight plane of section 1752.71 1107.11 27.5429 descemet's membrane 1671.6 1328.8 17.8571 optic nerve 1670.75 1652.75 162.114 basal cell carcinoma 1485.31 1257.39 108.407 brous tissue 1359.73 1121.73 158.15 anterior chamber 1322.34 1015.74 236.4 bowman's membrane 1154.09 863.693 10.4643 corneal diameters 947.777 874.577 4.88571 cell carcinoma 892.491 760.691 55.7857 malignant melanoma 715.904 589.389 57.19

CT Weight 645.6 342.8 230.2 216.6 238 306.6 290.4 73.2 131.8 117.8

Table 1: Weights for the 10 top-ranked terms

4 Results and Evaluation The context factor is incorporated into the term extraction method by adding to the C-Value weights (the statistical bottom layer of the NC-Value), to form a new set of weights we call SNC-Value. These weights determine the overall ranking of the candidate terms. The SNC-Value measure is de ned as follows:

SNCvalue(a) = 0:8 Cvalue(a) + 0:2 CF (a) 



(4)

where a) is the candidate term, Cvalue(a) is the C-Value of the candidate term CF (a) is the context factor of the candidate term

4.1 Improving the Ranking of Candidate Terms

Applying the SNC-Value to our corpus of eye pathology records has resulted in improvements in the ranking, thereby drawing more of a distinction between terms and non-terms, with most valid terms receiving a higher score and thus moving further up the ranking. Table 1 shows the 10 top-ranked terms and their new ranking, along with their old NC-Value ranking and the two weights we have introduced - the Similarity and Context Term Weights.

4.2 Disambiguation of Terms

The similarity weight is also used to separate the meanings of ambiguous terms. If either the candidate term or the context term is ambiguous, it will usually have more than one semantic category, and each meaning of the term will thus receive a di erent similarity weight. This is described more fully in [10]. Table 2 6

Term allergic conjunctivitis allergic conjunctivitis connective tissue connective tissue epithelial cyst epithelial cyst epithelial cyst

Position TB222 TB222 TA123 TA123 TA12 TA12 TA12

Context Term conjunctivitis conjunctivitis lying lying lesions lesions lesions

Position TA22 TB222 TB11 TA125 TB222 TA214 TA22

Similarity 0.2222 1 0.2222 0.8 0.2 0.4444 0.5

Table 2: Similarity between some ambiguous context terms and their respective terms shows some ambiguous terms found in the corpus, together with their associated weights. For example, we nd the term allergic conjunctivitis occurring with the context term conjunctivitis. This context term has 2 possible positions in the hierarchy, as shown in the table. One of these belongs to the same node as the term, producing a similarity of 1, whilst the other is in a completely di erent section of the hierarchy and has a much lower similarity. It turns out that the meaning of the context term which is more similar is the one which is most relevant and thus the one which should be taken into consideration.

5 Evaluation

5.1 The Context Term Weight

We have proposed the hypothesis that context words which are terms are useful indicators of other terms, because terms do not tend to appear in isolation but in groups. This was incorporated into our measure by considering the number of times a candidate term appears with a context term in the text. The more frequently they appear together, the more useful the context term is considered to be. To evaluate the e ectiveness of the context term weight, we performed an experiment to compare the frequency of occurrence of each context term with terms and with non-terms. This was achieved by taking each context term individually and calculating the proportion of terms to non-terms each occurred with. The results depicted in Table 3 show that 90% of context terms appear more frequently with a term than with a non-term, and only 4% appear less frequently. Note that the gure in the second column indicates not the total number of co-occurrences, but the number of context terms occurring in the proportion indicated by the rst column (more, less or equally frequently with terms and non-terms).

7

Occurrence No. of Context Terms Percentage more frequent with term 419 90% equally frequent with either 28 6% more frequent with non-term 17 4% Table 3: Occurrences of context terms with terms and non-terms Term Non-Term top set 76% 24% middle set 56% 44% bottom set 49% 51% Table 4: Semantic weights of terms and non-terms

5.2 Evaluation of the Semantic Weight

One of the problems with our similarity calculation is that it relies on a preexisting thesaurus, which by its nature is likely to have discrepancies and omissions. Methods relying solely on corpus data also have their drawbacks, however, particularly when dealing with relatively small amounts of data which can lead to skewed statistics. Bearing in mind its innate inadequacies, we can nevertheless evaluate the expected theoretical performance of the measure by assuming completeness and concerning ourselves only with what is actually covered. The semantic weight is based on the premise that the more similar a context term is to the candidate term it occurs with, the better an indicator it will be. A higher total semantic weight for a candidate term will result in a higher term ranking and a better chance of validity. To test the performance of the semantic weight, the terms were sorted in descending order of their semantic weights and the list divided into three, with the top third containing the highest semantic weights and the bottom third containing the lowest. The proportion of valid and non-valid terms was then calculated for each section of the list. Decision as to whether a term was valid or not was made by manual assessment, according to two domain experts. Since this judgement is naturally subjective, we do not assume it has perfect correctness, but simply use it to provide a baseline against which to measure performance of our system and to test improvements as they are made. The results depicted in Table 4 reveal that most of the valid terms are contained in the top third of the list, and the fewest valid terms are contained in the bottom third. Also, in the top of the list there are more terms than nonterms, whereas in the bottom of the list there are more non-terms than terms. The evaluation demonstrates that terms with high semantic weights tend to occur at the top of the list, and that terms at the top of the list tend to have high semantic weights, thereby indicating that the semantic weights are a good indication of term validity. 8

6 Conclusion This paper has focused on the use of a variety of sources of information to improve multi-word term extraction, and in particular identifying the most relevant parts of the context. Making use of contextual information is not sucient unless we have some means of distinguishing the relative contribution that context words can make. Our aim is for a deeper comprehension of the nature of terms and their relationship with the concepts they represent, which should contribute not only towards methods of automatic extraction and disambiguation, but also towards theoretical understanding of terminology. By incorporating the contextual weights into our measure, we have been able to both improve the ranking of candidate terms from the NC-Value, and distinguish between cases of term ambiguity. This means that there is more of a distinction drawn between terms and non-terms, with most valid terms receiving a higher score and thus moving further up the ranking. For the rst time, occurrences of terms are considered individually according to their meaning and context, so that polysemous terms may be given di erent rankings according to their meaning. This is of crucial importance to many applications of terminology such as information retrieval and machine translation, where distinguishing between di erent meanings of terms can be vital.

References [1] S. Ananiadou. Towards a methodology for automatic term recognition. PhD thesis, University of Manchester, UK, 1988. [2] D. Bourigault. Surface grammatical analysis for the extraction of terminological noun phrases. In Proc. of COLING, pages 977{981, 1992. [3] L. Bowker. A multidimensional approach to classi cation in Terminology: Working with a computational framework. PhD thesis, University of Manchester, England, 1995. [4] B. Daille, E. Gaussier, and J.M. Lange. Towards automatic extraction of monolingual and bilingual terminology. In Proc. of COLING 94, pages 515{521, 1994. [5] K.T. Frantzi. Incorporating context information for the extraction of terms. In Proceedings of ACL/EACL '97, pages 501{503, Madrid, Spain, July 1997. [6] K.T. Frantzi. Automatic Recognition of Multi-Word Terms. PhD thesis, Manchester Metropolitan University, England, 1998. [7] J.S. Justeson and S.M. Katz. Technical terminology: some linguistic properties and an algorithm for identi cation in text. Natural Language Engineering, 1:9{27, 1995. 9

[8] M. Lesk. Automatic sense disambiguation: how to tell a pine cone from an ice cream cone. In Proc. of the SIGDOC Conference, pages 24{26, 1986. [9] D.G. Maynard and S. Ananiadou. Acquiring contextual information for term disambiguation. In Proc. of 1st Workshop on Computational Terminology, Computerm '98, Montreal, Canada, 1998. [10] D.G. Maynard and S. Ananiadou. Term sense disambiguation using a domain-speci c thesaurus. In Proc. of 1st International Conference on Language Resources and Evaluation (LREC), Granada, Spain, 1998. [11] National Library of Medicine, U.S. Dept. of Health and Human Services. UMLS Knowledge Sources, 8th edition, January 1997. [12] H. Schutze. Dimensions of meaning. In Proc. of Supercomputing '92, 1992. [13] A. Smeaton and I. Quigley. Experiments on using semantic distances between words in image caption retrieval. In Proc. of 19th International Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996. [14] E. Sumita and H. Iida. Experiments and prospects of example-based machine translation. In Proc. of 29th Annual Meeting of the Association for Computational Linguistics, pages 185{192, Berkeley, California, 1991. [15] D. Yarowsky. Word-sense disambiguation using statistical models of roget's categories trained on large corpora. In Proc. of 14th International Conference on Computational Linguistics, pages 454{460, 1992. [16] Chengziang Zhai. Exploiting context to identify lexical atoms - a statistical view of linguistic context. In proc. of International & Interdisciplinary Conference on Modifying and Using Context (CONTEXT-97), pages 119{ 129, Rio de Janeiro, Brazil, 1997. [17] Gang Zhao. Analogical Translator: Experience-Guided Transfer in Machine Translation. PhD thesis, Dept. of Language Engineering, UMIST, Manchester, England, 1996.

10

Suggest Documents