Automatic Acquisition and Expansion of Hypernym Links - CiteSeerX

Automatic Acquisition and Expansion of Hypernym Links Emmanuel Morin ([email protected]) LINA-CNRS 2 chemin de la Houssinière - BP 92208 44322 Nantes Cedex 3, France

Christian Jacquemin ([email protected]) LIMSI-CNRS BP 133 91403 Orsay Cedex, France Abstract. Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms. Hypernym links acquired through an information extraction procedure are projected on multi-word terms through the recognition of semantic variations. The induced hierarchy is incomplete but provides an automatic generalization of singleword terms relations to multi-word terms that are pervasive in technical thesauri and corpora. Keywords: Corpus, hypernymy, pattern, terminology, thesaurus, semantic variation.

1. Introduction Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. Terminology Acquisition On the one hand, some tools are developed for automatic acquisition of candidate terms from corpora (Bourigault, 1995; Daille, 1996; Justeson and Katz, 1995). Terminology Structuring On the other hand, contributions to automatic construction of thesauri provide classes or links between single words. Classes are produced by clustering techniques based on similar word contexts (Sch¨ utze, 1993) or similar distributional contexts (Grefenstette, 1994). Links result from automatic acquisition of relevant predicative or discursive patterns (Hearst, 1992; c 2004 Kluwer Academic Publishers. Printed in the Netherlands.

morin04.tex; 13/04/2004; 20:51; p.1

2

Emmanuel Morin and Christian Jacquemin

Basili et al., 1993; Riloff, 1993). Predicative patterns yield predicative relations such as cause or effect whereas discursive patterns yield non-predicative relations such as hypernym, meronym or synonymy links. Terminology Exploitation Lastly, some tools exploit such automatically collected data for the purpose of automatic indexing (Jacquemin and Tzoukermann, 1999) or information extraction (Riloff, 1993). The main contribution of this article is to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms. First, we present a system for corpus-based acquisition of terminological relationships through discursive patterns. This system is built on previous work on automatic extraction of hyponymy links through shallow parsing (Hearst, 1992; Hearst, 1998). In addition to this previous study, the system is CJ:complemented with a classifier for the purpose of discovering new lexico-syntactic patterns through corpus exploration. Second, we show how such semantic links between single-word terms can be extended to semantic links between multi-word terms through corpus-based extraction of semantic variants.1 1.1. Method and Architecture In our approach, layered hierarchies of multi-word terms are constructed through corpus-based conflation of semantic variants. 2 Variant conflation is performed by rules that take as input semantic relationships between single-word terms and lists of multi-word terms. They produce as output semantic relationships between multi-word terms. For instance, from the corpus-based link between orge (barley) and céréale (cereal) and from the term orge de brasserie (brewers barley), the textual sequence céréale de brasserie (brewers cereal) is recognized as a variant of the initial term through automatic variant recognition. Thus the initial link between the two single-word terms is extended to a link between these two multi-word terms. The overall architecture of the application is organized as follows: A tool for terminology structuring (the Prométhée system of Morin 1

This article is an extended version of (Morin and Jacquemin, 1999), a paper published at the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99.) 2 Even though we focus here on generic/specific relations, the method would apply similarly to any other type of semantic relations.

morin04.tex; 13/04/2004; 20:51; p.2

Automatic Acquisition and Expansion of Hypernym Links

3

Termer

ACABIT

Multi-word candidate terms Term normalizer

Corpus

FASTR

Information extractor

PROMÉTHÉE

Single-word term hierarchies

Hierarchies of multi-word terms

Figure 1. Overview of the system for hierarchy projection

(1999a) for the acquisition of semantic links) is associated with a tool for terminology acquisition (the ACABIT system of Daille (1996) for the acquisition of multi-word candidate terms) and a tool for terminology exploitation (the FASTR system of Jacquemin (1996) for variant recognition). The system builds an efficient term structuring tool from three existing tools (see Figure 1). The use of ACABIT is intended to restrain the search space of variant recognition to the set of corpus-based candidate terms produced by this termer. The output of the system is a corpus-based multi-word thesaurus. It results from an extension of semantic relationships between single-word terms to multi-word terms. 1.2. Overview The remainder of this article is organized as follows. In Section 2, the method for corpus-based acquisition of semantic links is presented and evaluated for acquisition of lexico-syntactic patterns associated with the hypernym relationships in corpora. In Section 3, another evaluation is described, which CJ:measured the quality of the lexico-syntactic patterns relative to the merge and the produce relationships extracted by Prométhée. In Section 4, the tool for semantic term normalization is described and an evaluation of term variant conflation is presented. Section 5 describes the application of this technique to semantic link projection. Section 6 analyzes the results on an agricultural corpus and evaluates the quality of the induced semantic links. Section 7 discusses works related to this study, and section 8 presents our conclusions.

2. Acquisition of Semantic Relationships through Lexico-syntactic Patterns 2.1. Background A number of techniques have been developed for automatically extracting semantic information from text corpora. Usually these techniques

morin04.tex; 13/04/2004; 20:51; p.3

4


use bottom-up or top-down methods to extract relationships between terms (Condamines and Rebeyrolle, 2001). 2.1.1. Bottom-up Method Bottom-up methods extract information from texts without an a priori knowledge about data to be extracted. Classes are produced by clustering techniques based on similar word contexts — which describe words that are likely to be found in the immediate vicinity of a given word (Church and Hanks, 1990; Smadja, 1993) — or on similar distributional contexts — which reveal words that share the same syntactic environments (Hindle, 1990; Grefenstette, 1994). Thus Ruge (1991) extracts words with similar meanings by using modifier-head relations in noun phrases from a large corpus. In the same way, Hindle (1990) examines nouns which are subjects and objects of the same verb by combining syntactic analysis and statistical measures. With this technique, Hindle (1990) determines, for example, that the words most similar to boat are ship, plane, bus, jet, vessel, truck, car, helicopter, ferry and man. Grefenstette (1994) has developed a system called SEXTANT for extracting classes of semantically close words. In a first step, SEXTANT extracts syntactic dependencies: adjectivesnouns, nouns-nouns, subjects-verbs and verbs-objects (these relations provide context for each term in a corpus). SEXTANT then compares the contexts of each term and determines what words are most similar to a given word. These techniques require no hand-coded domain knowledge, and are more effective than co-occurrence approaches (Church and Hanks, 1990). Grefenstette’s approach has two main drawbacks: on the one hand, a word and its antonym can be included in the same class, and on the other hand, semantic relations are not labeled. Generally speaking, bottom-up methods are robust in extracting classes between single words but have some disadvantages: 1. Clusters obtained with such techniques are not a priori significant. 2. Clusters contain heterogeneous linguistic entities. 3. Conceptual similarity is a “neutral” link which does not yield semantic predicates. 2.1.2. Top-down Method Most top-down techniques for the acquisition of semantic relationships rely on hand-coded rules about the data to be extracted. For instance, Basili et al. (1993) used word associations, syntactic markers and hand-coded semantic tags. These features are mixed with selection

morin04.tex; 13/04/2004; 20:51; p.4


5

restrictions and conceptual relationships for the automatic acquisition of a case-based semantic lexicon. Similarly, Grishman and Sterling (1992) acquire selectional patterns in which functional relations are syntactic and not semantic. In contrast, Hearst (1992) reports a method dedicated to the acquisition of semantic relationships signaled by lexico-syntactic patterns. Here, lexico-syntactic patterns are used to extract lexical relations between words from unrestricted text. For instance, the pattern NP, especially {NP,}* {or|and} NP (where NP is a noun phrase), and the sentence (...) most European countries, especially France, England and Spain yield three lexical relations: (1) HYPONYM (France, European country), (2) HYPONYM (England, European country), and (3) HYPONYM (Spain, European country). These relations can then be included in a hierarchical thesaurus. Here, only a single instance of a lexico-syntactic pattern needs to be encountered to extract the corresponding conceptual relation. Links between words that result from automatic acquisition of relevant predicative or discursive patterns (Hearst, 1992; Basili et al., 1993; Riloff, 1993) are fine and accurate, but the acquisition of these patterns is a tedious task that requires substantial manual work. For our work, we have developed a system which extracts and uses lexico-syntactic patterns to acquire semantic relations between terms. 2.2. Overview of the Prométhée system We first present the Prométhée system for corpus-based information extraction of semantic relationships between terms through lexico-syntactic patterns. As illustrated by Figure 2, the Prométhée system has two functionalities: 1. The corpus-based acquisition of lexico-syntactic patterns with respect to a specific conceptual relation. 2. The extraction of pairs of conceptually related terms through a database of lexico-syntactic patterns. These functionalities are implemented in three main modules: Lexical Preprocessor The lexical preprocessor receives as input raw texts. First the text is tokenized (recognition of words and sentences boundaries), tagged and lemmatized. Noun phrases, acronyms and sequences of noun phrases are detected through regular expressions. The output of the lexical preprocessor is an enriched text with sgml tags.

morin04.tex; 13/04/2004; 20:51; p.5

6

Emmanuel Morin and Christian Jacquemin Bootstrap: initial pairs of terms

Corpus

Lexical Preprocessor

Shallow parser + Classifier

Lexico-syntactic patterns

Lemmatized and tagged corpus Information Extractor

Database of lexico-syntactic patterns

Partial hierarchies of terms

Figure 2. The Prométhée System

Shallow Parser and Classifier This module extracts lexico-syntactic patterns relative to a semantic relationships. This phase is inspired by the works of Hearst (1992; 1998) and implemented by a shallow parser associated with a classifier. Information Extractor The information extractor acquires pairs of conceptually related terms by using a database of lexico-syntactic patterns. This database can be the output of the shallow parser and classifier, or manually specified. 2.3. Lexico-syntactic Pattern Discovery Process The shallow parser is complemented with a classifier for the purpose of discovering new patterns through corpus exploration. This procedure is inspired by Hearst (1992; 1998) and consists of the following 7 steps: 3 1. Select manually a representative conceptual relation, e.g. the hypernym relation. 2. Collect a list of pairs of terms linked by the previous relation. This list of pairs of terms can be extracted from a thesaurus, a knowledge base or manually specified. For instance, the hypernym relation “neocortex is-a-kind-of vulnerable area” is used. 3. Find sentences in which conceptually related lemmatized terms occur. These sentences are lemmatized, and noun phrases are identified. They are represented as lexico-syntactic expressions. For instance, the previous relation HYPERNYM (vulnerable area,neocortex ) is used to extract the sentence: Neuronal damage was found 3 For expository purposes, some examples are taken from [MEDIC], a 1.56-million word English corpus of scientific abstracts in the medical domain, CJ:but evaluations are undertaken on the agricultural corpus [AGRO-ALIM].

morin04.tex; 13/04/2004; 20:51; p.6

7


in the selectively vulnerable areas such as neocortex, striatum, hippocampus and thalamus from the [MEDIC] corpus. The sentence is then transformed into the following lexico-syntactic expression: 4 NP find in NP such as LIST

(1)

4. Find a common environment that generalizes the lexico-syntactic expressions extracted at the third step. This environment is calculated with the help of a function of similarity and a procedure of generalization that produce candidate lexico-syntactic patterns. For instance, from the previous expression, and at least one other similar one, the following candidate lexico-syntactic pattern is deduced: (2) NP such as LIST This step is detailed in the next section. 5. Validate candidate lexico-syntactic patterns by an expert. 6. Use these validated patterns to extract additional candidate pairs of terms. 7. Validate candidate pairs of terms by an expert, and go to step 3. At this level, two significant points make our technique different from Hearst’s (1992; 1998) methodology: 1. A common environment relative to a set of sentences is extracted automatically by the previous method and manually by Hearst. 2. The expert evaluation of candidate lexico-syntactic patterns or pairs of terms is absolutely necessary since all candidate patterns or pairs of terms do not denote the target relationship. This evaluation is not mentioned by Hearst. 2.3.1. Automatic Classification of Lexico-syntactic Patterns Let us detail the fourth step of the preceding algorithm in which lexicosyntactic patterns are automatically acquired through the clustering of similar lexico-syntactic expressions. As described in the third step above, expression (1) is acquired from the relation HYPERNYM (vulnerable area,neocortex ). Similarly, from the relation HYPERNYM (complication,infection), the sentence: Therapeutic complications such as infection, recurrence, and loss of 4

NP stands for a noun phrase, and LIST for a sequence of noun phrases.

morin04.tex; 13/04/2004; 20:51; p.7

8

Emmanuel Morin and Christian Jacquemin W in1 (A)

W in2 (A)

W in3 (A)

A = A 1 A2 · · · · · · Aj · · · Ak · · · · · · An

B = B 1 B 2 · · · B j 0 · · · · · · B k 0 · · · B n0 W in1 (B)

W in2 (B)

W in3 (B)

Figure 3. Comparison of two expressions

support of the articular surface have continued to plague the treatment of giant cell tumor is extracted through corpus exploration. A second lexico-syntactic expression is inferred: NP such as LIST continue to plague NP

(3)

The lexico-syntactic expressions (1) and (2) can be abstracted as: 5 A = A 1 A2 · · · Aj · · · Ak · · · An HY P ERN Y M (Ak , Aj ), k > j + 1

(4)

and B = B 1 B 2 · · · B j 0 · · · B k 0 · · · B n0 0

0

HY P ERN Y M (Bk0 , Bj 0 ), k > j + 1

(5)

Let Sim(A, B) be a function measuring the similarity of lexicosyntactic expressions A and B that relies on the following hypothesis: HYPOTHESIS 2.1 (Syntactic isomorphy). If two lexico-syntactic expressions A and B represent the same pattern then, the items A j and Bj 0 , and the items Ak and Bk0 have the same syntactic function in the sentence. Let W in1 (A) be the window built from the first through j-1 words, W in2 (A) be the window built from words ranking from j+1 th through 5

Ai is the ith item of the lexico-syntactic expression A, and n is the number of items in A. An item can be either a lemma, a punctuation mark, a symbol, or a tag (NP, LIST...). Since we extract relationships between noun phrases, the relation k > j + 1 states that there is at least one item between Aj and Ak (this condition is necessary but not sufficient).

morin04.tex; 13/04/2004; 20:51; p.8


9

k-1 th words, and W in3 (A) be the window built from k+1 th through nth words (see Figure 3). The similarity function is defined as follows: Sim(A, B) =

3 X

Sim(W ini (A), W ini (B))

i=1

with    W in1 (B) = B1 B2 · · · Bj 0 −1  W in1 (A) = A1 A2 · · · Aj−1 W in2 (A) = Aj+1 · · · Ak−1 and W in2 (B) = Bj 0 +1 · · · Bk0 −1   W in3 (A) = Ak+1 · · · An W in3 (B) = Bk0 +1 · · · Bn0

(6)

The function of similarity Sim(W in i (A), W ini (B)) between lexicosyntactic patterns is defined as a function of the longest common subsequence (LCS)6 . In order to compute the LCS of two strings X (for instance W in i (A)) and Y (for instance W ini (B)), we use the notation X[1 · · · k] to denote the prefix of length k in the string X[1 · · · m] and Y [1 · · · l] to denote the prefix of length l in the string Y [1 · · · n]. Now, for two strings X[1 · · · m] and Y [1 · · · n], let c[k, l] be the length of an LCS of the sequences X[1 · · · k] and Y [1 · · · l]. The c[k, l] value can be obtained from the following recursive formula: c[k, l] =

   0

c[k − 1, l − 1] + 1

  max(c[k, l − 1], c[k − 1, l])

if k = 0 or l = 0 if k, l > 0 and Xk = Yl if k, l > 0 and X k 6= Yl

(7)

By using this function of similarity, all lexico-syntactic expressions are compared two by two (CJ:p is the number of expressions). Thus, we defined a matrix of similarity M = (mij )p×p . From this matrix, similar expressions are clustered by identifying connected components. Then, each cluster is associated with a candidate pattern. This candidate pattern corresponds to CJ:the lexico-syntactic expression which has the smallest standard deviation with the other expressions in the cluster. 7 NP such as LIST

(8)

6

The function of similarity has been experimentally verified and compared with other well know measure of similarity such as Salton, Cosinus and Jaccard (Salton and McGill, 1983) in Morin (1999a). 7 For more information on the Prométhée system, in particular a complete description of the pattern generalization process, see Morin (1999b).

morin04.tex; 13/04/2004; 20:51; p.9

10


2.4. Evaluation of the Discovery Process The Prométhée system is now evaluated for the semantic acquisition of hypernym relationships from the French [AGRO-ALIM] corpus. In order to bootstrap the system, we CJ:define manually 40 pairs of hyper/hyponym terms in the agricultural domain (see Table I). Table I. Bootstrap of the hypernym relation Hypernyms

Hyponyms

fruits tropicaux (tropical fruits) cations (cations) arbres (trees) arbres (trees) céréales (cereals) céréales (cereals) fruits (fruits) fruits (fruits) fruits (fruits) fruits (fruits) légume (vegetables) légume (vegetables) sucre (sugar) sucre (sugar) huiles (oils) huiles (oils) ...

bananes (bananas) sodium (sodium) chênes (oaks) pin (pine) blé (wheats) orge (barley) orange (orange) kiwi (kiwifruits) fraise (strawberry) cerise (cherry) carotte (carrots) concombre (cucumber) saccharose (sucrose) glucose (glucose) huile de soja (soybean oil) huile de tournesol (sunflower oil) ...

From these 40 instances, the Prométhée system incrementally acquires the following eleven lexico-syntactic patterns: 1. {deux|trois...|2|3|4...} NP1 ( LIST2 ) (...) analyse foliaire de quatre espéces ligneuses (chêne, frêne, lierre et cornouiller) dans l’ensemble des sites étudiés. (...) foliar analysis of four woody species (oak, ash tree, ivy and dogwood) in all the sites studied.

2. {certain|quelque|de autre...} NP1 ( LIST2 ) Après cinq années de résultats sur les principales cultures vivrières (sorgho, ma¨ıs, mil), il apparaˆıt qu’il existe un grand nombre de combinaisons possibles. After five years of results for the principal food crops (sorghum, corn, millet), it appears that there is a great number of possible combinations.

morin04.tex; 13/04/2004; 20:51; p.10


11

3. {deux|trois...|2|3|4...} NP1 : LIST2 Comportement hydrique au cours de la saison sèche et place dans la succession de trois arbres guyanais : Trema micrantha, Goupia glabra et Eperua grandiflora. Hydrous behavior during the dry season and place in the succession of three Guyanese trees: Trema micrantha, Kabukalli and Eperua grandiflora.

4. {certain|quelque|de autre...} NP1 : LIST2 L’objet de cette thèse est la recherche de marqueurs moléculaires liés a ` des gènes de résistance aux principales maladies du pois : Fusariose, O¨ıdium, Anthracnose et Mosa¨ıque commune du pois. The purpose of this thesis is the identification of molecular markers related to genes of resitance to the major pea diseases: Fusarium wilt, Powdery mildew, Anthracnose, common mosaic pea.

5. {de autre}? NP1 tel que LIST2 Des cations tels que le sodium, le potassium, le calcium et le magnésium peuvent être dosés par des méthodes de routine. Cations such as sodium, potassium, calcium and magnesium can be measured by routine methods.

6. NP1 , particulièrement NP2 , En parallèle, on a étudié la densité des espèces de phlébotomes anthropophiles, particulièrement Lutzomyia trapidoi, dans le milieu domestique et les caféières adjacentes, ainsi qu’en sous-bois. In parallel, we have studied the density of anthropophilic sandfly species, particularly Lutzomyia trapidoi, in the domestic environment and the adjacent coffee plantations, as well as in the undergrowth

7. {de autre}? NP1 comme LIST2 Des polysaccharides comme l’amidon ou la cellulose sont également assimilés. Polysaccharides such as starch or cellulose are also assimilated.

8. NP1 tel LIST2 Les caractéristiques du site telles la pente, le sous-bois et la distance des usines ne sont pas apparues importantes dans la détermination des valeurs. Site characteristics such as the slope, the underwood and the distance between the factories did not appear significant in the determination of the values.

9. NP2 {et|ou} de autre NP1 Des réactions croisées (différentes suivant les techniques et les anticorps) ont été mises en évidence avec des protéines de la même famille, issues des deux blés ou d’autres céréales. Cross reactions (different according to the techniques and the antibodies) were highlighted with proteins of the same family, resulting from two wheat species or other cereals.

morin04.tex; 13/04/2004; 20:51; p.11

12


10. NP1 et notamment NP2 La fermentation alcoolique industrielle de sous-produits de sucrerie, se déroule en conditions non stériles, ce qui entraˆıne le développement de bactéries et notamment de germes lactiques. The industrial alcoholic fermentation of the by-products of sugar refining, proceeds in non-sterile conditions, which involve the development of bacteria and in particular of lactic germs.

11. chez le NP2 , NP1 , Chez les Phalaenopsis, Orchidées monopodiales, l’excision de méristèmes en vue d’une micropropagation détruit ou lèse gravement la plante souche. Among Phalaenopsis, monopodial orchids, the meristem excision for a micropropagation destroys or seriously injured the plant stock.

In table II, each pattern is associated with the number of term pairs (after validation by an expert), and corresponding values of Precision, Recall, and F-Measure (Rijsbergen van, 1979; Salton and McGill, 1983). These measures, initially defined for the evaluation of information retrieval procedures are adapted to the evaluation of sematic relationship acquisition as follows: Let us define: − Ntotal : the total number of hyper-/hyponym pairs of terms CJ:instantiated in the corpus. − Ncorrect : the number of correct hyper-/hyponym pairs of terms extracted by the Prométhée system. − Nincorrect : the number of incorrect hyper-/hyponym pairs of terms extracted by the Prométhée system. Precision is used to evaluate the ratio of incorrectly extracted relationships: Ncorrect P recision = Ncorrect + Nincorrect Recall is used to evaluate the ratio of correct relationships extracted by the system: Ncorrect Recall = Ntotal F-Measure is a score combining Precision and Recall: F -M easure =

2 × P recision × Recall P recision + Recall

morin04.tex; 13/04/2004; 20:51; p.12


13

The average Precision on the [AGRO-ALIM] corpus is 82% and average Recall 8 is 56%. Although the quality of the relations CJ:produced by Prométhée is high, some partial or spurious relations are extracted: − Sometimes the CJ:relationship is under-specified like in HYPERNYM (caractéristique,dureté ) (HYPERNYM (characteristic,hardness)). Here, the kind of characteristic is not known. This information, namely the grain, is mentioned in the previous sentence. This metonymy problem is often met with generic nouns such as élement (element), espèce (species), facteur (factor)... − CJ:More often the relationships depents too much on the context to be accepted as correct like in HYPERNYM (environnement idéal,Malaisie) (HYPERNYM (ideal environment,Malaysia)) or HYPERNYM (échantillon,Allemagne) (HYPERNYM (sample,Germany)). − Lastly, CJ:some hypernyms found in extracted relations are very close to meronymys, as seen in HYPERNYM (axe du corps,membre) (HYPERNYM (body axis,limb)). The low coverage of the hypernym relation can be explained by several phenomena: − Although the quality of part-of-speech tagging is CJ:globally high (94%), some words are mistagged. We often find part-of-speech ambiguity between adjective and past participle. Consequently, noun phrases and sequences of noun phrases are partially identified and some relationships are not extracted. For example, in the sentence (...) trois types de boues (boue thermique, boue floculée et boue aérobie prolongée) (lit. (...) three types of sludges (thermic sludge, flocculated sludge, extended aerobic sludge)) the word prolongée, which is an adjective, is tagged as past participle. Thus, the noun phrase identified is boue aérobie and the sequence of noun phrases identified is boue thermique, boue floculée et boue aérobie. Consequently, the lexico-syntactic pattern {deux|trois...|2|3|4...} NP 1 ( LIST2 ) cannot be instantiated in the previous sentence. Because of one CJ:single tagging error in the lexical preprocessor three hyponym relations are CJ:missed. − An additional difficulty is CJ:the detection of sequential noun phrases. These sequences involve complex syntactic structures such as 1) a sequence of noun phrases can be CJ:embedded in another 8

The Recall has been manually calculated from a subset of the corpus (between 10% and 100% of the corpus according to the pattern productivity).

morin04.tex; 13/04/2004; 20:51; p.13

14


sequence of noun phrases, 2) an apposition can be extracted as a sequence of noun phrases, 3) a coordination CJ:overlap another coordination in a sequence of noun phrases... Table II. Evaluation of Lexico-syntactic Patterns Pattern

{deux|trois...|2|3|4...} NP1 ( LIST2 ) {certain|quelque|de autre...} NP1 ( LIST2 ) {deux|trois...|2|3|4...} NP1 : LIST2 {certain|quelque|de autre...} NP1 : LIST2 {de autre}? NP1 tel que LIST2 NP1 , particulièrement NP2 , {de autre}? NP1 comme LIST2 NP1 tel LIST2 NP2 {et|ou} de autre NP1 NP1 et notamment NP2 chez le NP2 , NP1 , Total

# pairs of terms

P.

R.

F-M.

270 212 241 116 210 4 90 36 17 6 14

84% 87% 79% 84% 86% 100% 69% 90% 59% 70% 62%

56% 52% 51% 47% 70% 36% 64% 67% 65% 43% 66%

68% 65% 62% 60% 76% 53% 67% 76% 62% 53% 64%

1216

82%

56%

66%

3. Other Experiments In the previous experiment, we have CJ:evaluated the quality of the hyponymy relations acquired by the Prométhée system. Obviously, CJ:this specific case of semantic relationship is not really representative of more difficult problems in discovering semantic relations from discursive patterns such as specific semantic structure. From this viewpoint, we present another evaluation of the Prométhée system: CJ:the extraction of lexico-syntactic patterns relative to the merge and the produce relationships from a subset of the [REUTERS] corpus (a 0.9-million word English corpus made of 5,770 newswires). 3.1. Merge Relation A pair of terms belonging to the merge relation is of the form MERGE (CN1 ,CN2 ), where CN1 and CN2 are both Company-Name terms that participates in some CJ:merging event (CJ:merge in progress, actual CJ:merge...).

morin04.tex; 13/04/2004; 20:51; p.14


15

In order to bootstrap Prométhée, we have manually defined two lexico-syntactic patterns: 1 merger of CN1 with CN2 Dixons Group Plc said shareholders at a special meeting of Cyclops Corp approve the previously announced merger of Cyclops with Dixons.

2 merger of CN1 and CN2 Hoechst Celanese was formed Feb 27 by the merger of Celanese Corp and Ameri can Hoechst Corp.

All CJ:the instances of these patterns CJ:are then extracted from the corpus [REUTERS]. Prométhée incrementally CJ:learns additional patterns for the merge relation, namely: 3 CN1 said it complete * acquisition of CN2 Chubb Corp said it completed the previously announced acquisition of Sovereign Corp.

4 CN1 said its shareholder * CN2 approve * merger of the two company INTERCO Inc said its shareholders and shareholders of the Lane Co approved the merger of the two companies.

5 CN1 said its shareholder approve * merger with CN2 Fair Lanes Inc said its shareholders approved the previously announced merger with Maricorp Inc a unit of Northern Pacific Corp.

6 CN1 said it agree * to {acquire|buy|merge with} CN2 Datron Corp said it agreed to merge with GGFH Inc a Florida-based company formed by the four top officers of the company.

7 CN1 ’s {proposed}? acquisition of CN2 Fujitsu’s acquisition of Fairchild would have given the Japanese computer maker control of a comprehensive North American sales and distribution system and access to microprocessor technology an area where Fujitsu is weak analysts said.

Using these patterns, 101 pairs of CJ:conceptually related terms are extracted from the CJ:[REUTERS] corpus. 9 The precision of CJ:this extraction is high: 92%.

9 Here, several pairs of terms express the acquisition relation (i.e. the acquisition of a Company-Name term by another Company-Name term) which is close to the merge relation.

morin04.tex; 13/04/2004; 20:51; p.15

16


3.2. Produce Relation A pair of terms belonging to the produce relation is of the form PRODUCE (CN1 ,NP2 ), where CN1 is a Company-Name term and NP2 is a noun phrase describing a product. The semantic interpretation is that CN1 produces NP2 , but it can also mean that CN1 {distributes|sells|provides|supplies} NP2 . In order to bootstrap Prométhée, we have used the previous interpretation to extract initial pairs of terms. In contrast to the merge relation, all instances of the previous pattern are not sufficient or relevant for bootstrapping the pattern discovery process. The Prométhée system cannot be used to extract patterns relative to the CJ:produce relation. 3.3. Synthesis The experiments CJ:reported in this section show that the merge relation is well identified by discursive patterns whereas the produce relation is not identified by discursive patterns. In general, this relation is usually implicit i.e. it is not clearly represented by discursive patterns, except for the trivial pattern: CN 1 produces NP2 like in the sentence: Vismara primarily produces a variety of pork products. In fact, the most important pieces of news about company events are typically reported in the beginning of a news story, while more detailed facts are described later, sometimes with no reference to the name of the company (see (Riloff, 1993) for the same observation). In the [REUTERS] corpus, a story begins with important events (merger, acquisition, bankruptcy, etc.), and secondary information like products and people appears later, usually with company CJ:name anaphoras. An additional difficulty is that in some cases the name of the company includes the name of its product, like in National Computer System Inc expects fiscal year earnings to improve (...). This relation cannot be extracted by the Prométhée system. 4. Expansion of Hypernym Links The links extracted by the Prométhée system for the hypernym relation are limited to links used in the same sentence. Therefore, these links cover only a part of the hypernym links CJ:expressed in the corpus. In order to increase the number of relationships extracted from the corpus, this section presents a technique for incremental acquisition through corpus exploration. From the iterative algorithm described in section 2, we have extracted eleven lexico-syntactic patterns relative to the hypernym rela-

morin04.tex; 13/04/2004; 20:51; p.16


17

tion. These patterns are used by the information extractor to extract 1216 pairs of conceptual related terms. The hypernym links are thus distributed: 26.2% between two multiword terms, 23.5% between two single-word terms, and the remaining ones between a single-word term and a multi-word term. Our purpose is to design a technique for the expansion of links between single-word terms to links between multi-word terms. Given a link between fruit and apple, similar links between multi-word terms are extracted such as apple juice / fruit juice (apple N / fruit N) and apple juice / fruit nectar (apple N1 / fruit N2 with N1 semantically related to N2 ). We now turn to the description of semantic variations through which semantic links between single-word terms are extended to semantic links between multi-word terms. 4.1. The Three Types of Variations The extraction of semantic variations is performed by FASTR, a transformational parser that relies on a metagrammar and a feature-based representation of linguistic data. Terms and variations are represented in a two level description (Jacquemin, 1999): Syntagmatic The syntagmatic level represents variations by a source and a target structure. Paradigmatic The paradigmatic level denotes morphological and/or semantic relations between lexical items in the source and target structures of the variation. Before focusing on the extraction of semantic variants, we first present the three major linguistic families of term variations. According to Jacquemin (2001), variations can be classified into three major categories: syntactic, morpho-syntactic and semantic. The structures of variants depend on the syntactic structures of the original terms. In French, there are four main structures of binary terms: Adj Noun, Noun Adj, Noun Prep Noun and Noun Prep Det Noun. 4.1.1. Syntactic Variants The content words of the original term are found in the variant but the syntactic structure of the term is modified. CJ:In the French language, syntactic variants CJ:can be classified in three classes: Coordinations A coordination is the combination of two terms with a common head/argument word. For example, fruits secs ou frais (dried or fresh fruits) is a coordination variant of the term fruits frais (fresh fruits).

morin04.tex; 13/04/2004; 20:51; p.17

18


Modifications A modification is the insertion of a modifier without reference to another term. For example, résistance mécanique de la graine (lit. mechanized grain resistance) is a modification variant of résistance de la graine (lit. grain resistance). Synapsies A synapsy has a structure similar to the structure of the controlled term where some of the empty words 10 may be changed. For example, maturation du fruit (fruit maturation) is a synaptic variant of maturation des fruits (maturation of the fruits). 4.1.2. Morpho-syntactic Variants The content word of the original term or one of their morphologically related words are found in the variant. The syntactic structure of the term is also modified. Morpho-syntactic variants CJ:can be divided in three classes: Noun-Noun Variations graine de cotonnier (lit. seed of cotton plant) /graine de coton (cottonseed). Noun-Verb Variations production d’enzymes(production of enzymes) /produire des enzymes(lit. to produce enzymes). Noun-Adjective Variations production de fruit(lit. production of fruits)/production fruitière (lit. fruit production). 4.1.3. Semantic Variants Semantic relations (synonyms, hypernyms...) are found between words in the original term and words in the variant. Semantic variants CJ:belong to three classes: Pure Semantic Variants Pure semantic variations result from synaptic variants enriched with a link between head words or argument words. For example, farines de ma¨ıs (maize-flour) is a variant of farine de blé (wheat flour) with a semantic link between argument words; and potassium en solution (lit. potassium in solution) is a variant of sodium en solution (lit. sodium in solution) with a semantic link between head words. Syntactico-semantic Variants Syntactic transformations (modification and coordination) are enriched with additional semantic links between the head words or between the argument words. For instance, grains durs de ma¨ıs (lit. hard maize grains) is a variant of 10

Empty words (or functional words) are non-lexical words: mainly determiners, prepositions, or coordinating conjunctions.

morin04.tex; 13/04/2004; 20:51; p.18


19

grain de blé (lit. grain of wheat) through a modification variation with a semantic link between argument words. Fruits secs ou frais (dried or fresh fruits) is a variant of raisin frais (fresh grapes) through a coordination variation with a semantic link between head words. Morpho-syntactico-semantic Variants Morpho-syntactic transformations are doubled by adding a semantic link between the words that are morphologically related. For instance, sucre résiduel (lit. residual sugar) is deduced from résidu de glucose (lit. residue of glucose) through an Adjective-Noun transformation by adding a semantic link between head words. 4.2. Constraints on Semantic Variations A semantic variation is a linguistic occurrence of a term in which one of the content words of the original term is replaced by a semantically related word. However, the fact that two multi-word terms w 1 w2 and w10 w20 contain two semantically-related word pairs (w 1 ,w10 ) and (w2 ,w20 ) does not necessarily entail that w1 w2 and w10 w20 are semantically close. The three following requirements should be met: Syntactic isomorphy The correlated words must occupy similar syntactic positions: both must be head words or both must be arguments with similar thematic roles. For example, procédé d’élaboration (process of elaboration) is not a variant of élaboration d’une méthode (elaboration of a method) even though procédé (process) and méthode (method) are synonymous, because procédé is the head word of the first term while méthode is the argument in the second term. Semantic isomorphy The correlated words must have similar meanings in both terms. For example, analyse du rayonnement (analysis of the radiation) is not semantically related with analyse de l’influence (analysis of the influence) even though rayonnement and influence are semantically related. The loss of semantic relationship is due to the polysemy of rayonnement in French which means influence when it concerns a culture or a civilization and radiation in physic field. Since we are in a technical domain, rayonnement has its second meaning and is not related with influence. Holistic semantic relationship The third criterion verifies that the compound terms are synonymous. For example, the terms inspection des aliments (food inspection) and contrˆ ole alimentaire (food

morin04.tex; 13/04/2004; 20:51; p.19

20


control) are not actually semantically related even though the components are synonymous. The first one is related to the quality of food and the second one with respect to standards. The discrepancy is due to the fact that contrˆ ole and inspection have closer meanings than their hyponyms concerning food. The three preceding constraints can be translated into a general scheme representing two semantically-related multi-word terms: DEFINITION 4.1 (Semantic variants). Two multi-word terms w 1 w2 and w10 w20 are semantic variants of each other if the three following constraints are satisfied:11 1. Some type of semantic relation S holds between w 1 and w10 and/or between w2 and w20 (synonymy, hypernymy, etc.). The non semantically related words are either identical or morphologically related. 2. w1 and w10 are head words and w2 and w20 are arguments with similar thematic roles. 3. w1 w2 and w10 w20 share the same type S of semantic relation. The definition above is used to design a technique for acquiring links between multi-word terms through text mining. It relies on the hypothesis that if items 1 and 2 of Definition 4.1 are satisfied (the relation between single-word terms and the similar thematic roles) then the third property should be satisfied as well (the multi-word terms will be semantically related). This hypothesis is corroborated by the results obtained in section 6. 4.3. Corpus-based Semantic Normalization The preceding constraints on semantic variations and similar constraints on syntactic and morpho-syntactic variations are used to design the metagrammar that is used by FASTR for corpus-based term variant recognition. Tables III and IV describes the three types of variations for N1 Prep N2 terms; similar data exist for other term structures (Jacquemin and Tzoukermann, 1999). 12 The first three columns indicate the type of syntactic, morphological and semantic relations that 11 w1 w2 is an abbreviated notation for a phrase that contains the two content words w1 and w2 such that one of both is the head word and the other one an argument. For the sake of simplicity, only binary terms are considered, but our techniques would straightforwardly extend to n-ary terms with n ≥ 3. 12 The symbols for part of speech categories are N (Noun), A (Adjective), Art (Article), C (coordinating conjunction), Prep (Preposition), Punc (Punctuation), Adv (Adverb), NP = Art? A? N A? and PP = Prep NP.

morin04.tex; 13/04/2004; 20:51; p.20


21

hold between the content words of the original term and the corresponding variant. There are two types of syntactic relations: coordination (Coor) and addition of a modifier (Modif). There are as many types of morphological relations as pairs of syntactic categories of content words (adjective, noun, or verb). The second column shows the type of morphological link together with the position of the word(s) concerned by the link. For instance, N/V Head is a noun to verb morphological link on the head noun. The third column gives the position of the semantically related word(s). For instance, Head in the third column indicates a semantic relationship between the head noun in the original term and the head verb in the variant. For each candidate term w1 w2 produced by the termer, the set of its semantic variants satisfying the constraints of Definition 4.1 is extracted from a corpus. In other words, a semantic normalization of the corpus is performed. It relies on corpus-based semantic links between single-word terms and variation patterns defined as all the licensed combinations of morphological, syntactic and semantic links. 13 In order to illustrate the type of variation that this technique is likely to account for, let us detail how a variant extracted by the CJ:twelfth pattern in Table IV CJ:(in bold): N1 Prep N2 → M(N1 , N) Adv? Adj? Prep Art? Adj? S(N2 ) Through this pattern, a semantic variation is found between composition du fruit (fruit composition) and composés chimiques de la graine (chemical compounds of the seed). It relies on the morphological relation between the nouns composé (compound) and composition (composition). It is also based on the semantic relation (part/whole relation) between fruit (fruit) and graine (seed). The morphological link is noted N/N Head in the second column of the 12th line of Table IV, while the semantic link is noted Arg in the third column.

M composition du fruit → composées chimiques de la graine S In addition to the morphological and semantic relations, the categories of the words in the semantic variant composés N chimiques A 13

Identical words are considered as a special case of morphological relation for the purpose of concision.

morin04.tex; 13/04/2004; 20:51; p.21

22


Table III. Patterns of semantic variation for a term of structure N1 Prep N2 (1/2) Synt.

Morph.

Sem.

— — — — Coor

— — — — —

— Arg Head Head & Arg —

Coor

—

Arg

Coor

—

Head

Coor

—

Head & Arg

Modif

—

—

Modif

—

Arg

Modif

—

Head

Modif

—

Head & Arg

Modif

—

—

Modif

—

Arg

Modif

—

Head

Modif

—

Head & Arg

Pattern of semantic variation N1 (Prep? Art? ) N2 N1 (Prep? Art? ) S(N2 ) S(N1 ) (Prep? Art? ) N2 S(N1 ) (Prep? Art? ) S(N2 ) N1 (((Punc? C Adv? Prep? NP Prep)|(A C Adv? Prep)|(PP C Adv? Prep? )) Art? A? ) N2 N1 (((Punc? C Adv? Prep? NP Prep)|(A C Adv? Prep)|(PP C Adv? Prep? ))Art? A? ) S(N2 ) S(N1 ) (((Punc? C Adv? Prep? NP Prep)|(A C Adv? Prep)|(PP C Adv? Prep? ))Art? A? ) N2 S(N1 ) (((Punc? C Adv? Prep? NP Prep)|(A C Adv? Prep)|(PP C Adv? Prep? ))Art? A? ) S(N2 ) N1 ((A ((Punc A)? C? A)? |Adv|V|(A? PP)) Prep Art? Adv? A? ) N2 N1 ((A ((Punc A)? C? A)? |Adv|V|(A? PP)) Prep Art? Adv? A? ) S(N2 ) S(N1 ) ((A ((Punc A)? C? ? ? A) |Adv|V|(A PP)) Prep Art? Adv? A? ) N2 S(N1 ) ((A ((Punc A)? C? A)? |Adv|V|(A? PP)) Prep Art? Adv? A? ) S(N2 ) N1 ((A ((Punc A)? C? A)? |Adv|V|(A? PP))? Prep Art? Adv? A) N2 N1 ((A ((Punc A)? C? A)? |Adv|V|(A? PP))? Prep Art? Adv? A) S(N2 ) S(N1 ) ((A ((Punc A)? C? ? ? ? A) |Adv|V|(A PP)) Prep Art? ? Adv A) N2 S(N1 ) ((A ((Punc A)? C? ? ? ? A) |Adv|V|(A PP)) Prep Art? Adv? A) S(N2 )

morin04.tex; 13/04/2004; 20:51; p.22


23

Table IV. Patterns of semantic variation for a term of structure N1 Prep N2 (2/2) Synt.

Morph.

Sem.

—

N/V Head

—

—

N/V Head

Arg

—

N/V Head/Arg

—

—

N/V Head/Arg

Arg

—

N/V Arg/Head

—

—

N/V Arg/Head

Head

—

N/V Arg

—

—

N/V Arg

Head

—

N/V Arg

—

—

N/V Arg

Head

—

N/N Head

—

—

N/N Head

Arg

— — — — — — — — — — — —

N/N Arg N/N Arg N/N Arg/Head N/N Arg/Head N/A Head/Arg N/A Head/Arg N/A Head N/A Head N/A Arg N/A Arg N/A Arg/Head N/A Arg/Head

— Head — Head — Arg — Arg — Head — Head

Pattern of semantic variation M(N1 ,V) (Adv? (Prep? Art| Prep) A? ) N2 M(N1 ,V) (Adv? (Prep? Art| Prep) A? ) S(N2 ) N2 (A? (Prep Art? A? N (A (C A)? )? )? (C Art? Adv? A? N A? )? V? V? Adv?) M(N1 ,V) S(N2 ) (A? (Prep Art? A? N (A (C A)? )? )? (C Art? Adv? A? N A? )? V? V? Adv? ) M(N1 ,V) M(N2 ,V) ((Prep? Art|Prep Art? ) A? ) N1 M(N2 ,V) ((Prep? Art|Prep Art? ) A? ) S(N1 ) N1 ((Adv? A (C Adv? A)? )? V? Prep) M(N2 ,V) S(N1 ) ((Adv? A (C Adv? A)? )? V? Prep) M(N2 ,V) N1 (A? (V? |(Prep Art? (Adv? A)? N)? ) (Adv? A)? Adv? ) M(N2 ,V) S(N1 ) (A? (V? |(Prep Art? (Adv? A)? N)? ) (Adv? A)? Adv? ) M(N2 ,V) M(N1 ,N) (((Adv? A? )|V? ) ((C|Punc) Prep? Art? A? N (PP)? )? Prep Art? A? ) N 2 M(N1 ,N) (((Adv? A? )|V? ) ((C|Punc) Prep? Art? A? N (PP)? )? Prep Art? A? ) S(N2 ) N1 (Prep Art? A? ) M(N2 ,N) S(N1 ) (Prep Art? A? ) M(N2 ,N) M(N2 ,N) (Prep Art? A? ) N1 M(N2 ,N) (Prep Art? A? ) S(N1 ) N2 (V? ) M(N1 ,A) S(N2 ) (V? ) M(N1 ,A) M(N1 ,A) N2 M(N1 ,A) S(N2 ) N1 M(N2 ,A) S(N1 ) M(N2 ,A) M(N2 ,A) (Prep Art? A? ) N1 M(N2 ,A) (Prep Art? A? ) S(N1 )

morin04.tex; 13/04/2004; 20:51; p.23

24


de Prep la Art graine N satisfy the regular expression in bold: the categories that are realized are underlined. In the target pattern, M(N1 ,N) stands for a noun N (composé ) that belongs to the morphological family of the head noun N 1 (composition). S(N2 ) denotes a noun (graine) that is semantically related with the argument noun N2 of the source term (fruit). The metagrammar for French has been developed through a collaboration with a French documentation center for scientific information: INIST-CNRS (Jacquemin and Royauté, 1994). It has been evaluated through several applications that require term identification: mainly automatic indexing, term extraction, information retrieval, and question answering. The metagrammar for French contains 16 metarules for Adjective Noun terms, 22 for Noun Adjective terms, and 72 rules for Noun Preposition Noun terms. The metagrammar has also been developed for several other languages. Vivaldi Palatresi and Rodr´ıguez Hontoria (2002) have designed metagrammars for Catalan and Spanish. These grammars are relatively close to the French metagrammar because all these languages are Romance languages. More different are the rules for the English language, described and evaluated on several domains in (Jacquemin, 2001). This publication also provides a detailed presentation of the methodology used for tuning metarules based on corpus processing. Metarules were also developed for Japanese (Yoshikane et al., 1998), and for German in collaboration with Antje Schmidt-Wigger (IAI). The experience gained from developing metagrammars for various languages has shown us that the notion of variation and its transformational description can apply to a wide range of languages. Languages in the same family tend to share similar transformational patterns. The development from scratch of a basic set of metarules requires only a few weeks of human work. FASTR is even used for teaching information retrieval and automatic indexing to undergraduate students in Computer Science and CJ:Comutational Linguistics. Students can cope with the formalism and design their own set of metarules in a few hours. 4.4. Evaluation of Term Variants The projection of corpus-based links produces 1,143 variations. 14 After human inspection, 981 of these variants are judged as correct (see Table V). The production of syntactic and semantic variants is higher (495 and 584) whereas the production of morpho-syntactic variants is lower (64). 14

In this step pairs of terms manually rejected are not included in the expansion and projection phases.

morin04.tex; 13/04/2004; 20:51; p.24

25


Table V. Production of term variants Corpus-based links # Variants

# Correct Variants

Precision of Variant Extraction

56 158 281 495

54 139 272 465

92.9% 88.0% 96.8% 93.9%

0 0 3 1 17 32 11 64

0 0 3 0 14 22 7 46

0.0% 0.0% 100.0% 0.0% 82.4% 68.8% 63.6% 71.2%

Semantic Variants Synap + Sem link

340

293

86.2%

Syntactico-semantic Variants Coor + Sem link Modif + Sem link

49 132

37 98

75.5% 74.2%

28 0 1 0 4 14 16

25 0 1 0 2 7 7

89.3% 0.0% 100.0% 0.0% 50.0% 50.0% 43.8%

584

470

80.5%

1,143

981

85.8%

Coord Modif Synap Total: Syntactic Variants A to A A to Adv A to N A to V N to A N to N N to V Total: Morpho-syntactic Variants

Morpho-syntactico-semantic Variants A to A + Sem link A to Adv + Sem link A to N + Sem link A to V + Sem link N to A + Sem link N to N + Sem link N to V + Sem link Total: Semantic Variants Total: Variants

The precision of these variations shows two levels of quality. On the one hand, syntactic and “pure” semantic variations are extracted

morin04.tex; 13/04/2004; 20:51; p.25

26


with a high level of precision (93.9% and 86.2%). On the other hand, morpho-syntactic variations and semantic variations involving a syntactic and/or morphological link have a less significant precision(71.2% and 73.8%). The lower precision of morpho-syntactic variations generally corresponds to a semantic discrepancy of meaning between both derivationally related words. For example, séchage de ma¨ıs (drying of maize) is not a correct variant of séchoir a ` ma¨ıs (maize drying shed) through a Noun-to-Noun variation. Although both séchage (drying) and séchoir (drying shed) are built of the verb root sécher (to dry), they have two different meanings. In séchage de ma¨ıs, the meaning is related to a agricultural processing while, in séchoir a ` ma¨ıs the meaning is related to a agricultural equipment. Consequently, the lower precision of semantic variations involving a syntactic and/or morphological link is due a cumulative effect of semantic shift through combined variations. For example, séchage de riz (rice drying) is incorrectly extracted as a variant of séchoir a ` ma¨ıs (maize drying shed) through a Noun-toNoun variation with a semantic link between argument words. However, séchage de riz (rice drying) is a correct variant of séchage de ma¨ıs (maize drying). Since syntactic and pure semantic variations are almost three times more numerous than hybrid variations, the average precision of term conflation is high: 85.8%. Because of such high precision rates, semantic variants can be used fruitfully for projecting semantic relationships between single words to semantic relationships between multi-word terms as shown in the next CJ:section.

5. Projection of a Single Hierarchy on Multi-word Terms CJ:The extraction of semantic variants by FASTR relies on a list of semantic relationships between single words. Depending on the semantic data, two modes of semantic relationships CJ:between single words are considered: a link mode in which each semantic relation between two words is expressed separately, and a class mode in which semantically related words are grouped into classes. The first mode corresponds to synonymy links in a dictionary or to generic/specific links in a thesaurus such as [AGROVOC]15 . The second mode corresponds to the synsets in WordNet (Fellbaum, 1998) or to the semantic data provided by the information extractor Prométhée used in this work. Each class 15 [AGROVOC] is a multilingual thesaurus for indexing and retrieving data in agricultural information systems managed by FAO (Food and Agriculture Organization of the United Nations).

morin04.tex; 13/04/2004; 20:51; p.26


27

is composed of hyponyms sharing a common hypernym—named cohyponyms—and all their common hypernyms. The list of classes is given in Table VI and two classes are detailed in Figure 4. Table VI. The twelve semantic classes acquired from the [AGRO-ALIM] corpus Classes

Hypernyms and co-hyponyms

arbres (trees)

arbre, bouleau, chêne, érable, hêtre, orme, peuplier, pin, poirier, pommier, sapin, épicéa élément, calcium, potassium, magnésium, manganése, sodium, arsenic, chrome, mercure, sélénium, étain, aluminium, fer, cadium, cuivre céréale, ma¨ıs, mil, sorgho, blé, orge, riz, avoine enzyme, aspartate, lipase, protéase fruit, banane, cerise, citron, figue, fraise, kiwi, noix, olive, orange, poire, pomme, pêche, raisin fruit, olive, Amellau, Chemlali, Chétoui, Lucques, Picholine, Sevillana, Sigoise fruit, pomme, Cartland, Délicious, Empire, McIntoch, Spartan légume, asperge, carotte, concombre, haricot, pois, tomate polyol, glycérol, sorbitol polysaccharide, amidon, cellulose, styrène, éthylbenzène protéine, chitinase, glucanase, thaumatin-like, fibronectine, glucanase sucre, lactose, maltose, raffinose, glucose, saccharose

éléments chimiques (chemical elements) céréales (cereals) enzymes (enzymes) fruits (fruits) olives (olives) pommes (apples) légumes (vegetables) polyols (polyols) polysaccharides (polysaccharides) protéines (proteins) sucres (sugars)

fruit (fruit)

orange (orange)

Cartland

Class ‘fruits’: fruit, orange, pomme

pomme (apple)

Délicious

McIntoch

Class ‘apples’: fruit, pomme, Cartland, Délicious, McIntoch

Figure 4. Two classes induced from corpus-based semantic links

morin04.tex; 13/04/2004; 20:51; p.27

28


5.1. Analysis of the Projection Through the projection of single-word hierarchies on multi-word terms, the semantic relation can be modified in two ways: Transfer The links between concepts represented by single-word terms (such as fruits) are transferred to multi-word terms in another conceptual domain (such as juices) located at a different place in the taxonomy. Thus the link between fruit and apple is transferred to a link between fruit juice and apple juice, two hyponyms of juice. This modification results from a semantic normalization of argument words. Specialization The links between concepts represented by single-word terms (such as fruits) are specialized into parallel relations between multi-word terms associated with more specific concepts located lower in the hierarchy (such as dried fruits). Thus the link between fruit and apple is specialized as a link between dried fruits and dried apples. This modification is obtained through semantic normalization of head words. Although the number of polysemous terms is very low in a technical domain, some spurious links can result from ambiguous meanings. For instance, the French word pêche has two homographs which mean either a fruit (peach) or an activity (fishing). Because of this polysemy and the lack of semantic disambiguation, an incorrect link between produit de la pêche (fishery products) and produits a ` partir de fruits (products from fruits) is inferred from the link between pêche (peach) and fruit (fruit) through semantic normalization of head words. As will be shown in the evaluation section, the ratio of spurious semantic relations is however very low because we are operating in a specialized domain and because we are dealing with non-polysemous multi-word terms. The transfer or the specialization of a given hierarchy between single words to a hierarchy between multi-word terms generally does not preserve the full set of links. In Figure 5, the initial hierarchy between plant products (topmost hierarchy in Figure 5) is only partially projected through transfer on juices or dryings (the two bottom leftmost hierarchies in Figure 5) and through specialization on fresh and dried plant products (the bottom rightmost hierarchy in Figure 5). Since multi-word terms are more specific than single-word terms, they tend to occur less frequently in a corpus. Thus only a subset of the possible projected links are observed through corpus exploration. Because of the non-systematic transfer or specialization of singleword links, this technique for automatic projection of semantic links

morin04.tex; 13/04/2004; 20:51; p.28

29

Automatic Acquisition and Expansion of Hypernym Links produit végétal (plant products)

céréale

épice

fruit

légume

(cereal)

(spice)

(fruit)

(vegetable)

blé

riz

(wheat)

maïs

figue

(rice)

(maize)

(figs)

orge

ananas

banane

(ananas)

(bananas)

carotte (carrots)

(barley)

fruit à noyau

fruit à pépins

petit fruit

(stone fruits)

(pome fruits)

(soft fruits)

cerise

olive

(cherries)

(olives)

abricot

pomme

poire

(apples)

(pears)

raisin

endive

(tomatoes)

(chicory)

fraise

(grapes)

(apricots)

tomate

(strawberries)

cassis (black currants)

Head link Argument link

Specialization

Transfer

Specialization Transfer jus de fruit

séchage de céréale

séchage de légume

(cereal drying)

(vegetable drying)

(fruit juice)

fruit frais

légume frais

(fresh fruits)

(fresh vegetables)

séchage de fruit (fruit drying)

figue séche

jus de ananas

(carrot drying)

(maize drying)

(ananas juice)

jus de raisin

(dried figs)

séchage de carotte

séchage de maïs

jus de pomme

fruit sec (dried fruits)

séchage de riz (rice drying)

séchage de la banane (banana drying)

raisin frais

raisin sec

(fresh grapes)

(dried grapes)

(grape juice)

(apple juice)

jus de poire (pear juice)

séchage de l’abricot (apricot drying)

Figure 5. Projected links on multi-word terms (the hierarchy of single words is extracted from [AGROVOC])

should be considered as a context-based assistance to thesaurus extension. The user is provided with a set of partial candidate sub-hierarchies that she may decide to generalize or not. Thus the partial hierarchy of dryings proposed by Figure 5 can be generalized to all the plant products unless an expert decides that some plant products cannot be dried. We now present an evaluation of our technique for automatically projecting semantic links.

6. Evaluation of the Projection In this section, we evaluate the projection of corpus-based links and we then compare the results of this projection with thesaurus-based links. 6.1. Projection of Corpus-based Links From 1216 pairs of conceptual related terms extracted by the Prométhée system from [AGRO-ALIM] corpus (see section 2), 89 links be-

morin04.tex; 13/04/2004; 20:51; p.29

30


tween single-word terms are selected for this experiment (see table VI above). Table VII shows the results of the projection of corpus-based links. The first column indicates the semantic class from Table VI. The next three columns indicate the number of multi-word links projected through the Specialization mode of projection, the number of correct links and the corresponding value of precision. The same values are provided for the Transfer mode of projection in the following three columns. Table VII. Precision of the projection of corpus-based links Classes

Specialization # Occ. Correct occ.

P.

# Occ.

Transfer Correct occ.

P.

trees chemical elements cereals enzymes fruits olives apples vegetables polyols polysaccharides proteins sugars

1 8

1 4

100.0% 50.0%

3 101

3 99

100.0% 98.0%

6 3 32 4 4 3 0 3 0 13

1 3 20 1 1 2 1 11

16.7% 100.0% 62.5% 25.0% 25.0% 66.7% 33.3% 84.6%

76 29 214 10 16 3 0 13 8 34

65 20 172 8 12 3 11 6 26

85.5% 69.0% 80.4% 80.0% 75.0% 100.0% 84.6% 75.0% 76.5%

Total

77

45

58.4%

507

425

83.8%

Transfer projections are more frequent (507 links) than specializations (77 links). Some classes, such as chemical elements, cereals and fruits are very productive because they are composed of generic terms. Other classes, such as trees, vegetables, polyols or proteins, yield fewer semantic variations. The low productivity of these term classes can be explained by (1) the specificity of terms (such as polyols and proteins) or (2) the low frequency of terms in the corpus (as for trees and vegetables). These results take into account possibly repeated relationships because some classes produce several occurrences of the same variant. For instance, the class cereals composed of 8 single-word terms infers 195 tokens of relations between multi-word terms, but only 79 differ-

morin04.tex; 13/04/2004; 20:51; p.30

31


ent types of relations are actually discovered that connect 68 different multi-word terms. The quality of the semantic links acquired between multi-word terms is judged by manual inspection. A link between a multi-word term and the semantic variant is considered as correct if there is an actual semantic relationship between these two terms. For instance, the link between activité de protéase (lit. protease activity) and activité d’enzymes (enzymes activity) is correct because protease is a kind of enzyme. On the contrary, centre de production (production center) and milieu de production (environment of production) are not semantically related because both centre and milieu are polysemous. They are semantically related only if they both mean middle. However, in this context, milieu means environment and centre means center. The average precision of Specializations is relatively low (58.4% on average) with a high standard deviation (between 16.7% and 100%). Conversely, the precision of Transfers is higher (83.8% on average) with a smaller standard deviation (between 69.0% and 100%). Since Transfers are almost ten times more numerous than Specializations, the overall precision of projections is high: 80.5%. Table VII above presents the total production of each class. Here, the projection of single-word term hierarchies on multi-word terms can yield several occurrences of the same variant. In order to evaluate the production of new terms (hypernyms and co-hyponyms) and relations between these terms, we only count new links yielded through the projection of corpus-based links (see Table VIII). Once more, the production of multi-word terms is higher with Transfers (72 multi-word terms) than Specializations (345 multi-word terms). Here, 427 relevant multi-word terms are inferred through projection.

Table VIII. Production of new terms and new correct links through the projection of corpus-based links Terms (hypernyms/ co-hyponyms)

Hypernym relations

Corpus-based links

96 (14/82)

94

Specialization proj. Transfer proj.

72 (30/42) 345 (89/256)

30 167

427 (119/298)

197

Total projections

morin04.tex; 13/04/2004; 20:51; p.31

32


6.2. Comparison with Thesaurus-based Links In the preceding section, extracted semantic relationships are based on corpus-based semantic links between single words. Since these links result from automatic corpus-based acquisition, they are prone to error. In order to evaluate the influence of the quality of these links on the quality of the induced relationships between multi-word terms, we now compare the projection of corpus-based links with the projection of links extracted from the [AGROVOC] thesaurus. [AGROVOC] is composed of 15,800 descriptors, and only single-word terms found in the [AGRO-ALIM] corpus are used in this evaluation (1,580 descriptors). From these descriptors, 168 terms representing 4 topics: cultivation, plant anatomy, plant products and flavorings are selected for the purpose of evaluation.16 The results of this second experiment are very similar to the first one (see Table IX). Here, the precision of Specializations is similar (57.8% for 45 links inferred), while the precision of Transfers is slightly lower (72.4% for 326 links inferred). Interestingly, these results show that links resulting from the projection of a thesaurus have a significantly lower precision (70.6%) than projected corpus-based links (80.5%). The use of a thesaurus provides deeper hierarchies with sparser links between co-hyponyms. For instance, fruits, vegetables, cereals and spices produce only four relevant relations between fruits and vegetables and no relation between other co-hyponyms. CJ:There are only very few variants extracted in which both argument and head words are semantically related. Thus the projection of the hierarchy produced by the information extractor does not infer this variation, and [AGROVOC] infers only one relation between parenchyme de pomme (apple parenchyma) and tissu de fruits (fruit tissue) A comparison of Tables VIII and X shows that, while 197 projected links are produced from 94 corpus-based links (ratio 2.1), only 88 such projected links are obtained through the projection of 159 links from [AGROVOC] (ratio 0.6). In fact the ratio of projected links is higher with corpus-based links than thesaurus links, because corpus-based links represent better the ontology embodied in the corpus. They are more easily associated with other single words to produce projected hierarchies.

16

These topics are selected because clearly mentioned in [AGRO-ALIM] and composed of hierarchies with at least of three levels.

morin04.tex; 13/04/2004; 20:51; p.32

33


Table IX. Precision of the projection of thesaurus-based links Classes

cultivation harvesting pruning plant anatomy plant reproductive organs inflorescences flowers leaves stems tissues plant products cereals spices (plant products) fruits (plant products) stone fruits pome fruits soft fruits oilseeds vegetables flavourings condiments spices (flavourings) Total

Specialization # Occ. Correct occ.

P.

# Occ.

Transfer Correct occ.

P.

0 0 0 0 5

1

20%

0 3 0 0 26

1 20

33.3% 76.9%

0 0 0 2 1 8 5 0

1 6 1 -

100.0% 75.0% 20.0% -

0 0 3 0 2 26 78 1

3 2 14 65 1

100.0% 100.0% 53.8% 83.3% 100.0%

3

3

100.0%

53

41

77.4%

2 7 8 0 4 0 0 0

1 3 7 3 -

50.0% 42.9% 87.5% 75.0% -

19 32 45 31 3 0 0 1

14 17 30 25 2 1

73.7% 53.1% 66.6% 80.6% 66.7% 100.0%

45

26

57.8%

326

236

72.4%

6.3. Synthesis This section CJ:has reported an evaluation of the technique used to expand links between single-word terms to links between multi-word terms. In the first experiment, the precision measures show that the production of multi-word terms from corpus-based semantic links is high and accurate (with a significant difference of quality between Specialization and Transfer projections).

morin04.tex; 13/04/2004; 20:51; p.33

34


Table X. Production of new terms and new correct links through the projection of [AGROVOC] links Terms (hypernyms/ co-hyponyms)

Hypernym relations

Thesaurus-based links

162 (27/135)

159

Specialization proj. Transfer proj.

49 (18/31) 256 (65/191)

18 70

Total projections

305 (83/222)

88

In order to evaluate the influence of the quality of links on the quality of the induced relationships between multi-word terms, we have performed the same experiment with links extracted from a thesaurus. The results of this second experiment confirm the relevance of the projection technique even though projection of thesaurus-based links have a lower precision (70.6%) than projection of corpus-based links (80.5%).

7. Comparison with Related Work Semantic normalization is presented as semantic variation in Hamon et al. (1998) and consists of an extraction of relations between multiword terms based on semantic relations between single-word terms. Our approach differs from this preceding work in that we CJ:use domain specific corpus-based links instead of general purpose dictionary synonymy relationships. Another original contribution of our approach is that we exploit simultaneously morphological, syntactic, and semantic links in the detection of semantic variation in a single and cohesive framework. We thus cover a larger spectrum of linguistic phenomena: morpho-semantic variations such as contenu en isotope (isotopic content) a variant of teneur isotopique (isotopic composition), syntactico-semantic variants such as contenu en isotope a variant of teneur en isotope (isotopic content), and morpho-syntactico-semantic variants such as dureté de la viande (toughness of the meat) a variant of résistance et la rigidité de la chair (lit. resistance and stiffness of the flesh). Thus the combination of FASTR and Prométhée shows how CJ:the acquisition of semantic relationships and their use in variant recognition can contribute to semantic knowledge enrichment. This article proposes some kind of second order semantic acquisition from (1) first order links

morin04.tex; 13/04/2004; 20:51; p.34


35

(semantic relationships between single words directly acquired from corpora) and (2) second order variants (semantic variants of two-word terms). One could think of extending further the process of acquisition by extracting third order semantic acquisition: the discovery of semantic links between three-word terms from second-order semantic relationships and third order variants. For instance, the semantic link between jus de fruits (fruit juice) and jus de pomme (apple juice) can yield additional relationships such as the link between vente de jus de fruits (sale of fruit juices) and vente de jus de pommes (sale of apple juices).

8. Conclusion

This study has described and evaluated a method for the automatic acquisition and expansion of hypernym links from large text corpora. The method used for this task combines (1) automatic CJ:acquisition of semantic links through an information extraction procedure and (2) expansion of links between single-word terms to multi-word terms through semantic variant recognition. Hypernym links extracted by the Prométhée system have a high precision, but cover only a part of the hypernym links present in the corpus since Prométhée only extracts pairs of terms occurring in the same sentence. In order to increase the coverage of the hypernym relation, we project a partial ontology between single-word terms on a set of multi-word terms through the extraction of semantic variations. The projection of a single-word terms hierarchy does not preserve the original full set of links. However, the evaluation shows that the level of precision is high. Both the Transfer and the Specialization mode yield high quality hierarchies of multi-word terms, whatever the source links used for semantic conflation. Even though we focus here on generic/specific relations, the methodology could be likely applied to other conceptual relations such as synonymy or meronymy. One of the most important conclusions of this study is that automatic acquisition of hypernym links from large text corpora is a difficult task which can be achieved by combining several techniques. By bridging the gap between term acquisition and term structuration, this study offers new perspectives for the concurrent use of text-mining tools (such as termers and information extractors) and automatic indexers, in the process of automatic thesaurus enrichment.

morin04.tex; 13/04/2004; 20:51; p.35

36


References Basili, R., M. T. Pazienza, and P. Velardi: 1993, ‘Acquisition of Selectional Patterns in Sublanguages’. Machine Tranlation 8, 175–201. Bourigault, D.: 1995, ‘LEXTER, A Terminology Extraction Software for Knowledge Acquisition from Texts’. In: Proceedings, 9th Banff Knowedge Acquisition for Knowledge-Based Systems Workshop. Banff, pp. 1–17 (vol. 5). Church, K. W. and P. Hanks: 1990, ‘Word Association Norms, Mutual Information, and Lexicography’. Computational Linguistics 16(1), 22–29. Condamines, A. and J. Rebeyrolle: 2001, ‘Searching for and identifying conceptual relationships via a corpus-based approach to a Terminological Knowledge Base (CTKB): Method and Results’. In: D. Bourigault, C. Jacquemin, and M.-C. L’Homme (eds.): Recent Advances in Computational Terminology. Amsterdam: John Benjamins Publishing Company, pp. 127–148. Daille, B.: 1996, ‘Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.’. In: J. L. Klavans and P. Resnik (eds.): The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Cambridge, MA: MIT Press, pp. 49–66. Fellbaum, C. (ed.): 1998, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Grefenstette, G.: 1994, Explorations in Automatic Thesaurus Discovery. Boston, MA: Kluwer Academic Publisher. Grishman, R. and J. Sterling: 1992, ‘Acquisition of Selectional Patterns’. In: Proceedings, 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, pp. 658–664. Hamon, T., A. Nazarenko, and C. Gros: 1998, ‘A Step Towards the Detection of Semantic Variants of Terms in Technical Documents’. In: Proceedings, 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL’98). Montreal, pp. 498–504. Hearst, M. A.: 1992, ‘Automatic Acquisition of Hyponyms from Large Text Corpora’. In: Proceedings, 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, pp. 539–545. Hearst, M. A.: 1998, ‘Automated Discovery of WordNet Relations’. In: C. Fellbaum (ed.): WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, pp. 131–151. Hindle, D.: 1990, ‘Noun classification from predicate argument structures’. In: Proceedings, 28th Annual Meeting of the Association for Computational Linguistics (ACL’90). Berkeley, CA, USA, pp. 268–275. Jacquemin, C. and E. Tzoukermann: 1999, ‘NLP for Term Variant Extraction: A Synergy of Morphology, Lexicon, and Syntax’. In: T. Strzalkowski (ed.): Natural Language Information Retrieval. Boston, MA: Kluwer, pp. 25–74. Jacquemin, C.: 1996, ‘A symbolic and surgical acquisition of terms through variation’. In: S. Wermter, E. Riloff, and G. Scheler (eds.): Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Heidelberg: Springer, pp. 425–438. Jacquemin, C.: 1999, ‘Syntagmatic and Paradigmatic Representation of Term Variation’. In: Proceedings, 37th Annual Meeting of the Association for Computational Linguistics (ACL’99). University of Maryland, pp. 341–348. Jacquemin, C.: 2001, Spotting and Discovering Terms through NLP. Cambridge, MA: MIT Press.

morin04.tex; 13/04/2004; 20:51; p.36


37

Jacquemin, C. and J. Royauté: 1994, ‘Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework’. In: Proceedings, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94). Dublin, pp. 132–41, Springer Verlag, New York. Justeson, J. S. and S. M. Katz: 1995, ‘Technical terminology: some linguistic properties and an algorithm for identification in text’. Natural Language Engineering 1(1), 9–27. Morin, E.: 1999a, ‘Des patrons lexico-syntaxiques pour aider au dépouillement terminologique’. Traitement Automatique des Langues 40(1), 143–166. Morin, E.: 1999b, ‘Extraction de liens sémantiques entre termes a ` partir de corpus de textes techniques’. PhD thesis in Computer Science, University of Nantes. Morin, E. and C. Jacquemin: 1999, ‘Projecting Corpus-Based Semantic Links on a Thesaurus’. In: Proceedings, 37th Annual Meeting of the Association for Computational Linguistics (ACL’99). University of Maryland, pp. 389–396, ACL. Rijsbergen van, C. J.: 1979, Information Retrieval. Butterworths. Riloff, E.: 1993, ‘Automatically Constructing a Dictionary for Information Extraction Tasks’. In: Proceedings, 11th National Conference on Artificial Intelligence (AAAI’93). Washington, DC, pp. 811–816. Ruge, G.: 1991, ‘Experiments on Linguistically Based Term Associations’. In: Proceedings, Intelligent Multimedia Information Retrieval Systems and Management (RIAO’91). Barcelone, Espagne, pp. 528–545. Salton, G. and M. J. McGill: 1983, Introduction to Modern Information Retrieval. New York: McGraw-Hill. Sch¨ utze, H.: 1993, ‘Word Space’. In: S. J. Hanson, J. D. Cowan, and L. Giles (eds.): Advances in Neural Information Processing Systems 5. San Mateo, CA: Morgan Kauffmann, pp. 895–902. Smadja, F.: 1993, ‘Retrieving Collocations from Text: Xtract’. Computational Linguistics 19(1), 143–177. Vivaldi Palatresi, J. and H. Rodr´ıguez Hontoria: 2002, ‘Medical Term Extraction using EWN ontology’. In: Proceedings, Terminology and Knowledge Engineering (TKE’2002). Nancy, France, Gesellschaft f¨ ur Terminologie und Wissenstransfer. Yoshikane, F., K. Tsuji, K. Kageura, and C. Jacquemin: 1998, ‘Detecting Japanese Term Variation in Textual Corpus.’. In: Proceedings, Fourth International Workshop on Information Retrieval with Asian Languages (IRAL’99). Academia Sinica, Taipei, Taiwan, pp. 97–108.

morin04.tex; 13/04/2004; 20:51; p.37

morin04.tex; 13/04/2004; 20:51; p.38

Automatic Acquisition and Expansion of Hypernym Links - CiteSeerX

Automatic Acquisition and Expansion of Hypernym Links - CiteSeerX

Suggest Documents

Automatic Acquisition and Expansion of Hypernym Links - limsi

Automatic Acquisition and Initialization of Articulated ... - CiteSeerX

Learning syntactic patterns for automatic hypernym discovery

Automatic Acquisition of Tactical Go Rules - CiteSeerX

Automatic acquisition of semantic-based question ... - CiteSeerX

Automatic Acquisition of Tactical Go Rules - CiteSeerX

Automatic Acquisition of Fuzzy Footprints - CiteSeerX

AUTOMATIC ACQUISITION OF LEXICAL

Automatic Paraphrase Acquisition from News Articles - CiteSeerX

Tools for Automatic Lexicon Maintenance: Acquisition ... - CiteSeerX

Automatic Lexical Acquisition Based on Statistical ... - CiteSeerX

Automatic Acquisition of Noun and Verb Meanings - CiteSeerX

AUTOMATIC CONCEPT ACQUISITION FROM

Automatic Acquisition of Image Filtering and Object

Automatic Acquisition and Initialization of ... - Semantic Scholar

Automatic Data Acquisition and Visualization for Usability ... - CiteSeerX

Semi-automatic Acquisition of Semantic Descriptions of ... - CiteSeerX

Automatic Image-Based Scene Model Acquisition and ... - CiteSeerX

Automatic Image-Based Scene Model Acquisition and ... - CiteSeerX

A Survey of Automatic Query Expansion in Information ... - CiteSeerX

A Survey of Automatic Query Expansion in Information ... - CiteSeerX

Ontology Acquisition for Automatic Building of Scientific ... - CiteSeerX

AUTOMATIC TARGET ACQUISITION AND TRACKING WITH ...

Automatic Acronym Acquisition and Term Variation Management ...