Extracting Semantic Relationships between Terms - CiteSeerX

Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods Michal Finkelstein-Landau Math & Computer Science Department Bar Ilan University Ramat Gan 52900 ISRAEL [email protected]

Emmanuel Morin IRIN 2, rue de la Housini`ere - BP 92208 44322 Nantes Cedex 3 FRANCE [email protected]

May 13, 1999

1 Introduction As the amount of electronic documents (corpora, dictionaries, newspapers, newswires, etc.) becomes more and more important and diversified, there is a need to extract information automatically from these texts. In order to extract terms and relations between terms, two methods can be used. The first method is the unsupervised approach, which requires a term extraction module and few predefined types, especially term types, in order to find relationships between terms and to assign appropriate types to the relationships. Works on automatic term recognition usually involve predefinition of a set of term patterns, extraction procedure and a scoring mechanism to filter out non-relevant candidates. Smadja (1993) describes a set of techniques based on statistical methods for retrieving collocations from large text collections. Daille (1996) presents a combination of linguistic filters and statistical methods to extract two-word terms. This work implements finite automata for each term pattern, then various statistical scores for ranking the extracted terms are compared. Unsupervised identification of term relationships is a more complicated task, reported in works from various fields including Computational Linguistics and Knowledge Discovery in Texts. A keyword-based model for text mining is described in Feldman and Dagan (1995). The work suggests to use a wide range of KDD (Knowledge Discovery in Databases) operations on collections of textual documents, including association discovery among keywords within the documents. Cooper and Byrd (1997) reports the TALENT extraction tools, designed to extract and organize lexical networks from named and unnamed relations in the text. Named relations are not necessarily primed with specific relation names that it is looking for, but discovered by exploiting text patterns in which such relationships are typically expressed. The second method is the supervised relation classification system, which requires predefinition of lexicosyntactic patterns as well as manual traverses on outputs of terminologists, in order to find pairs that belong to the predefined relations. Hearst (1992, 1998) reports a method using lexico-syntactic patterns to extract lexical relations between words from unrestricted text. For example, the pattern NP, especially NP, or and NP (where NP is a noun phrase), and the sentence: (...) most European countries, especially France, England and Spain extract three lexical relations: (1) HYPONYM(France, European country), (2) HYPONYM(England, European country), and (3) HYPONYM(Spain, European country). These relations can then be included in a hierarchical thesaurus. Here, only a single instance of a lexico-syntactic pattern needs to be encountered to extract the corresponding conceptual relation. Other supervised systems use partial syntactic structures by The experiments presented in this paper were performed on a subset of the [REUTERS] corpus, a 0.9-million word English corpus including 5770 news stories.

1

Bootstrap: initial pairs of terms

Corpus

Lexical

preprocessor

Shallow parser + classifier

Lexico-syntactic patterns

Lemmatized and tagged corpus Information extractor

Database of lexico-syntactic patterns

Partial hierarchies of single-word terms

Figure 1: The information extraction system P ROM E´ TH E´ E

using local information for extracting specific relations. L IEP (Huffman, 1995) learns information extraction patterns from example texts containing events. A user can choose which combinations of entities signify events to be extracted. These positive examples are used by L IEP to build a set of extraction patterns. These supervised systems have good performance for information extraction tasks in limited domain. But, the cost of adapting an information extraction system to a new domain can be prohibitive. In order to evaluate the complementarity of these methods, we compared a supervised method with an unsupervised method for the extraction of semantic relationships between terms. The paper is the result of this study. The remainder of this paper is organized as follows. Section 2 presents the supervised system P ROM E´ TH E´ E, and describes the methodology for acquisition of lexico-syntactic patterns. Section 3 presents an unsupervised method that combines ideas of term identification and term relationship extraction for term-level text mining. Section 4 presents the integrated system and experimentation. Finally, section 5 concludes this study.

2 Iterative Acquisition of Lexico-syntactic Patterns We first present the supervised system P ROM E´ TH E´ E for corpus-based information extraction that extracts semantic relations between terms.1 This system is built on previous work on automatic extraction of hypernym links through shallow parsing (Hearst, 1992, 1998). In addition to this previous study, the system incorporates a technique for the automatic generalization of lexico-syntactic patterns that relies on a syntactically-motivated distance between patterns. As illustrated in Figure 1, the P ROM E´ TH E´ E system has two functionalities: 1. The corpus-based acquisition of lexico-syntactic patterns with respect to a specific conceptual relation. 2. The extraction of pairs of conceptual related terms through a database of lexico-syntactic patterns.

Shallow Parser and Classifier A shallow parser is complemented with a classifier for the purpose of discovering new patterns through corpus exploration. This procedure, inspired by Hearst (1992, 1998), is composed of 7 steps: 1. Select manually a representative conceptual relation, for instance the hypernym relation. 2. Collect a list of pairs of terms linked by the selected relation. The list of pairs of terms can be extracted from a thesaurus, a knowledge base or can be manually specified. For instance, the hypernym relation neocortex IS-A vulnerable area is used. 1

For expository purposes of this section, some examples are taken from [MEDIC], a 1.56-million word English corpus of scientific abstracts in the medical domain.

2

3. Find sentences in which conceptually related terms occur. These sentences are lemmatized, and noun phrases are identified. Therefore, sentences are represented as lexico-syntactic expressions 2 . Through this simplification process, we have a more generic representation of relevant sentences, and the comparison of these sentences is easier. For instance, the previous relation HYPERNYM(vulnerable area, neocortex) is used to extract from the corpus [MEDIC] the sentence: Neuronal damage were found in the selectively vulnerable areas such as neocortex, striatum, hippocampus and thalamus. The sentence is then transformed into the following lexico-syntactic expression: NP find in NP such as LIST

(1)

4. Find a common environment that generalizes the lexico-syntactic expressions extracted at the third step. This environment is calculated with the help of a measure of similarity and a procedure of generalization that produce candidate lexico-syntactic pattern. For instance, from the previous expression, and at least another similar one, the following candidate lexico-syntactic pattern is deduced: NP such as LIST

(2)

5. Validate candidate lexico-syntactic patterns by an expert. 6. Use new patterns to extract more pairs of candidate terms. 7. Validate candidate pairs of terms by an expert, and go to step 3. Through this technique, lexico-syntactic patterns are extracted from a technical corpus. These patterns are then exploited by the information extractor that produces pairs of conceptual related terms.

Automatic Classification of Lexico-syntactic Patterns Let us illustrate with more details the fourth step of the described algorithm that automatically acquires lexicosyntactic patterns by clustering similar patterns. As indicated in item 3. above, the relation HYPERNYM(vulnerable area,neocortex) instantiate the pattern: NP find in NP such as LIST

(3)

Similarly, from the relation HYPERNYM(complication, infection), the sentence: Therapeutic complications such as infection, recurrence, and loss of support of the articular surface have continued to plague the treatment of giant cell tumor is extracted through corpus exploration. A second lexico-syntactic expression is produced: NP such as LIST continue to plague NP

(4)

Lexico-syntactic expressions (3) and (4) can be abstracted as:3 .

/1!0)2"3$4 #&%')(* , + .5 5 5 5 5 5 5 5 and 5 6 6 76 $ / 6 80

Extracting Semantic Relationships between Terms - CiteSeerX

Extracting Semantic Relationships between Terms - CiteSeerX

Suggest Documents