Proposition d'un plan pour un article sur les couples

0 downloads 0 Views 570KB Size Report
Discovering and Organizing Noun-Verb Collocations in Specialized Corpora ... The dictionary includes rich collocational information under each head word. The.
Discovering and Organizing Noun-Verb Collocations in Specialized Corpora Using Inductive Logic Programmig Vincent Claveau Marie-Claude L’Homme Observatoire de linguistique Sens-Texte (OLST) Université de Montréal C.P. 6128, succ. Centre-ville Montréal (Québec) H3C 3J7 [email protected] [email protected] Abstract: This article presents an automatic method for discovering and organizing noun-verb (N-V) combinations found in a French corpus on computing. Our aim is to find N-V combinations in which verbs convey a “realization meaning” as defined in the framework of lexical functions, LFs (Mel’čuk 1996, 1998). Our approach, chiefly corpus-based, uses a machine learning technique, namely Inductive Logic Programming (ILP) and is designed as a generic syntagmatic relationship acquisition system. The whole acquisition process is divided into three steps: 1) isolating contexts in which N-V pairs sharing a realization relationship occur; 2) from these contexts, inferring linguistically-motivated rules that reflect the behaviour of realization N-V pairs; 3) projecting these rules on corpora to find other valid N-V pairs. This technique is evaluated in terms of the relevance of the rules inferred and in terms of the quality (recall and precision) of the results, i.e., the set of N-V pairs retrieved, with the help of a large test set. The results obtained show that our approach is able to find these very specific semantic relationships (the realization N-V pairs) with very good success rates, making this acquisition method attractive and helpful for terminologists.

Key words: collocation, verbs of realization, machine learning, Inductive Logic Programming, specialized corpus, lexical function

1.

Introduction

The objective of this research is to devise and evaluate an automatic method for capturing and organizing noun + verb (N-V) combinations found in specialized corpora. The work is undertaken in order to assist terminologists in the enrichment of a specialized dictionary, more precisely a French dictionary of computing, the DiCoInfo (L’Homme 2004). The dictionary includes rich collocational information under each head word. The information is extracted from a corpus of computing that amounts to approximately 600,000 words. The encoding in the dictionary is based on lexical functions (LFs) (Mel’čuk 1996, 1998; Mel’čuk et al. 1984-1999) which are used to organize collocates according to syntactic and semantic parameters (LFs are presented in Section 3). (An example of this encoding for the term ordinateur (Eng. computer) is provided in Appendix A.) Apart from the Explanatory Combinatorial Dictionary (ECD) and our dictionary on computing, other lexicographical and terminological projects have applied LFs or similar formalisms to encode collocations (Binon et al. 2000; Cohen 1986; Fontenelle 1997). When browsing the corpus for finding relevant collocations, terminologists face two different problems. First, they must make a selection among all combinations that appear in the corpus in order to retain only those that are relevant. As far as automation is concerned, this

problem has been addressed by collocation extractors (e.g., Grefenstette 1994; Lin 1998; Smadja 1993). Secondly, they must classify collocations according to the syntactic and semantic relationships of their components. This type of classification is much more difficult to automate and has not been dealt with extensively (notable exceptions are Wanner 2004 and Wanner et al. 2005). The work reported in this article addresses the latter issue. Our aim is to find N-V combinations in which verbs convey realization. In line with the work by Wanner, we carried out this task using a machine learning technique. However, our approach relies on Inductive Logic Programming (ILP) and is chiefly corpus-based. It can be divided into three steps: 1) isolating contexts in which N-V pairs sharing a realization relationship occur; 2) from these contexts, inferring linguistically-motivated rules that reflect the behaviour of realization N-V pairs; and 3) projecting these rules on corpora to find other valid N-V pairs. The article is organized as follows. Section 2 presents some related work and explains how our own approach distances itself from others. We will focus on collocation extraction and attempts for classifying these collocations semantically. Also, since this work is an extension of previous experiments (Bouillon et al. 2001; Claveau et al. 2003), we will say a few words on these. Section 3 presents the specific objectives of the experiments reported in this article and describes the concept of “lexical function” on which our classification of N-V combinations is based. In section 4, we describe the different steps of our methodology (the composition of the corpus and basic principles of Inductive Logic Programming). Finally, section 5 gives the details and the results of the evaluation we carried out to validate our approach. 2.

Related work and positioning of our own approach 2.1

Collocation extraction techniques

Numerous studies have been dedicated to the acquisition of semantic relations from corpora using numerical techniques. Grefenstette (1994) presents the state-of-the-art in the domain, and Manning and Schütze (1999) describe a wide range of statistical methods that have been used for that purpose. With numerical approaches, relations between lexical units are first acquired by studying word cooccurrences in a text window (or in a specific syntactic structure). Then, the strength of the association is evaluated with a statistical score (association coefficient) that detects words appearing together in a statistically significant way (see for example the seminal work by Church and Hanks (1989)). Among the best-known statistical coefficients, let us cite Mutual Information, the Log-likelihood Coefficient (Dunning 1993) and the Dice coefficient (Smadja 1993). A comparison between different statistical coefficients can be found in Pearce (2002). In general, these numerical approaches prove quite efficient, giving good results in terms of collocations retrieved, and do not necessitate any human intervention. Nonetheless, when used without refinement, these methods are unable to distinguish between different kinds of collocations and thus are unable to focus on a precise semantic relation. In addition to numerical techniques, many systems use patterns to retrieve collocations from corpora. These patterns are usually based on lemmas, Part-of-Speech tags or syntactic information and are applied prior to the numerical analysis (inter alia, Kilgariff and Tugwell 2001; Goldman et al. 2001) or afterwards (Smadja 1993). These hybrid systems yield more precise

results and, contrary to purely numerical ones, they can be used to focus on special kinds of collocations by retrieving collocates that are in a specific syntactic relation with the collocation base. Most of the time, the list of patterns used is created manually and then fed into the extraction system. They can be definitively fixed, or tuned with respect to the particularities of the corpus studied. In this respect, the originality of the work presented here is to automatically produce patterns for extracting collocations conveying a realization meaning by positioning ourselves in a machine learning framework. 2.2

Semantic classification of collocations

To our knowledge, very few attempts have been made to organize collocations automatically according to the semantics of their components. Hence, in this section, we will focus on work by Wanner (2004) and Wanner et al. (2005), who carried out two series of experiments to classify NV pairs found in Spanish corpora. It is worth noting that many projects for finding specific semantic relationships have been carried out using approaches based on linguistic patterns. These patterns are used to retrieve words sharing relationships such as causality (Garcia et al. 2000) or hyperonymy (Hearst 1998; Morin 1999). However, the aim in these researches is not to find collocations but rather to devise semantic information extraction systems. Wanner’s first series of experiments (Wanner 2004) focused on classifying combinations of a verb with a noun denoting an emotion (e.g., admiración (Eng. admiration), alegría (Eng. joy), entusiasmo (Eng. enthusiam) in a Spanish general-language corpus (Wanner 2004: 109)); the second series (Wanner et al. 2005) was carried out on a law corpus and aimed at classifying collocations in this specialized corpus. In both series of experiments, N-V combinations were linked syntactically and were classified in terms of lexical functions (LFs) using machine learning techniques (two techniques were tested: the nearest neighbour classification and a variant of Bayesian networks), leading to very good results. An external resource (namely, Spanish EuroWordNet) was used in order to disambiguate the components of the collocations. Our work bears a number of similarities to that of Wanner (2004) and Wanner et al. (2005). We also try to distinguish between different types of N-V combinations using a machine learning technique; and use LFs to organize them semantically. However, our approach differs in a number of ways. First, we do not try to classify syntactically linked N-V pairs once they have been acquired. We identify, in a corpus, N-V pairs that potentially belong to a subset of LFs, i.e., realization meanings. The syntactic links between N-V pairs, although necessary for a proper classification, are discovered during the learning process. Also, since we identify valid pairs in a technical corpus, i.e., a French corpus of computing, we cannot rely on an external resource to access information on the semantics of the nouns and verbs that form the combinations. Finally, we use a different machine learning technique, namely Inductive Logic Programming, which is particularly well suited for our task (see Section 4.4.2) and can produce interpretable results (see the discussion on the patterns it infers, Section 5.1). 2.3

ASARES and previous uses

The corpus-based acquisition technique we use to retrieve N-V pairs sharing a realization semantic link, called ASARES, is a symbolic extraction system: the N-V pairs are actually found in a (Part-ofSpeech tagged) corpus with the use of patterns. However, the originality of this approach lies in the fact that the patterns we need to acquire the N-V combinations in which we are interested are

learned automatically from examples. The machine learning technique on which ASARES relies to infer these patterns is Inductive Logic Programming (further details on the technique implemented in ASARES are provided in Section 4). Roughly, the acquisition process can be divided in three steps. First, relevant examples of what we want to extract are gathered. In our case, these examples reflect the behaviour of realization N-V pairs in context. Then, with the help of ILP, ASARES infers general patterns corresponding to the examples and describing — in terms of Part-of-Speech tags, lemmas and distances between N and V — what distinguishes realization N-V pairs from other N-V pairs. The set of the patterns produced is called a classifier. Finally, the last step consists of applying the classifier (the patterns) to the corpus to retrieve new realization N-V pairs. This acquisition technique has already been used in a corpus-based acquisition task. In these experiments (Bouillon et al. 2001, Claveau et al. 2003), the combinations sought were N-V pairs in which the V expresses one of the qualia roles of N as defined in the Generative Lexicon framework (Pustejovsky 1995). These qualia roles group different semantic relationships that can be associated with N. For example, the agentive role expresses the way the entity denoted by N is created (e.g., write for book); the telic role expresses the usual way N is used (e.g., read for book); the formal role expresses the way N can be described from the semantic classes it inherits (e.g., contain for book since, according to Pustejovsky, the description of a book is a physical object that contains information). It is important to note that no assumption is made about the syntactic relation between a noun and its qualia verbs. Claveau et al. (2003) use ASARES to acquire qualia verbs of nouns from corpora. However, in these experiments, the distinction between qualia roles is not considered. The acquisition task we propose in this article, though similar in terms of the acquisition technique used, differs with respect to the N-V semantic relationships sought. Indeed, while the N-V pairs sought by Claveau et al. (2003) can share a broad range of semantic relationships and any syntactic link, the pairs we are looking for must convey a realization meaning and the LFs are acquired according to the syntactic relation between N and V (see section 3.1). 3.

Realization verbs according to the lexical function framework

The notion of “realization verb” upon which our experiments rely is based on the lexical function (LF) formalism. In this section, we give a brief description of the concept of “lexical function”, and provide further details on the subset in which we are interested, i.e., realization LFs. We also show how the formalism serves as a backbone for conducting the experiments and interpreting the semantic links between N-V pairs found in a specialized corpus of computing. 3.1

Lexical functions (LFs)

A lexical function is designed to capture a general and abstract meaning that can correspond to a high number of different linguistic values. For example, Magn is a function that expresses intensification. It can be applied to different lexical units and produce a large set of values (e.g., Magn(smoker) = heavy; Magn(bachelor) = confirmed, etc.) (Mel’čuk et al. 1995: 126-127). A lexical function is written f(x) = y: f represents the function, x, the argument, and y, the value expressed by the function when applied to a given argument.

In this article, we will be concerned with syntagmatic LFs — i.e., LFs designed to express the relationship between a key word and a collocate in a collocation. Collocations are defined as combinations of lexical units (LUs) which are lexically restricted, i.e., arbitrary and unpredictable (this characterization is based on Haussmann (1979) and Mel’čuk et al. (1995)). For example, there is no way to predict, based on regular syntactic and semantic rules, that the adjective used to express intensification for smoker is heavy. The same applies to bachelor (intensifier: confirmed) or, to use an example in the field of computing, to density (intensifier: high). 3.2

Realization verbs

In this experiment, we focus on a specific set of verbs that could be encoded with LFs that indicate realization meanings. Realization verbs are those that express the fulfilment of the requirement of the noun, i.e., that express what one is supposed to do with the noun (Mel’čuk 1998: 40). The “requirement” differs according to the semantics of the noun: for example, the realization verbs found with artefacts (which are very frequent in the field of computing) express the use that is made of them according to their specific function (Mel’čuk 1998: 40). Realization meanings are represented with three basic LFs according to the actantial1 position of the key word: · Fact0: expresses a realization meaning when the key word is first actant (no · ·

other actant is involved); Facti expresses a realization meaning when the key word is first actant (a second actant is involved). Reali: expresses a realization meaning when the key word is second actant. Labrealij: expresses a realization meaning when the key word is third actant.

This first set of LFs is often opposed to another series that represent support verbs or verbs that convey a meaning related to the creation or the existence of the key word, namely Func0 or Funci, Operi, Laborij. Examples of verbs in the field of computing that belong to both series are provided in Table 1. (LFs cited in this article are explained in Appendix B.) Fact0 Fact1 Real1 Real2

Realization verbs système tourne: system runs commande s’exécute : command runs hébergeur héberge un site : host hosts a Web site naviguer dans Internet: surf the Internet s’afficher à l’écran (appear on screen)

Func0 Func2 Oper1 Oper2

Labreal12 écrire un programme en langage : write a Labor12 program in a language stocker les données sur disk (store data on disk)

“Support” verbs panne survient; crash(happens configuration comporte (configuration includes) tomber en panne (to crash) resolution offrir une résolution: offer a resolution mettre des ordinateurs en réseau : connect computers to a network

Table 1: Realization and support verbs in the field of computing

Lexical functions that represent realization verbs are often used in combination with other lexical functions in order to capture more complex meanings (cause, process, etc.). Table 2 lists the combinations that can be found in the current version of our dictionary of computing and which 1

In accordance with Explanatory Combinatorial Lexicology (ECL), we use actant and actantial to refer to the participants of a predicate. However, other frameworks use argument.

will serve as positive examples in our experiment. Lexical functions are listed according to the actantial position of the key word. As we will see further in the article, our experiments take this actantial position into account. LF IncepFact0 Fact0 Fact1 InceptFact2 Fact2 FinFact0 De_NouveauIncepFact0 Prepar1Fact0 Caus1Fact0 Liqu1Fact0 Caus1De_nouveauFact0 Caus1Able1Fact0 Liqu1Able1Fact0 Caus1De_NouveauAble1Fact0 Caus1Non-Able1Fact0 Prepar1Real1 Real1 FinReal1 Labreal12

Example English translation The key word is first actant ordinateur: ~ démarrer computer starts ordinateur : ~ tourne computer runs virus : ~ contamine le logiciel, le matériel a virus infects the hardware or the software internaute : ~ se connecte à Internet the Web user connects to the Internet internaute : ~ consulte une page, un site the Web user visits a Web page, a Web site ordinateur : ~ plante computer crashes PC : ~ redémarre PC restarts The key word is second actant programme : charger ~ load a program traitement de texte : lancer ~ launch/open the word processor système : éteindre ~ turn off/shut down? the system ordinateur: redémarrer ~ restart the computer menu : activer ~ activate (?) a menu fonction : désactiver ~ deactivate a function fichier: réparer ~ repair a file disque : endommager ~ damage a disk menu : dérouler ~ open a menu page : consulter ~ visit a Web page logiciel : quitter ~ quit/exit an application The key word is third actant mémoire: charger un logiciel en ~ load an application into memory souris : cliquer sur l’icône avec la souris click on the icon with the mouse

Table 2: Realization verbs encoded in the dictionary on computing

It is important to specify at this point that the assignment of LFs is based on a prior disambiguation of the LUs they involve. We will come back to this later on when describing some choices that were made in the experiments. To illustrate the problem, we will examine the case of ergative verbs (i.e., verbs that have an intransitive and a transitive use which implies a causative meaning). The difficulty lies in the fact that these verbs can combine with the same LU in both cases. However, since the two uses correspond to two different senses, LFs will differ. The problem is illustrated below with the verb démarrer (Eng. start): l’ordinateur démarre (Eng. the computer starts): IncepFact0(ordinateur) = ~ démarre l’utilisateur démarre l’ordinateur (Eng. the user starts the computer): Caus1Fact0(ordinateur) = démarrer ~.

3.3

Realization verbs and problems posed by their identification in specialized corpora

By attempting to discover N-V collocations in which verbs convey a realization meaning (one of those listed in Table 2 and potentially others that have not been encoded yet in the dictionary), we must distinguish them from other, non-valid N-V pairs. By non-valid N-V pairs, we mean those: ·

in which the noun and the verb do not share a syntactic link (not even a indirect one —– we will come back to this later), even though they appear in the same sentence;

·

in which the verb and the noun are syntactically related, but that would not be encoded in our dictionary as a relevant collocation; in which the verb and the noun are syntactically related, would be considered as a candidate for inclusion in our dictionary, but would be described with an LF that does not include Fact0 or Facti, Reali or Labrealij.

·

Realization verbs were chosen in this experiment because they are believed to be frequent and important in corpora on computing. They combine with terms that denote hardware, software, graphical entities, etc., which abound in this sort of corpus. Since we do not know beforehand which pairs in our specialized corpus convey the semantic link for which we are looking, the experiments have been designed to identify any combination in which the verb can qualify as a realization verb. The general idea is to provide terminologists with a list of potentially interesting pairs. However, all pairs will not necessarily be encoded in a dictionary. For instance, in our corpus on computing, both naviguer (Eng. surf) and utiliser (Eng. use) were found in combination with Internet as potential realization candidates. However, terminologists would probably retain only naviguer as an instance of Real1 for Internet even though utiliser means, in this context, almost the same thing and appears frequently in the corpus. Our automatic method should be able to extract both candidates. N-V pairs sharing a specific semantic link can appear in different syntactic structures. For example, the pair exécuter + commande (Eng. execute + command (which is a pair we aim at capturing in these experiments) can be found in the following sentences. In all instances, exécuter means “to cause the command to function” (Caus1Fact0). Our method should be able to identify all these syntactic patterns. ·

·

·

Commande is the first complement of exécuter: Si aucun de vos fichiers ne s’appelle LOTUS.COM , exécutez cette commande avec le nom de vos fichiers. Pour exécuter cette commande correctement et pour passer au répertoire Lotus.

Commande is subject of exécuter (in a passive construction): Avec l’opérateur, la commande command2 n’est exécutée que si le code d’erreur de la commande command est non nul, ou, autrement dit , si cette commande ne s’est pas exécutée correctement. La commande depmod a est exécutée automatiquement lors de l’installation des modules du noyau .

o

Commande shares an indirect syntactic link with exécuter: Commande referred to by a relative pronoun:

o

Commande referred to by a personal pronoun:

o

Commande is the head of a prepositional phrase containing exécuter:

Le fichier AUTOEXEC.BAT comprenant un ensemble de programmes ou commandes qui seront exécutées séquentiellement d’une façon automatique. Cette commande a pour conséquence de créer l’arborescence des répertoires des sources de GCC dans le répertoire où elle a été exécutée . Pour qu’elle soit exécutée, une commande doit se terminer par un retour [Enter] . display type [commande] où display est le nom du display à utiliser, type est le type de serveur X (serveur local ou distant), et commande est la commande à exécuter pour lancer le serveur s’ il est local .

In order to capture all valid N-V pairs (given the syntactic diversity mentioned above and other reasons such as the large number of proper nouns and command names in the corpus), no syntactic analysis is performed on the corpus. We believe a prior syntactic analysis would reduce the number of cases identified by ASARES. For instance, the indirect syntactic links would be

discarded, although such links are very frequent, especially in realization pairs in which N is the third actant of V. 4.

Methodology 4.1

The corpus

The French corpus used in our experiments is composed of more than 50 articles from books or Web sites specialized in computer science; all of them were published between 1988 and 2003. It covers different computer science sub-domains and comprises 600,000 words. Table 3 describes the size and the distribution of the texts of the corpus according to these sub-domains. Analysis Corpora

Number of texts

Size of corpora Number of occurrences

Computer Science Basics Internet Software Hardware Programming and Networks Operating Systems

8 12 4 5 11 13

116,821 102,972 78,412 41,816 38,909 221,104

Total

53

600,034

Table 3: Size and composition of the corpus

Segmentation, Part-of-Speech tagging and lemmatization have been carried out using the tool CORDIAL, a commercial product of Synapse-Développement. Each word is accompanied by its lemma and Part-of-Speech tag (noun, verb, adjective). Also, the tool indicates inflection (gender and number for nouns, tense and person for verbs) and gives partial syntactic information (headmodifier) for noun phrases. For the reasons invoked in the preceding sub-section, no further syntactic analysis is performed. 4.2

Inductive Logic Programming and ASARES

As was previously said, our extraction system ASARES is essentially based on Inductive Logic Programming (ILP) (Muggleton and De Raedt 1994).2 This learning technique allows us to infer rules that can be used afterwards as extraction patterns. In this section, we first give readers a basic background on ILP, and then we argue that this machine learning technique is well suited for our corpus-based acquisition task. For a more detailed presentation of ASARES, in particular for all aspects concerning computational efficiency, logical background and expressiveness of the inferred patterns, refer to Claveau et al. (2003). 4.2.1

General background on ILP and Notations

ILP is a technique at the intersection of two domains: logic programming (which Prolog is the best known representative language) and machine learning. It produces (that is, infers) general rules or hypotheses (Horn clauses) that explain a concept using, on one hand, sets of positive and negative 2

The use of symbolic machine learning in natural language processing (NLP) is becoming widespread. ILP, thanks to its expressiveness and its flexibility, has been used in various tasks such as Part-of-Speech tagging, syntactic analysis, or natural language querying of specialized databases (see Cussens and Džerovski (2000) for a more complete overview of the researches in this domain).

examples of this concept and, on the other, known information called Background Knowledge. Hereafter, the set of positive examples is called E+, the set of negative ones E-, and the Background Knowledge B. The rules are obtained by generalizing the positive examples with respect to the Background Knowledge. Negative examples are used to prevent an over-generalization: the rules produced must not cover (i.e., explain) these negative examples, or at least must not cover most of them (some noise can be allowed). A hypothesis language LH is also provided to the ILP algorithm; it is used to define the form of the inferred rules precisely. Thus, this language ensures that users obtain only well-formed rules that are relevant to a specific learning task. The set of rules produced by ILP, which forms the classifier, is indicated by H. The main advantage of ILP is its relational capabilities. Indeed, this learning technique is often used to solve problems in which examples cannot be described by sets of standard attributevalue tuples. The classifier H produced by ILP also benefits from this relational nature since the rules inferred are Prolog clauses. The expressiveness in input (examples description) and in output (inferred rules) makes ILP suitable and attractive for many real learning problems. Some conditions are imposed on this learning process; they represent the logical framework of this learning technique. The first two, listed below, deal with the data (|= represents implication): · a priori consistency ensures that negative examples do not contradict information from the Background Knowledge, that is: B ^ E- |\= □ (where □ means false); · a priori necessity expresses the fact that the information known is not sufficient to explain the examples, which is noted B |\= E+. With these preconditions, the ILP algorithm tries to produce the set of rules H such that the two following conditions are verified: · a posteriori consistency imposes that no contradiction exists between B, H and E-, which is noted: B ^ E- |\= □; · completeness ensures that the produced hypotheses plus the Background Knowledge actually explains the positive examples, that is: B ^ H |= E+. With respect to these settings, ILP aims at inferring rules that comply to the LH that cover most of the positive examples and no or only a few negative ones. These rules are searched within hypothesis spaces containing all the possible rules. Although restricted by LH, the search space remains huge and even infinite in some cases. Hopefully, hypotheses can be organized by a generality relation that makes it possible to scan the spaces efficiently by means of an operator called a refinement operator. The choice of a hypothesis in these search spaces is based on a score function Sc, which usually depends on the number of positive and negative examples that are covered by the hypothesis.

4.2.2

Relevance of ILP for our application

For the task we are dealing with in this article, the concept we want to learn is what distinguishes, according to the context in which it appears, a realization N-V pair from other pairs. Thus, our ILP learning process should provide us with a classifier able to detect realization N-V pairs within sentences, but it should also be able to give terminologists some insights about general and corpusspecific linguistic patterns conveying realization links. ILP appears to be a relevant technique in this context, since rules can be directly interpreted as linguistic extraction patterns and offer expressiveness issued from first order logic, on which it relies. This expressiveness is also exploited to describe the examples, that is, to describe the sentences in which realization N-V pairs appear (for the positive examples) and other N-V pairs (for the negative examples). Indeed, this description step would be impossible to carry out with a less expressive learning technique, mainly because the number of attributes is variable, depending on the number and nature of words occurring in the context of an example. Moreover, the possibility of adding information via the Background Knowledge also allows us to easily make the most of the hierarchical structures (see Section 4.3.2) of the Part-of-Speech tags. Finally, ILP makes managing noisy data possible, which is essential for our task since a few tagging errors cannot be avoided. The ILP software we use in our experiments is ALEPH, a Prolog implementation, written by Srinivasan (2001). The way it operates can be roughly described by the following algorithm. Aleph Algorithm Until E+ is empty, iterates: · pick up randomly a positive example e+ from E+; · define a hypothesis search space EH from e+ with respect to LH; · search EH for the clause h that maximizes the score function Sc; · add h to the set H and prune the positive examples covered by h from E+. End of iteration 4.3

Learning realization pairs extraction patterns with ASARES

ILP makes it possible to produce rules that characterize positive examples of a concept from negative examples of this concept. In our case, we want to be able to discriminate realization N-V pairs from non-realization ones with respect to their contexts in our corpus. Nonetheless, as explained in Sections 3.2 and 3.3, realization relationships actually cover different semantic and syntactic configurations. Our first task (presented in sub-section 4.3.1) is to decide what are the concepts we want to learn and gather the necessary (positive and negative) examples. The next sub-section describes how these examples are encoded in ASARES with their contexts. Sub-section 4.3.3 describes the format of the rules inferred as they are used in the remainder of this paper.

4.3.1

Acquisition Process: 3 Experiments

To infer extraction patterns, ASARES needs a set of examples (E+) and a set of negative examples (E-) of the elements we want to retrieve. In our case, E+ must thus be composed of (PoS-tagged) sentences containing valid N-V pairs; conversely, E- must be composed of sentences containing non-valid N-V pairs. While the acquisition of the example sets is tedious and usually carried out manually, the originality of our work lies in the fact that E+ and E- are obtained automatically. To produce these examples, we use the existing entries of the DiCoInfo we are currently developing. These entries are thus a kind of bootstrap in our acquisition process. More precisely, every N-V pair in which V is described in the database as a realization verb for N is included. Then, all sentences in our corpus containing this N-V pair are considered as positive examples and added to E+. Note that we do not check if each occurrence of the N-V pair actually shares the target semantic link or even a syntactic link in the sentences that are extracted. Thus, we assume that most sentences containing the N-V pair are valid. Actually, some of the examples in E+ might be incorrect, but, as we have said in Section 4.2.1, ASARES tolerates a certain amount of noise in the data. Moreover, in order to make the most of our pattern inference system, we divided our task of retrieving realization pairs into three sub-tasks. Each sub-task focuses on the retrieval of a specific realization N-V pair: in each sub-task, N is a different actant (1st, 2nd or 3rd) of V. Thus, Experiment 1 aims at inferring and using patterns describing realization N-V pairs in which N is the first actant of V. The LFs identified in this experiment are given in the first part of Table 2. Similarly, Experiment 2 focuses on realization N-V pairs in which N is the second actant of V, that is, the LFs described in the second part of Table 2. Finally, the realization pairs in which N is the third actant are studied in Experiment 3. By learning the realization N-V pair patterns separately with respect to the actantial role of N, we expect to provide ASARES with regular examples and thus obtain homogeneous and meaningful patterns. In practice, the three experiments only differ according to the sets E+ and E- fed into Indeed, in Experiment 1, E+ is only composed of sentences containing N-V pairs that correspond to entries in the DiCoInfo in which the verb is encoded as a realization verb having the noun as its first actant. As was said in section 3.3, the actual syntactic link between N and V can vary: N can be the subject of V in the active form or its complement in the passive form; it can also be found in other syntactic positions (including indirect syntactic links). The negative examples are simply sentences containing N-V pairs that are noted as realization pairs but with N being the second or third actant of V. ASARES.

It is worth noting that our methodology does not allow us to take into account the different senses of lexical units when retrieving sentences containing N-V pairs taken from the DiCoInfo. Therefore, from the point of view of our learning process, some N-V pairs could be encoded with several LFs. In particular, in some cases, a certain pair is encoded with two LFs belonging to two different groups of realization LFs (as defined in Table 2) due to the polysemy of N or V. A quite regular case of alternation in French is due to ergative verbs (see the example with démarrer in Section 3.2). In such cases, we cannot decide whether pairs should be used as positive or negative examples; thus, they are not used in our experiments.

Of course, the entire process described in the previous paragraphs is applied in the two other experiments; sentences containing N-V pairs encoded in the DiCoInfo as being realization pairs in which N has the desired actantial role are used as positive examples, sentences with realization pairs but not the right actantial role for N, count as negative examples. Finally, Table 4 sums up the number of positive and negative examples in each experiment. Number of positive examples

Number of negative examples

Experiment 1 (N 1 actant of V)

81

2188

Experiment 2 (N 2nd actant of V)

1197

2665

142

1661

st

rd

Experiment 3 (N 3 actant of V)

Table 4: Number of positive and negative examples for the 3 experiments

4.3.2

Encoding of examples

The sentences containing pairs considered as positive or negative examples in each of the three learning processes are transformed into Prolog facts in order to be used by ASARES. These facts are part of the Background Knowledge and will be used to infer the rules. Positive and negatives examples are described the same way. For example, the sentence containing the realization N-V pair message-envoyer: “ce service permet d'envoyer des messages HTML” (Eng. “this service makes it possible to send HTML messages”) is transformed into realization(m1029_7,m1029_5) (in which m1029_7 and m1029_5 are unique identifiers of N and V), which is added to the set E+. The following information is then added to B (for reading convenience, the words described are given on the right of the tags/3 predicate describing them): tags(m1029_1,tc_pds). lemma(m1029_1,"ce"). sentence_beginning(m1029_1). tags(m1029_2,tc_ncmp). noun_ph_modifier(m1029_2). pred(m1029_2,m1029_1). tags(m1029_3,tc_vindp3s). pred(m1029_3,m1029_2). tags(m1029_4,tc_prep). lemma(m1029_4,"de"). pred(m1029_4,m1029_3). tags(m1029_5,tc_vinf). pred(m1029_5,m1029_4). tags(m1029_7,tc_ncmp). noun_ph_head(m1029_7). pred(m1029_7,m1029_5). tags(m1029_8,tc_ncms). noun_ph_modifier(m1029_8). pred(m1029_8,m1029_7). sentence_end(m1029_8). precedes(m1029_5,m1029_7). distances(m1029_7,m1029_5,0,0).

ce service permet d' envoyer des messages HTML

In this example, pred(x,y) indicates that the word y appears just before the word x in the sentence, the predicate tags/2 gives the PoS tag of a word, sentence_beginning/1 and sentence_end/1 identify the words occurring at the beginning and the end of the sentence, precedes/2 indicates if N occurs before V in the sentence or the contrary, the predicate lemma/2 gives the lemma of grammatical words and distances/4 specifies the number of words and verbs occurring between N and V in the sentence. During this process, some word categories are not taken into account. This is the case for determiners and some adjectives, which are considered as irrelevant for characterizing patterns detecting realization N-V pairs. In addition, information about the PoS tags of the corpus is given in the Background Knowledge B. In particular, the hierarchical organization of these PoS tags is retranscribed through Prolog predicates. For example, the fact that a word being tagged with tc_verb_pl is a verb in the active plural form and can be simply considered as a conjugated verb or even more generally as a verb, or a verb in the active form, is noted in B by: conjugated_plural_verb(W) :- tags(W, tc_verb_pl ). conjugated_verb(W) :- conjugated_plural_verb(W). verb(W) :- conjugated_verb(W). active_verb(W) :- tags(W, tc_verb_pl ).

These predicates (along with many others describing each possible way to consider a tagged word) are the basic elements that will compose our patterns where the variables refer to words (see next section for an example of a pattern and Section 5.1 for a presentation of the patterns inferred in our experiments). 4.3.3

Form of the inferred rules

The rules inferred, which will now be used as extraction patterns, are similar to the one presented below: realization(N,V)

:-

precedes(V,N), proper_noun(P).

active_verb(V),

no_verb_between(N,V),

suc(N,P),

This rule specifies that a pair, composed of a noun N and a verb V, is considered as a realization pair if: · V precedes N in the sentence; · V is in the active form; · there is no verb between N and V; and, finally · N is followed by a proper noun. The rules inferred during the learning process form our classifier, and can be used as extraction patterns to retrieve new N-V pairs from our corpus. For example, the rule presented above allows us to retrieve realization N-V pairs such as case-cocher in the sentence “l'utilisateur coche la case TCP/IP” (Eng. “The user checks the TCP/IP box”) or serveur-démarrer in “la commande /etc/rc.d/init.d/smb start démarre le serveur Samba” (Eng. “The /etc/rc.d/init.d/smb start command starts the Samba server”).

In the remainder of the article, the rules will be expressed using standard linguistic terminology. For example, the rule presented above will be written thus: V active form + (any token but a verb)* + N + proper noun. Note that in this representation, a token means any word or punctuation symbol; the superscript near the parentheses indicates quantifiers: ? means 0 or once, * means any number of times (including 0), 3-6 means at least 3 times and at most 6 times. 5.

Evaluation

This section is dedicated to the evaluation of our task of extracting realization N-V pairs with ASARES following the modus operandi described above. This evaluation is carried out with a view to achieving two different objectives. The first one, presented in the next sub-section, is to verify that the inferred patterns are meaningful and that they possibly highlight interesting corpus-based structures conveying realization relationships. The second part of this evaluation focuses on the efficiency of the acquisition process, that is the quality of the N-V pairs actually retrieved by the patterns. This quality is measured in terms of recall rate (i.e., most of the N-V pairs that should have been found are retrieved) and in terms of precision (most of the N-V pairs retrieved are actually realization pairs with the wanted actantial relationship). 5.1

Inferred patterns

The patterns that have been inferred during the learning process are listed in the following three tables. Each table presents some general and alternative patterns produced by each experiment and illustrates them with examples. General patterns are defined as the most basic patterns that were discovered (for example, the most basic pattern for a realization link when the noun is first actant and surface syntactic subject is: N + V conjugated active form). Alternative patterns are variations that have been found for basic ones. Table 5 lists the general and alternative patterns inferred when the noun in the pair is first actant. The table is divided into three parts according to the surface syntactic relationships between N and V. The first part presents the general and alternative patterns found when the N is the subject of the verb, which is the typical surface position in which we expect the noun as first actant. The second part shows other general and alternative patterns discovered for another regular syntactic structure: in this case, N is the first complement, but V is in the passive form. Finally, the third parts lists the patterns found for indirect syntactic links. As can be seen in Table 5, a relatively small number of patterns (11) have been inferred by the first experiment, especially when compared to those produced by the other experiments. Most rules discovered are regular and can probably apply to other corpora. In addition to rules describing standard active structures, some rules covering passive structures have been generated by ASARES, which reflects the important use of passive sentences in technical corpora. Only two different patterns have been produced for indirect syntactic links. Finally, one pattern is corpusspecific (i.e., N head of phrase + noun + V active form). This sort of pattern, showing the frequent use of appositive nouns in our corpus (e.g., environnement système, fichier texte), has also been found in the other experiments. General pattern N + V conjugated active form

Examples Étant donné que les serveurs tournent sur des plateformes UNIX... Un système démarrant avec une version précédente ne verra pas ce type de partition.

Examples of alternatives · · · · ·

N head of phrase + noun + V active form subordination conjunction + N + (any token but a verb)* + V active form verb infinitive form + N + any token but a verb + V active form N head of phrase + V present participle active form “permettre” + “à” + N head of phrase + “de” + (any token but a verb)* + V active form

General pattern V + (any token)? + “par” + N

Examples C'est le cas pour les graveurs, qui seront pilotés par les logiciels de gravage. Ce fichier contient la liste de tous les symboles du noyau, il est utilisé par le système.

Examples of alternatives · ·

“il” + V passive form + any token but a verb + any token but a verb + N noun + V + any token + “par” + N head of noun phrase

General pattern(s)

Examples

N head of noun phrase + any token Vous avez un système qui plante à tout instant... il est peut-être temps de + “qui” + V active form passer à Linux .

Examples of alternatives N head of phrase + (any token but a verb)* + “qui” + V active form Table 5: Inferred patterns for N as actant 1

Our second experiment yielded 36 different rules. Table 6 gives some of the patterns produced when the noun is second actant. As in the previous table, this one is organized according to the surface syntactic relationships between the verb and the noun. The first part presents the general and alternative patterns found for the typical surface structure in which N as second actant is expected, i.e., as first complement. The second part lists general and alternative patterns discovered for another regular syntactic structure: N is the subject, but V is in the passive form. Finally, the third part lists the patterns found for an indirect syntactic link. Our second experiment yielded an impressive number of rules in which the noun and the verb share a syntactic link, as well as a few rules for indirect syntactic links. Again, most rules apply to regular structures that we expected to find for nouns as second actant. Hence, they would probably apply to other corpora. As in the previous experiment, a number of rules cover passive structures. In this experiment, however, the number of rules for active and passive structures is equivalent, which shows that realization relationships for nouns as second actants are frequently found in technical corpora in passive structures. Also, a number of rules cover cases where the noun is modified by an appositive noun.

General pattern(s)

Examples

V active form + N

Il faut évidemment choisir l’option French xf86 config demande [sic]??? alors de saisir un nom de variante pour le clavier choisi . pronoun + V + spatial or temporal Les requêtes de DNS font partie des requêtes les plus courantes lorsque preposition + N l’on navigue sur Internet.

Examples of alternatives · · · · ·

V active form + (any token but a verb)* + N + proper noun preposition + V + (any token but a verb)* + N + participle V + “sur” + (any token but a verb)? + N V + any token but a verb + N + noun “en” + V participle + (any token but a verb)* + N head of phrase

General pattern(s)

Examples

N + V passive form N + V participle

Ainsi, si l’on trouve l’expression /3 pour les heures, la commande sera exécutée toutes les trois heures . D’ailleurs la règle essentielle, même pour un ordinateur utilisé par une seule personne, est de toujours créer un compte utilisateur normal et de ne jamais travailler sous le compte root .

Examples of alternatives · · · · · · ·

N head of phrase + V passive form “de” + N + (any token but a verb)* + V past participle N + proper noun + V passive form V + “à” + (any token but a verb)? + N N + noun + (any token but a verb)* + V passive form N head of phrase + any token + V past participle proper noun + V + (any token)? + preposition + N

General pattern(s)

Examples

N + “à” + (any token but a verb)* + V N + “que” + any token + V

timidity fichier où fichier est le nom du fichier MIDI à lire. En_résumé, le shell utilise les variables d’environnement du système pour gérer ses propres variables, et permet de les exporter vers l’environnement d’exécution qu’il communique aux commandes qu’il lance . N + relative pronoun + (any token)? + Le fichier AUTOEXEC. BAT comprenant un ensemble de programmes ou V passive form commandes qui seront exécutés séquentiellement d’une façon automatique.

Examples of alternatives · · · ·

preposition + N + preposition + V active form V + “à” + (any token but a verb)* + N head of phrase N + (any token but a verb)* + “qui” + V passive form N + “à” + (any token)? + V Table 6: Inferred patterns for N as actant 2

Finally, our third experiment produced 39 different patterns. Table 7 lists some of the patterns discovered when the noun is third actant. This table is organized in two parts. The first parts displays the rules inferred when the noun is the second complement of the verb in active and passive structures. The second part shows the rules produced for indirect syntactic links. The number or rules produced in this experiment is very high, especially considering that they were inferred from a small number of examples (142). This highlights the diversity of syntactic structures in which realization links can be found for nouns as actant 3, but it also indicates that our learning process was unable to detect strong regularities in the examples. This may lead to patterns with poorer predictive quality and to lower extraction results.

General pattern(s)

Examples

V + noun + preposition of manner + N Vous pourrez alors redémarrer la machine avec la commande suivante. V passive form + (any token but a verb) Les fichiers portant des suffixes réservés DOS ( .EXE, .COM, .BIN, etc.) * + preposition of manner + N ne peuvent être traités avec les commandes TYPE. SYNTAXE :

Examples of alternatives · · · · · · · · · · · · · · · ·

V + (any token but a verb)* + “sur” + N head of phrase V + any token + noun + (any token)* + preposition of manner + N noun + V + (any token but a verb)* + “en” + N head of phrase V past participle + “de” + N V past participle + “à” + N V + (any token)? + preposition + N head of phrase + spatial or temporal preposition V past participle + preposition of manner + N V passive form + (any token but a verb)* + “de” + N head of phrase V passive form + (any token but a verb)* + “à” + N head of phrase V + “à” + N head of phrase + noun head of phrase V + “avec” + (any token but a verb)* + N + spatial or temporal preposition V + preposition + (any token but a verb)* + N + “dans” V + (any token)? + “en” + N V + noun head of phrase + (any token but a verb)* + preposition + N head of phrase pronoun + V + “à l'aide de” + N V + noun head of phrase + any token but a verb + N head of phrase + spatial or temporal preposition

General pattern(s)

Examples

N + (permet|servir|concourir|contribuer| Cette commande permet de lancer le démon pppd sur le port_série aider|empêcher) + preposition + V COM1 N head of phrase + (any token but a verb)* + “faire” + V infinitive

Examples of alternatives · ·

N head of phrase + (any token but a verb)* + (concevoir|employer|faire|mettre en oeuvre|utiliser) + “pour” + V N head of phrase + (any token but a verb)* + (permet|servir|concourir|contribuer|aider|empêcher) + preposition + V · N head of phrase + (permet|servir|concourir|contribuer|aider|empêcher) + (any token but a verb)* + V Table 7: Inferred patterns for N as actant 3

5.2

Methodology for evaluation

In order to evaluate the quality of the extracted N-V pairs, we are interested in two different measures. The first one, the recall rate (formally defined below), expresses the completeness of the set of retrieved N-V pairs, that is, how many valid pairs are found with respect to the total number of pairs which should have been found. The second measure, the precision rate (defined below), indicates the reliability of the set of retrieved N-V pairs, that is, how many valid pairs are found with respect to the total number of retrieved pairs. These two rates were evaluated using a test sample containing the information presented below. 5.2.1

The test set: 10 specific terms

We constructed a test set with ten domain-specific terms: commande (Eng. command), configuration, fichier (Eng. file), Internet, logiciel (Eng. application), option, ordinateur (Eng. computer), serveur (Eng. server), système (Eng. system), and utilisateur (Eng. user). The terms have been identified as the most specific to our corpus by a terminology extraction system called TermoStat developed by Drouin (2003). The ten most specific nouns have been produced by comparing our corpus of computing to the French corpus Le Monde, composed

of newspaper articles (Lemay et al. 2005). Note that to prevent any bias in the results, none of these terms were used as positive examples during the pattern inference step. (They were removed from the example sets.) It is well worth noting that the terms selected belong of different semantic classes. Some terms denote artefacts (computer, server, and system in some contexts); one denotes an animate (user), configuration denotes an activity in some contexts and a result in others, and so on and so forth. Some terms will more easily combine with realization verbs. In addition, the verbs associated with the terms differ from one semantic class to another. 5.2.2

Context annotation

For each of these 10 nouns, a manual identification of every N-V pair occurring in the corpus was carried out. Terminologists were asked to analyze the highlighted N-V pairs in the sentences in terms of semantic and syntactic relationships. First, the syntactic linked was analyzed. The syntactic annotation involves checking if there is a syntactic link (even indirect, refer to section 3.3) between N and V. If no syntactic link was observed, then the contexts did not undergo further analysis. If a syntactic link was observed, terminologists indicated the actantial role of N with respect to V (actant 1, 2 or 3). Contexts with valid syntactic links were then subjected to a semantic analysis. The analysis entailed distinguishing realization meanings from others. Hence, three different annotations were performed: a) potential realization meaning; b) another type of interesting semantic link that could be encoded in a specialized dictionary; c) no interesting link. Examples of the sentences produced are given in Table 8. In the evaluation, a pair is considered as valid if at least one of its occurrences has the desired semantic relationship (cf. Section 3.1) and the actantial role of N with respect to V is pertinent for the experiment. Examples

Comment

MemCheckBoxInRunDlg autorise les utilisateurs à exécuter Realization relationship: "user runs a program" un programme 16 bits dans un processus VDM (Virtual DOS N is 1st actant of V. Machine) dédié (non partagé). Ensuite, le fichier AUTOEXEC.BAT sera exécuté s’il existe .

Realization relationship: “execute a file” N is 2nd actant of V.

Les autres options permettent d’arrêter le travail en cours , Realization relationship: “deactivate the printer with de le suspendre , de désactiver l’imprimante pour les travaux an option” suivants, et inversement de relancer les travaux d’impression N is 3rd actant of V. sur cette file . Exécutée sous le compte root, la commande suivante permet Other semantic relationship: “agent adds users” d'ajouter un utilisateur et de définir son mot de passe. N is 2nd actant of V. Ces commandes sont essentielles pour utiliser le système , No (interesting) semantic relationship; mais elles sont rébarbatives et peu d'utilisateurs acceptent de N is 1st actant of V. s'en contenter . Mais côté utilisateur, plus on a entouré son Macintosh de No (interesting) semantic relationship; périphériques , plus grands sont les risques de rencontrer des No syntactic relationship. blocages. Table 8: Examples of test set annotations

5.3.

Results 5.3.1.

Evaluation metrics

To compare the results obtained by our technique with the analysis carried out manually, we use the traditional precision/recall approach. Thus, we applied the patterns to the corpus and kept all the pairs retrieved when N is one of the ten specific nouns. The results of the comparison are summarized with the help of confusion matrices like the one presented in Table 9 (A means actual, Pr predicated, T true, F false, P positive and N negative; S is the total). LF sought

Not LF sought

Total

the

TP(s)

FP(s)

PrP(s)

Predicated not in the relation sought

FN(s)

TN(s)

PrN(s)

AP

AN

S

Predicated in relation sought

Total

Table 9: Generic confusion matrix

It is important to note here that the values in this confusion matrix depend on a parameter: a detection threshold noted s. Indeed, a single occurrence of a N-V pair with the patterns is not sufficient for a pair to be considered as valid. The threshold s represents the minimal number of occurrences to be detected for a pair to be considered as valid. The recall and precision rates (respectively R and P), measured on our test set, and are thus defined according to s: Recall rate R(s) = TP(s)/AP Precision rate P(s) = TP(s)/PrP(s) As a baseline, the graphs we will present also give the density, computed as AP/S, which represents the precision that would be obtained by a system deciding randomly if a pair is valid or not. As far as we know, no existing method can extract these precise LFs with respect to the actantial role of N; therefore, no comparison can be made with other methods for the first three experiments. More specifically, it would have been interesting to compare our results to those obtained by Wanner (2004) and Wanner et al (2005). However, the LFs sought, the domains chosen for the experiments and the methodologies devised (which, in the case of Wanner, involve an external resource) bear too many differences to make comparisons meaningful. Also, for every experiment, we indicate the recall and precision rates that maximize the fmeasure defined by f(s) = 2R(s)P(s)/(R(s)+P(s)). This measure represents the harmonic mean of R and P. This can be viewed as a way to choose the threshold s giving the best compromise between recall and precision. 5.3.2.

Results of Experiments 1, 2 and 3

Each Experiment is evaluated separately. For example, let us consider our Experiment 1. ASARES is used to extract realization pairs in which N is the 1st actant of V by applying to our corpus the inferred patterns presented in Table 5. The pairs covered by the patterns in which N is one of the ten terms sought are retrieved, along with the number of occurrences detected. These results are compared with our test set: all the pairs that are annotated as valid and that have N as the 1st actant are considered as valid pairs, while the others are not valid (thus, even realization pairs with

another actantial role for N are not considered valid). The same method holds for experiments 2 and 3. Figures 1, 2 and 3 represent the recall-precision graph obtained from these comparisons. Table 10 indicates the optimal Recall/Precision compromise (i.e., maximizing the f-measure) reached in each experiment.

Figure 1: Recall-Precision graph for Experiment 1

Figure 2: Recall-Precision graph for Experiment 2

Figure 3: Recall-Precision graph for Experiment 3

As indicated by the Density line in each experiment, the number of realization pairs in which N is 1st, 2nd or 3rd actant of V with respect to all the N-V pairs is low. This gives us a good indication about the difficulty of the task. In this context, our system performs quite well in all three experiments; nonetheless, results vary from one experiment to another. In particular, results in Experiment 2 are very good, which is confirmed by the significant gap between the optimal f-measure and the two others reported in Table 10. Indeed, one quarter of all the realization pairs in which N is 2nd actant of V is retrieved without a single non-valid pair. Conversely, from the recall-precision curves, it appears that results

in Experiment 3 are lower than the two others. This can be explained by the fact that a third actant of a verb may appear in wide range of structures, sometimes very far from the verb, which makes its automatic detection harder than for actant 1 or 2. This hypothesis is indirectly confirmed by the number of patterns that were inferred and are presented in Section 5.1. Recall

Precision

f-measure

Experiment 1

37.70%

52.27%

0.4381

Experiment 2

55.13%

63.24%

0.5890

Experiment 3

63.04%

36.71%

0.4640

Table 10: Optimal Recall/Precision compromises

5.3.3.

All realization links

The goal of the last experiment is to evaluate the results of ASARES in extracting realization pairs when no distinction is made between the actantial roles of N. Thus, all the rules issued from the three preceding experiments are put together and then applied to the corpus. The N-V pairs retrieved are then compared with our test set. In this case, every N-V pair annotated as indicating realization, regardless of the actantial role of N with respect to V, is considered as valid. Figure 4 reports the recall-precision graph obtained, as well as the density. For comparison purposes, the figure also contains the recall-precision graph obtained by a common numerical technique used for collocation acquisition: the Loglike coefficient (Dunning 1993).

Figure 4: Loglike and ASARES recall-precision graphs

The Density line, as in the other experiments, is very low. The fact that this line represents the worst precision limit of an extraction system is clear, since the Loglike method tends to achieve the same results when R is close to 100%. In comparison, ASARES’ results are much better; indeed, for a specific recall rate, the difference in precision obtained by the numerical approach and ASARES can reach up to 46%. As a matter of fact, the optimal compromise between Recall and Precision is

for R = 75.45% and P = 52.20% for ASARES (f-measure = 0.6171) and only R = 29.09% and P = 72.73% (f-measure = 0.4156) for the Loglike-based system. 5.4.

Discussion of the results

In this sub-section we have a closer look at the extraction results. In particular, we first detail in a qualitative way the different causes of errors, that is the N-V pairs wrongly retrieved. Then we examine the special case of N-V pairs sharing a semantic link but not a realization one. 5.4.1.

Causes of errors

a)

Tagging errors: First, some errors are due to tagging mistakes. For example, in the sentence “la première solution est d'utiliser la commande date” (Eng. “the first solution is to use the date command”) the noun date was incorrectly tagged as a verb, and the N-V pair commande-dater was retrieved by one of the inferred patterns. Even if this kind of error does not come directly from our acquisition technique and does not call into question our approach, it is still a factor that should be taken into account, especially when considering the choice of the tagger and the quality of the texts composing our corpus.

b)

N and V are not syntactically related: In a few cases, there is no (direct or indirect) syntactic link between N and V, as is the case with logiciel-garantir (software-guarantee) in “cette phase garantit la vérification du logiciel” (“this step guarantees the verification of the software”). These errors are quite rare in the retrieved pairs examined, especially in Experiments 2 and 3. Nonetheless our symbolic extraction system could certainly be enhanced with information about the syntactic function of nouns (subject, direct object, etc.). The learning algorithm could then incorporate this information and produce more relevant patterns.

c)

N has an actantial position other than the one desired: In very rare cases in Experiments 1, 2 and 3, ASARES retrieves N-V pairs sharing a syntactic relation, but in which N does not have the actantial role desired. In such cases, even if the pair actually conveyed realization, it was considered as an error. Thus, here again, adding syntactic information in our acquisition process could help to prevent this kind of error.

d)

N and V are syntactically related but there is no semantic link: Some N-V pairs are retrieved, although there is no semantic link between N and V, or at least no semantic link that would be encoded by a terminologist in a dictionary. This is the case for ordinateursavoir (computer-know) in “l'ordinateur ne sait pas où chercher” (“the computer does not know where to look”). Again, Part-of-Speech information is not always sufficient to distinguish between semantically related pairs and non-semantically (but syntactically) related ones.

e)

There is an interesting link between N and V, but not the realization link sought in these experiments: Other very frequent errors are caused by the fact that there actually is an interesting semantic link between N and V in a retrieved pair, but not the realization link we are looking for. Indeed, some nouns belonging to specific semantic classes will often cooccur with verbs expressing realization meanings (and thus often appear in valid N-V pairs) while others do not. For example, nouns like ordinateur (computer) and utilisateur

(user) often appear in valid N-V pairs. On the other hand, other nouns clearly do not appear in combinations with realization verbs (e.g., configuration, Internet). The last errors listed above clearly illustrate some limitations of our symbolic approach.3 In fact, they tend to show that our method could be enhanced if it could incorporate richer linguistic information. Indeed, morpho-syntactic information is not always sufficient to operate fine-grained sense distinctions. For example, the sentences below have the same combination of Part-of-Speech tags: “vous pouvez utiliser la commande exit” (Eng. “you can use the exit command”) and “vous devez corriger le fichier lilo.conf” (“you must correct the lilo.conf file”); in the first one the underlined N-V pair is valid, whereas this is not the case in the second one. Realization verbs for fichier would be activer (Eng. activate) and executer (Eng. execute), for example. Here again, these subtle distinctions could be handled by our symbolic method provided that some semantic information is given about nouns. This information could be supplied by a semantic tagger. 5.4.2.

Other interesting collocations found and related work

As we previously said, many errors (item e) in the previous section), that is, N-V pairs wrongly extracted by ASARES, are actually semantically related pairs. Thus, even if they do not convey a realization meaning, many pairs are interesting for the terminologist and can be encoded in a terminological database with the help of lexical functions. Table 11 shows some examples of other interesting semantic relationships between nouns and verbs encoded in our dictionary under the term fichier (Eng. file) as opposed to realization relationships. Realization meaning Prepar1Fact0 Prepar1Real1 Real1 FinReal1 Caus1Fact0

charger un fichier mémoire ouvrir un fichier

Other interesting relationships en Caus1Func0 Caus1Func2

créer un fichier stocker un fichier dans un répertoire copier un fichier

manipuler, modifier, éditer Caus1Func0 + dupliquer un fichier fermer un fichier Caus1Func0 de manière générer un fichier automatique exécuter un fichier Liqu1Func0 supprimer, détruire, effacer un fichier Table 11: Realization and other relationships encoded under fichier (Eng. file)

In order to illustrate the importance of these non-realization semantically related pairs in our results, we recomputed the recall and precision rates for each of the previous experiments but considered now as valid every semantically related pair (while still taking into account the actantial role of N for the three first experiments). These new results represent the performance of ASARES when used as a standard collocation extraction system. Figures 5, 6, 7 and 8 give, for each of the preceding experiments, the new recall-precision graph and density obtained. The previous the previous graphs considering only the realization N-V link are also reproduced. In these graphs, it appears clearly that in most of our experiments, several N-V pairs wrongly considered as realization ones are actually semantically related pairs that terminologists would consider as interesting candidates for inclusion in a dictionary. In this respect, the relatively 3

But these errors are also very frequent errors in numerical approaches, since cooccurrence is not enough to capture subtle semantic distinctions.

poor results of Experiment 3, mentioned previously, seem to be explained by the fact that ASARES cannot efficiently distinguish realization N-V pairs from other semantically related pairs. The last figure, where no distinction is made in the actantial position of N, is also eloquent: for example, ASARES allows us to extract 60% of all the realization N-V pairs in the corpus with a precision of about 50%, while it extracts 60% of all the semantically related pairs in the corpus with a precision of about 90%.

Figures 5-6-7-8: Recall-precision graphs considering every semantically related pairs

From a practical point of view, this kind of error is not harmful since the retrieved pairs are interesting relationships that terminologists may potentially want to collect and encode in specialized dictionaries. From a technical point of view, it is quite clear that our approach chiefly based on PoS tags, without refinement, cannot completely capture the particularities of a realization link. Indeed, semantic information has to be added to the patterns to allow ASARES to distinguish between realization and other semantically related pairs. Such semantic resources could be easily added to the learning process and thus be used in the patterns. Ideally, these resources may come directly from the corpus but experiments reported in Wanner (2004) and Wanner et al. (2005) suggest that external general semantic resources could also be used. Finally, it is well worth noting that the very good results obtained by considering valid every semantically related pair are very similar to the ones obtained with ASARES on the qualia acquisition task (see Section 2.2). This fact is not surprising since, as we previously said, the qualia relationships actually cover a broad range of semantic relationships.

6.

Conclusion

We have presented an original corpus-based acquisition method for acquiring noun-verb collocations and classifying them according to the semantic link between their components. We focused on N-V pairs in which verbs convey a realization meaning and based our classification on lexical functions (LFs). In the experiments presented in this paper, the noun-verb pairs were acquired from a French domain-specific corpus of computing. Our acquisition method, which relies on ASARES, an extraction pattern inference technique, produces results that are quite good and show that they would be useful in a practical terminological setting. It can even be considered to be an improvement on classical collocation extraction techniques (chiefly based on statistics), since it outperforms them. In addition to the overall good performance of our method in terms of recall and precision, one of its strengths lies in the fact that it provides users (in this case, terminologists) with interpretable patterns. These patterns are interesting clues to how verbs and nouns that share a specific semantic link actually combine in corpora. Most patterns are general and we believe that they could apply to corpora of different natures, but the method also highlights corpus-specificities that would be overlooked by general rules. Finally, patterns are learned with respect the to actantial role of the noun, allowing terminologists to work on smaller sets of related collocations. Many possibilities for future work are suggested by these experiments. Concerning the acquisition process, some adaptations could certainly improve the results, currently limited by the sole use of Part-of-Speech tags and noun-phrase information. As was previously mentioned, syntactic and semantic information could be added to the corpus through a tagging and parsing process. These two improvements/additions could help to overcome some limitations of our symbolic approach for capturing the nature of N-V relationships. However, such improvements, and especially the use of semantic information, are costly and could lead to a less portable and flexible method. In addition, the use of syntactic analyses could result in a failure to identify some valid pairs, especially those in which the syntactic link is indirect. It would also be interesting to apply our technique to different corpora for finding realization pairs or other semantically related N-V pairs and compare, on one hand, the rules inferred during the leaning process, and on the other, the results yielded by applying them on a different corpus. In terms of applications, it would be interesting to use a similar technique for the acquisition of other, more specific semantic links between nouns and verbs and even between nouns and nouns or other categories of words: in other terms, for finding other subsets of lexical functions. These semantic relationships would allow us to complete the description of the terminological units contained in our dictionary of computing. The comparison of the acquisition results and of the inferred patterns could lead to interesting insights. 7.

References

Binon, J., S. Verlinde, J. Van Dick et A. Bertels (2004). Dictionnaire d’apprentissage du français des affaires. Dictionnaire de compréhension et de production de la langue des affaires, Paris : Didier.

Bouillon P., V. Claveau, C. Fabre and P. Sébillot (2001). Acquisition of Qualia Elements from Corpora – Evaluation of a Symbolic Learning Method. In Proceedings of the 1st International Workshop on Generative Approaches to the Lexicon, GL'01, Genève, Switzerland. Church, K. and P. Hanks (1989). Word Association Norm, Mutual Information, and Lexicography. Computational Linguistics 16(1), pp. 22-29. Claveau, V., P. Sébillot, C. Fabre and P. Bouillon (2003). Learning semantic Lexicons from a Part-ofSpeech and Semantically Tagged Corpus using Inductive Logic Programming. Journal of Machine Learning Research, special issue on ILP 4, pp. 493-525. Cohen, B. (1986). Lexique de cooccurrents, Brossard (Québec) : Linguatech. Cussens, J. and S. Džerovski (eds.) (2000). Learning Language in Logic, Lecture Notes in Artificial Intelligence. Berlin: Springer-Verlag. Duning, T. E. (1993). Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics 19(1), pp. 61-74. Fontenelle, T. (1997). Turning a Bilingual Dictionary into a Lexical-Semantic Database, Tübingen: Max Niemeyer. Garcia, D., N. Aussenac-Gilles and A. Courcelles (2000). Exploitation pour la modélisation des connaissances causales repérées par COATIS dans les textes. In Ingénierie des connaissances : évolutions récentes et nouveaux défis. Paris: Eyrolles. Goldman J.-P., L. Nerima and E. Wehrli (2001). Collocation Extraction Using a Syntactic Parser. In Proceedings of the ACL'01 Collocations Workshop. Toulouse, France, pp. 61-66. Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer Academic Publishing. Grimes, J. (1990). Inverse Lexical Functions, In J. Steele (Ed.), Meaning-Text Theory: Linguistics, Lexicography and Implications, Ottawa: Ottawa University Press, pp. 350-364. Haussmann F. J. (1979). Un dictionnaire des collocations est-il possible ?, Travaux de linguistique et de littérature 17(1), pp. 187-195. Hearst, M. (1998). Automatic Discovery of WordNet Relations. In WordNet: An Electronic Lexical Database. Cambridge (USA): MIT Press, pp. 131-151. Kilgariff, A. and D. Tugwell (2001). WORD SKETCH: Extraction and Display of Significant Collocations for Lexicography. In Proceedings of the ACL'01 Collocations Workshop. Toulouse, France, pp. 3238. L’Homme, M.-C. (2004). Sélection des termes dans un dictionnaire d’informatique : comparaison de corpus et critères lexico-sémantiques, In Euralex International Congress,. Proceedings. Lorient, France, pp. 583-593. Lemay, C., M.C. L'Homme and P. Drouin (2005). Two Methods for Extracting Specific Single-word Terms from Specialized Corpora: Experimentation and Evaluation, International Journal of Corpus Linguistics, 10(1). Lin, D. (1998). Extracting Collocations from Text Corpora, In First Workshop on Computational Terminology, Computerm 1998, Montréal, Canada, pp. 57-63. Manning, C. D. and H. Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge (USA): The MIT Press. Mel’čuk, I. (1996). Lexical Functions: A Tool for the Description of Lexical Relations in a Lexicon, In Wanner, L. (Ed.), Lexical Functions in Lexicography and Natural Language Processing, Amsterdam/Phildelphia: John Benjamins, pp. 37-102.

Mel’čuk, I. (1998). Collocations and Lexical Functions, In Cowie, A.P. (ed.), Phraseologie: Theory, Analysis and Applications, Oxford: Clarendon Press, pp. 23-53. Mel’čuk, I. et al. (1984-1999). Dictionnaire explicatif et combinatoire du français contemporain. Recherches lexico-sémantiques 1-IV, Montréal : Les Presses de l’Université de Montréal. Mel’čuk, I., A. Clas and A. Polguère (1995). Introduction à la lexicologie explicative et combinatoire, Louvain-la-Neuve (Belgique): Duculot / Aupelf - UREF. Morin, E. (1998). PROMÉTHÉ, un outil d'acquisition de relations sémantiques entre termes. In Actes de la conférence Traitement Automatique du Langage Naturel. Paris, France. Muggleton, S. and L. De Raedt (1994). Inductive Logic Programming: Theory and Methods. Journal of Logic Programming 19-20, pp. 629-679. Pearce, D. (2002). A Comparative Evaluation of Collocation Extraction Techniques. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC'02, Las Palmas de Gran Canaria, Spain. Polguère, A. (2003). Collocations et fonctions lexicales : pour un modèle d’apprentissage », In F. Grossmann and A. Tutin (eds.). Les collocations. Analyse et traitement, Coll. Travaux et recherches en linguistique appliquée, Paris : Éditions de Werelt, pp. 117-142. Pustejovsky, J. (1995). The Generative Lexicon. Cambridge (USA): The MIT Press. Smadja, F. (1993). “Retrieving Collocations from Text: Xtract.” Computational Linguistics 19(1), pp. 143197. Srinavasan, A. (2001). “The ALEPH manual.” Wanner, L. (ed.). (1996). Lexical Functions in Lexicography and Natural Language Processing, Amsterdam/Phildelphia: John Benjamins. Wanner, L. (2004). “Towards Automatic Fine-grained Semantic of Verb-Noun Collocations.” Natural Language Engineering 10(2), pp. 95-143. Wanner, L., B. Bohnet, M. Giereth and V. Vidal (2005. forthcoming). “The first steps towards the automatic compilation of specialized collocation dictionaries”, Terminology 11(1).

Appendix A Entry in the DiCoInfo

ORDINATEUR 1 (Eng. COMPUTER) Actantial structure : ordinateur utilisé par agent{utilisateur 1} pour intervenir sur patient{tâche 1; ressource 1} Terme plus général (Syn)* Spécifique (Spec)**

APPAREIL, SYSTÊME, MACHINE

Intersection de sens (Syn) Partie du mot clé (Part)***

PÉRIPHÉRIQUE

CLIENT, SERVEUR, MICRO-ORDINATEUR, MINIORDINATEUR, PC, PORTABLE,

~ DE BUREAU

CARTE, DISQUE DUR, UNITÉ DE STOCKAGE, BUS, LECTEUR,

MÉMOIRE,

PROCESSEUR,

More general term Hyponym Co-hyponym Part of the key word

UNITÉ

CENTRALE DE TRAITEMENT, UNITÉ CENTRALE

Nom de l’agent (S1) Nom du patient (S2) On crée le mot clé (CausFunc0)

UTILISATEUR DONNÉES, APPLICATION, LOGICIEL, TÂCHE

On CONÇOIT l’ ~

Name of the typical agent Name of the typical patient Someone creates the key word

L’agent prépare le mot clé (Prepar1) Nom (S0Prepar1) Le mot clé commence à fonctionner (IncepFact0) Nom (S0IncepFact0) Le mot clé fonctionne (Fact0) Le mot clé intervient sur le patient (Fact2) Nom (S0Fact2) Le mot clé cesse de fonctionner (FinFact0) Nom (S0FinFact0) L’agent fait fonctionner le mot clé (Caus1Fact0) Nom (S0Caus1Fact0)

l’agent CONFIGURE l’~ CONFIGURATION

The first actant/agent?? (ou changer dans S1, S2 ??) prepares the key word The key word starts to function

de l’~ par l’agent

l’~ DÉMARRE DÉMARRAGE

de l’~

l’~ TOURNE The key word functions l’~ TRAITE, EXÉCUTE, LANCE le patient The key word operates on the TRAITEMENT, EXÉCUTION, LANCEMENT du second actant/patient? patient par l’ ~ l’~ PLANTE, tombe en panne The key word stops functioning de l’~ l’agent DÉMARRE l’~, l’agent INITIALISE l’~, PLANTAGE

de l’~ par de l’~ par l’agent l’agent UTILISE l’~ UTLISATION de l’~ par l’agent le patient TOURNE sur l’ ~ DÉMARRAGE

l’agent,

The first actant causes the key word to function

INITIALISATION

L’agent utilise le mot clé (Real1) Nom (S0Real1) Le patient utilise le mot clé (Real2)

L’agent met fin au fonctionnement l’agent ÉTEINT l’~ du mot clé (Liqu1Fact0) Le mot clé a ce qu’il faut pour bien ~ PERFORMANT fonctionner (Able1BonFact0)

The first actant uses the key word The second actant uses the key word The first actant stops the functioning of the key word The key word respects all conditions to function properly

* Two explanations of the relationships between the term and the collocate are provided. The first one is formal (based on lexical functions) and used by terminologists when adding the entries to the dictionary. The second one is written in natural language and based on a proposal by Polguère (2003). ** This function was proposed by Grimes (1990). *** This function was proposed by Fontenelle (1997).

Appendix B Lexical functions cited in this article and their explanations Lexical function CausiAble1Fact0 CausiDe_NouveauAble1 Fact0 CausiDe_nouveauFact0 CausiFact0 Caus1Func0 Caus1Func2 Caus1Non-Able1Fact0

English example activate an icon repair a file restart a PC open a session develop a program copy a file into? a directory damage a computer

De_NouveauIncepFact0 PC restarts Fact0 a program runs Facti a programmer writes a program FinFact0 FinReal1 Func0 Funci

computer crashes quit/exit an application a crash happens a configuration includes …

Explanation Cause the key word to be able to function Cause the key word to be able to function properly again Cause the key word to function again Cause the key word to function Cause the key work to exist Cause the key word to be in its second actant Cause the key word not to be able to function properly The key word starts functioning again The key word functions The key word performs a typical action on its first actant The key word stops functioning Stop using the key word Support verb when the key word is first actant The key word has one of its actants

IncepFact0 InceptFacti LiquiAble1Fact0 LiquiFact0 Liqu1Func0 Magn

a computer starts the Web user connects to the Internet deactivate a function cancel a command delete a file high-density

The key word starts functioning The key word starts performing a typical action on one of its actants Cause the key word to be no longer able to function Cause the key word to stop functioning Cause the key word to stop existing intensifier

Labrealij

load a program into memory

An actant of the keyword uses another actant of the keyword to perform a typical action on the keyword

surf the Internet

One of the actants of the key word uses the key word Prepare the key word to function Prepare to use the key word

Operi Reali PrepariFact0 PrepariReali

type a command connect to the Internet