Adding syntactico-semantic information to specialized ...

3 downloads 0 Views 291KB Size Report
Montréal (Québec) ... Both resources include English, French and Spanish ..... Grammatical information (part of speech, gender for nouns in French and Spanish,.
Adding syntactico-semantic information to specialized dictionaries: an application of the FrameNet methodology Marie-Claude L’Homme Observatoire de linguistique Sens-Texte (OLST) Université de Montréal C.P. 6128, succ. Centre-ville Montréal (Québec) H3C 3J7 [email protected]

Abstract. This contribution presents a corpus-based methodology for incorporating syntactico-semantic information in specialized dictionaries. It is argued that the annotation of predicative units (especially verbs) and their participants helps lexicographers and terminologists validate intuitions about the meaning and syntactic behaviour of lexical units or terms. The first part of the article presents these arguments based on the literature and selected annotation projects. The second part of the article focuses on our specific method for collecting contexts in which terms appear and their annotation. A short description of the two databases in which the annotation is implemented (a dictionary of computing and the Internet, DiCoInfo; and a dictionary on the environment, DiCoEnviro) is also given. Our method, which is largely based on the one developed within the FrameNet project (Ruppenhofer et al. 2010), is then presented. Finally, a few lines for future work are listed. Keywords: syntactico-semantic annotation, terms, actantial structure, actant, semantic role, DiCoInfo, DiCoEnviro, FrameNet

1. Introduction An increasing number of lexical databases provide detailed information on the syntactic and semantic properties of lexical units (LUs) (e.g. DiCouèbe 2012, FrameNet 2012; Propbank 2012; VerbNet 2012; cf. Section 2.2). Unfortunately, this does not apply to terminological databases as few include rich linguistic information. Since the focus has usually been placed on concepts and their explanation, terminological databases and specialized dictionaries usually

limit themselves to providing definitions and links to regular conceptual relationships such as hyperonymy and meronymy. The project described in this paper1 aims to fill this gap and proposes a methodology for annotating terms in sentences based on that developed within the FrameNet project (Ruppenhofer et al. 2010). Predicative units (especially verbs) and their participants (actants and circumstants) are characterized both syntactically (syntactic groups and functions are specified) and semantically (each participant is labelled with a semantic role). The method is implemented in two online resources: the first one – DiCoInfo – contains terms related to the fields of computing and the Internet; the second one – DiCoEnviro – lists terms in the field of the environment (mainly climate change). Both resources include English, French and Spanish terms.2 This contribution summarizes parts of our annotation project, namely its objectives and methodology. In Section 2, assumptions found in the literature and similar projects are briefly reviewed. Section 3 gives a short description of the two specialized resources to which an annotation module was added. Section 4 is dedicated to the objectives our annotation is designed to fulfil. Section 5 gives the details of the methodology we devised and lists some differences that can be found when comparing it with some of the choices made within the FrameNet project. Finally, a few concluding remarks and lines for future work are given in Section 6. 2. Annotation and semantic roles Our annotation method and other projects to which I refer throughout this contribution are based on the assumption that the relation between a predicative lexical unit (i.e. a lexical unit that requires actants in order to account for its meaning) and its participants 3 can be

1

The project is carried out at the Observatoire de linguistique Sens-Texte (OLST), Université de Montréal. 2 The DiCoInfo can be accessed at the following URL: http://olst.ling.umontreal.ca/cgi-bin/dicoinfo/search.cgi. The DiCoEnviro can be found at the following URL: http://olst.ling.umontreal.ca/cgi-bin/dicoenviro/search_enviro.cgi 3 Participants is used as a generic term that covers actants (or arguments) and circumstants (or adjuncts). The terminological distinction between actant and circumstant is based on Mel’čuk (2004). Actants are obligatory and are a part of an LU’s meaning; circumstants can appear in sentences but are

characterized – on the semantic level – in terms of semantic roles or other compatible labeling systems.4 This section presents some assumptions that have been stated in the literature about semantic roles and a few implementations in lexicography as well as in natural language processing. 2.1 Assumptions about semantic roles Semantic roles have been debated in linguistic circles for several decades. One of the most quoted contributions in the area is that of Fillmore (1968) who devised a “Case grammar” to account for the deep relation between a predicate and its arguments, thereby capturing generalizations that surface syntax representations can miss and providing a mapping between surface and deep syntactic structures. For instance, in the following sentences, the “deep” relation between John and break, on the one hand, and between a hammer and break, on the other, remains the same regardless of the syntactic functions of the noun phrases: John (Agent) broke the window. A hammer (Instrument) broke the window. John (Agent) broke the window with a hammer (Instrument) (Fillmore 1968: 42).

The difference between the cases instantiated by John and a hammer can be validated by the fact that the two noun phrases cannot be combined in a normal sentence: *John and hammer broke the window (Fillmore 1968: 42). Case grammar is also assumed to represent relations that are valid across languages unlike case systems that are unique to certain languages (e.g. Greek, Russian). Fillmore’s (1968) original list consisted of six cases (Agentive, Instrumental, Dative, Factitive, Locative and Objective). However, up to now, there is no general agreement on the necessity and linguistic validity of the notion of “semantic role”. In addition, even among those who assume that roles are essential, there is considerable variation between what is believed to be a finite number of roles and the labels used to identify them.

optional and are not an integral part of the LU’s meaning. 4 I use semantic role, but other expressions appear in the literature and are mentioned is this contribution: thematic roles, argumental roles. In Frame Semantics (cf. 2.1), Frame Elements (FEs) is preferred.

In the late 1970s, Fillmore himself proposed a new framework called Frame Semantics that – although not completely disconnected from his original Case grammar – does display some important distinctions. I will not go here into the details of the theory since it is outside the scope of the article. However, I will mention some of the aspects that are relevant for the annotation method presented herein. Central to Frame Semantics (Fillmore 1982) is the notion of “Frame” defined as a conceptual structure against which lexical units sharing conceptual properties can be defined. A Frame describes a situation (e.g. that of a commercial transaction in which a person acquires something from another person in exchange of something, usually money) and lexical units are said to evoke that Frame (the Commercial_transaction frame comprises LUs such as buy, charge, cost, pay, sell, spend and each evokes this frame from a different perspective). A Frame is defined with a set of Frame elements (FEs), which in turn are defined as participants in the situation denoted by the Frame (some FEs are obligatory, others are optional). For instance, the frame labeled Transfer comprises the following Frame elements: Donor, Theme, and Recipient (Fillmore et al. 2003a). The Frame contains LUs, such as give and receive, where the said FEs can be instantiated in different syntactic positions (e.g. in give, the prototypical syntactic function of the Donor will be subject; in receive, the same FEs will be instantiated as a complement linked to the verb with the preposition from). According to Fillmore et al. (2003a), FEs can capture generalizations about meaning that representation systems based on semantic roles, including Case grammar, would miss. If buy and sell were described in terms of semantic roles, we would obtain the following: Give: John (Agent) gives a book (Theme) to Bob (Recipient) Receive: Bob (Agent) received a book (Theme) from John (Source)

This labeling fails to capture the converse relationship between give and receive. FEs, on the other hand, show explicitly the opposing perspectives on the Frame Transfer (give focuses on the Donor and backgrounds the Recipient; conversely, Receive focuses on the Recipient and backgrounds the Donor). Give: John (Donor) gives a book (Theme) to Bob (Recipient) Receive: Bob (Recipient) received a book (Theme) from John (Donor)

Frame Semantics has been implemented in a lexical resource called FrameNet in which a very detailed annotation method shows how FEs are realized in sentences extracted from corpora and how they interact with the LU evoking a frame. FrameNet is described briefly in the next section. The argument structures of LUs and their realizations in syntax are central to many other approaches. I mention only one here that can be considered to be compatible with Case Grammar and Frame Semantics, namely Corpus Patterns Analysis (Hanks and Pustejovsky 2005). The approach proposes a labelling of the valence of verbs observed in corpora in terms of types and roles. It is assumed that this approach is an efficient means to represent semantic distinctions between verbal meanings. For example, the difference between two of the meanings of grasp ‘to seize hold of something’ and ‘to understand something” can be accounted for with the following descriptions: [[Person=Animate]] ~ [[PhysObj]]) (one of the possible valencies) [[Person 1=Cognitive]] ~ {[[Abstract=Concept]] | [N-clause]} (one of the possible valencies).

The approach is said to be complementary to others: for example, it complements the approach taken in Frame Semantics in the sense that it aims to cover all “normal senses” of a verb whereas Frame Semantics describes Frames and one cannot be sure that all senses of verbs are covered until all frames are described. 2.2 Representation of syntactico-semantic properties of lexical units Although the notion of “semantic role” was first introduced in syntax, it raised interest in lexicography and natural language processing. In these areas, as well as in syntax, it is increasingly assumed that a formal or semi-formal description of the interface between syntax and semantics is necessary in linguistic description. This section describes some of the projects that implement such representations.5

5

This section focuses on resources that implement the notion or “role” to some extent. Other resources use different labeling systems. For example PropBank (2012) does not give explicit information on the meaning of arguments and uses a numbering system (Arg1, Arg2, Argn). The Dicouèbe (2012) labels arguments with variables (X, Y, Z) on the semantic level, and with Roman numbers of the Deep syntactic level (I, II, III).

One lexical database that is reminiscent of Case grammar as far as semantic roles are concerned is VerbNet (2012). The resource describes verb classes (based on an extension of Levin 1993; Kipper et al. 2006) and details sets of syntactic descriptions that are designed to map the argument structure of verbs. VerbNet uses a set of over 20 roles (called thematic) (such as Actor, Instrument, Patient) and states for each role selectional restrictions (e.g. animate, substance, place) and a set of syntactic realizations. Another lexical database that provides detailed information on the mapping between the argument structure (in fact, other participants are also described in detail) and its syntactic realizations is FrameNet. 6 As was said above, FrameNet is an implementation of Frame Semantics (Fillmore 1982) and thus lists within each frame regular participants called core frame elements (FEs) (most of them being frame-specific) and other optional participants, called non-core FEs. For example, the frame Cause_temperature_change (that the following LUs evoke: chill, cool down, heat, warm) is described with the following core FEs: Agent (or Cause), Hot/Cold Source, Item. Non-core FEs include Means, Place, etc. Some of the FEs are further specified with semantic types: the Agent is said to be a sentient (FrameNet 2012). Each LU within a frame undergoes further analysis that includes an annotation of sentences in which it appears (Figure 1). The sentences are extracted from the British National Corpus. Two different summaries are generated from these annotations: a table of syntactic realizations (Figure 2) and valence patterns (Figure 3).

6

There have been a few extensive applications of the FrameNet annotation method to specialized resources. Some notable exceptions are The Kicktionary (Schmidt 2009) and JuriDiCo (Pimentel 2012).

429-s20-rcoll-air 1.

As [Causethe sun] WARMSTarget [Itemtheir walls] , the air inside becomes hotter than that in the centre of the

2.

Winds will remain light and variable , and this will allow [Causeall that sunshine] to WARMTarget [Itemthe

nest .[Hot_Cold_sourceINI] air] nicely , temperatures getting up to a comfortable ten or eleven celsius , that 's fifty or fifty fahrenheit .[Hot_Cold_sourceINI] 429-s20-rcoll-hand 1.

But each morning , when [Agentthe entire team] WARMTarget [Itemtheir hands] [Instrumenton steaming mugs of tea] [Placein the kitchen at Foulrice Farm] , spirits will be lifted by thoughts of a date with destiny on March 18 .[Hot_Cold_sourceINI]

2.

[AgentHe] pinched himself and WARMEDTarget [Itemhis hands] [Meanson the fire] , shuddering against the

3.

[AgentHe] tried to WARMTarget [Itemthem] [Instrumentwith his hands] .[Hot_Cold_sourceINI]

4.

[AgentDonna] WARMEDTarget [Itemher hands] [Instrumentround her tea] and glanced at her watch again

5.

[AgentThe three of the family] WARMEDTarget [Itemtheir hands] .[Hot_Cold_sourceINI]

extremes of the cold night air and the warmth of the signal box .[Hot_Cold_sourceINI]

.[Hot_Cold_sourceINI]

Figure 1: Part of the annotations provided for warm in FrameNet

Figure 2: Part of the syntactic realizations of warm in FrameNet

Figure 4: Part of the valence patterns of warm in FrameNet

Frames are also linked to other frames according to different types of relations. For instance, the Cause_temperature_change we just discussed is related to two other frames as shown in Figure 4. The relation between the Cause_temperature_change frame and the

Transitive_action

Cause_temperature_change

frame is

is also

one linked

of to

“inheritance”. another

frame,

The i.e.

Inchoative_change_of_temperature, by virtue of an “inchoative of” relation.

We will see further (Sections 4.1 and 4.2) that our projects share many assumptions with the theoretical frameworks and resources presented in this section. First, Section 3 describes

Figure 3: Relations between the Cause_temperature_frame and other frames viewed in the Frame Grapher

the specialized resources to which I refer in the remainder of the contribution. 3. DiCoInfo and DiCoEnviro Two specialized databases are used in this project: DiCoInfo contains terms related to the subject fields of computing and the Internet; in DiCoEnviro, terms are related to the field of the environment. Terms belong to the parts of speech of noun, verb, adjective and adverb (e.g. DiCoInfo: computer, to browse, virtual, remotely; DiCoEnviro: biodiversity; to warm; human-induced; seasonally). The databases are compiled according to the theoretical and methodological principles of Explanatory Combinatorial Lexicology, ECL (Mel’čuk et al. 1984-1999). The DiCoInfo currently consists of three language versions: French (approximately 1,000 entries, including 15,000 lexical relationships), English (approximately 800 entries, including more than 4,500 relationships) and Spanish (approximately 100 entries are currently online). In DiCoEnviro between 150 and 200 entries are on line in each language. Entries are written in an XML editor and transformed into HTML when posted on the Web. Entries contain the following data categories (Figures 5 to 7): •

Headword (the lemma and a number used to identify a specific meaning).



Grammatical information (part of speech, gender for nouns in French and Spanish, transitivity for verbs).



Status (this number indicates how advanced the writing of the entry is: 2 is the first stage; 0 indicates that the entry is complete)



In DicoEnviro, some entries are accompanied with a subject field label. More terms are related to the topic of climate change, but recently terms linked to renewable energy were added and appeared necessary to distinguish them from the others.



Actantial structure: the number of actants; a first label that states the semantic role and a second label between curly brackets) that indicates the typical term(s) that are likely to appear in that position.

Figure 5: Entry download in DiCoInfo

• Linguistic realizations of actants (Figure 6): forms in which actants can appear in running text (in entries labelled with a status 0, the list is complete; in others, realizations are still in the process of being added). •

Synonyms and variants.



Equivalents in other languages (hyperlinked when the entries are available online).



Contexts (Figure 7): a sample of sentences extracted from corpora is displayed (these are selected among those – between 15 and 20 – that are placed by the lexicographer in the entry). Some of them are annotated (cf. Section 4).



Lexical relations (Figure 5): a list of terms that are semantically related to the headword along with a short explanation of the relationship (in entries labeled with a status 0, the list is complete; in others, related terms are still in the process of being added). The database provides lists of paradigmatic relationships (near synonymy, antonymy, other parts of speech, etc.) and syntagmatic relationships (i.e. collocations).

Figure 6: Linguistic realizations of actants in DiCoEnviro

Figure 7: Display of contexts in DiCoEnviro

4. Why annotate terms in specialized resources? Up until recently, our resources DiCoInfo and DiCoEnviro, although stating explicitly the actantial structure of terms with semantic roles and typical terms, lacked specific details on the syntactico-semantic properties of predicative terms and their participants. 7 A module

7

This is due to a decision made by the DiCoInfo and DiCoEnviro team and not to a gap in ECL. The model does specify the syntactic properties of lexical units and their actants (in a data category called Régime). However, in ECL no annotation of sentences is provided and the Regime is an abstract representation of the interface between the actantial structure and deep syntactic realisations, on the one hand, and between deep and surface syntactic realizations, on the other.

containing annotated contexts has been added to the entries and in an increasing number of them annotations can be visualized by users (Figure 8).

Figure 8: Annotated contexts for boot in DiCoInfo

By adding this module, we wanted to: •

Provide users of specialized lexical databases with complete descriptions of the syntactico-semantic properties of terms (displaying combinatorial possibilities between predicative terms, especially verbs, and their actants).



Take into account, not only actants, but also circumstants, thereby providing a complete picture of the syntactic behavior or the terms recorded in our databases.



Build a semi-formal resource that can be integrated into NLP applications (these applications could use the syntactic description to find relevant information in running text).

Although the annotation module was first designed to meet specific user needs, we soon realized that annotations would also be extremely useful for lexicographers writing the entries since they provide them with additional data to validate their intuitions about the meanings of terms. 4.2 The annotation module In our resources annotated contexts indicate (Figure 9): •

The predicative term in capital letters (DOWNLOAD);



The participants and their nature (actants: you, file, directory; or circumstants: easily);



The semantic roles of participants (Agent, Patient, Destination, Source, Manner);



The syntactic function of the participant (subject, object, complement, modifier);



The syntactic group of the participant (NP, PP, AdvP).

By now you should be able to locate websites on the net, DOWNLOAD files from these websites. DOWNLOAD the file to your download directory. Alternatively, the person in charge of the printer server could DOWNLOAD these files and install them using the Additional drivers. The best services let you upload and DOWNLOAD files easily.

Agent

Patient Destination Source

Manner

DOWNLOAD Actants Subject (NP) (1) You Indirect link (NP) Person (2) Object (NP) (3) file (3) Complement (PP-to) Complement (PP-from) Circumstants Modifier (AdvP)

Directory Website Easily

Figure 9: Annotated sentences and summary table in DiCoInfo

In the resources, the list of annotated contexts is displayed if users request it under relevant entries. A link labelled “Annotated contexts” appears in the entries where they are available. Different colors and fonts are used to distinguish between the term and its participants, on the one hand, and between actants and circumstants, on the other. A table appears under the contexts: it lists the syntactic functions and groups in which participants can be realized. 4.1 Semantic roles in DiCoInfo and DiCoEnviro The representation system implemented in both resources is based on semantic roles. Semantic roles allow us to capture regularities and differences between terms that belong to the same semantic field. More specifically, we wanted to represent in a systematic way some phenomena that were recurrent in our databases. I list some of these phenomena below (see also L’Homme and Pimentel 2012).



Capture the similarities between terms that belong to different parts of speech and share all or parts of their actantial structures as shown in the examples below (Figure 9 shows the relationships between the actantial structures of terms related to the verb download): pollute1b, vt: Agent{human} or Cause{activity} ~ Patient{area} with Means{gas} (Mining or harvesting these resources can also pollute the soil, water and atmosphere). polluting1, adj: ~ Means{gas} (Emissions of other polluting gases and particles into the atmosphere can also have large effects) polluter1, n: Agent{human} is a ~ of Patient{area} (China and India, the ringleaders in this dispute, are the second and the sixth greatest polluters respectively in terms of CO2) pollutant1, n: Means(gas) is a ~ Patient{air} (Effects of climate change on other air pollutants are less well established) pollution1b.1, n: ~ of Patient{area} by Agent{human} or Cause{activity} by Means{gas} (Key human influences include changes in greenhouse gas concentrations, stratospheric ozone depletion, local air pollution and alterations in land use)

Figure 9: Graphical visualization of actants with DiCoInfo visuel (Robichaud 2011)



Capture similarities between synonyms, near synonyms and certain types of antonyms: connect1: Agent{user} ~ to Destination{network} (A user connects to the Internet) log on1: Agent{user} ~ to Destination{network} (A new person can log on to the PC). burn1: Agent{user} ~ Patient{data} on Destination{CD} with Instrument{burner} (You could download the distribution from the Website, burn it on to a CD and install Linux) read1: Agent{ computer, drive} ~ Patient{data} from Destination{storage device} (data is being read from the first disk) write1: Agent{drive, program, processor} ~ Patient{data} to Destination{storage device} (the optical disk drive is able to both read and write to this disk many times)

• Show semantic distinctions in a specific subject field:

connect1: Agent{user} ~ to Destination{network} (A user connects to the Internet) connect2: Agent{user} ~ Patient{computer, hardware} to Destination{computer, hardware} (Connect the computer to a hub) program3: Agent{ programmer} ~ Patient{program} in Material{language} (The application was programmed in Java) program4: Agent{user} ~ Patient{component} (The Interface Card is programmed to read…)



Represent regular alternations (Levin 1993) in an explicit way. For instance, in causative-inchoative or agent-instrument alternations one actant is omitted and another one changes position. melt1a: Patient(ice) ~ (Experts predict smaller mountain glaciers could melt) melt1b: Cause(temperature) ~ Patient(ice) (The heating will further melt ice.) print1a: Instrument{printer} ~ Patient{data} (The printer is printing the file) print1b: Agent{user} ~ Patient{data} with Instrument{printer} (You can print the document on this laser printer).



Cross-linguistic relationships: In our resources, we assume that terms with identical actantial structures in different languages are quite probably equivalents, even though some differences may exist at the syntactic level (e.g. inversion of actants, different choices of prepositions in each language). For example, the terms connecter1 (Fr), connect1 (En) and conectar1 (Es) are defined as equivalents and have the same actantial structure. Fr. connecter1: Agent ~ à Destination (L’internaute se connecte à l’Internet) En. connect1: Agent ~ to Destination (A user connects to the Internet) Es. conectar1: Agente ~ a Destino (El usuario se conecta a las gigantescas bases de datos)

The system also allows us to better represent cross-linguistic relationships between polysemous terms (for example, between connect1 and connect2 and its French equivalents). In addition, since French and English terms have near synonyms, the equivalence relationship can be extended to them. En. connect1: Agent ~ to Destination (A user connects to the Internet) En. log on1: Agent ~ to Destination (A new person can log on to the PC) Fr. connecter1: Agent ~ à Destination (L’internaute se connecte à l’Internet) En. connect2: Agent ~ Patient to Destination (Connect the computer to a hub) Fr. brancher1: Agent ~ Patient à Destination (Il faut brancher le modem au serveur) Fr. connecter2: Agent ~ Patient à Destination (Connecter l’imprimante au serveur)

The previous examples show that we share a number of assumptions mentioned in Sections 2.1 and 2.2. However, it is worth pointing out that our list of labels is reminiscent or Case grammar rather than Frame Semantics. There are a number of reasons for this: •

As will be seen in the next section, our method is bottom-up and does not rely on a previous definition of conceptual frames. Hence, our roles attempt to represent the semantic relationship between the predicate unit and the actant and should be used in the actantial structures of several terms throughout the databases (and hence not be frame-specific).



It can be argued (as it was in Fillmore et al. 2003a, cf. 2.1) that semantic roles are not fine-grained enough to capture some semantic relationships between terms. Although this is true, our semantic role labeling is complemented by a typical term and this should help represent in a relatively straightforward manner similarities between semantically related terms. Typical terms are chosen according to the following criteria (L’Homme 2010): 1. they should correspond to lexical units which have been defined as terms in each resource (ideally, they should appear in the word list); 2. they are those most frequently found in sentences extracted from corpora in which both predicative units and actants appear; 3. they correspond to generic terms; this allows us to state only one typical term (or a small set of typical terms) instead of listing a series of possibilities; 4. in the explanations of lexical relationships or in the definition, typical terms should be the most “natural” ones to appear in these contexts.

5. Annotation methodology Our annotation methodology is based on that devised in the FrameNet project (FrameNet 2012; Ruppenhofer et al. 2010), but is adapted and simplified in order to meet our specific needs. More specifically, it comprises the following steps: 1. Select predicative units that lend themselves to the annotation. During the first stages of the project, we concentrate on verbs. 2. Collect contexts from specialized corpora that were compiled for the purpose of the DiCoInfo and DicoEnviro projects.

3. Within contexts, annotate the verb and the phrases that are syntactically linked to it. A few indirect syntactic relationships are also taken into account. 4. Distinguish phrases that instantiate actants from those that instantiate circumstants. As was mentioned previously, circumstants were added to our descriptions at a later stage in order to give a complete picture of the combinatorial patterns of predicative terms. 5. Annotate the semantic role of the participant. In many entries, actants were already stated in the actantial structure and lexicographers could use a series of predefined roles (e.g. Agent, Patient, Destination, Instrument) in the annotations. In some cases, however, the annotation led to a slight redefinition of the actantial structure (e.g. the addition of an actant that had not been considered when first creating the entry). In contrast, circumstants were an addition and new roles needed to be created (e.g. Condition, Degree, Direction, Mode). 6. Annotate the syntactic function, syntactic group and head of the phrases. We provide further details on some of these steps below. Our corpora comprise specialized texts in each field covered by our resources (computing and the Internet; climate change). We prioritize texts written by experts and those that target laymen or students. The size of the corpora varies from one language to the other: it can comprise between 1 million to 3 million words.8 The corpora are then queried with an in-house concordancer allowing lexicographers to view KWIC as well as full text concordances. Lexicographers select 15 to 20 contexts for each term and note their origin. When selecting contexts, lexicographers try to find those that reflect the different syntactic and combinatorial patterns of the term together with its actants: active vs. passive; different positions of actants; actants realized or not, etc. It is assumed that this number of contexts is sufficient to capture most combinatorial patterns of terms and their participants in specialized corpora.

8

Specialized corpora are notably smaller in size that general-language corpora (Bowker and Pearson

2002). It is assumed that the semantic, syntactic and combinatorial behaviour of terms can be captured in corpora of smaller size since terms are likely to occur frequently in specialized texts.

The annotation is carried out in the XML version of the database as shown in Figure 10. A specific schema was designed in order to assist lexicographers when selecting actantial roles, syntactic functions and syntactic groups. In many cases, predefined lists of values appear when annotators enter a specific attribute (for instance, the following list pops up when annotating the name of the syntactic group: Modifier, Prop, Pro, SA, SADV, SN, SP, SV). The schema is designed in French, since it was the first language described in our resources, and is used in English and Spanish. However, a number of adaptations to our annotation rules were necessary (the web versions of the resources were also localized): •

The linguistic metalanguage was translated (e.g. Eng. Patient -> Es. Paciente; Eng. Destination -> Es. Destino; Fr. Modificateur -> Eng. Modifier)



Auxiliary verbs were added in English and Spanish (e.g. Eng. do; Es. estar, ser)



We needed to take into account some modals that are specific to some languages (e.g. Es. tener que).

Figure 10: XML annotation of a context with connect

When annotating, lexicographers also have at their disposal a guide with a series of annotation rules (L’Homme and Pimentel 2010, still under construction). The guide also

contains a table that provides definitions and examples of semantic roles. An example is given in Figure 11 with the role Instrument.

Semantic role

Definition

Actant or circumstant

Instrument (Fr. Instrument) (Es. Instrumento)

The participant used by an agent to carry out the activity expressed by the predicative unit

Actant or circumstant

Most frequent semantic classes of participant Concrete object (computer, peripheral, aircraft), Program

Most frequent syntactic function Complement of a verb (introduced by through, with, in) Subject of a verb (in the case of an agent-instrument alternation with another meaning of this verb); Indirect syntactic relationships (Instrument is used to verb)

Terms access 2

Examples Then start the PC that (Agent) will be accessing the Internet (Location) through the other one's modem (Instrument) edit 1 Shoot, I bet if you (Agent) booted up only in the command line and just edited some documents (Patient) with VI (Instrument) you could stretch it out to 2.5 hours. You (Agent) may edit the current page (Patient) in the HTML editor (Instrument) of your choice. print 1a Ink jets (Instrument) now print great color photographic images (Patient) and rival laser speed. Definitions found in the literature Examples in the literature Instrumental (Fillmore 1968): The case of the inanimate John (Agentive) opened the door force of object causally involved in the action of state (Objective) with a chisel identified by the verb (1968: 24) (Instrumental). Instrument (FrameNet 2012): An entity directed by the agent With a simple hammer that allows them to achieve their goal. (Instrument) and saw (Instrument), you can accomplish almost anything (Goal). Instrument: used for objects (or forces) that come in contact with an object and cause some change in them. Generally introduced by a `with' prepositional phrase. Also used as a subject in the Instrument Subject Alternation and as a direct object in the Poke-19 class for the Through/With Alternation and in the Hit-18.1 class for the With/Against Alternation. (VerbNet 2006) Figure 11: Definition of the role Instrument in DiCoInfo and DiCoEnviro

The annotation is still performed manually but an automatic method was developed for annotation in French (Hadouche et al. 2011a). First results for French are quite encouraging and show that a lot of time can be gained when using the automated method (Hadouche et al. 2011b). There are a number of differences that can be noticed between our annotation method and the one developed within the FrameNet project. We already mentioned the choice we made of resorting to semantic roles rather than labels for Frame Elements (cf. Section 4.2). The other

main difference is the starting point for each method. We apply a strictly bottom-up approach. Our lexicographers first select terms by means of a technique of corpus comparison (Drouin 2003) and then apply lexico-semantic criteria (L’Homme 2004) to select specialized meanings (meanings that are not relevant for the fields under examination are not taken into consideration). Then, lexicographers make semantic distinctions and collect and annotate contexts. In contrast, FrameNet lexicographers use a chiefly top-down approach. First, frames are identified and defined along with their Frame elements (FEs), then, a list of relevant LUs evoking the frames is compiled, and, finally, intuitions about frames, frame elements and LUs are validated based on corpora (Fillmore et al. 2003b). Furthermore, there are some distinctions that FrameNet lexicographers make that are not taken into consideration in DiCoInfo and DiCoEnviro projects. In FrameNet, non-core FEs are subdivided into peripheral or extra-thematic. In our resources, circumstants are not distinguished. Finally, at the syntactic level, other differences can be mentioned. FrameNet lexicographers introduce a label for a certain number of non-instantiated FEs. This is not done in our resources. However, some indirect syntactic relationships are taken into account (for instance, actants realized in the form of subjects of modal verbs are taken into account). Finally, our resources do not provide tables illustrating the valence patterns of terms, but the information on the semantic and syntactic properties of the terms and their participants show all the syntactic functions and groups in which participants can be found in sentences. 6. Concluding remarks and future work In this contribution I described a methodology designed to better characterize the syntactico-semantic properties of predicative terms and their participants. The project has been undertaken in order to provide users with richer syntactico-semantic information about terms. Up to now, in DiCoInfo approximately 500 verbs have been annotated in French, 300 verbs in English, and we have started annotating Spanish verbs. The DiCoEnviro contains about 200 French entries with annotations and we have just started the annotation work in English. I emphasized the fact that a sound methodology based on the annotation of terms in their contexts allows lexicographers to capture semantic regularities and better characterize

differences between terms within a language as well as across languages. Although manual annotation is time-consuming, it is extremely useful and leads to more systematic descriptions. The usefulness of annotations for lexicography has already been underlined; our projects show that they are also extremely helpful in terminology work. Of course, some improvements to this method from a user’s perspective could be considered. In the Web versions of the databases, the summary of syntactic functions and groups is not presented in the most intuitive way: a more user-friendly display could easily be envisaged. Secondly, in a bilingual version, an automatic comparison of the syntactic behaviour of equivalents could be envisaged, allowing users to visualize them on a single screen. I argued that semantic relationships and distinctions (near synonymy, antonymy, alternations, etc.) within specialized fields such as computing and Internet as well as climate change could be captured elegantly with a system based on semantic roles (ideally characterized with typical terms). The representation system could be extended to the analysis of semantic differences between the two fields. For example, the two databases describe different meanings of environment, vulnerability, stocker (Fr) and could be used to better characterize the differences. The analysis can also be useful to define differences between meanings found in general language resources and those recorded in a specialized one (this was explored in Pimentel et al. 2012). I also argued very briefly that a representation system based on semantic roles and the systematic annotation of contexts could assist lexicographers when defining cross-linguistic relationships. This was also discussed in Pimentel and L’Homme (2011) who made a typology of equivalence relationships based on the examination of annotated contexts. However, the procedure described here would benefit from an automated method for assisting lexicographers when annotating. Some experiments were carried out for French and are quite promising. However, it has not been completely implemented and adaptations still need to be done for English and Spanish. An automated method could also be developed in order to check the consistency of actantial structures within a specific language and across languages.

On a more theoretical level, a sounder definition of semantic roles for terminological purposes could be provided based on the work carried out in our databases. In the more recent extension of our method to DiCoEnviro (that annotation work had started in DiCoInfo), we noticed that some semantic roles were used much more frequently (e.g. Cause) and others needed to be introduced (e.g. Degree). These observations could lead to the assumption that some roles might be dependent on the nature of terms found in certain specialized domains. But this assumption still needs to be verified. Our method could also lead to the discovery of frames or frame-like structures after annotation (and not prior to lexicographical descriptions as in FrameNet (Fillmore et al. 2003b)). For instance, DiCoInfo contains several lexical units related to the idea of “creating” (ex. write, create, program, develop) and these and others could be grouped into frames. DiCoEnviro contains several verbs that convey a meaning of “reduction” or “increase”. Also, these groups could help discover distinctions between fields of knowledge and perhaps help identify the main semantic fields in specialized domains. Finally, another interesting aspect of our project is that, although the specialized databases have been compiled based on the theoretical principles of Explanatory Combinatorial Lexicology (Mel’čuk et al. 1995), we managed to implement an annotation methodology taken from a different framework, namely Frame Semantics rather elegantly. This seems to show that as far as participants (actants and circumstants, on the one hand, and core frame elements and non-core frame elements, on the other) the frameworks are compatible. Acknowledgments The work described in this paper is supported by the Social Sciences and Humanities Research Council of Canada (SSHRC). I would like to thank the team of annotators at the Observatoire de linguistique Sens-Texte (OLST) who participated actively in the projects. Special thanks to Janine Pimentel and Carlos Subirats with whom I had many discussions regarding this project and to Fadila Hadouche and Guy Lapalme who collaborated on the definition of the XML schema and worked on automating the method.

References Dicouèbe

(2012):

Le

DiCouèbe.

Dictionnaire

en

ligne

de

combinatoire

du

français.

(http://olst.ling.umontreal.ca/dicouebe/). Accessed 8 March 2012. Drouin 2002: Drouin, P. Term Extraction Using Non-technical Corpora as a Point of Leverage. In Terminology 9(1). 2003, 99-115. Fillmore 1968: Fillmore, C. J. The case for case. In E. Bach / R. T. Harns (edd). Universals in linguistic Theory. New York: Holt, Rinehard & Winston. 1968, 1-88. Fillmore 1982: Fillmore, C.J. Frame Semantics. In The Linguistic Society of Korea (ed). Linguistics in the Morning Calm. Seoul: Hanshin. 1982, 111-137. Fillmore et al. 2003a: Fillmore, C.J. / C.R. Johnson / M. Petruck. Background to FrameNet. In International Journal of Lexicography 16(3). 2003, 235-250. Fillmore et al. 2003b: Fillmore, C. / Petruck, M. / Ruppenhofer, J. / Wright A.. FrameNet in Action: The Case of Attaching. International Journal of Lexicography 16(3). 2003, 297-332. FrameNet 2012: FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/). Accessed 8 March 2012. Hadouche et al. 2011a: Hadouche, F. / Lapalme, G. / L'Homme, M.C. Attribution de rôles sémantiques aux actants des lexies verbales. In Traitement automatique des langues TALN 2011, 27 juin au 1er juillet 2011, Avignon. Hadouche et al. 2011b: Hadouche, F. / Desgroseilliers, S. / Pimentel, J. / L'Homme M.-C. / Lapalme G. Identification des participants de lexies prédicatives : évaluation en performance et en temps d'un système automatique. In Actes de la 9e conférence internationale Terminology and Artificial Intelligence (TIA'11). 2011, Paris, France. Hanks / Pustejovsky 2005: Hanks, P. / Pustejovksy, J. A pattern dictionary for natural language processing. In Revue française de linguistique appliquée 10(2). 2005, 63-82. Kipper et al. 2006: Kipper, K. / Korhonen, A. / Ryant, N. / Palmer, M. Extending VerbNet with Novel Verb Classes. In Fifth International Conference on Language Resources and Evaluation (LREC 2006). 2006, Genoa, Italy. L’Homme 2008: L’Homme, M.-C. Le DiCoInfo. Méthodologie pour une nouvelle génération de dictionnaires spécialisés. In Traduire 217. 2008, 78-103. L’Homme 2010: L’Homme, M.C. Designing Terminological Dictionaries for Learners based on

Lexical Semantics: The representation of actants. In Fuertes-Olivera, P. (ed.). Specialised Dictionaries for Learners, Berlin/New York: De Gruyter. 2010, 141-153. L’Homme / Pimentel 2010: L’Homme M.C. / Pimentel, J. Guide d’annotation des contextes anglais du DiCoInfo, Montréal : Observatoire de linguistique Sens-Texte, 2010 L’Homme / Pimentel 2012: L’Homme, M.C. / Pimentel, J. Capturing Syntactico-semantic Regularities among Terms: An Application of the FrameNet Methodology to Terminology. In Languages Resources and Evaluation (LREC 2012). 2012, Istanbul, Turkey. Levin 1993: Levin, B. English Verb Classes and Alternation: A Preliminary Investigation. Chicago: The University of Chicago Press, 1993. Mel’čuk et al. 1984-1999: Mel’čuk I., et al. Dictionnaire explicatif et combinatoire du français contemporain. Recherche lexico-sémantiques I, II, III, IV, Montréal : Les Presses de l’Université de Montréal. 1984-1999. Mel’čuk 2004: Mel’čuk, I. Actants in semantics and syntax. I: actants in semantics. In Linguistics, 42(1). (2004), 1-66. Pimentel 2012: Pimentel, J. Criteria for the Validation of Specialized Verb Equivalents: An Application in Bilingual Terminography. Thesis presented at the University of Montreal, 2012. Pimentel / L’Homme 2011: Pimentel, J. / L’Homme M.C. Annotation syntaxico-sémantique de contextes spécialisés : application à la terminographie bilingue. In M. van Campenhoudt, T. Lino & R. Costa (Edd). Passeurs de mots, passeurs d’espoir : lexicologie, terminologie et traduction face au défi de la diversité, Paris : Édition des archives contemporaines/Agence universitaire de la francophonie. 2011, 651-670. Pimentel et al. 2012: Pimentel, J. / L'Homme M.C. / Laneville M.E. General and Specialized Lexical Resources: a Study on the Potential of Combining Efforts to Enrich Formal Lexicons, In International Journal of Lexicography. 2012, forthcoming. PropBank 2012: PropBank (http://verbs.colorado.edu/propbank/). Accessed 21 March 2012. Robichaud 2011: Robichaud, B. A Graph Visualization Tool for Terminology Discovery and Assessment. In Boguslavsky, I. / L. Wanner (edd). Proceedings of the 5th International Conference on Meaning-Text Theory (MTT’11). 2011, Barcelona, Spain. Ruppenhofer et al. 2010: Ruppenhofer, J. / Ellsworth M. / Petruck M. / Johnson C. / Scheffczyk J. FrameNet II: Extended Theory and Practice. 2010. http://framenet.icsi.berkeley.edu/. Accessed 8

March 2012. Schmidt 2009: Schmidt, T. The Kicktionary – A Multilingual Lexical Resources of Football Language. In Boas, H.C. (ed). Multilingual FrameNets in Computational Lexicography. Methods and Applications, Berlin/New York: Mouton de Gruyter. 2009, 101-134. VerbNet

2006:

VerbNet

(http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html).

Accessed 1st November 2006. VerbNet

2012:

VerbNet

(http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html).

Accessed 21 March 2012.