Lang Resources & Evaluation (2009) 43:329–354 DOI 10.1007/s10579-009-9103-2
MEDIA: a semantically annotated corpus of task oriented dialogs in French Results of the French
MEDIA
evaluation campaign
He´le`ne Bonneau-Maynard Æ Matthieu Quignard Æ Alexandre Denis
Published online: 19 September 2009 Springer Science+Business Media B.V. 2009
Abstract The aim of the French MEDIA project was to define a protocol for the evaluation of speech understanding modules for dialog systems. Accordingly, a corpus of 1,257 real spoken dialogs related to hotel reservation and tourist information was recorded, transcribed and semantically annotated, and a semantic attribute-value representation was defined in which each conceptual relationship was represented by the names of the attributes. Two semantic annotation levels are distinguished in this approach. At the first level, each utterance is considered separately and the annotation represents the meaning of the statement without taking into account the dialog context. The second level of annotation then corresponds to the interpretation of the meaning of the statement by taking into account the dialog context; in this way a semantic representation of the dialog context is defined. This paper discusses the data collection, the detailed definition of both annotation levels, and the annotation scheme. Then the paper comments on both evaluation campaigns which were carried out during the project and discusses some results. Keywords Dialog system Speech understanding Corpus Annotation Evaluation
H. Bonneau-Maynard (&) LIMSI–CNRS, Universite´ Paris-Sud 11, Baˆt. 508, BP 133, 91403 Orsay Cedex, France e-mail:
[email protected] M. Quignard A. Denis LORIA, Campus Scientifique, BP 239, 54506 Vandoeuvre-le`s-Nancy Cedex, France M. Quignard e-mail:
[email protected] A. Denis e-mail:
[email protected]
123
330
H. Bonneau-Maynard et al.
1 Introduction The assessment of a dialog system is complex. This is partly due to the high integration factor and tight coupling between the various modules present in any spoken language dialog system (SLDS), for which today, no commonly accepted reference architecture exists. The other major difficulty stems from the dynamic nature of dialog. Hence most SLDS evaluations up to now have either tackled the system as a whole, or have measurements based on dialog-context-free information. The European DISC project (Giachim et al. 1997) has collected a systematic list of bottom-up evaluation criteria, each corresponding to a partially ordered list of properties likely to be encountered in any SLDS. Although the DISC project results are quite extensive and are presented in an homogeneous way, they do not provide a direct answer to the problems posed by SLDS evaluation; their contribution lies more at the specification level. Moreover, although the approach and goals of the European EAGLES (King et al. 1996) project were different, one could make much the same remark about the results of the speech evaluation work group (Gibbon et al. 1997). The MADCOW (Multi Site Data Collection Working group) coordination group set up in the USA by ARPA in the context of the ATIS (Air Travel Information Services) task to collect corpora, was the first to propose a common infrastructure for SLDS automatic evaluation (Hirschman 1992), this also addressed the problem of language understanding evaluation. The evaluation paradigm is based on system answer comparison-list of possible flights based on user constraints—to a pair of minimal and maximal reference answers. Unfortunately no direct diagnostic information can be produced, since understanding is estimated by gauging the distance from the system answer to the pair of reference answers. In ATIS, the protocol was only applied to context free sentences. It is relatively objective and generic because it relies on counts of explicit information and allows for a certain variation in answers. PARADISE (Walker et al. 1998) can be seen as a sort of meta-paradigm that correlates objective and subjective measurements. Its grounding hypothesis states that the goal of any SLDS is to achieve user-satisfaction, which in turn can be predicted through task success and various interaction costs. With the help of the kappa coefficient, (Carletta 1996) proposes to represent the dialog success independently from the intrinsic task complexity, thus opening the way to generic task comparative evaluation. PARADISE has been used in the COMMUNICATOR project (Walker et al. 2001, 2002), and has made it possible to evaluate SLDS performances with a series of domain independent global measures which can be automatically extracted from the log files of the dialogs. The MEDIA project addresses only a part of the SLDS evaluation problem, using a paradigm for evaluating the context-sensitive understanding capability of any SLDS. The paradigm is based on test sets extracted from real corpora, and has three main advantages: it is generic, contextual and it offers diagnostic capabilities. Here genericity is envisaged in a context of information dialogs access. The diagnostic aspect is important in order to determine the different qualities of the systems under test. The contextual aspect of evaluation is a crucial point since dialog is dynamic by nature.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
331
The first step (Sect. 3) was dedicated to the definition and the collection of the MEDIA corpus of French dialogs for the chosen task (tourist information). During the second step, the common semantic representation was defined. A dedicated annotation tool1 was developed allowing the semantic annotation of the corpus. The literal annotation of the corpus is described in Sect. 4. The definition of a semantic representation of the context is then given in Sect. 5. Two evaluation campaigns were performed in the project using the proposed paradigm; Sects. 6.1 and 6.2 discuss them in detail.
2 The MEDIA project and consortium 2.1 Motivations In broad outline, SLDSs are composed of different modules for speech recognition, for natural language understanding, and for dialog management and generation. They usually include an explicit understanding model to represent the semantic level. The semantic interpretation can be decomposed into two steps. The first step consists of providing a semantic representation of an utterance (the literal semantic representation) without taking into account the rest of the dialog (see the ATIS project). The literal representation is then reconsidered in a second step by taking into account the dialog context, thereby making it possible to solve inter-query references and providing the contextual semantic representation of the utterance. Previous experiments with the PARADISE paradigm (Bonneau-Maynard et al. 2000) have shown that contextual understanding is strongly connected to user satisfaction and therefore to the overall quality of the dialog system. The aim of French Technolangue EVALDA-MEDIA project (referred to as MEDIA) was to focus the quality evaluation on SDLS interpretation modules, for both literal and contextual understanding tasks. The evaluation paradigm is based on the use of test suites from real-world corpora, a common semantic representation and common metrics. The evaluation environment relies on the assumption that, for database query systems, it is possible to construct a common semantic representation to which each system is capable of converting its own internal representation. The chosen semantic representation is generic. Most attributes are domain-independent so that the representation has been already used for other domains (Lefe´vre et al. 2002) or for other languages (Bonneau-Maynard et al. 2003) in the case of the IST-AMITIE´S project. Thanks to the precision of the semantic representation (which notably includes explicit representation of references), selective evaluation on utterances including particular linguistic difficulties can be performed, as is described is Sect. 6.1.3. In a way, the MEDIA evaluation paradigm complements evaluation programs centered on performance evaluation with global measures. The global evaluations perform the comparison of systems on logs of dialogs, which is obviously of a great interest. However, specific recordings are needed to perform the evaluation of each 1
http://www.limsi.fr/Individu/hbm/.
123
332
H. Bonneau-Maynard et al.
system, which is known to be costly. On the other hand, the MEDIA paradigm performs the comparison of the systems on the same data and enables the evaluation on specific difficulties. New approaches can also be tested without recording new dialogs. Finally, the objective of the MEDIA project is not only to give the scientific community the means to perform comparative evaluations of understanding modules, but also to offer the possibility to share corpora and define representations and generic common metrics. 2.2 The MEDIA consortium Participants from both academic (IRIT, LIA, LIMSI, LORIA, VALORIA, CLIPS) and industrial sites (France Telecom R&D) took part in the project. The initiator of the project, the LIMSI Spoken Language processing group, was responsible for coordinating the scientific aspects of the project. To ensure impartiality, the campaign was coordinated and managed by ELDA who did not participate in the evaluation campaign. ELDA was also in charge of creating the corpus necessary for the project and responsible for creating or providing the software or tools necessary for the evaluation campaign itself. The company VECSYS provided the recording platform for the corpus (hardware and software including the ‘Wizard of Oz’ system, see below). All partners were involved in the discussions concerning the choice of the task, the recording protocol of the corpus, and the common semantic representation. Only academic partners participated in the evaluation campaigns. This paradigm was used within two evaluation campaigns involving several sites carrying out the task of querying information from a database.
3 Data collection The dialogs are attempts to make hotel reservations using tourist information with data obtained from a web-based database. The corpus was recorded, the vocal tourist information server being simulated by a Wizard of Oz (WOZ) system (Devillers et al. 2003). In this way, each user believes she or he is talking to a machine whereas she or he is actually talking to a human being (a ‘wizard’) who simulates the behavior of a tourist information server. This enabled a corpus of varied dialogs to be obtained, thanks in part to the flexible behavior of the wizard. The operator (wizard) used the graphical interface, developed by VECSYS, which assisted him in generating the responses communicated to the user. The generated replies were obtained by completing a sentence template with the information obtained by consulting a tourist information website taking into account the user’s request. The signal was recorded in digital format. The callers referred to pre-defined tourist and hotel reservation scenarios (generated from a set of templates in such a way as to have a set of varied dialogs) and given to the callers by telephone. Several starting points were possible for the dialogs (for example, choice of town, itinerary, tourist event, festival, price, date,
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French Table 1 A simple scenario
333
Date:
Second weekend of May
Town:
Marseille
Situation:
Near the harbor
No. of rooms:
1 Single
Price:
50–60 euros per night
and so on). Eight scenario categories were defined, each with a different level of complexity. An example of a simple scenario is given in Table 1. A complexscenario could consist of reserving several hotels in several locations according to an itinerary. In addition to the variety of scenarios given to the callers, a set of instructions for the wizard was defined in order to vary the type of dialogs. There are three categories of instructions. The first concerns speech recognition or comprehension errors. In this way, the wizard produces a response of having ’misunderstood’ the user request. The second involves explicit or implicit feedback to the user. The final type concerns the level of cooperation on the part of the wizard. At one end of the spectrum, the wizard returns all the information requested by the user. On the other end, he is not able to reply to any of the user’s requests. Between these two extremes, the wizard may provide partial information to the user, and here we may expect to observe misunderstandings, clarification requests, and so on, that are frequent in spoken dialogs. Most interesting phenomena (such as reference, negotiation, negation) were observed with complex scenarios and a non-cooperative wizard. 3.1 Corpus characteristics Main dialog characteristics are given in Table 2. 1,257 dialogs were recorded, from 250 different speakers, where each caller carried out five different hotel reservation scenarios. The final corpus is on the order of 70 h of dialogs, which have been transcribed and semantically annotated by ELDA (Client utterances only were annotated). The total vocabulary size is 3,203 words including hotel and city names, with a mean number of words per utterance of around six for user requests. Although the wizards speak almost two times more (283 k words) than the users (155 k words), the lexicon size is much lower for the wizards (1,932) than for the users (2,715). This is due to the fact that the wizards pronounce sentences generated automatically while the users have no restrictions on their replies. Table 2 Main characteristics of the MEDIA corpus
Wizard
User
Total
No. of words
283 k
155 k
438 k
No. of utterances
19.6 k
18.8 k
37 k
Mean words per utterance
14.4
8.3
11.8
Lexicon size
1,932
2,715
3,203
No. of dialogs
1,257
Average dialog duration
3,30
123
334
H. Bonneau-Maynard et al.
4 Literal semantic representation and annotation scheme 4.1 Attribute/value representation In order to provide a diagnostic evaluation, the evaluation paradigm relies on a common generic semantic representation. The formalism was agreed upon by all project partners and chosen to enable a large corpus to be annotated with semantic tags. The selected common semantic representation, inspired by (Bonneau-Maynard et al. 2003), is based on an attribute-value structure in which conceptual relationships are implicitly represented by the name of the attributes. This formalism enables communicative acts as well as the semantic content of an utterance to be coded in a two level attribute-value representation. Each turn of a dialog is segmented into one or more dialogic segments and each dialogic segment is segmented into one or more semantic segments with the assumption that a semantic segment corresponds to a single attribute. The communicative acts associated with each dialogic segment are derived from FIPA (FIPA 2002). Six dialog acts have been agreed to by all participants: Inform, Query, Accept (Confirm), Reject (Dis-confirm), Opening and Close, corresponding also roughly to the DAMSL2 backward looking functions. This reduced list makes it possible to obtain a high level of inter-annotator agreement. However, since the project focussed on semantic evaluation, the partners involved in the campaigns were not expected to provide the dialogic segmentation and their corresponding communicative acts. An example of a literal semantic representation of a client utterance is given in Table 3. An example of a whole dialog is given as an Appendix at the end of the paper. A semantic segment is represented by a triplet which contains the mode (affirmative ’?’, negative ’-’, interrogative ’?’ or optional ’*’), the name of the attribute representing the meaning of the sequence of words and the value of the attribute. The order of the triplets in the semantic representation follows their order in the utterance. The values of the attributes are either numeric units, proper names or semantic classes merging lexical units, which are synonyms for the task. The modes are assigned per semantic segment basis. This allows disambiguating sentences like ‘‘not in Paris in Nancy’’ which would be misleading for the dialog manager. This Attribute-Value Representation (AVR) allows for a simple annotation process. The semantic representation relies on a hierarchy of attributes representing the task and domain ontology and defined in a semantic dictionary. The semantic dictionary was jointly developed by the MEDIA consortium. The basic attributes are divided into several classes. The database attributes correspond to the attributes of the database tables (e.g. DBObject or payment-amount). The database attributes are classified in packages (e.g. time or payment), which are domain-independent, and hotel which is domaindependent. Each package is defined as a hierarchy of attributes (e.g. package payment involves a sub-attribute amount which in turn involves a sub-attribute int). The modifiers attributes (e.g. comparative) are linked to database attributes and used to modify the meaning of the relying database attribute (e.g. in Table 3 the comparative attribute, which value is lessthan is associated to the payment-amount attribute). 2
http://www.cs.rochester.edu/research/speech/damsl/RevisedManual/.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
335
Table 3 Example of the literal semantic attribute/value representation for the sentence ‘‘hum yes the hotel whose price doesn’t exceed one hundred euros’’ Word seq.
Mode/attribute name
Attribute value
hum
?/null
yes
?/response
Yes
the
?/refLink-coRef
Singular
hotel
?/DBObject
Hotel
whose
?/null
price
?/object
Payment-amount-room
doesn’t exceed
?/comparative-payment
Less-than
one hundred
?/payment-amount-int-room
100
euros
?/payment-unit
Euro
The relations between attributes are given by their order in the representation and the composed attribute names. The segments are aligned on the sentences
General attributes are also defined as command-task (cf segments number 1,26,47,61 in the Appendix) which includes the different actions that can be performed on objects of the task (e.g. reservation, information), or command-dial with values cancellation, correction, etc. One of the general attributes refLink is dedicated to reference annotation (cf segments number 24, 27, 37, 41, 44, 45 in the Appendix). Three kinds of references are represented: co-references (as in ‘‘in that hotel’’), co-domain (as in ‘‘another hotel’’), and element/set (as in ‘‘the first hotel’’). The general and modifier attributes are domain independent and were directly derived from other applications (Bonneau-Maynard et al. 2003) whereas most of the database attributes were derived from the database linked to the system. Two types of connectors are also defined: connectAttr and connectProp which represent respectively the logical relations between attributes of a same object (with the default value and), and relations for complex queries (with values explanation, consequence or opposition). A connectAttr attribute indicates a semantic dependence between two attributes, as in the following example: Word seq.
Mode/attribute name
hum I’d like to know if there is
null
a swimming pool
?/hotel-services
Attribute value
swimming pool
or
?/connectAtt
alternative
a jacuzzi
?/hotel-services
jacuzzi
A connectProp attribute indicates a semantic dependence between two parts of a statement, each composed of several semantic segments (e.g. utterance C2, attribute 5 in the Appendix). In the following utterance: ‘‘alors a` ce moment-la` j’ aimerais re´server donc a` au a` l’ hoˆtel du champ de mars euh mais par contre j’ aimerais connaıˆtre le prix des chambres parce que mon budget serait infe´rieur de 150 fr(ancs) 150 euros pardon’’
123
336
H. Bonneau-Maynard et al.
Table 4 Hierarchical representation derived from attribute/value representation of Table 3 response:
yes
refLink:
coRef singular
DBObject
hotel room payment amount comparative:
less
integer:
110
unit:
euro
(‘‘then I’d like to reserve then hmm at a at the Champs de Mars hotel hmm but on the contrary I’d like to known the price of the rooms because I can’t pay more than 150 fr(ancs) 150 euros sorry’’) the connectProp attribute has to be assigned to the semantic segment ‘‘mais par contre’’ (‘‘but on the contrary’’) with the value opposition and to the segment ‘‘parce que’’ (‘‘because’’) with the value explanation. Hierarchical semantic representation is powerful as it makes it possible to explicitly represent relationships between segments, possibly non-adjacent in the transcription of the statement. On the other hand, a flat representation facilitates manual annotation. A set of specifiers is defined to preserve the relationships which are combined with database or modifier attributes. Their combination with the database attributes specifies the exact relations between segments. The combination of the attributes and the specifiers together with connectors allows one to derive a hierarchical representation from the flat attribute/value representation. In the example of Table 3, the attribute name payment-amount-int-room results from the combination of a hierarchy of attributes from package paymentamount-int and the specifier room. Attribute comparative-payment is also derived from the combination of the comparative attribute and the payment specifier. The example of Table 3 can then be derived in the hierarchical representation given in Table 4. 4.2 Corpus annotation Semantic annotation is done on dialog transcriptions. In order to decrease the annotation cost, the annotation tool described in (Bonneau-Maynard et al. 2003) was used. It helps for both the definition of the semantic representation and the annotation process. Semantic disambiguation may require listening to the signal. The Semantizer annotation tool3 provides compatibility with Transcriber 3
http://www.limsi.fr/Individu/hbm/.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
337
(Barras and Geoffrois 2001), which is becoming a standard for speech transcription. The formalization of the semantic dictionary and the assistance provided by the tool to the annotators increase the consistency of the annotations. For literal annotation, dialog turns are randomly presented to prevent the use of the dialog context. The attribute name is selected from the list generated from the semantic dictionary. Automatic completion of attribute names speeds up the process and is greatly appreciated by the annotators. An on-line verification is performed on the attribute value constraints. The tool ensures that the provided annotation respects the semantic representation defined in the semantic dictionary. Usually, the semantic annotation is keyword-based: the attributes are associated to the words which determine their value. In the chosen annotation scheme, a statement is segmented into semantic segments: the attributes are associated to sequences of words—the segments—which better disambiguate their semantic role. Based on the semantic representation described above, the literal semantic annotation of the users utterances has been performed by two annotators. The semantic dictionary includes 83 basic attributes and 19 specifiers. The combination of the basic attributes and the specifiers, automatically generated by the annotation tool, results in a total of 1,121 attributes that are able to be used during the annotation process. The 83 basic attributes includes 73 database attributes, 4 modifiers, and 6 general attributes. The MEDIA consortium decided not to use semiautomatic techniques in order not to bias the evaluation process in favor of a participant system. The MEDIA corpus has been split by ELDA into randomly generated packages of 200 dialogs. The mean annotation time is about 5 times real time. In order to verify the quality of the annotations, periodic evaluations were performed, by computing the kappa statistic (Carletta 1996) for the mode and attribute identification. Therefore an alignment on per segment basis is performed by using the Media scoring tool (see Sect. 6.1.2) in order to deal with the cases where the annotators do not assign the same number of segments to the utterance. In the last inter-annotator experiment, the kappa is almost 0.9, which shows a good quality of annotation (usually, a kappa value is considered to be good for value greater than 0.8). The most common sources of disagreement across the annotators are due to connectors (14% of the errors, with a 0.7 agreement), the identification of the mode (14% of the errors, with a 0.97 agreement), and the reference links (12.5% of the errors, with a 0.8% agreement). Also 14% of the errors are due to specifiers. The most frequent attributes are the yes/no response (17%), followed by reference attributes (6.9%) and command-task (6.8%). Those are task-independent. Task-dependent attributes (hotel, room...) represent only 14.1% of the observed attributes. The semantic dictionary ensures a good coverage of the task considering that only 0.1% segments are annotated with the unknown attribute. Given that the objective of the project is to perform system evaluations, the client utterances have been divided into three corpora: the adaptation corpus which is necessary for the adaptation of the system to the domain and to the task, the development corpus which is used to test the evaluation procedure, and the test corpus itself. Table 5 gives their main characteristics.
123
338
H. Bonneau-Maynard et al.
Table 5 Adaptation, development and test corpus characteristics Adaptation
Dev.
Test
No. of dialogs
727
79
208
No. of client utterances
11,010
1,009
3,003
Mean number of words per utterance
4.8
5.4
6.2
Vocabulary size
2,115
794
900
No. of observed attributes
31,677
3,363
8,788
Mean number of attributes per utt.
2.7
3.1
3.9
No. of distinct attributes
145
105
126
5 The contextual semantic annotation and annotation scheme The evaluation of understanding abilities that rely on context is a very difficult task because it depends on the contextual models of each system. We propose here a methodology for evaluating the final product of these abilities without considering the method actually used to build it. 5.1 Representation of the context First, we had to agree on what would be called the context during the evaluation. We studied four ways to represent the context. The first representation, called ‘‘ecological’’, contains only the preceding transcribed utterances (the dialog history). This representation is very close to real situations in which a system does not have any external information other than the utterance. However such evaluation would not distinguish errors that take place in the course of interpretation (like parsing or semantic building) from pure contextual understanding. The second representation, called abstract representations, only contains the literal and contextual representations of preceding utterances, but requires that all systems are able to take these representations as (unique) input. The third representation, called mixed-representation, contains both transcribed utterances and their literal and contextual representations. Finally, the context could have been encoded as a paraphrase, which is a small text that sums up the preceding dialog, more difficult to construct but usable by all systems. Each participant had to choose a preferred way to represent the context for the evaluation of their system, and we decided then to evaluate the systems according to the ecological and mixed representations. 5.2 Contextual representation and annotation scheme In the MEDIA framework, we define the contextual semantic representation as a product of the re-interpretation of the current utterance according to the previous dialog context. The process of re-interpretation of the context according to the current utterance has been excluded from the evaluation because it is too dependent on particular strategies and internal representations of each system. The contextual understanding abilities of the systems have been evaluated the same way as literal
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
339
ones and focuses on two facets of understanding: the contextual meaning refinement which consists of modifying the semantic representation of an utterance according to the previous dialog history, and the reference resolution which consists of representing the entities that are referred to by a referring expression. The contextual annotation has to respect some practical constraints: first it should not introduce a new segmentation of the utterances with respect to the literal annotation. We avoid then the problem of comparing different segmentations. Second, it is necessary to have the same dictionary of features, considering that the contextual meaning of an utterance could be reformulated by literal semantic features. Third, for reference resolution, on the contrary to literal understanding, utterances of the system need also to be annotated. And fourth, reference annotation has to be done using descriptions instead of relationships (like coreference chains, see Sect. 5.2.2). 5.2.1 Contextual meaning refinement The refinement of the meaning of an utterance is only required if once we consider the context, this meaning differs from the literal interpretation. The contextual semantic specification consists of modifying the literal annotation using the same vocabulary: the set of concepts and their corresponding attributes cannot be altered. The following example (Table 6) shows how the meaning could be refined using the context (the revised meaning is in bold). 5.2.2 Reference resolution In the MEDIA project, reference resolution was restricted to resolution of intralinguistic anaphora and more precisely coreference, that is, when two referring expressions refer to the same individual (van Deemter and Kibble 2000). Most approaches evaluate the relationships between referring expressions (Popescu-Belis et al. 2004), and rely on annotation schemes focused on relations, like the MUC-6 and MUC-7 campaigns, based on coreferences (Chinchor et al. 1997; van Deemter and Kibble 2000), or the Reference Annotation Framework, RAF (Salmon-Alt et al. 2004) in which referring expressions are annotated by markables and relationships by referential links. These approaches are well designed for identifying the relationships but are less efficient to deal with particular types of references (like in ‘‘I take some’’, where ‘‘some’’ quantifies over a type of objects, here elliptical). In addition, they require to add a new level, completely different from the semantic level which entails developing new measures. We preferred to evaluate instead the semantic description of referents. First it allows us to deal with a larger scope of phenomena, and second, it does not require to develop new measures. However, Table 6 Contextual meaning refinement
Utterances
S: In which district do you want to reserve ? U: uh I’d like in midtown
Literal
?/location-relative = midtown
Contextual
?/location-relative-hotel = midtown
123
340
H. Bonneau-Maynard et al.
Table 7 Refinement categories for literal reference annotation Refinement
Usage
Referring expressions
coRef
Coreference: when the expression denotes directly its referent
Pronouns, definite articles, demonstratives
eltSet
Element-set: when the expression denotes the referent thanks to properties that oppose it to other entities in a set
Some demonstrative pronouns, ordinals, superlatives, relatives
coDom
Co-domain: when the expression denotes the referents thanks to an alterity expression
Alterities expression (e.g. the other one)
globally evaluating the semantic description of referents is not very accurate because some semantic features are more important than others to identify objects (the city in a description of a room seems much more important than knowing if it has a bathroom). But as the systems were able to produce a semantic description, the evaluation of reference resolution is limited to this representation and to the description of referring expressions, with a taxonomy close to RAF (identity, codomain, or part-of). The literal annotation of reference has been limited to the referring expression, as such, using a refLink feature, refined by the expression category. The different categories are very close to those used in RAF, (see Table 7), but without the part-of relation for which there was no agreement. On the contrary to RAF markables, only determinants of noun phrases are associated to a refLink feature because the rest of the noun phrase is already annotated by the literal semantic annotations. The value of the refLink feature equals the expected number of referents: singular (‘‘this hotel’’), plural (‘‘those hotels’’) or undetermined when no information of number is given (‘‘there’’ can refer to one or more hotels). To keep the annotation cost low, while focusing on interesting phenomena, only referring expressions whose scope was beyond the utterance have been annotated. This excludes any referring expression whose antecedent is located in the same utterance, but also named entities or indefinite expressions.4 Eventually, only entities of the task were annotated (hotel, room…). The Appendix shows room annotation (turn C10), hotel annotation (turn C12) and price annotation (turn C22). The contextual representation of a reference is based on the literal annotation of the referents. A reference is represented by a set of referents, each one described by a set of semantic features. We do so by adding a reference field to the refLink features; for instance, ’’t1,t2; t3’’ would be the annotation of a referring expression that refers to an entity described by two features and another one described by only one feature. An example (turn C10 from the dialog in Appendix) is given below. The reference field of the feature 24—the determiner ‘‘les’’ (the) in ‘‘les chambres’’ (the rooms)—contains three referents described by preceding features: the city (13), the name of the hotel (14, 15, 18, 21), and the price (16, 17, 19, 20, 22, 23). 4 Exception: indefinite alterity expressions (e.g. another N) are annotated. In this case, the excluded entity has been annotated instead of the actual referent, which is undetermined. This is observed in turn C16 of the dialog given in the Appendix.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
341
C10 ‘‘je veux dire je voudrais savoir si les chambres que je vais rTserver les chambres six chambres individuelles donnent sur une cour et est-ce qu’ il y a un parking prive´’’ ‘‘I mean I’d like to know if the rooms I’m going to book the rooms six single rooms overlook a courtyard and if there is a private parking’’ 24
les
25
chambres
?/refLink-coRef: plural reference=‘‘13,14,15,16,17; 13,18,19,20; 13,21,22,23’’ ?/object: room
The main limit of this formalism is how the ambiguity phenomenon is approximated. Another level would have been expected to represent ambiguities of plural groups. Without this level, ambiguity is encoded as a plural group but with an additional specifier (ambiguous) on the refLink concept. For example, an ambiguous expression like the other hotel should be annotated as refLink-coDom-ambiguous. We collectively designed annotation rules following three constraints: a low annotation cost, a large set of interesting phenomena taken into account, and a high inter-annotator agreement (see below). The most important rule is how to describe referents, and especially referents described by other referents. For instance, because ‘‘room’’ is the reserved object, its description could gather all the features of the reservation, and as such would imply a high annotation cost. Several solutions have been studied: the maximal annotation constituted by all the features describing a referent (accurate but too costly), the discriminating annotation defined by the smallest description of the preceding context that can identify the referent without ambiguity (uninteresting to evaluate if there is no ambiguity), and a recency-based annotation composed of the descriptive features contained in the utterance containing the most recent antecedent (useless for pronouns or demonstratives). Since none of these solutions is fully satisfactory, we made a compromise between the maximal and the discriminating annotation which relies on the type of entities: the named entities (or equivalent like a named hotel, date, price, city, etc.) are only described by a very small set of features which are discriminating by definition (the name or the value), whereas other entities (unnamed hotel, or room) are annotated with the largest set of features, including other referents’ features. Other annotation rules define the scope of a referent’s description which can contain all the semantic features present in preceding utterances. Finally we constrain the referent description to be normalized, that is, in a non-redundant, non-contradictory and fully specified semantic form. During this second campaign, the corpus was again split into three subsets for adaptation, development (dry run) and the final test (Table 8). The manual annotation of referring expressions has been controlled at three times, by measuring the inter-annotator agreement using the three levels evaluation measure presented Sect. 6.2.2. The agreement, evaluated on 31 dialogs (taken from the 814 dialog training corpus), is very good with respect to the description of referring expressions (DRE, 95%) and referent identification (IREF, 95%). Still good, the full description of referents (DREF, 82%) is weaker than the former, showing the difficulty, even for human annotators, of providing the unique complete description of the referents.
123
342 Table 8 Adaptation, development and test corpus characteristics for contextual evaluation
H. Bonneau-Maynard et al.
Adaptation No. of dialogs
Development
Test
814
79
173
No. of client utterances
11,800
1,009
2,816
No. of segments
38,800
4,532
9,528
2,294
207
447
No. of referring expressions
6 The MEDIA evaluation campaigns 6.1 Evaluation of literal understanding 6.1.1 Systems presentation Five systems participated in the evaluation. LIMSI-1 and LIA use corpus-based automatic training techniques, the LORIA and VALORIA systems rely on hand-crafted symbolic approaches, and the LIMSI-2 system is mixed. The Spoken Language Understanding module developed at LIA (Raymond et al. 2006), starts with a translation process in which stochastic Language Models are implemented by Finite State Machines (FSM). The result of the translation process is a Structured n-Best list of interpretations. The last step in this interpretation process consists of a decision module, based on classifiers, choosing an hypothesis in this n-best list. The LIMSI-1 system (Bonneau-Maynard et al. 2005) is founded on a corpus-based stochastic formulation. Two stages are composed: a first step of conceptual decoding produces the modality and attribute sequences associated with word segments, then a final step translates the word segments into the values expected by the representation. Basically, the understanding process consists of finding the best sequence of concepts given the sequence of words in the user statement under the maximum likelihood framework. The LIMSI-2 system is based on previous work on automatic detection of dialog acts (Rosset et al. 2005) and consists of three modules: a symbolic approach is used for specific entities detection, utterance semantic segmentation is done using a 4-gram language model representation, and then automatic semantic annotation is performed by using a memory based learning approach. The approach of the LORIA system (Denis et al. 2008) is based on deep-parsing, and description logics. Derivation trees (even partial ones) are used to build a semantic graph by matching TAG elementary trees with logical predicates. The resulting conceptual graph is tested against an internal ontology to remove inconsistencies. Projection into the final representation is carried out by use of an external ontology, the one of MEDIA, and description logics. The VALORIA system, called LOGUS, implements a logical approach to the understanding of spoken French (Villaneau et al. 2004), according to the illocutionary logic of Vanderveken (1990). Concepts and conceptual structures are used in order to enable the logic formula to be convertible into a conceptual graph.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
343
6.1.2 Evaluation protocol The scoring tool developed for the MEDIA project allows the alignment of two semantic representations and their comparison in terms of deletion, insertion, and substitution. It is able to handle alternative representations for each statement. The scoring is done on the whole triplet including [mode, attribute name and attribute value]. Different scoring methods have been performed on the system results. The Full scoring used the whole set of attributes, whereas in the Relax scoring, the specifiers are no longer considered. Another simplification consists of applying a projection on modes resulting in a mode distinction limited to affirmative and negative (two modes). Each participant benefited from the same semantically annotated 11k utterance training corpus to enable the adaptation of its models to the task and the domain, as well as the semantic dictionary and the annotation manual. The 3,203 word lexicon of the MEDIA corpus and the list of 667 values for the open-value (as hotel or city names, as opposed to comparative attribute which values are given by the representation) attributes which appear in the corpus were also given to the participants. Following a dry-run on a 1k utterance set which enabled the definition of the test protocol, the literal evaluation campaign was performed on a test set of 3k utterances. As observed from the inter-annotation experiment some variability should be allowed in the semantic representation of a statement. In a post-result adjudication phase, the participants were asked to propose either modifications or alternatives for the test set annotation. At the end a consensus vote was carried out. Only 179 queries were associated to several alternative annotations, it means less than 6% of the whole test corpus, with approximately 2 alternatives per statement. 6.1.3 Results Table 9 gives the results obtained by the five participant systems in terms of understanding error rates (Bonneau-Maynard et al. 2006). First it can be observed that the corpus-based training systems (LIMSI-1, LIMSI-2 AND LIA) obtain better results than the others. Concerning the performance of the symbolic systems, a significant part of the errors comes from a bad projection (or translation) into the expected annotation format, and not only from the understanding errors.
Table 9 Results in terms of understanding error rates (best results in bold)
Full
Relax
Four modes
Two modes
Four modes
Two modes
LIA
41.3
36.4
29.8
24.1
LIMSI-1
29.0
23.8
27.0
21.6
LIMSI-2
30.3
23.2
27.2
19.6
LORIA
36.3
28.9
32.3
24.6
VALORIA
37.8
30.6
35.1
27.6
123
344
H. Bonneau-Maynard et al.
Given the number of attributes present in the test set (8 788), the 95% precision of the results is good (p = 0.000 114). The understanding error rates are relatively high: 29% for the best system in Full scoring with four modes, and 19.6% for the best system in Relax scoring with two modes. This last result may be compared with the understanding error rate on the ARISE task (Lamel et al. 1999), with a similar evaluation protocol, which was around 10% on exact transcriptions (Lefe´vre et al. 2002). The gap in performance between the ARISE and MEDIA tasks may be explained by the number of attributes involved in the models which is much higher for the MEDIA task (83 attributes, 19 specifiers) than for the ARISE task (53 attributes, no specifiers). The performance improvement between the results obtained with and without the specifiers (Full vs. Relax) is significant for all the systems. It is worth noting that no significant difference in performance is observed between systems using such a hierarchical representation internally to those obtained with systems implementing a tagging approach (the lowest relative increase in error rate (around 7%) is obtained by two systems (VALORIA and LIMSI-1) representing both approaches). Using four modes instead of two is also a major difficulty for all the systems. This can be partially explained by the fact that the signal—which was listened to by the human annotators—is often necessary to disambiguate between interrogative and affirmative mode. The attributes on which errors are most frequently occuring are the reference link attribute (refLink). Obviously, the annotation of references represents the most difficult problem on which research teams may have to focus their efforts. This is also true for the connectors identification. Except for these two points, the nature of the errors is rather different among the systems. ROVER tests (Fiscus 1997) have been efficiently performed to exploit the nature of the errors made by the multiple systems: in an Oracle mode, the best combination of the five systems could reduce the error rate to 10%. A meta-annotation of the test corpus has been performed in terms of linguistic difficulties, semi-automatically derived from the semantic annotation. Table 10 gives the systems’ error rate for the subsets of statements containing the most significant difficulties in the Full scoring mode with four modes. The first line gives the number of utterances in which each difficulty is observed in the test set and the corresponding 95% precision of the results (p). Complex requests correspond both to multiple requests or requests which are on the borderline of the MEDIA domain. Repetition is tagged when a concept is repeated in the utterance several times with the same value (as in ‘‘the second the second week-end of March’’), whereas Table 10 Selective error rates in Full scoring mode with 4 modes on main linguistic difficulties (best results in bold) No. of occurence
Complex (%) 136 (p = 0.039)
Repetition (%) 117 (p = 0.044)
Correction (%) 47 (p = 0.069)
LIA
54
54
58
LIMSI-1
33
38
37
LIMSI-2
35
40
41
LORIA
47
42
46
VALORIA
46
46
53
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
345
Correction is used when the concept is repeated with different values (as in ‘‘the second the third week-end of March’’). The understanding error rates become significantly greater for sentences including difficulties. The systems which have got the best results on the whole test set keep the best results for the difficulties. From a the relative point of view, LIMSI-1 and LIMSI-2 systems are more resistant to errors for complex utterances (resp. 14 and 17% relative error increasing) than the other systems (around 30%). 6.2 Evaluation of contextual understanding 6.2.1 Systems presentation LORIA symbolic approach LORIA’s system focused on the processing of referring expressions, leaving apart the problem of meaning specification in dialog context. The reference solver developed in LORIA’s system (Denis et al. 2006) is based on Reference Domains Theory (Salmon-Alt and Romary 2001). This theory assumes that referring expressions require the identification of a domain in which the expression isolates the referent. Although the theory was originally designed for multimodal reference processing, the MEDIA campaign was an opportunity to evaluate its relevance for anaphora resolution. In this framework, a reference domain consists of a support, a set of objects defined either in intension or in extension, and a set of differentiation criteria which discriminate their elements. Each designation activates the corresponding domain, in which the element is extracted and focalized, enhancing therefore the salience of this element for later designation. The alterity expression (e.g. ‘the other hotel’) looks for a domain having a focalized partition, in which the other part will be extracted. The projection into the MEDIA formalism is carried out by collecting and merging along the dialog history the literal semantic representations of referents. In the mixed evaluation (see Sect. 5.1) we integrate the literal semantic representations at this step of the process: we do not use these information for solving the referring expressions. LIA probabilistic approach As mentioned in the LIA system presentation concerning the literal understanding campaign, the contextual meaning refinement and reference resolution processes are carried out at a second stage, on the basis of the n-best concept chains produced at earlier stage. Contextual meaning refinement is processed as a tagging task: specifiers are attributed by a probabilistic tagger, based on conditional random fields (CRF). CRF (Lafferty et al. 2001) have been successfully used for many tagging tasks and provide the ability to predict a tag from a sequence of observations happening in the past or in the future. This ability is very helpful for specifiers since the refinement of a given concept may be triggered by elements occurring before or after the concept in a broader context. Once the tagging is over, the resolution of reference is done according to the following algorithm: all concepts in the closer dialog history (limited to the n previous utterances) which hold the same specifier as the object pointed by the referential link are associated with this link. Each object is described by a given
123
346
H. Bonneau-Maynard et al.
number of features (for example, the town, the trademark, the name or the services associated with one hotel). The association algorithm will keep in the referential link all the concepts describing those features. More information on this approach is also given in (Denis et al. 2006). 6.2.2 Evaluation protocol The evaluation of reference resolution is carried out by comparing the semantic features describing each referent. Before describing a referent, this referent needs to be identified. This identification also requires that the system correctly identifies the referring expression. Since these tasks are based on potentially different abilities, we found it necessary to evaluate the process of reference resolution upon three levels, each giving rise to classical scores like recall, precision and f-measure: DRE Ability to describe referring expressions, i.e. to provide the correct specifiers (coRef, coDom, eltSet, but also inclusion, exclusion, and ambiguous). IREF Ability to identify the referents, i.e. to provide enough correct features for each referent, for it to be matched with the correct one. DREF Ability to describe in extenso the referents. This evaluation only applies to referents correctly identified (IREF). A maximal matching between the features of each referent is carried out. 6.2.3 Results and discussion Table 11 shows the results of the systems for both ecological and mixed evaluation conditions (see Sect 1). The confidence intervals are given with respect to a precision of 95%. In DRE, the LORIA gets a very average score in the ecological phase, which improves notably by having the correct literal description in the mixed phase. Concerning referent identification (IREF), the symbolic system has the same low recall score, in both conditions. This lack of improvement is explained by the fact that the additional information provided in the mixed protocol can only be integrated in the LORIA system after the referents have been identified, and would
Table 11 Results of reference resolution evaluation for the LIA and LORIA systems
LIA
LORIA
Precision
Recall
Precision
Recall
DRE
72.2 ± 4.7
72.2 ± 4.7
50.9 ± 5.0
50.9 ± 5.0
IREF
74.1 ± 4.6
61.9 ± 4.6
65.2 ± 5.3
44.3 ± 4.6
DREF
67.3 ± 3.4
55.2 ± 3.3
68.9 ± 4.2
48.3 ± 3.8
DRE
86.5 ± 3.2
86.5 ± 3.2
85.4 ± 3.2
85.4 ± 3.2
IREF
77.1 ± 3.7
73.8 ± 3.8
75.2 ± 5.1
44.7 ± 4.2
DREF
74.1 ± 2.6
64.0 ± 2.6
75.0 ± 4.2
56.8 ± 3.7
Ecological
Mixed
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
347
only help to better describe the referents (DREF). Finally, both systems equally improve their score in DREF comparing the mixed condition to the ecological one. We scrutinized LORIA’s results according to its IREF errors. First, we noticed that only 57% of the errors come from the reference resolution algorithm while 43% come from upstream or downstream modules (literal projection, semantic form building, or syntactical or lexical analysis). The reference resolution errors have been classified into two groups: the phenomena that were not handled at all (35%) and the phenomena that were wrongly processed (65%). The first group contains complex cases like generic use of ‘‘the room’’, while the errors in the second group gather for instance wrong use of semantic constraints or erroneous domain management. This evaluation proves that the LORIA’s model is fine-grained but error-prone: one missed referent in the beginning of the dialog could lead to many other reference errors in the following utterances. The results of LIA system in the mixed condition show which errors are only produced by the contextual meaning refinement and the reference resolution processes. Contextual meaning refinement and referential links are rather correctly tagged but these scores vary very much according to tag values (the specifiers). The low occurrence of some phenomena is problematic for probabilistic methods which requires a large number of examples to learn models. Reference identification (IREF) is performed with rather good precision, with respect to the very simple heuristics we designed for this task. Finer analyses show that the LIA system is quite good at resolving direct reference, which are the most common. Many errors concentrate on ambiguities and alterities. Finally, we note a limited drop of the score in the ecological condition with respect to the mixed one. Therefore this approach is rather robust.
7 Conclusion The paper has described in details the MEDIA annotation scheme for semantic annotation of spoken dialogs. Its main characteristics and advantages are that: • • • •
The representation is generic and provides compatibility with Transcriber. It includes both literal and contextual annotation levels. It enables a good level of precision (including explicit representation of references). The reduced annotation time enables the annotation of large corpus.
The very good inter-annotator agreement validates the choice of the annotation formalism and the development of the corresponding annotation tool. The MEDIA project provides a large dialog corpus to the community: more than 1,200 real dialogs with their corresponding semantic annotations. Because of the large size of the corpus, systems which require supervised learning have got enough data to train on. Furthermore, the MEDIA consortium has designed a common framework for evaluating the understanding modules of dialog systems, including the possibility of evaluating the performances of understanding modules to take into account the local context. Specific evaluation tools have been developed, enabling cross-system
123
348
H. Bonneau-Maynard et al.
comparison and detailed analyses such as literal understanding, contextual meaning refinement and reference resolution. The corpus also includes the speech signal, so that experiments from speech signal to speech understanding are possible. An evaluation package which includes the corpus along with protocols, scoring tools, and evaluation results is available and distributed by ELDA.5 The documents (in particular annotation instruction manuals) and the tools (both annotation and evaluation tool) provided by the project enable the possibility to apply the methods to other domains. For example the European project LUNA-IST 33549 has used the semantic representation and the dedicated annotation tool for the annotation of their multi-lingual corpus of customer-operator dialogs. The MEDIA corpus has also been acquired by the UniversitT du Maine to perform studies on dialog systems. The wide availability of those resources—corpus and evaluation tools—will support the development of robust dialog studies. In pursuit of this goal, two PhD theses have been carried out within this project. Both propose to exploit the MEDIA corpus (dialogs and semantic annotations) for evaluating the ability for a system to overcome either difficulties (simulated user behaviors) (Allemandou 2007) or reference solving errors by a grounding process (Denis et al. 2007). Acknowledgments Thanks to Christelle Ayache, Fre´de´ric Be´chet, Laurence Devillers, Anne Kuhn, Fabrice Lefe´vre, Djamel Mostefa, Sophie Rosset and Jeanne Villaneau for their participation in the project.
Appendix We give a full annotated dialog (#1037) from the MEDIA corpus, where W is the wizard, and C the client. Below each utterance the sequence of segments with their corresponding contextual annotation is given. The segment numbers (1–85) may be referred to, for referring expression annotation. W1 ‘‘...quelle information de´sirez-vous’’ ‘‘...which information would you like’’ C2 ‘‘je voudrais faire une re´servation pour le trente et un mai deux jours deux nuits a` Paris mais dans un hoˆtel qui se trouverait pre`s de la place de la Bastille s’ il vous plaıˆt pour six chambres individuelles’’ ‘‘I’d like to book for may the 31st 2 days 2 nights in Paris but in an hotel which is near place de la Bastille please six single 1 2 3 4 5
5
je voudrais faire une re´servation pour le trente et un mai deux jours deux nuits a` Paris mais
?/command-task: reservation ?/time-date: 05/31 ?/stay-nbNight-reservation: 2 ?/location-city: paris ?/connectProp: addition
http://catalog.elra.info/product_info.php?products_id=998&language=fr.
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
6 7 8 9 10
dans un hoˆtel pre`s de la place de la Bastille pour six chambres individuelles
349
?/DBObject: hotel ?/location-relativeDistance-hotel: near ?/location-street-hotel: bastille ?/number-room-reservation: 6 ?/room-type: single
W3 ‘‘vous souhaitez faire une re´servation a` Paris’’ ‘‘You’d like to book in Paris C4
‘‘oui {yes}’’ 11 oui
?/response: yes
W5 ‘‘pre`s de la place de la Bastille’’ ‘‘near the place de la Bastille’’ C6
‘‘oui madame {yes}’’ 12 oui
?/response: yes
W7 ‘‘veuillez patienter je recherche vos informations’’ ‘‘please wait I’m looking for you information’’ C8
‘‘merci bien {thanks}’’
W9 ‘‘ a` Paris je vous propose trois hoˆtels le Me´ridien Bastille la chambre est a` soixante euros l’ athanor hoˆtel la chambre est a` quatre-vingt-cinq euros l’ hoˆtel Richard Lenoir la chambre est a` cinquantecinq euros voulez-vous re´server dans l’ un de ces hoˆtels ou obtenir plus d’ informations’’ ‘‘in Paris I propose you 3 hotels the Bastille Me´ridien the room is 60 euros the Athanor hotel the room is 85 euros the Richard Lenoir hotel the room is 55 euros do you want to book in one of those hotels or ask for more information’’ 13 14 15
a` Paris le Me´ridien Bastille
?/location-city-hotel: paris ?/hotel-trademark: Me´ridien ?/name-hotel: bastille
16 17 18 19 20 21 22 23
soixante euros l’ athanor hoˆtel quatre-vingt-cinq euros l’ hoˆtel Richard Lenoir cinquante-cinq euros
?/payment-amount-integer-room: 60 ?/payment-unit: euro ?/name-hotel: athanor ?/payment-amount-integer-room: 85 ?/payment-unit: euro ?/name-hotel: richard lenoir ?/payment-amount-integer-room: 55 ?/payment-unit: euro
C10 ‘‘je veux dire je voudrais savoir si les chambres que je vais re´server les chambres six chambres individuelles donnent sur une cour et est-ce qu’ il y a un parking prive´’’ ‘‘I mean I’d like to know if the rooms I’m going to book the rooms six single rooms overlook a courtyard and if there is a private parking’’ 24
les
?/refLink-coRef: plural reference=‘‘13,14,15,16,17; 13,18,19,20; 13,21,22,23’’
123
350
H. Bonneau-Maynard et al.
25 26 27
chambres que je vais re´server les
?/object: room ?/command-task: reservation ?/refLink-coRef: plural
28 29 30 31 32 33 34
chambres six chambres individuelles donnent sur une cour et un parking prive´
reference=‘‘13,14,15,16,17; 13,18,19,20; 13,21,22,23’’ ?/object: room ?/number-room-reservation: 6 ?/room-type: single ?/location-relativeDistance-hotel: near ?/location-relativePlace-general-hotel: unknown ?/connectProp: addition ?/hotel-parking: private
W11 ‘‘veuillez patienter je recherche cette information je vous propose l’ hoˆtel Richard Lenoir cet hoˆtel se situe dans un endroit calme pre`s de la place de la Bastille l’ hoˆtel est e´quipe´ d’ un parking prive´ surveille´ souhaitez-vous faire une re´servation dans cet hoˆtel’’ ‘‘please wait I’m looking for your information I propose the Richard Lenoir hotel this hotel is located in a quiet place near the place de la Bastille and has got a private parking do you want to book in this hotel’’ C12 ‘‘euh j(e) il y a le parking prive´ mais c’est un hoˆtel vous me dites qui est tre`s calme donc il ne donne pas sur une cour il donne sur un boulevard ou pouvez-vous me le situer s’ il vous plaıˆt’’ ‘‘euh I there is a private parking but you tell me it is a very quiet hotel so it does not overlook a courtyard or can you locate it for me please’’ 35 36 37
le parking prive´ mais c’est
38 39 40 41
un hoˆtel tre`s calme donc il
?/DBObject: hotel -/location-relativePlace-general-hotel: livelyDistrict ?/connectProp: implies ?/refLink-coRef: singular
42 43 44
donne pas sur une cour il
-/location-relativeDistance-hotel: near -/location-relativePlace-general-hotel: unknown ?/refLink-coRef: singular
?/hotel-parking: private ?/connectProp: opposition ?/refLink-coRef: singular reference=‘‘13,21’’
reference=‘‘13,21’’
reference=‘‘13,21’’ 45 46 47
donne sur un boulevard le
?/location-relativeDistance-hotel: near ?/location-relativePlace-general-hotel: unknown ?/refLink-coRef: singular reference=‘‘13,21’’
48
situer
?/object: location-hotel
123
MEDIA: a semantically annotated corpus of task oriented dialogs in French
351
W13 ‘‘je suis de´sole´e je n’ ai pas ce type d’ informations’’ ‘‘Sorry I don’t have that kind of information’’ C14 ‘‘bon ben e´coutez je vais re´server dans cet hoˆtel hoˆtel Richard Lenoir donc six chambres individuelles pour le trente et un mai deux jours et deux nuits hein’’ ‘‘well listen I’ll book in this hotel hotel Richard Lenoir so 6 single rooms on the 31st of may 2 days and 2 nights OK’’
W15
?/command-task: reservation 49 je vais re´server ?/name-hotel: richard lenoir 50 dans cet hoˆtel hoˆtel Richard Lenoir ?/number-room-reservation: 6 51 six ?/room-type: single 52 chambres individuelles ?/time-date-reservation: 05/31 53 pour le trente et un mai ?/stay-nbNight-reservation: 2 54 deux jours et deux nuits ‘‘merci de patienter je ve´rifie les disponibilite´s cet hoˆtel est complet il n’ y a plus de chambres libres correspondant a` vos crite`res souhaitez-vous changer de dates ou re´server dans un autre hoˆtel’’ ‘‘please wait I’m checking for the availability this hotel is full there is no more free room corresponding to your choices do you wish to change the date or book in another hotel’’
C16 ‘‘alors je re´serve dans un autre hoˆtel qui a les meˆmes crite`res hein’’ ‘‘so I book in another hotel with the same conditions OK’’ 55 56 57
je re´serve un autre
?/command-task: reservation ?/number-hotel: 1 ?/refLink-coDom-exclusion: singular reference=‘‘13,21’’
58 59
hoˆtel les meˆmes crite`res
?/DBObject: hotel ?/object: undetermined
W17 ‘‘merci de patienter je vous propose le Me´ridien Bastille la chambre est a` soixante euros souhaitezvous faire une re´servation dans cet hoˆtel’’ ‘‘please wait I propose the Me´ridien Bastille the room is 60 euros do you wish to book in this hotel’’ C18 ‘‘mais e´coutez je vais faire la re´servation dans cet hoˆtel il y a bien un parking prive´ et c¸a donne s() est-ce que c¸a donne sur une cour ou sur une rue tranquille’’ ‘‘but listen I will book in this hotel there is indeed a private parking and it overlooks does it overlook a courtyard or a quiet road’’ 60 61
je vais faire la re´servation cet
?/command-task: reservation ?/refLink-coRef: singular reference=‘‘13,14,15’’
62 63 64 65 66 67
hoˆtel il y a bien un parking prive´ et donne sur une cour
?/DBObject: hotel ?/command-dial: confirmation-request ?/hotel-parking: private ?/connectProp: addition ?/location-relativeDistance-hotel: near ?/location-relativePlace-general-hotel: unknown
123
352
H. Bonneau-Maynard et al.
68 69 70
ou sur une rue tranquille
?/connectProp: alternative ?/location-relativeDistance-hotel: near ?/location-relativePlace-general-hotel: livelyDistrict
W19 ‘‘cet hoˆtel se situe dans un endroit calme pre`s de la place de la Bastille l’ hoˆtel est e´quipe´ d’ un parking prive´ souhaitez-vous faire une re´servation dans cet hoˆtel’’ ‘‘this hotel is located in a quiet place near the place de la Batille the hotel has got a private parking do you want to book in this hotel’’ C20 ‘‘ben e´coutez je vais faire une re´servation dans cet hoˆtel pour six chambres individuelles hein’’ ‘‘well listen I’me going to book in this hotel 6 single OK’’ 71 72
je vais faire une re´servation cet
73 74 75
hoˆtel pour six chambres individuelles
?/command-task: reservation ?/refLink-coRef: singular reference=‘‘13,14,15’’ ?/DBObject: hotel ?/number-room-reservation: 6 ?/room-type: single
W21 ‘‘j’ effectue votre re´servation le montant de votre se´jour s’ e´le`ve a` sept cent vingt euros le nume´ro de dossier correspondant est le ze´ro soixante-neuf cent quatre-vingts de´sirez-vous une autre information’’ ‘‘I’m doing your reservation the amount of you stay will be 720 euros the file number is 069180 would you like another information’ C22 ‘‘oui euh j’ aimerais savoir est-ce que le petit de´jeuner est compris dans la re´servation enfin de la re´servation dans le prix de la chambre’’ ‘‘euh yes I’d like to know if breakfast is included in the reservation well the reservation the price for the room’’ 76 77 78
oui le petit de´jeuner est compris dans la re´servation
?/response: oui ?/hotel-services : breakfastInclude ?/command-task: reservation
79
le
?/refLink-coRef: singular
80 81
prix la
?/object: payment-amount-reservation-room ?/refLink-coRef: singular
82
chambre
reference ‘‘13,14,15,16,17,10’’ ?/object: room
reference=‘‘16,17’’
W23 ‘‘il vous sera demande´ cinq euros supple´mentaires pour une formule petit de´jeuner’’ ‘‘breakfast is 5 euros more’’ C24 ‘‘bon ben e´coutez je vous remercie de tous ces renseignements donc je confirme et je re´serve’’ ‘‘well listen I thank you for this information so I confirm and I book’’ 83 84 85
je confirme et je re´serve
123
?/command-dial: confirmation-notice ?/connectProp: addition ?/command-task: reservation
MEDIA: a semantically annotated corpus of task oriented dialogs in French
W25
353
‘‘merci d’ avoir utilise´ le serveur vocal MEDIA au revoir’’ ‘‘thanks to have called the MEDIA vocal server goodbye’’
C26
‘‘au revoir madame et a` bientoˆt au revoir’’ ‘‘goodbye madam and see you soon goodbye’’
References Allemandou, J. (2007). SIMDIAL, un paradigme d’e´valuation automatique de syste`mes de dialogue homme-machine par simulation de´terministe d’utilisateurs. Ph.D. thesis, Universite´ Paris XI, Orsay. Barras C., Geoffrois E., et al. (2001). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, 33(1–2), 5–22. Bonneau-Maynard, H., Ayache, C., Bechet, F., et al. (2006). Results of the French Evalda-Media evaluation campaign for literal understanding. In Proceedings of the international conference on language resources and evaluation (LREC), Genoa (pp. 2054–2059). Bonneau-Maynard, H., Devillers, L., & Rosset, S. (2000). Predictive performance of dialog systems. In Proceedings of the international conference on language resources and evaluation (LREC), Athens. (pp. 177–181). Bonneau-Maynard, H., & Lefevre, F. (2005). A 2?1-level stochastic understanding model. In Proceedings of the IEEE automatic speech recognition and understanding workshop (ASRU), San Juan (pp. 256–261). Bonneau-Maynard, H., & Rosset, S. (2003). Semantic representation for spoken dialog. In Proceedings of the European conference on speech communication and technology (Eurospeech), Geneva (pp. 253–256). Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Computational Linguistics, 2(22), 249–254. Chinchor, N., & Hirschmann, L. (1997). MUC-7 coreference task definition (version 3.0). In Proceedings of message understanding conference (MUC-7). Denis, A. (2008). Robustesse dans les syste`mes de dialogue finalise´s: Mode´lisation et e´valuation du processus d’ancrage pour la gestion de l’incompre´hension. Ph.D. thesis, Universite´ Henri Poincare´, Nancy. Denis, A., Be´chet, F., & Quignard, M. (2007). Re´solution de la re´fe´rence dans des dialogues hommemachine : e´valuation sur corpus de deux approches symbolique et probabiliste. In: Actes de la Confe´rence sur le Traitement Automatique des Langues Naturelles (TALN), Toulouse (pp. 261–270). Denis, A., Quignard, M., & Pitel, G. (2006). A deep-parsing approach to natural language understanding in dialogue system: Results of a corpus-based evaluation. In Proceedings of the international conference on language resources and evaluation (LREC) (pp. 339–344). Devillers, L., Bonneau-Maynard, H., et al. (2003). The PEACE SLDS understanding evaluation paradigm of the French MEDIA campaign. In EACL workshop on evaluation initiatives in natural language processing, Budapest (pp. 11–18). FIPA. (2002). Communicative act library specification. Technical report SC00037J. Foundations for Intelligent Physical Agents, http://www.fipa.org/specs/fipa00037/. Fiscus, J. (1997). A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In Proceedings of the IEEE automatic speech recognition and understanding workshop (ASRU), Santa Barbara, CA (pp. 347–352). Giachim, E., & McGlashan, S. (1997). Spoken language dialog systems. In S. Young & G. Bloothooft (Eds.), Corpus based methods in language and speech processing (pp. 69–117). Dordrecht: Kluwer. Gibbon, D., Moore, P., & Winski, R. (1997). Handbook of standards and resources for spoken language resources. New York: Mouton de Gruyter. Hirschman, L. (1992). Multi-site data collection for a spoken language corpus. In Proceedings of the DARPA speech and natural language Workshop (pp. 7–14). King, M., Maegaard, B., Schutz, J., et al. (1996). EAGLES—evaluation of natural language processing systems. Technical report EAG-EWG-PR.2, Centre for Language Technology, University of Copenhagen.
123
354
H. Bonneau-Maynard et al.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th international conference on machine learning (ICML), Williamstown, MA (pp. 282–289). Lamel, L., Rosset, S., et al. (1999). The LIMSI ARISE system for train travel information. In IEEE conference on acoustics, speech, and signal processing (pp. 501–504). Lefe´vre, F., & Bonneau-Maynard, H. (2002). Issues in the development of a stochastic speech understanding system. In Proceedings of the international conference on spoken language processing (ICSLP), Denver (pp. 365–368). Popescu-Belis, A., Rigouste, L., Salmon-Alt, S., & Romary, L. (2004). Online evaluation of coreference resolution. In Proceedings of the international conference on language resources and evaluation (LREC), Lisbon. (pp. 1507–1510). Raymond, C., Be´chet, F., De Mori, R., & Damnati, G. (2006). On the use of finite state transducers for semantic interpretation. Speech Communication, 48(3–4), 288–304. Rosset, S., & Tribout, D. (2005). Multi-level information and automatic dialog acts detection in human– human spoken dialogs’. In Proceedings of ISCA InterSpeech 2005, Lisbon (pp. 2789–2792). Salmon-Alt, S. (2001). Re´fe´rence et Dialogue finalise´ : de la linguistique a` un mode´le ope´rationnel. Ph.D. thesis, Universite´ Henri Poincare´, Nancy. Salmon-Alt, S., & Romary, L. (2004). Towards a reference annotation framework. In Proceedings of the international conference on language resources and evaluation (LREC), Lisbon. van Deemter, K., & Kibble, R. (2000). On coreferring: Coreference in MUC and related annotation schemes. Computational Linguistics, 26(4):629–637. Vanderveken, D. (1990). Meaning and speech acts. Cambridge: Cambridge University Press. Villaneau, J., Antoine, J.-Y., & Ridoux, O. (2004). Logical approach to natural language understanding in a spoken dialogue system. In Proceedings of the 7th international conference on text, speech and dialogue (TSD), Brno (pp. 637–644). Walker, M., Litman, D., et al. (1998). Evaluating spoken cialogue agents with PARADISE: 2 Cases studies. Computer Speech and Language, 3(12), 317–347. Walker, M., Passonneau, R., & Boland, J. (2001). Quantitative and qualitative evaluation of Darpa communicator sopken dialog systems. In Proceedings of the annual meeting of the association for computational linguistics (ACL), Toulouse (pp. 515–522). Walker, M., Rudnicky, A., et al. (2002). Darpa communicator: cross-system results for the 2001 evaluation. In Proceedings of the international conference on spoken language processing (ICSLP), Denver (pp. 269–272).
123