VoiceUNL : a proposal to represent speech control ... - Semantic Scholar

VoiceUNL : a proposal to represent speech control mechanisms within the Universal Networking Digital Language Mutsuko Tomokiyo1 and Gérard Chollet2 1 2

GETA-CLIPS-IMAG, CNRS-INPG-UJF, BP 53-38041 Grenoble 9 –France, [email protected] ENST, CNRS-URA-820, 46 rue Barrault, 75634 Paris, cedex 13, [email protected]

Abstract  The Universal Networking Language (UNL) is a formal representation of written text which can be used to generate an equivalent text in a different language. The VoiceUNL proposal is an extension to orality. This paper focusses on the encoding of emotional expressions for oral dialogues. The work is done in the framework of the “Normalangue3” project which supports standardization activities in natural language processing. Applications could be found in Speech to Speech Machine Translation (SSMT) and language independent vocal information encoding for interactive voice servers. One of the key objectives of the “TECHNOVOCNORMAVOX 4” project resides in standardization and formalization of speech dialogues in order to participate in the normalization activities of spoken language processing, to make progress in multilingual vocal application development and to respond to user’s needs where information is exchanged in multimodal environment. It is in such social background that we proceed to establish the new approach as proposed here. A strong point of the UNL resides in the Universal Word (UW) dictionary, which enables us to specify word meaning at the deep level and to perform lexical disambiguation in semantic oriented formalism from the technical point of view, and also its worldwide partnership as working group which enables us to propose a joint research in solving some major difficulties in multilingual MT. The UNL is, however, a language completely oriented towards written texts, so it cannot keep up with a recent trend of the SSMT in spite of its robust multilingual directivity. In this paper we will explore into the possiblity of extending the UNL “Attributes” tags to suit to the SSMT 3 The Normalangue project concerns the normalization of linguistic resources. This is a project regrouped in two sub-projects : RNIL and TECHNOVOC. The two projects concern the normalization of technologies applied in the domain of the engineering of written and spoken language. The RNIL focusses on language engineering, whereas the Technolangue project is interested to vocal technologies in the framework of man-machine dialogues [13]. 4 The TECHNOVOC-NORMAVOX was launched as regrouped project of the Normalangue in 2002 by a consortium of eight partners (universities and industries) : SIEMENS, TELISMA, IDYLIC, DIALOCA, ELAN Speech, ST Microelectronics, LORIA, ENST Paris. as a branch of the Normalangue project [22].

processing in such a context as the “Lingtour” 5 or “TECHNOVOC-NORMAVOX” projects. Such an extension is possible since UNL is based on pivot method transfer by using syntaxico-semantic tags, and its devotion lies in multilingual generation. This is the key reason why we propose to cope with the SSMT by including voice, speech and gesture control based on the UNL. Index terms  emotion representation, non verbal language, Speech to Speech Machine Translation, UNL, UW

INTRODUCTION The UNL is estimated to display its power in coping with machine translation in multilingualism, but it ignores completely voice, speech, and image processing which are inevitable for multimodal treatment like SSMT or human-machine dialogue system. Our idea is to apply the UNL formalism to translation step in the SSMT as well as capturing the speech dialogue representation by adding dialogue property tags to the current UNL tag set. The UNL consists of “UWs”, “Relations”, “Attributes” and the UNL knowledge base represented in the form of tags. The UWs are the vocabularies of the UNL. “Relations” and “Attribute” make up the syntax, and the knowledge base covers the semantics of the UNL [24]. Here is an example : eg.1 - May I smoke? - No! You may not, Victor [23]. [S:01] {org:e1} - May I smoke? {/org} {unl} agt(smoke(icl>do).@entry.@present.@may. 5 The Lingtour project was launched in 2002 by the partnership which consists of TsingHua University (China), Paris 8 University (France), INT (France), ENST-Paris and Bretagne (France) and CLIPS (France). One of objectives of the project resides in R & D to enable multilingualmultimedia MT on user-friendly tools [9].

© Convergences ´03 December 2 - 6, 2003, Alexandria, EGYPT International Conference on the Convergence of Knowledge, Culture, Language and Information Technologies 1

@interrogative, I) {/unl} [/S] [S:02] {org:e2} - No! you may not, Victor6 {/org} {unl} agt(smoke(icl>do).@entry.@present.@may.@not, you) mod(smoke(icl>do).@entry.@present.@may.@not, no) mod(no, !(icl>symbol).@interjection) mod(smoke(icl>do).@entry.@present.@may.@not, Victor(icl>name).@vocative) {/unl} [/S] UWs are character strings that can be given specifications, attributes and Instance-ID in UNL semantic representation, namely the graphs. The UW dictionary includes basic UWs (bare English words, called Head words), restricted UWs (English words with a constraint list), and extra UWs, which are special type of restricted UWs [24]. “icl” in the constraint list enables to define subconcept of the Head word. eg. smoke (icl>do) sad(icl>sentiment>unhappy) state(icl>abstract thing) state(icl>country) These graphs in eg.1 do not contain any tags a part from the UNL tags, so they will be merged with other tags added as well as to be embedded in other format for the purposes of SSMT. This paper focusses on semantic representation of emotional expressions of dialogues. In section 2, we analyze emotional representations in speech dialogues using a small visual corpus, followed by section 3 where we will present the approach to parameterize emotion-eliciting factors. Finally in section 4, we propose a semantic representation for emotional expressions in dialogues. This is an on-going work, and we are developing a visual corpus from TV program.

EMOTIONAL EXPRESSIONS IN CORPUS It’s obvious that emotional expressions play an important role in human-to-human communication. So, in human-to-machine interaction, its interpretation and 6 How to deal with the utterance is open to further discussion : one sentence or two sentences like “No!” and “You may not, Victor.”

generation enrich the dialogue on human-machine interactive system, and for the SSMT in multimodality it is quite helpful to convey the speakers’ intention more correctly to the hearer or to disambiguate polysemic expressions in dialogues. It’s this reasons why we are interested in emotional representation. How many and what kind of emotional expressions are used in general dialogues? In this work, we adopt the 24 categories of human emotions defined by OCC [14]7 [2]. And we simplify them into following 10 categories [27]8 [15]9 [6]10 [20]11: (1) happiness, (2) sadness, (3) disgust, (4) surprise, (5) fear, (6) anger, (7) irritation, (8) hesitation, (9) uncertainty and (10) neutral For example “No!” in eg.1 expresses a surprise of Victor’s father, because Victor is a small boy. As for the emotion-eliciting factors, it is an essential question to be addressed when one copes with emotion representation. In our corpus, the followings are major one : • lexicon (sad, happy, etc) • phatics (ah, hein, etc.) • prosodies (fast, slow, strong, etc.) • voice (noisy, soft, etc.) • gestures (movement of hands, mouth, eyes, etc.) In the example, note that the surprise can be represented by the lexicon “No!”. However, at the same time, on TV, the father also made a grimace while speaking sharply “No!”. So, it’s also expressed by the movement of his eyebrows and voice tone. In the following table, we illustrate the potential representation of different emotions in terms of the above mentioned emotion-eliciting factor attributes. The numbers correspond to the emotion types above defined. TABLE: POTENTIAL REPRESENTATION OF EMOTIONS (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) lexicon * * * * * * * * * * 7 Emotions are analyzed from various points of view in the theory : basic emotions, emotions as reactions to events, self-pity and related states, prospect-based emotions, etc. [14]. 8 The classification of emotions is still open to suggestions. Morita classified Japanese emotional words into 40 categories, and it contains negative or positive judgment, sense to color or sound, psychological reactions, etc. [27]. 9 Paul Ekman mentioned there would be a linking of a second emotion with a first emotion, and emotions rarely occur simply or in pure form. We have not yet considered such a complex emotion [15]. 10 Plutchik’s emotion wheel consists of eight primary emotions : fear, surprise, sadness, disgust, anger, anticipation, joy and acceptance [6]. 11 The emotion study group in Southern Kings Consolidated school classifies the emotions as are followed : Thankfulness, Envy, Disgust, Worry, Kindheartedness, Stress, Boredom, Sadness, Loneliness, Bravery, Paranoia, Optimism, Stubbornness, Fear, Anxiety [20].


phatics prosodies voice hands mouth eyes eyebrows heads shoulders

* * * *

* * *

* * *

* * *

* * *

* * * *

*

* * * *

* *

* * *

*

* *

* * *

* * *

*

* *

shoulders) as mentioned above. Emotions expressed by lexicon and phatics are recognized at 2., interpreted, and generated at 4. as illustrated in Fig.1. Expressions by voice, gestures and prosodies are recognized at 2. and generated at 5. All task facets defined in Fig.1 concern the interpretation/generation or control of emotions.

* *

EXTENSION OF UNL AND PROPOSED TAGS

TRANSFER OF EMOTIONAL EXPRESSIONS AND UNL Task facets of the SSMT could be simplified as follows, when we apply UNL formalism to it12[8] [19]: 1.Speaker recognition (W3C type)

2. Gestures, face movements recognition, Speech recognition (W3C. type) 3. Transcription and text transfer (UNL)

5. Voice, speech, and gesture synthesis (W3C type)

4. Target language generation (Ariane-G5 tree structures)

FIGURE. 113: TASK FACETS OF SSMT Throughout all process from 1. to 5., the voice recognition, speech recognition, and speech synthesis are dealt with in monolingualism, and source gesture and source language are transferred in target gesture and target language, respectively, as MT process. The speaker’s emotions are expressed in various ways : lexicon, phatics, prosodies, voice, gesture (movement of hands, mouth, eyes, eyebrows, head and 12

In the tasks being defined in Fig.1 [8], 2. speech recognition and gesture control task and 5. speech synthesis for English and Italian languages have already been achieved by Pélachaud’s group, who is a research partner in “ Lingtour ” , in the framework of MagiCstar project [3]. Also, a complete model of the face movement is proposed for English and French speakers at ICP labs (Grenoble) and GRAVIR labs (Grenoble) [10]. We show 3.4.5. processes (from the UNL formalism to speech synthesis) defined in Fig.1 in XML formalism, taking a simple example from educational TV program on Arte. 13 In the Fig.1[8], W3C rec.type at 1, 2, and 5 is an available markup language for each task indicated. Ariane-G5 at 4, is an environment as UNL-French generator, developed by GETA (CLIPS, Grenoble) for UNL graphs-to-tree conversion [4] [5] [18] .

Semantic representation of the UNL is actually designed by a set of 113 tags, which are grouped in the Relation tags and the Attribute one [24]. In order to cope with the SSMT, some tags which cover spoken language properties are merged with the UNL tag set : prosody tags for speech recognition and synthesis, behaviour tags for gesture control processing and in particular speech act tags for utterance semantics [11]. The prosody elements permit to control speaking speed and volume, by interpreting the pitch-contour, range, rate, duration, and volume for the contained text. In W3C recommendation, “x-high”, “high”, “medium”, “low”, “x-low”, or “default” values are given for the pitch-contour and range attributes, “x-fast”, “fast”, “medium”, “slow”, “x-slow”, or “default” values for the rate, and “silent”, “x-soft”, “soft”, “medium”, “loud”, or “x-loud” or “default” values for the volume. The body movement elements permit to control gestures or facial expressions by interpreting the movement of eyebrows, mouth, hands, shoulders, and head. The MPEG-4 Standard includes the face and body animation tools which allows sending parameters that can define, calibrate and animate synthetic faces and bodies [12]. So, we can analyze and synthesize the emotion by interpreting the MPEG-4 Facial Animation Parameters given while comparing the face object that contains a generic face with a neutral expression [3][10]. eg. with regard to neutral expression of the face eyebrows : left-or right raised eyebrow, head : left-or right leaned head, upper-and-down shaken head, side-to-side shaken head mouth : open-mouth, left-or-right raised mouth The speech act tags permit to identify and disambiguate utterance goal. There are actually proposed 28 types of speech act tags [11] : eg. “request”, ”question”, “acknowledge”, “invite”, “offer”, “vocative”, “confirm”, “thanks”, etc. All tags proposed for the purposes of the SSMT appear in Appendix.

FORMAT TRANSFORMATION AND EMOTION REPRESENTATION


The UNL is a language of graphs, and in so far as it’s, is not easy to encode in the linear data stream. It‘s, however, feasible to project it on a format of description like XML, which authorize the definition of elements and attributes. The representation obtained offers, in consequence, the same expression power as graphs, but under the form of tags, easy to transmit and to make a form by software possible to interpret the DTD (Document Type Definition) of XML norm [22]. Thus, we tried to transform UNL graphs into XML format, because it is evident to facilitate the processing of speech synthesis information after generation of target language [26]. Here is a transformation example from UNL to XML format for the same utterances from TV[23]. It includes the speech act tag and the prosody tag in eg.2 and emotion tag in eg.3. And the father’s emotion is interpreted as the surprise by looking up his prosody and eyebrow movements when he pronounced “No!” in the semantic representation for the dialogues. eg.2