An Approach to Multimodal Input Interpretation in

0 downloads 0 Views 179KB Size Report
interpretation according to the sentence similarity level with sentence templates stored in a predefined knowledge base. Index Terms— Human-Computer ...
1

An Approach to Multimodal Input Interpretation in Human-Computer Interaction Fernando Ferri, Patrizia Grifoni, Stefano Paolozzi Institute of Research on Population and Social Policies, National Research Council, Via Nizza 128, 00198 Rome, Italy {fernando.ferri, patrizia.grifoni, stefano.paolozzi}@irpps.cnr.it

Abstract— Human-to-human conversation remains such a significant part of our working activities because of its naturalness. Multimodal interaction systems combine visual information with voice, gestures and other modalities to provide flexible and powerful dialogue approaches. The use of integrated multiple input modes enables users to benefit from the natural approach used in human communication. However natural interaction approaches introduce interpretation problems. In this paper is presented an approach to interpret user’s multimodal input. Starting from the analysis of the different types of modalities’ cooperation we take into account the user’s input behavior in order to better approximate the resultant multimodal input sentence with the user’s intention. This multimodal sentence is transformed in a natural language one and we provides an algorithm to calculate the exact/approximate interpretation according to the sentence similarity level with sentence templates stored in a predefined knowledge base. Index Terms— Human-Computer interaction, multimodal interface,, multimodality, natural language processing, sentence similarity.

I. INTRODUCTION

P

eople can communicate with great efficiency and expressiveness adopting natural interaction approaches. Perhaps this is the main reason why face-to-face conversation remains such a significant part of our working activities despite of the availability of a great number of communication technologies. Nevertheless, there is a great interest in research about natural interaction approaches in order to develop technologies to facilitate them [1], [2], [3] and to improve the interpretation by the computer side. From this perspective multimodality has the potential to greatly improve Human Computer Interaction combining harmoniously the different communication methods . A user can use voice, handwriting, sketching and gesture to input information., The system can use icons, text, sound and voice (output)to present information. This paper uses the concept of multimodal language defined as a set of multimodal sentences [4], by the extension of the definition of Visual Language given in [5]. A multimodal sentence contains atomic elements (glyphs/graphemes, phonemes and so on) that form the Characteristic Structure

(CS). The CS is given by the elements that form functional or perceptual units for the user. A multimodal sentence is defined, similarly to [6], as a function of: 1) the multimodal message, 2) the multimodal description that assigns the meaning to the sentence, and 3) the interpretation function that maps the message with the description, and the materialization function that maps the description with the message. One of the most relevant problems of the multimodal dialog consists of the fact that naturalness of communication is directly proportional with the complexity of the interpretation of messages. The goal of this paper is to propose a new approach to understand how different modalities cooperate each other, taking into account the user’s behavior that can alter the multimodal input recognized by the system. The resulting multimodal input sentence is matched with a template stored in a knowledge base to provide an interpretation of the sentence. The sentence can precisely match the template or approximates it. We consider the speech modality as the prevalent one because, generally, users explain their intentions by speech and use other modalities to “support” the speech and eventually to resolve ambiguities. For this reason we have chosen the system can map concepts involved in the multimodal sentence using a natural language sentence. Each multimodal sentence corresponds to a natural language one. The system returns the interpretation of the multimodal sentence if the template that exactly maps with the corresponding natural language sentence is available. When the user interacts with the system, the corresponding multimodal sentence must refer to a stored template in the knowledge base in order to be interpreted. If the corresponding sentence, expressed in natural language, doesn’t match with any of the stored templates than the system can interpret those sentences that approximate the matching considering templates similar to the first one. Because of some multimodal sentences (with different templates) can have very close interpretations (some times they can have the same meaning), this paper proposes to calculate the templates’ similarity starting from the semantic similarity of the natural language sentences corresponding to the Multimodal one. For this purpose it is possible the association of a sentence

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < and its template with the most similar template computing semantic similarity between natural language sentences. In this way we can provide an interpretation of the multimodal sentence also in case of non-perfect matching. The natural language sentence can be represented as a semantic network of objects and binary relations among them, where each object corresponds to one node. The rest of the paper is structured as follow: section II briefly presents a short literature of related works; section III provides a running example, in section IV is illustrated our approach to interpret the multimodal input basing on modalities interaction’s type and user’s behavior, section V is devoted to the evaluation of sentence similarity in the multimodal context, section VI concludes and describes some future research.

2

to materialize it as shown Fig. 1c. For he sake of simplicity we only consider the creation of the Professor entity. The user says: “The first rectangle is the entity Professor”, at the same time the user sketches the figure of a rectangle (in his/her intention). If the figure is properly interpreted by the sketch recognizer as a rectangle (Fig. 2a), we are in the typical situation of redundant information given by different modalities (i.e. the concept “rectangle” is expressed both by speech and sketch modalities) The Natural Language sentence associated to the multimodal one will be: “The first rectangle is the entity Professor” and it represents all the concepts expressed by speech and sketch. Given such a sentence, a corresponding template to the sentence in the knowledge base, must be found. The associated template (in lexical form) is:

II. RELATED WORKS The problem of multimodal system interaction has been studied for several years and still remains an active research field. Several studies have examined multimodal interaction and interpretation addressing in particular sketch-speech interaction. For example [7] discusses an integration technique in order to solve the problem of interpretation of multimodal queries constituted by multiple speech and sketch inputs. In [8] is underlined the importance of the study of the user’s behavior in order to correctly interpret multimodal inputs combining sketching with speech, enabling a more natural form of communication. In [9] and [4] are proposed interesting approaches for studying multimodal interaction showing how the use of integrated multiple input modes enables users to benefit from the natural approach used in human communication.

DT + JJ + rectangle + VB + DT + entity + NN where DT refers to a determiner, VB a verb, JJ an adjective, NN a singular noun. The situation is more complex if we have some ambiguities problem. That is for example, if a user draws a figure that in his/her intention is a rectangle, recognized by the system as a square. So the system is unable to identify that the given input is redundant and can produce an incorrect multimodal sentence interpretation. In Fig. 2 both situation are illustrated with the proper timeline.

Δt1

voice control

recognized as a rectangle

III. A MOTIVATING AND RUNNING EXAMPLE In order to explain our approach we consider the following example: the user draws an Entity-Relationship scheme assigning a label to each construct. Without loss of generality we assume that the input is given by only two modalities: speech and sketch. This scenario is represented in Fig. 1. The first rectangle is the entity Professor

a

" The First rectangle is the entity Professor "

Δt2

0

t1

t2'

t2

t1'

t b

" The First rectangle is the entity Professor "

Δt1

voice control

recognized as a square

Second rectangle is the entity Course

Δt 2

The rhombus is the relationship Teaching (a) What user sketches

Professor

(b) What user says

teaching

0

t1

t2

t2'

t1'

t

Course

Fig. 2. Example of multimodal input (by speech and sketch) with timeline. (c) What user wants

Fig. 1. An example of multimodal interaction.

Suppose that the user sketches the diagram as shown in Fig. 1a, and he/she speeches the sentences shown in Fig. 1b. The system has to interpret the multimodal sentence and then has

.From the user point of view both situation must reproduce the same multimodal sentence, but due to ambiguities in sketch recognition the system doesn’t produce the same sentences and they are not associated to the same template. In order to avoid these problems we propose an approach that take into account the user’s behavior to reproduce multimodal

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < measure of similarity between user’s sketch and the required figure (in this case a rectangle). This measure represents a fundamental element for the computation of concept similarity. More formally, if Ci is the i-th sample drawn by the user, and n is the total number of the samples, the user behaviour approximation Ass (for sketchspeech modalities interaction) is: Ass =

1 n ∑ Ci , n i =1

Ass ∈ (0,1)

(1)

4

Thus, we could say that schoolmate is more similar to student than professor to student. However, this method may be not sufficiently accurate if it is applied to a large and general semantic net such as WordNet. For example, the minimum length from student to animal is 4 (see Fig. 6), less than from student to professor; however, intuitively, student is more similar to professor than to animal. To address this weakness, the direct path length method must be modified as proposed in [12] utilizing more information from the hierarchical semantic nets. It is important to notice that concepts at upper layers of the WordNet’s hierarchy have more general semantics and less similarity between them, while words that appear at lower layers have more concrete semantics and have a higher similarity. Therefore, also the depth of word in the hierarchy should be considered. In summary, we note that similarity between words is determined not only by path lengths but also by depth (level in the hierarchy). Moreover we take into account the user behaviour considering the user behaviour approximation Ass. The proposed algorithm is an extensions of the one proposed in [12] for words similarity. entity organism, being, ...

Fig. 4. The system’s architectural flow.

animal, beast, ...

person, human, ...

adult, growup, ... professional, professional person, ...

acquaintance

enrollee student, pupil, educatee, ...

educator, pedagogue, ... Fig. 5. Time intervals and type of interaction.

The concepts are stored in the concepts database as natural language terms. We adopt an extended words similarity algorithm in order to calculate semantic similarity between concepts are using the WordNet lexical database [10]. B. Evaluating concepts similarity The taxonomical structure of the WordNet knowledge base is important in determining the semantic distance between words. In WordNet, terms are organized into synonym sets (synsets), with semantics and relation pointers to other synsets. One direct method for similarity computation is to find the minimum length of path connecting the two words [11]. For example, the shortest path between student and schoolmate in Fig. 6 is student-enrollee-person-acquaintance, the minimum path length is 4, the synset of person is called the subsumer for words of student and schoolmate, while the minimum path length between student and professor is 7.

academician, faculty member, ... professor, prof Fig. 6. A portion of the Wordnet’s semantic network.

Given two concepts, c1 and c2, we need to find the semantic similarity s(c1,c2). Let be l the shortest path length between c1 and c2, and h the depth of subsumer in the hierarchical semantic nets, the semantic similarity can be written as: s ( w1 , w2 ) = f1 (l ) ⋅ f 2 ( h ) ⋅ Bss

(2)

Bss represents the user’s behaviour contribute and its value depends on the value of the product between f1(l) and f2(h). If this value is higher than a threshold value (δ), Bss is equal to (Ass)-1, otherwise Bss is equal to 0. More formally:

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < ⎧ 0 ⎪ B ss = ⎨ 1 ⎪A ⎩ ss

if if

f1 ( l ) ⋅ f 2 ( h ) < δ f1 ( l ) ⋅ f 2 ( h ) ≥ δ

(3)

The threshold value used in our experiments is 0.4. The path length between two concepts, c1 and c2, can be computed according to one of the following cases: 1. c1 and c2 belong to the same synset, 2. c1 and c2 do not belong to the same synset, but their synsets contains one or more common words, 3. c1 and c2 neither belong to the same synset nor their synsets contain any common word. First case implies that c1 and c2 have the same meaning, then we assign the value 0 to the semantic path length between c1 and c2. In the second case we can notice that c1 and c2 partially share the same features, then we assign the semantic path length between c1 and c2 to 1. For case 3, we must count the actual path length between c1 and c2. By the above considerations we can considered the function to be a monotonically decreasing function of l: f1 ( l ) = e − α l

(4)

where α ∈ [0,1] . In order to calculate the depth contribution, according to the fact that at upper layers of the WordNet’s hierarchy words have more general semantics and less similarity between them, while words that appear at lower layers have more concrete semantics and an higher similarity, function f2(h) should be a monotonically increasing function with respect to depth h: f2 (h) =

eβ h − e−β h eβ h + e−β h

(5)

where β ∈ (0,1] . The whole formula for words similarity computation with the contribution of (4) and (5) is: s ( w1 , w2 ) = e − α l ⋅

eβ h − e−β h ⋅ B ss eβ h + e−β h

(6)

The values of α and β depend on the used knowledge base. For WordNet the optimal parameters are α=0.2 and β=0.45 as explained in [13].

V. EVALUATING MULTIMODAL SENTENCE SIMILARITY To compare the user’s input multimodal sentence with the stored templates the system has to proceed as follow: let be T1 the template associated with the user’s input sentence T1 S and T2 the template associated to the stored sentence T2S .

5

According to [12] we must construct a vector V that contains all the distinct elements from the associated sentences S S T1 and T2 . The semantic similarity between two sentences is defined as: Ss =

s1 ⋅ s 2 s1 ⋅ s 2

(7)

where si (i=1,2) are the semantic vectors. The value of an element of a semantic vector is defined as follows:

σ j = sˆ j ⋅ I ( w j ) ⋅ I ( wˆ j )

(8)

Where w j is a word in the joint word set and wˆ j is its associated word in the sentence and sˆ j is the j-element of the lexical vector sˆ (below introduced). Function I ∈[0,1] is based on the notion of information content introduced by Resnik. The information content of a concept c quantifies as informativeness decreases when probability of the concept c increases [14]. Note that this approach has been chosen because various results in the literature demonstrate that it has a higher correlation with human judgment than the traditional edge counting approach [15]. It has been shown that words that occur with a higher frequency (in a corpus) contain less information than those that occur with lower frequencies: I (w) = 1 −

log( n + 1) log( N + 1)

(9)

where N is the total number of words in the sentence, n is the frequency of the word w in the sentence (increased by 1 to avoid presenting an undefined value to the logarithm). The lexical vector sˆ is determined by the semantic similarity of the corresponding word in the joint word to a word in the sentence. Each entry of the semantic vector corresponds to a word in the joint word set, so the dimension is equal to the number of words in V. Take T1 S as an example: - if wj appears in the sentence, sˆ j is set to 1. - if wj is not contained in T1 S , a semantic similarity score is computed between wj and each word in the sentence S T1 , using the method similar to the one illustrated for concepts similarity in the previous section.. Thus, the most similar word in T1 S to wj is that with the highest similarity score.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < VI. CONCLUSION Multimodal human-computer interaction, in which the computer accepts input from multiple channels or modalities, is more flexible, natural, and powerful than unimodal interaction with input from a single modality. However naturalness of communication is directly proportional with the complexity of the interpretation of messages. In this paper is presented an approach to interpret multimodal input sentences, we have implemented a multimodal system based on speech and sketch modalities that use the aforementioned methodology to better understand user’s input, also considering the user’s way of inputting. Future work will address other input modalities, such as gesture, that could also help disambiguate the sketches and correctly simulate the user's ideas.

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

[14]

[15]

J-L. Binot, P. Falzon, D. Sedlock, M. D. Wilson, “Architecture of a multimodal dialogue interface for knowledge-based systems”, in the Proc.of Esprit’90 Conference, Kluwer Academic Publishers: Dordrecht, 1990, pp. 412-433. M. Johnston, P. R. Cohen, D. McGee, S. L. Oviatt, J. A. Pittman, I. Smith, “Unification-based Multimodal Integration”, in Proc. of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, 1997, pp. 281-288. W. Wahlster, “User and discourse models for multimodal communication”, in Readings in intelligent user interfaces, Morgan Kaufmann Publishers Inc, San Francisco, 1998, pp. 359-371. M. C. Caschera, F. Ferri, P. Grifoni, “Multimodal interaction systems: information and time features”, in Int. Journal of Web and Grid Services 2007 - Vol. 3, No.1 pp. 82 - 99, to be published. P. Bottoni, M. F. Costabile S. Levialdi, P. Mussio, “Formalizing visual languages”, in Proc. of IEEE Symp. Vis. Lang. '95, IEEE CS Press, 1995, pp. 334-341. A. Celentano, D. Fogli, P. Mussio, F. Pittarello, “Model-based Specification of Virtual Interaction Environments” in Proc. of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing (VLHCC'04) - Volume 00, 2004, pp. 257-260. B-W. Lee, A.W. Yeo, “Integrating sketch and speech inputs using spatial information”, in Proc. of the 7th Int. Conf. on Multimodal interfaces, ACM Press, 2005, pp.2-9. A. Adler, R. Davis, “Speech and sketching for multimodal design”, in Proc. of the 9th Int. Conf. on Intelligent User Interfaces, ACM Press, 2004, pp. 214–216. J.C. Martin, “Toward intelligent cooperation between modalities: the example of a system enabling multimodal interaction with a map”, in Proc. of Int. Conf. on Artificial Intelligence (IJCAI’97) Workshop on “Intelligent Multimodal Systems”, Nagoya, Japan, 1997. WordNet 2.1: A lexical database for the English language. http://www.cogsci.princeton.edu/cgi-bin/webwn, 2005. R. Rada, H. Mili, E. Bichnell, M. Blettner, “Development and Application of a Metric on Semantic Nets”, IEEE Trans. System, Man, and Cybernetics, vol. 9, no. 1, 1989, pp. 17-30. Y. Li, D. McLean, Z. A. Bandar, J. D. O’Shea, K. Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics” in IEEE Trans. Knowledge and Data Eng, vol. 18, n. 8, 2006, pp. 1138-1150. Y. Li, D. McLean, Z. A. Bandar, “An Approach for Measuring Semantic Similarity Using Multiple Information Sources”, IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, 2003, pp. 871-882 P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy”, in International Joint Conference for Artificial Intelligence, 1995, pp. 448-453. J. H. Lee, M. H. Kim, Y. J. Lee, “Information retrieval based on conceptual distance in is-a hierarchies”, in Journ..of Documentation, Vol.49, n.2, 1989, pp. 188-207.

6