Exploiting Correlation between Body Gestures ... - ACM Digital Library

0 downloads 0 Views 1MB Size Report
Sep 20, 2017 - [19] Vito Gentile, Salvatore Sorce, Alessio Malizia, and Antonio Gentile. 2016. ... [24] Andrea Kleinsmith and Nadia Bianchi-Berthouze. 2013.
Exploiting Correlation between Body Gestures and Spoken Sentences for Real-time Emotion Recognition Fabrizio Milazzo∗ Universit`a degli Studi di Palermo Dipartimento dell’Innovazione Industriale e Digitale (DIID) Palermo, Italy [email protected]

Agnese Augello Consiglio Nazionale delle Ricerche Istituto di Calcolo ad Alte Prestazioni Palermo, Italy [email protected]

Giovanni Pilato Consiglio Nazionale delle Ricerche Istituto di Calcolo ad Alte Prestazioni Palermo, Italy [email protected]

Vito Gentile∗ Universit`a degli Studi di Palermo Dipartimento dell’Innovazione Industriale e Digitale (DIID) Palermo, Italy [email protected]

Antonio Gentile∗† Universit`a degli Studi di Palermo Dipartimento dell’Innovazione Industriale e Digitale (DIID) Palermo, Italy [email protected]

Salvatore Sorce∗† Brunel University London College of Engineering, Design and Physical Sciences Department of Computer Science Uxbridge, United Kingdom [email protected]

ABSTRACT Humans communicate their affective states through different media, both verbal and non-verbal, often used at the same time. The knowledge of the emotional state plays a key role to provide personalized and context-related information and services. This is the main reason why several algorithms have been proposed in the last few years for the automatic emotion recognition. In this work we exploit the correlation between one’s affective state and the simultaneous body expressions in terms of speech and gestures. Here we propose a system for real-time emotion recognition from gestures. In a first step, the system builds a trusted dataset of association pairs (motion data → emotion pattern), also based on textual information. Such dataset is the ground truth for a further step, where emotion patterns can be extracted from new unclassified gestures. Experimental results demonstrate a ∗ Member

of the Ubiquitous Systems and Interfaces (USI) research group, https://usi.unipa.it † Also with InformAmuse s.r.l., Via Nunzio Morello 20, 90144 Palermo - Italy http://www.informamuse.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHItaly ’17, September 18–20, 2017, Cagliari, Italy © 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery. ACM ISBN 978-1-4503-5237-6/17/09. . . $15.00 https://doi.org/10.1145/3125571.3125590

good recognition accuracy and real-time capabilities of the proposed system. CCS CONCEPTS • Information systems → Nearest-neighbor search; Similarity measures; • Computing methodologies → Ranking; • Computer systems organization → Real-time system specification; KEYWORDS Emotion Recognition, Dynamic Time Warping, K-nearest neighbor. ACM Reference Format: Fabrizio Milazzo, Agnese Augello, Giovanni Pilato, Vito Gentile, Antonio Gentile, and Salvatore Sorce. 2017. Exploiting Correlation between Body Gestures and Spoken Sentences for Real-time Emotion Recognition. In Proceedings of CHItaly ’17, Cagliari, Italy, September 18–20, 2017, 6 pages. https://doi.org/10.1145/3125571.3125590

1

INTRODUCTION

In the last decade, the interest in emotion recognition (ER) has significantly grown, considering several applications, e.g.: health monitoring [35], automatic movie subtitling [28], e-tutoring systems [1], and so on. Text/speech and humans’ facial expressions have been the most widespread channels used by ER algorithms for long time [10], [13]. However, many researchers have recently started to focus on body gestures analysis [6] as a new data input for ER. This is motivated by the proved effectiveness of body movements and gestures in revealing the human

CHItaly ’17, September 18–20, 2017, Cagliari, Italy

F. Milazzo et al.

Figure 1: Flow diagram of the proposed system for emotion recognition.

affective state, sometimes higher than other communication channels (e.g. face and speech) [21], [9], [24]. Many authors have analyzed the motion cues that mainly convey affective information [2], [20]. Among them, multimodal approaches integrate body gestures with the information obtained from other expression modes. They have been proposed to increase the performance of emotion classifiers, as in [8], [23]. The foundation of this work is the existence of a “correlation” between gesture, speech, and the performer’s affective state. The correlation among such channels was clarified by Ekman [11] and Yang et al. [38], which stated that gestures classes like emblems (non-verbal movements with a verbal equivalent), and illustrators (movements creating an image to support a spoken message) must be performed during a speech and can be associated to a distinct emotion. Starting from this evidence, we developed a system able to classify and detect such a correlation, and to automatically recognize emotions from body gestures. The system has been trained and validated with a dataset of skeletal data - sentence pairs, where each sentence is intended as spoken while the gesture is performed. It exploits a sentiment analysis algorithm to learn emotions from the sentences associated to body gestures. After that, each new (unclassified) gesture is compared with the gestures and the related emotions learned in the previous stage. The system then outputs the probabilistic distribution of the emotions conveyed by the unclassified gesture. The Dynamic Time Warping (DTW) recognition technique [4] allows to compare training and unclassified gestures. This makes the recognition process independent of the length and size of input gestures. The system is supplied with a trainingset reduction module (a procedure to speedup classification [22], [31]), which makes it compliant to real-time constraints. 2

Emotion recognition from text is a subfield of Natural Language Processing (NLP) which studies how words reveal emotions. It consists in a set of techniques of language processing to classify text according to an emotional content [27], [29], relying on the exploitation of emotion lexicons, such as SentiWordnet [15], the WordNet Affect Lexicon [36] and the Dictionary of Affect and Language [37]. Gesture recognition is a field of pattern recognition, aimed at the classification of motion patterns. The growing interest in this field has given boost to the development of novel devices, recognition algorithms and acquisition systems [26]. The state of the art in gestural input acquisition is mostly based on Kinect-like devices [18], consisting in RGB-D cameras allowing efficient and real-time skeletal tracking algorithms [33]. The skeletal channel provides information about the position of the skeletal joints, which can be used along with velocity and acceleration, as features to recognize static/dynamic gestures. In this work, we focus on emotion recognition from dynamic gestures. The involved gestures such as the aforementioned illustrators and emblems, will be connected to a spoken sentence. To this aim, we translate the emotion recognition into an equivalent gesture recognition problem. The most used mathematical frameworks to recognize dynamic gestures are the Hidden Markov Models (HMMs) and the Dynamic Time Warping (DTW). Carmona and Climent have compared their performance in [7] and concluded that DTW allows to use less training examples, achieving a higher classification accuracy at the cost of slightly higher execution times. For such reason we resort to the DTW framework as the mean to perform emotion recognition. Moreover, we propose a training-set reduction technique to reduce the training set size (thus to reduce the recognition time) by fixing a constraint on the minimum acceptable classification accuracy [25].

RELATED WORKS

The ER task is carried out by analyzing the human’s expression, to provide emotional labels according to an emotions categorization approach. The most known categorization was proposed by Ekman [12], who defined six basic emotions, namely anger, disgust, fear, joy, sadness and surprise.

3

SYSTEM DESCRIPTION

This Section discusses our proposed ER system based on gesture-sentence input pairs. It is able to produce a mapping gesture→emotion pattern, by exploiting the sub-symbolic coding of sentence→ emotion pattern.

Emotion Recognition from Gestures and Spoken Sentences

CHItaly ’17, September 18–20, 2017, Cagliari, Italy

Each gesture G will be coded as a temporal sequence of length T of the 3D positions of the N skeletal joints. Thus, it will be arranged as a matrix belonging to RT ×3N :  (x 1,1 , y1,1 , z 1,1 ), . . . , (x 1, N , y1, N , z 1, N )      .. .. .. G= (1)  . . .   (xT ,1 , yT ,1 , zT ,1 ), . . . , (xT , N , yT , N , zT , N )   A sentences S, represents the textual description of the gesture and is modeled as an ordered sequence of W words: S = (s 1 , s 2 , . . . , sW )

(2)

The system is composed of three modules, namely Training, Training set reduction and Recognition, as depicted in Figure 1. The Training module exploits a Gesture-Sentence input dataset, namely GS where each entry has the form < G i , Si >. It outputs a Gesture-Emotional dataset GE which contains gesture-emotion pairs in the form < G i , ei >. Each gesture G i is associated to a pattern of emotions, coded as a six-dimensional vector ei . The values of the six components of ei represent the strength (normalized in the interval [0, 1]) of the six emotions of the Ekman model associated to the gesture description Si of the gesture G i . The set of the basic emotions defined by Ekman is defined as: X = {anдer, disдust, f ear, joy, sadness, surprise}

(3)

The vector containing the strength of the six basic emotions is defined as follows: e =< eanдer , edisдust , e f ear , e joy , esadness , esur pr ise > (4) It will be called emonoxel [17], analogously to the knoxel given by the conceptual space paradigm [16]. The Training set reduction module applies a constrained optimization technique, based on gradient-descent, to discard redundant samples from GE. This would reduce the processing time of the recognition module. The Recognition module accepts as input a new (unclassified) gesture G new and the filtered emotional dataset GE and outputs a six-dimensional emonoxel enew . It implements a K-nearest neighbor classifier exploiting the Dynamic Time Warping similarity measure to discover the most similar examples in the filtered dataset. Training module The Training module translates a sentence Si spoken while the gesture G i is performed, into an emonoxel ei . It is applied to the whole GS dataset and outputs a new dataset, namely GE, associating emonoxels ei to gestures G i . Each emonoxel is derived from the affective content of the sentence associated to the gesture. Strapparava et al. [36] pointed out that some parts of the speech, the so-called emotional words,

Figure 2: Naive Bayesian classifier used to reveal emotions from textual descriptions of gestures.

reveal the affective state of the speaker. If a gesture is performed while a sentence is spoken, then we may associate the body movement and the emotional state revealed from the spoken sentence. Let a sentence be an ordered sequence of words, as in Equation 2. We try to model the influence between the performer’s emotional state and the sentence by means of a Naive Bayesian classifier, shown in Figure 2. The variable e is the unknown emonoxel, while the words composing the sentence S are the evidence of e. The components of the emonoxel vector are computed as the probability of the whole sentence given a specific emotion: ex = p(S |Emotion = x) =

W Ö

p(si |Emotion = x), ∀x ∈ X

i=1

(5) The non-emotional words have no influence onto the output emonoxel e, i.e. for each non-emotional word si , it holds that p(si |Emotion = x) = 1. Recognition module During the recognition phase, the unclassified gesture G new is analyzed, compared to the entries in the GE and the emotional pattern e provided as output. This process is done by the K-nearest neighbor, which classifies inputs according to the K most similar samples found in the training set. In this work we resort to the Dynamic Time Warping similarity DTW (·, ·) to compare gestures [7]. Dynamic time warping computes the cost to warp the first input signal onto the second one by means of insertion, deletion and match operations. The warping cost is given by the minimum cumulative sum of the cost of the single warping operations. The cost per operation is computed as a distance metric, typically the cosine, Manhattan or euclidean. If the two input signals last T , and C is the average time to perform one warping operation, then O(CT 2 ) is the time complexity of a DTW computation [32]. The K-nearest neighbor computes DTW (G new , G i ) for the whole training set. Samples are then sorted according to the

CHItaly ’17, September 18–20, 2017, Cagliari, Italy

F. Milazzo et al.

distance from G new . Finally, the output emonoxel enew is computed as the average of the K closest training emonoxels. Training set reduction module The heaviest operation of the recognition module is the computation of Dynamic Time Warping DTW (G new , G i ) for all the samples of GE. The training set reduction module reduces the cardinality of GE with the aim to lower as much as possible the recognition processing time, and to maximize its accuracy. The original dataset is filtered so that only the most representative gestures, the “prototypes”, will then be used for recognition. The original training set GE is partitioned into two subsets, namely P (the prototype gestures set) and N P (the non-prototype gestures set). To this end, we define an evaluation function A(P, N P) ∈ [0, 1] which is defined as the accuracy of recognition for the elements in N P, by using P as training set. A constrained optimization procedure is thus applied to GE to maximize the value of A(P, N P) and to minimize the cardinality |P |:   max A(P, N P)    (6) min |P |   θ ≤ A(P, N P),  where θ is a lower bound on the minimum acceptable accuracy. The initial state of the procedure is as follows:  P = GE    (7) NP = ϕ   A(P, N P) = 1  A while-loop reduces the cardinality of P by one element at a time, until the constraint is satisfied at equality. For each iteration, all the samples in P are put, one at a time, in N P. This action produces a variation in the accuracy A(P, N P), whose gradient is as follows: ∇A(P, N P) = A(Pi− , N Pi+ ) − A(P, N P),

(8)

where Pi− and N Pi+ indicate the sets obtained by moving the i-th sample gesture from P into N P. Gestures in P are sorted according to the gradient of A(P, N P), and the one with the maximum value is definitively put in the N P set. The loop is iterated until A(P, N P) reaches or is above the θ threshold. The module outputs the set P which will definitively replace GE for the recognition task. 4

EXPERIMENTAL ASSESSMENT

This Section describes the experimental results obtained by our ER system applied to the Chalearn 2013 gestural dataset [14]. The dataset contains 7663 gestures acquired by a Microsoft Kinect. Each entry contains an RGB-D video, the

Figure 3: Accuracy obtained by varying the training set size and the distance metric for Dynamic Time Warping. The accuracy of other two compared methods is shown for x=70%.

skeletal joints sequence, and a spoken sentence (represented as raw text). Skeletal sequences are acquired at 20fps, have a variable time length, and contains 20 joints, represented in the world 3D coordinates. The performed gestures are typical of the Italian culture and are mixed between illustrators and emblems. The system was implemented in Python 2.7 and tested on a MacBookAir model 6,2 equipped with an Intel Core [email protected], and 4GB DDR3 RAM. According to some preliminary tests, we detected the best neighborhood size K = 5 and fixed it in the following experiments. Evaluating different cost functions for DTW With this experiment we checked the impact of different Dynamic Time Warping cost functions, namely the Manhattan, Euclidean and Cosine distances, onto the recognition accuracy. In this experiment no training-set reduction was applied. First, we run the training module onto the skeletal-sentence pairs of the Chalearn dataset, and produced the output dataset GE which was used as ground truth to assess the recognition accuracy. Then, we split the GE dataset into training/validation sets, with different mixing percentages, ranging from 50% to Leave-One-Out (LOO). Finally, we trained the recognizer and evaluated its accuracy on the validation set. In particular, we compared the maximum component of the ground truth emonoxel with the maximum component of the output provided by the recognition module. Thus, for each run, we computed a 6-by-6 confusion matrix which was used to compute the recognition accuracy as described in [34]. Figure 3 shows the achieved results. As the training set percentage increases, then the resulting accuracy increases. An explanation is that more training samples would increase the chance to obtain the right classification for the validation

Emotion Recognition from Gestures and Spoken Sentences

Figure 4: Accuracy and recognition time obtained for different accuracy thresholds and training/validation mixing percentages.

samples. Anyway, the performance range is very narrow, indeed both the Euclidean and Cosine distances perform closely between 77% and 80%. Manhattan distance performs slightly worse obtaining a range of 76% − 77.5%. The results lead us to conclude that the Euclidean distance is the best cost metric to be used for Dynamic Time Warping. Two other unimodal skeletal-based gesture recognition methods, namely OS-ELM [5] and SUMO (the best one for the Chalearn 2013 challenge), approach an average accuracy of about 70% for a training/validation mixing percentage equal to 70/30. Compared to such methods, our system performs much better, showing 78% of accuracy for the same mixing percentage. Evaluating real-time performance This experiment proves that the training set reduction module allows to run the proposed ER system in real-time. Figure 4 shows the achieved results for accuracy and running time of the validation sets (in the y-axis) and different values for the mix percentage train-validation (in the x-axis). The GE dataset was split the same way of the previous experiment. The reduction module was run on the training split of GE by setting different values on the training accuracy threshold. In particular, we set the values for θ to 75% and 90%; θ was set to 101% to force the system to not apply training-set reduction. First, we noted that the accuracy of the validation set remains quite stable in a very narrow range between 70%-80%, both for different accuracy thresholds and split percentages. Such result means that the training set reduction module does not significantly affect the recognition accuracy, even when some non-prototypes elements are removed from the training set. This lead us to conclude that prototypes largely influence recognition while the non-prototypes provide a very small contribution. The best recognition performance are achieved when no training-set reduction is

CHItaly ’17, September 18–20, 2017, Cagliari, Italy applied, but the difference with the remaining cases is very negligible. The second row of Figure 4 clearly shows that the reduction module is able to dramatically decrease the recognition time. In particular, we computed a reduction of about 10x and 5x respectively for the thresholds values 75% and 90% against the case where no training-set reduction was applied. We set a real-time limit to 3.33 ms since in many real-world scenarios the common frame rate of Kinect-like devices is set to 30fps [19]. Based on the obtained results, we can conclude that the threshold for the reduction module equal to θ = 75% leads the recognition module to achieve a good performance both in accuracy (75.2% for the LOO case) and allows a real-time execution (3ms per sample). 5

CONCLUSIONS

In this paper we presented an ER system able to compute emotion patterns from particular body gestures associated to a spoken sentence. The system learns emotions from the sentences by combining the use of WordNet and a Naive Bayesian classifier and outputs the so-called emonoxel, i.e. a vector containing the strength of the six basic emotions of Ekman. The training set reduction module discards redundant samples from the training set by means of constrained optimization. The recognition module combines the use of the Dynamic Time Warping distance and the K-nearest neighbor classifier to compute an output emonoxel for unclassified gestures. Experimental results demonstrated that the best cost metric for Dynamic Time Warping is the Euclidean distance. Moreover, different threshold values for the reduction module, i.e. 75%, 90% and 101%, lead to similar accuracy values. On the contrary, the recognition time was largely affected, and for θ = 75% recognition is performed in real-time. In the next future, we are planning to develop a multimodal ER system integrating the face/gaze channel, intonation, prosody and intensity of the audio signal to the ones proposed in our system [3]. Sentiment analysis may be improved by means of SenticNet which may be combined to the WordNet Affect Lexicon to improve emotion detection [30]. REFERENCES [1] Oryina Kingsley Akputu, Kah Phooi Seng, and Yun Li Lee. 2016. Affect Recognition for Web 2.0 Intelligent E-Tutoring Systems: Exploration of Students’ Emotional Feedback. In Psychology and Mental Health: Concepts, Methodologies, Tools, and Applications. IGI Global, 818–848. [2] Anthony P Atkinson, Mary L Tunstall, and Winand H Dittrich. 2007. Evidence for Distinct Contributions of Form and Motion Information to the Recognition of Emotions from Body Gestures. Cognition 104, 1 (2007), 59–72. [3] Tobias Baur, Dominik Schiller, and Elisabeth Andr´e. 2016. Modeling Userfis Social Attitude in a Conversational System. In Emotions and Personality in Personalized Services. Springer, 181–199.

CHItaly ’17, September 18–20, 2017, Cagliari, Italy [4] Donald J Berndt and James Clifford. 1994. Using Dynamic Time Warping to Find Patterns in Time Series. In KDD workshop, Vol. 10. Seattle, WA, 359–370. [5] Arif Budiman, Mohamad Ivan Fanany, and Chan Basaruddin. 2014. Constructive, robust and adaptive OS-ELM in human action recognition. In Industrial Automation, Information and Communications Technology (IAICT), 2014 International Conference on. IEEE, 39–45. [6] Lea Canales and Patricio Mart´ınez-Barco. 2014. Emotion Detection from Text: A Survey. Processing in the 5th Information Systems Research Working Days (JISIC 2014) (2014), 37. [7] Josep Maria Carmona and Joan Climent. 2012. A Performance Evaluation of HMM and DTW for Gesture Recognition. In Iberoamerican Congress on Pattern Recognition. Springer, 236–243. [8] Ginevra Castellano, Loic Kessous, and George Caridakis. 2008. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech. Affect and emotion in human-computer interaction (2008), 92–103. [9] Berardina De Carolis, Marco de Gemmis, Pasquale Lops, and Giuseppe Palestra. 2017. Recognizing Users Feedback from Non-Verbal Communicative Acts in Conversational Recommender Systems. Pattern Recognition Letters (2017). [10] Marilena Ditta, Fabrizio Milazzo, Valentina Rav´ı, Giovanni Pilato, and Agnese Augello. 2015. Data-driven Relation Discovery from Unstructured Texts. In 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), Vol. 1. IEEE, 597–602. [11] Paul Ekman. 2004. Emotional and Conversational Nonverbal Signals. In Language, knowledge, and representation. Springer, 39–50. [12] Paul Ekman. 2005. Basic Emotions. John Wiley & Sons, Ltd, 45–60. [13] Paul Ekman and Erika L Rosenberg. 1997. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA. [14] Sergio Escalera, Jordi Gonz`alez, Xavier Bar´o, Miguel Reyes, Oscar Lopes, Isabelle Guyon, Vassilis Athitsos, and Hugo Escalante. 2013. Multi-Modal Gesture Recognition Challenge 2013: Dataset and Results. In Proceedings of the 15th ACM on International conference on multimodal interaction. ACM, 445–452. [15] Andrea Esuli and Fabrizio Sebastiani. 2006. Sentiwordnet: A Publicly Available Lexical Resource for Opinion Mining. In Proceedings of LREC, Vol. 6. Citeseer, 417–422. [16] Peter G¨ardenfors. 2004. Conceptual Spaces: The geometry of Thought. MIT press. [17] Vito Gentile, Fabrizio Milazzo, Salvatore Sorce, Antonio Gentile, Agnese Augello, and Giovanni Pilato. 2017. Body Gestures and Spoken Sentences: A Novel Approach for Revealing Userfis Emotions. In IEEE 11th International Conference on Semantic Computing (ICSC). IEEE, 69–72. [18] Vito Gentile, Salvatore Sorce, and Antonio Gentile. 2014. Continuous Hand Openness Detection Using a Kinect-like Device. In Eighth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS). IEEE, 553–557. [19] Vito Gentile, Salvatore Sorce, Alessio Malizia, and Antonio Gentile. 2016. Gesture recognition using low-cost devices: Techniques, applications, perspectives [Riconoscimento di gesti mediante dispositivi a basso costo: Tecniche, applicazioni, prospettive]. Mondo Digitale 15, 63 (2016). http://mondodigitale.aicanet.net/2016-2/articoli/ 02 Riconoscimento di gesti mediante dispositivi a basso costo.pdf [20] Donald Glowinski, Antonio Camurri, Gualtiero Volpe, Nele Dael, and Klaus Scherer. 2008. Technique for Automatic Emotion Recognition by Body Gesture Analysis. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’08). IEEE, 1–6.

F. Milazzo et al. [21] Asha Kapur, Ajay Kapur, Naznin Virji-Babul, George Tzanetakis, and Peter F. Driessen. 2005. Gesture-based Affective Computing on Motion Capture Data. In International Conference on Affective Computing and Intelligent Interaction. Springer, 1–7. [22] Chatchai Kasemtaweechok and Worasait Suwannik. 2015. Training set reduction using Geometric Median. In 15th International Symposium on Communications and Information Technologies (ISCIT). IEEE, 153–156. [23] Loic Kessous, Ginevra Castellano, and George Caridakis. 2010. Multimodal Emotion Recognition in Speech-based Interaction Using Facial Expression, Body Gesture and Acoustic Analysis. Journal on Multimodal User Interfaces 3, 1-2 (2010), 33–48. [24] Andrea Kleinsmith and Nadia Bianchi-Berthouze. 2013. Affective Body Expression Perception and Recognition: A Survey. IEEE Transactions on Affective Computing 4, 1 (2013), 15–33. [25] Fabrizio Milazzo, Vito Gentile, Antonio Gentile, and Salvatore Sorce. 2017. Real-Time Body Gestures Recognition Using Training Set Constrained Reduction. In Conference on Complex, Intelligent, and Software Intensive Systems. Springer, 216–224. [26] Fabrizio Milazzo, Vito Gentile, Giuseppe Vitello, Antonio Gentile, and Salvatore Sorce. 2017. Modular Middleware for Gestural Data and Devices Management. Journal of Sensors 2017 (2017). [27] Saif Mohammad. 2011. From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Association for Computational Linguistics, 105–114. [28] Seung-Bo Park, Eunsoon Yoo, Hyunsik Kim, and Geun-Sik Jo. 2011. Automatic Emotion Annotation of Movie Dialogue using WordNet. In Asian Conference on Intelligent Information and Database Systems. Springer, 130–139. [29] Giovanni Pilato and Umberto Maniscalco. 2015. Soft Sensors for Social Sensing in Cultural Heritage. In 2015 Digital Heritage, Vol. 2. IEEE, 749–750. [30] Soujanya Poria, Alexander Gelbukh, Erik Cambria, Peipei Yang, Amir Hussain, and Tariq Durrani. 2012. Merging SenticNet and WordNetAffect emotion lists for sentiment analysis. In IEEE 11th International Conference on Signal Processing (ICSP), Vol. 2. IEEE, 1251–1255. [31] Jos´e Salvador S´anchez. 2004. High Training Set Size Reduction by Space Partitioning and Prototype Abstraction. Pattern Recognition 37, 7 (2004), 1561–1564. [32] Pavel Senin. 2008. Dynamic Time Warping Algorithm Review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA 855 (2008), 1–23. [33] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. 2013. RealTime Human Pose Recognition in Parts From Single Depth Images. Commun. ACM 56, 1 (2013), 116–124. [34] Marina Sokolova and Guy Lapalme. 2009. A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management 45, 4 (2009), 427–437. [35] Thad Starner, Jake Auxier, Daniel Ashbrook, and Maribeth Gandy. 2000. The Gesture Pendant: A Self-illuminating, Wearable, Infrared Computer Vision System for Home Automation Control and Medical Monitoring. In the fourth international symposium on Wearable computers. IEEE, 87–94. [36] Carlo Strapparava, Alessandro Valitutti, et al. 2004. WordNet Affect: an Affective Extension of WordNet. In LREC, Vol. 4. 1083–1086. [37] Cynthia Whissell. 1989. The Dictionary of Affect in Language. Emotion: Theory, research, and experience 4, 113-131 (1989), 94. [38] Zhaojun Yang and Shrikanth S Narayanan. 2014. Analysis of Emotional Effect on Speech-Body Gesture Interplay. In INTERSPEECH. 1934– 1938.