Extracting and Associating Meta-features for ... - Semantic Scholar

1 downloads 0 Views 916KB Size Report
and associating emotional meta-features to support the development of ... same words may be used as a joke, or as a genuine question seeking an answer, or ...
Cogn Comput DOI 10.1007/s12559-010-9072-1

Extracting and Associating Meta-features for Understanding People’s Emotional Behaviour: Face and Speech Nikolaus Bourbakis • Anna Esposito Despina Kavraki



Received: 6 May 2010 / Accepted: 1 September 2010  Springer Science+Business Media, LLC 2010

Abstract Emotion is a research area that has received much attention during the last 10 years, both in the context of speech synthesis, image understanding as well as in automatic speech recognition, interactive dialogues systems and wearable computing. There are promising studies on the emotional behaviour of people, mainly based on human observations. Only a few are based on automatic machine detection due to the lack of Information Technology and Engineering (ITE) techniques that can make available a deeper and large-scale noninvasive analysis and evaluation of people’s emotional behaviour and provide tools and support for helping them to overcome social barriers. The present paper reports a study for extracting and associating emotional meta-features to support the development of emotionally rich man–machine interfaces (interactive dialogue systems and intelligent avatars).

Introduction

Keywords Vocal and facial emotional expressions  Speech and image processing  Communicative signals  Mathematical models

Emotional Behaviour in People

N. Bourbakis Wright State University, Dayton, OH, USA e-mail: [email protected] N. Bourbakis  D. Kavraki AIIS Inc, Columbus, OH, USA A. Esposito (&) Department of Psychology and IIASS, Second University of Naples, Via Vivaldi 43, 81100 Caserta, Italy e-mail: [email protected]; [email protected]

Emotion is a topic that has received much attention during the last few years, both in the context of speech synthesis, image understanding as well as in automatic speech recognition, interactive dialogues systems and wearable computing. The goal in this paper is to design ITE-based systems that contribute to the understanding and responding to human emotional behaviour. It is evident that the same words may be used as a joke, or as a genuine question seeking an answer, or as an aggressive challenge. Knowing what is an appropriate continuation of the interaction depends on detecting the register that the speaker is using, and a machine communicator that is unable to acknowledge the difference will have difficulty in managing a natural-like conversation.

People manifest several different emotional behaviours on the basis of specific environmental and interactional conditions that are observable through verbal and nonverbal expressions. There are several Information Technology issues dealing with emotions and their encoding procedures in order to improve the machine ability to recognize and synthesize emotional states. However, it has become apparent that useful and effective machine representations of emotions can be developed exploiting information and algorithms inspired by human psychological responses to emotional states and their ability to recognize them. A number of different emotional theories have been proposed and a number of different approaches to their understanding have been suggested. The existence, for example, of a discrete number of emotional states (exploited in the current IT applications) is just one aspect

123

Cogn Comput

of the search for understanding emotions. Other theories include several dimensions of the emotional responses or attribute emotions to social learning abilities [1–4]. This is because changes in emotional states involve a number of different responses that can be neural, physiological, and behavioural. Humans react to emotional states not only by changing their expressive behaviour and making visible these changes in the face, the voice, and the body expressions but also by changing their physiology, their acting, and their neural activity. The answer to the question ‘‘What is an emotion?’’ posits by W. James a century ago is still open even though a lot has been discovered on emotional states. On this research line, studies attempting to clarify the basic mechanisms of emotional processes in disabilities and health disorders follow, in some aspects, the same paths exploited in typical normal conditions and are different in other aspects [5]. Impairments and developmental disorders may change both emotional responses and relative expressions and needs. Emotional reactions may be different. Questions on how these differences are expressed, and felt, and their relevance during social interaction are still open and must be taken into account in studying emotions. Comparing normative and disordered expressions of emotional states will shed light not only to our understanding of emotional behaviour but will be useful for implementing effective intelligent systems able to interact with typical and disordered people. Recognizing Emotions in Speech From an acoustic point of view, the commonly applied approach in creating automatic emotional speech synthesis and analysis systems is to start with an emotional speech database annotated with emotional tags by a panel of listeners. The next step is to perform an acoustic analysis of these data and correlate statistics of certain acoustic features (mainly related to F0 and some of its derived measures) with the emotional tags. In the last step, the robustness of the acquired acoustic parameters is verified and adapted by checking the performance of an automatic dialogue system [6]. Most of the approaches devoted to the study of emotional expressions, focus on a small set of six (some authors identified more than six basic emotions [2, 4]) basic emotions—happiness, sadness, fear, anger, surprise, and disgust. These emotions are considered key points of reference or primary emotions, from which the remaining or secondary emotions are derived. In speech, the syntheses of these primary emotions have been created through the modification of acoustic parameters such as F0, spectral energy distribution, and harmonics-to-noise ratio. For example, the mean F0 value has been shown to be lower for sadness with respect to happiness. In general, preliminary results indicate that,

123

based on this approach, it is possible to synthesize some primary emotions very effectively. However, this synthesis mostly exploits simulated emotional data, i.e. data based on recordings of actors simulating emotional states through the readings of emotionally rich texts. This is due to the difficulty to gather spontaneous emotional data. This approach raised doubts on the effectiveness of the controlled parameters. It is questionable in fact, whether the actors reproduce a genuine or a stylized idealization of the emotional characteristics exploited by ordinary people when they spontaneously experience emotions. The automatic recognition of emotional vocal expressions is a much harder issue compared to synthesis, due to the fact that acoustically, emotions overlap at various degrees among themselves and are affected by other speech sources of variability such as the phonetic, intra- and inter-speaker variability, as well as environmental noise. Therefore, the acoustic realization of affective speech is dependent on several other factors than emotional states [7–11]. Recognizing Emotions in Faces The salient issues in emotion recognition from faces are parallel in some respects to those associated with voices, but divergent in others [6, 12–14]. As in speech, a long established tradition attempts to define the facial expressions of emotion in terms of qualitative subjects—i.e. static positions capable of being displayed in a still photograph. The still image usually captures the apex of the expression, i.e. the instant at which the emotion indicators are most marked, while the natural recognition of emotions is based on continuous variations of facial expressions. The analysis of human emotional facial expressions requires a number of pre-processing steps that attempt to detect or track the face, locate characteristic facial regions (such as eyes, mouth and forehead on it) to extract and follow the movement of facial features (such as characteristic points in these regions) and model facial gestures using anatomic information about the face. The above techniques are based on the Facial Action Coding System (FACS) [2, 15], a system for describing visually distinguishable facial movements through a set of basic units, the Action Units (AUs), i.e. visible movements of a single or a small group of facial muscles that are the basic units of the FACS [15]. The AUs can be combined to define a mapping between changes in facial musculature and visible facial emotional expressions. There are 46 independent AUs that can be combined and up to 7,000 distinct combinations of them have been observed in spontaneous interactions. Most of the implemented automatic systems attempt to recognize only expressions corresponding to the above-mentioned six basic emotional states—happiness, sadness, surprise, disgust, fear, and anger [16, 17] or to recognize AUs. In

Cogn Comput

general, emotional facial expressions are recognized by tracking AUs and measuring the intensity of their changes. Thus, approaches to the classification of facial expressions can be divided into spatial and spatio-temporal. In spatial approaches, the facial features obtained from still images are used for classifying emotional facial expressions. Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Linear Discriminant Analysis, and Gaussian Mixture Models (among many others) are the mathematical models mostly exploited for such classifications [18–21]. In spatio-temporal approaches also dynamic features extracted by video sequences are taken into account and therefore the more appropriate mathematical tools are the Hidden Markov Models (HMMs) and the Stochastic Petri-Nets (SPNs) [22–25] since they are able to consider variations of the face along time. The Proposed Approach During the past three decades, there has been a significant interest in developing vision-based intelligent Human Computer Interaction (HCI) systems. The research in building such devices is based on the assumption that information about a person’s identity, state, and intention can be extracted from the cross-modal analysis of her/his verbal and nonverbal behaviour and that computers can be made to react accordingly, e.g., by observing a person’s behaviour. A system that could automatically detect the presence of humans, determine their identity, and interpret their behaviour (by means of a cross-modal analysis and recognition of speech, gestures, gaze, and facial expressions) in order to respond accordingly, would significantly improve the performance of HCI systems. What we proposed here is to face these problems exploiting audio–video processing/analysis models. From an image processing point of view, we use the Local–Global (LG) graph methodology [26] for detecting faces and recognizing the facial expressions associated to the six basic emotions specified earlier. Specifically, each facial feature composing a particular facial expression is represented by a detailed LG graph which allows signify even the most subtle expressions. Moreover, since all measurements and computations need to be validated by a mathematical model, to be effective, we propose to use the Hausdorff Input image

Input color histogram

metric [27, 28] that is able to describe a digital topology for speech and image processing. Recently, a theoretical result has showed that the lower Hausdorff metric (uniform) topology is generated by families of uniformly discrete sets as hit sets [29, 30] suggesting its use for clustering different emotional expressions as the Vietoris prototype topology. Because of its low computational complexity, this metric has already been used for measuring distances among sound spectra and for identifying musical similarities [28].

Detecting Emotion in Faces Face Detection via Skin Detection The primary steps for skin detection in an image using colour information are (1) to represent the image pixels in a suitable colour space, (2) to model the skin and nonskin pixels using a suitable colour distribution, and (3) to classify the modelled distribution [31]. However, the accuracy of such models is limited as the skin colour is affected by several factors such as illumination and background conditions, camera characteristics, and ethnicity. Figure 1 shows our proposed skin and face detection methodology based on colour constancy. Image Registration Image registration is the process of geometrically aligning two images of the same scene that have been acquired under varying conditions. It is usually employed at the first stage of other computer vision schemes and finds its application in medical imaging, remote sensing, robot vision, motion detection, guidance systems, stereo vision, and pattern recognition. In the remote sensing field, it is frequently being used for change detection, topographic mapping, and urban planning among others. For the case of change detection, the objective is to find the differences between two images of the same scene that have been taken from variable viewpoints, at different times, using different sensors. The proposed in house registration method [32] is based on a registration scheme that uses Neural Networks (NNs) and Genetic Algorithms (GAs) to reduce the search space defined by the affine transform parameters and more

Skin color adapted NN

Skin pixels

Color correction

Thresholding achromatic values

Fig. 1 The colour constancy approach to skin detection for face detection

123

Cogn Comput

specifically by translation, rotation, and isometric scaling variations. In addition, different data representations and similarity measures are employed and tested for their applicability to the proposed scheme. Region-based Image Segmentation Segmentation is a very important technique in the field of computer vision, image analysis, and processing, and many algorithms have been proposed for its effective implementation [33–36]. Segmenting an image entails the identification of image regions clustered together according to certain similarity features, such as RGB (Red, Green, Blue) colours, greyness, and texture. Segmentation algorithms are employed in the analysis and synthesis of image regions and represent the front-end processing for the subsequent extraction of relevant image features and/or for the recognition of objects. Here, we plan to use the in house Fuzzy-like Reasoning Segmentation (FRS) [34, 37] method with additional capabilities for eliminating shadows and highlights that confuse the extraction of objects from colour images. The results obtained by the proposed method are more accurate in terms of a psychological theory of colour perception and more suitable for object reconstruction. The FRS method is based on three processing stages: smoothing, edge detection, and segmentation. The initial smoothing operation is aimed to remove noise from the image. The subsequent edge detector algorithm extracts edge information that is exploited by the segmentation algorithm to appropriately detect image segments. The Local–Global (LG) Graph Graph theory is a very powerful methodology used on a great variety of computer science problems [22, 38, 39]. The graph models are mostly used for the relational representation and recognition of images, objects, and their features and attributes [31, 40]. In order to recognize patterns, or models in images, different graph techniques have been proposed, to measure, for example, image similarity [38, 39]. The proposed in house LG graph method deploys region information in each graph node and interrelates the nodes with each other to express all possible relationships [26]. In particular, it adds local information into graphs making them

more accurate for representing an object. By combining the FRS and the LG graph methods, it is possible to improve the object recognition accuracy without increasing the computational complexity. The main components of an LG graph are as follows: (i) the local graph that encodes size and shape information and (ii) the skeleton graph that provides information about the internal shape of each segmented region. The global graph represents the relationships among the segmented regions for the entire image. The nodes Pi of the global graph include the LG graph, the colour, the texture, and the skeleton graph. The local–global image graph components are briefly described below. The LG graph attempts to emulate an image human-like perceptual understanding by developing global topological relationships among regions and objects within an image. Model Driven Region Synthesis Recognition Many object recognition methods use region synthesis based on similar features (colour, texture, shape, etc.) [26, 38–40]. In particular, shape similarity is used to model the deformation energy needed to align two different shapes. Shape similarity can match an object with deformation other than rigid transformation, assuming that the shape has been segmented from the background, and the mathematical shape representation is sensitive to some kinds of deformations. The process used here for object recognition is based on the synthesis of segmented adjacent regions, using the LG graph, and the association (through comparisons) of the existing object models available in an LG graph database. The method is exemplified through Fig. 2 reported below. The LG graph method can search objects in a given image based upon the provided object model database. In the model database, all objects are represented by their multi-view structure. Every object has been modelled from six views (top, down, left, right, front, rear). Intermediate views can be recognized by combining invariant features from these 6 views using the LG graph. In other words, any different view of an object not available in the database, can be generated as a synthesis of the other views existing in the database. Figure 2 graphically presents facial regions and their synthesis for extracting a face and models it with an LG graph [14].

Fig. 2 The sequence of steps that extract the LG graph views for the face

(a) Original face image

123

(b) Segmented regions

(c) Synthesis of regions for face detection

(d) Local - Global graph views

Cogn Comput

The LG Graph for the Recognition of Emotional Facial Expressions For recognizing emotional facial expressions, it is necessary to detect the face, identify the emotional facial features it contains, and finally recognize the emotional facial expression associated to them. Using the proposed LG graph method, after detecting the face, the various facial features can be retrieved through the local–global graph node correspondences. The extra step needed for recognizing facial expressions is to compare the image LG graph with the existing Expression LG graphs stored in the LG database. It should be noted that each facial expression is characterized by a specific configuration of the facial features. This particular configuration is represented by the Expression LG graphs, similar to the Face LG graph previously defined see Fig. 3 below. The advantage of the proposed method is that each individual facial configuration is represented by a more detailed graph representation which allows to represent even more subtle expressions. For recognizing expressions, each node in the LG graph is modified according to the Eq. 1 reported below: node ¼ fðx; yÞ; color=texture; L; size; border; LGEXPR1 ; . . . LGEXPRi g

ð1Þ

where LGEXPRi represents an Expression LG graph, (x, y) are the coordinates of the region’s centroid, colour represents the RGB values of the region after segmentation, texture represents the texture patterns of the region before segmentation, size is the number of pixels in the region, and border the region contour. At the present stage, only five facial features are used to represent each expression.

Fig. 3 The Lg expression graphs for the features considered for the neutral, happy, angry, and scream expressions, respectively

Features Considered

The features used are left and right eyebrows, eyes, and mouth. Figure 3 shows the corresponding Expression LG graphs for neutral, happy, angry, and scream expressions. The images used are taken from the AR face database [41]. Experimental Results In order to evaluate the effectiveness of the method, a set of 160 images (40 9 40) were used containing four emotional facial expressions (neutral, happy, angry, and scream) from the AR database. Clearly, the method is able to distinguish and recognize different expressions as defined by the LG models. In general, it is observed that an angry face is blended with a neutral expression and hence is less effectively recognized than other emotional facial expressions. Figure 4 shows how the proposed method recognizes realistic emotional facial expressions. Again, the method is tested on the six basic emotions mentioned earlier. In general, the system is able to accurately recognize facial expressions for which there exists a corresponding expression template (model) in the database. In addition, it assumes that the size of the face is no larger than a 70 9 70 pixel matrix with a frontal pose. These restrictions are not a limitation in the application scenario for which the use of the proposed system is envisaged, namely for recognizing the emotional facial expression of a person seated in front of the user. In general, in these interactional situations, the user’s gaze is directed towards the person in front of her/him (or towards a monitor, for example). The recognition of expressions from other views is not considered in this study. A future work will take into consideration other facial features for emotional expression recognition, such as cheeks and forehead furrows.

EyeBrowL (LG EXPR_EBL )

EyeBrowR (LG EXPR_EBR )

EyeL (LG EXPR_EL )

EyeR (LG EXPR_ER )

Mouth (LG EXPR_M )

Expressions

Neutral

Happy

Angry

Scream

123

Cogn Comput Fig. 4 Example of facial expression recognition on real world images

Happy = 0.72 Sadness = 0.44 Surprise = 0.77 Disgust = 0.54 Fear = 0.52 Anger = 0.29 Expression = Angry Happy = 0.12 Sadness = 0.81 Surprise = 0.67 Disgust = 0.79 Fear = 0.73 Anger = 0.79 Expression = Happy Happy = 0.82 Sadness = 0.34 Surprise = 0.78 Disgust = 0.51 Fear = 0.53 Anger = 0.59 Expression = Sad

(a) Original image Detecting Emotion in Speech LG Graph–based Speech Processing Here, we present a speech signal processing methodology based on the Local Global (LG) Graph method [22, 42]. The main reason for selecting LG graphs as the medium of speech representation is its compatibility with the LG graph representation of facial expressions. Thus, the LG graph will represent the speech patterns, their features, and their relationships. Note that reference and test signals are required to have identical wording. Intra-speaker variations are induced by the emotional state of the speaker, health, and ageing whereas inter-speaker variations are caused by different speaking habits such as rhythm and intonation or dialects. Pre-processing the Speech Signal In general, before processing and recognizing a speech signal, it is necessary to align it in time in order to delete those parts of it that do not convey relevant information (i.e. silence before or after the speech is uttered). Such segments are eliminated by computing the standard deviation of the speech samples contained in an appropriately sized moving window. If the window size is too small, the window will not contain enough signal information to be able to measure the desired features. If the window size is too large, the feature values will be averaged over a too long and variable speech interval and the extracted features

123

(b) Detected face

(c) Average error

will lose precision. By trial and error processes, the optimum window size was found to range from 75 to 125 samples, when the sampling frequency was 16 kHz. The speech signal was padded with the mean value of the windowed signal at the onset and offset in order to handle boundary problems. The mean padding was preferred to the traditional zero padding procedure because the latter introduces abrupt variations that are not suitable for the subsequent standard deviation method applied for eliminating sections of no interest in the speech signal. Afterwards, we calculate the mean value of the computed standard deviation values (stored in an array) and those values less than the calculated mean are forced to zero. Finally, the current speech wave is re-sampled according to the new stored standard deviation value eliminating all the zero-value standard deviation samples. The procedure is illustrated in Figs. 5 and 6 that show an uttered speech sentence and its pre-processed version, respectively. LG Graph Representation for Speech Recognition Once the alignment is done, the LG graphs of speech signals can be extracted and compared to the speech signals LG graph database. First, for each test signal, a local distance measure from all the signals in the LG database is computed. The global distance is then computed by averaging the local distances over the length of the signal. This procedure allows to represent the speech signal as an LG graph and therefore its comparison against the ones in the LG graph database. The LG graph embeds both local

Cogn Comput Reference Sample

0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 0

2

4

6

8

10

12

14

16

18 x 10

4

Fig. 5 Uttered speech sentence. On the x-axis is the time in ms and on the y-axis is the amplitude in db. The speech wave was sampled at 16 kHz

Pre-processed Reference Sample

0.5 0.4 0.3 0.2

Fig. 7 Maxima–Minima detection in a short speech window of approximately 5 ms (%80 samples). The speech wave was sampled at 16 kHz

points of the signal. This is done by finding the slope of the straight line segments that form the signal and identifying points where the slope of the signal changes. A maximum is defined as the point on the curve where the tangent changes from positive to negative. A minimum is defined as the point on a curve where the tangent changes from negative to positive. Figure 7 shows the maxima and minima detection in a short segment of the speech wave illustrated in Fig. 5. Such maxima and minima are then computed on the entire speech signal.

0.1 0 -0.1 -0.2 -0.3 -0.4 0

1000

2000

3000

4000

5000

6000

7000

Fig. 6 Pre-processed speech sentence. On the x-axis is the time in ms and on the y-axis is the amplitude in db. The speech wave was sampled at 16 kHz

information (the shape of the signal, i.e. parameters of a triangle, stored within the local graph at each node) and global information (the topology of the signal). For a given signal, relevant segmental features are extracted and mapped to other extracted features by finding relations among them. The LG graph describes the segmental features in relation to the entire wave, thus offering a robust methodology for describing, comparing, and recognizing speech signals represented as LG graphs. Maxima–Minima Detection The first step is to extract the local features by detecting the maximum and minimum

LG Graph Representation of the Signal Once all the maxima and minima in the signal have been identified, the next step is to construct triangles between two minima and one maximum (or two maxima and one minimum) in the given signal. Triangles above the zero x-axis are formed by two minima and one maximum while triangles below the zero x-axis are formed by two maxima and one minimum. This enables to describe the signal through a set of triangles of varying sizes. It should be observed that between the two minima and one maximum (or two maxima and one minimum) that form an approximate triangle, there are points where the slope changes. At these points, the slope either continues to increase towards the maximum (if the point lies on the part of signal that is approaching a maximum from a minimum) or decreases towards a minimum (if the point lies on the part of the signal that is approaching a minimum from a maximum). The points (along with the maxima and minima) where the slope of the signal changes are termed nodes. Nodes occur between the two minima and the one maximum that form a triangle as illustrated in Fig. 8. Thus, the triangles are approximations of the signal at given points. To make these approximations precisely match the entire signal, we extract from each segment of

123

Cogn Comput Fig. 8 Examples of nodes identified in a triangular approximation of a short segment of speech

Fig. 9 Formation of Local Graphs from triangular approximations of a speech wave

the signal, approximated by a triangle, a set of node features. These features are explained in the following: a given node Ni is represented by: Ni ¼ f SPi ; EPi ; hi ; Li g

ð2Þ

Here, Ni stands for the ith node, SPi stands for the starting point of a segment joining Ni, EPi stands for the ending point of a segment joining Ni, hi stands for the angle between Nii and the horizontal axis (slope), and Li stands for the length of segment joining Ni. The relationship between two nodes is given by aij, aij stands for {*Ni, hij, *Nj}, where Ni and Nj stand for the two consecutive nodes that occur in the signal and hij stands for the angle ‘‘connecting’’ Ni and Nj. Figure 8 shows the occurrence of the nodes in a triangle approximating a speech segment. A Local Graph, LGi, for the ith node, is described by Eq. 3 and illustrated in Fig. 9. LGi ¼ N1 a12 N2 a2;3 N3 a3;4 . . .Nk1 ak1;k Nk ak;1

ð3Þ

From the set of Local Graphs generated according to the procedure previously described, it is possible to generate the global maps Mi (as illustrated in Fig. 10) and calculate their distances dij which can be used for the recognition (graph matching) process. Mi = {LGi, Areai, x–y of Peaki, Width of Basei,…} di,j = {distance between Mi and Mj} Emotional Behaviour and Language Impairments There is considerable evidence that emotion produces change in respiration, phonation, and articulation, all of

123

Fig. 10 Formation of Global Graphs from triangular approximations of a speech wave

which determine the parameters of the acoustic signal. Therefore, emotional speech is described as a complex of acoustic attributes which include F0 frequency values, durations of time intervals related to the closing and the opening of the glottis and the time intervals during which the glottis remains closed, formant bandwidths, energy values in different frequency bands, energy ratio, F0 contour, sentence, syllables, and phoneme duration, spectral tilting, and others [11, 43]. Yet, so far, there is little systematic knowledge about the details of the decoding process (i.e. the precise acoustic cues listeners use to infer the speaker’s emotional state). Since the acoustic attributes, which seem to play a role in signalling emotions, are the same acoustic attributes which are modified by the phonetic context, the inter- and intra-speaker variability, and the environmental noise, it would be interesting to investigate if there are reliable algorithms able to detect differences in emotional

Cogn Comput

and phonetic attributes and how they change at the light of disabilities related to language impairments. Furthermore, once the acoustic attributes of the emotional state have been detected, we should find a computational architecture able to integrate emotional (prosodic) cues in speech understanding systems. This is because language impaired individuals may not exploit the same acoustic cues for detecting emotional states, since their perceptual behaviour could be different, as it has been suggested for timing in the case of specific language impaired and reading-disabled individuals. The interest in speech timing in the area of Specific Language Impairments (SLI) and Specific Reading Disability (SRD) is supported by a number of studies evaluating the perceptual performance of SLI/SRD subjects on Temporal Order Judgment (TOJ) and Inter Stimuli Interval (ISI) tasks [44– 47]. SLI/SRD subjects made significantly more TOJ errors with respect to the control group when the length of the stimulus was short, and/or when the ISI was short (B150 ms) rather than long (B300 ms). These data were explained hypothesizing a general auditory temporal processing deficit that causes a poor aptitude in processing brief and/or rapidly changing acoustic events, whether speech or nonspeech. However, in several other studies, SLI/SRD subjects did not exhibit such an auditory deficit and the genesis of their disorder was attributed to purely linguistic and specific speech deficits related to a poor ability in transforming the linguistic input into the corresponding phonological code [48, 49]. The criticism against the general auditory deficit hypothesis was mainly focused on the stimulus types used to test SLI/SRD performance [49, 50]. Counterexamples to challenge the general auditory hypothesis were still based on synthetic speech and/or nonspeech stimuli showing simultaneously rapid temporal variations both in frequency and time and, therefore, making it difficult to evaluate the effects of one variable on the other. To take into account the objections reported above, we decided to test the auditory performance of a group of SRD children on natural speech and synthetic stimuli manipulated only along the temporal dimension. To this aim, we exploited, as speech stimuli, the single/geminate contrast in Italian, which has been proved to carry only temporal information [51]. As nonspeech stimuli, paired one tone stimuli separated by different ISI lengths were used.

Results The results are reported in Fig. 11 that displays the identification curves for both controls (filled squares) and SRD (filled circles) children. On the x-axis is reported the time (in ms) and on the y-axis the percentage of correct identification for the three set of continua papa/pappa (pope/ baby soup), pala/palla (shovel/ball), and the single/double

Fig. 11 Identification Curves for the continua papa/pappa (pope/baby soup), pala/palla (shovel/ball), and single/double sine wave sounds, respectively. The control group is indicated with filled squares and the SRD group with filled circles

paired sine waves. For the sine wave, paired stimuli with ISI longer than 90 ms were not reported since, above this threshold, both control and SRD children correctly identified two different sounds. An analysis of variance (ANOVA) was performed on the proportion of single responses, with group (control and SRD children) as a between-subjects variable and stimulus types as a within-subjects variable. Group effects did not reach significance (F (1, 18) = 0.07, q = .78) suggesting that the discrimination between single and geminate consonants and between different ISI lengths was reliable both in SRD as in control children. Stimulus duration effects were, as expected, significantly different (F (9, 36) = 12.916, q = .0000). No interaction was found between the groups (SRD and control) and different stimulus durations. SRD children were less categorical in the identification of the stimuli at the extreme of the/l, p/continua but more categorical in the identification of the single/double sine wave continuum. The above results suggest that SRD children (and in general SRD people) may not be affected by a temporal processing problem and therefore timing does not play a role in their language disorder. It may, however, play a role in their language performance, since, as it has been pointed out above, timing plays a great role in the detection of emotional states. Therefore, the analysis of emotional states expressed by language-impaired individuals can be useful to understand how changes in vocal expressions can change the perception of emotional states.

123

Cogn Comput

Stochastic Petri-Net Modelling of Human Subjects A Petri-net model is a methodology exploited by numerous applications. Here, the SPN model is transformed into a graph to take the advantage of the SPN properties (timing, parallelism, concurrency, synchronization of events) for our LG graph methodology [22, 24, 42, 52]. The LG graph models described earlier have the capability of holding structural information about objects/subjects. The functional behaviour of an object/subject is described by the states in which the particular object/subject could move on, after an appropriate trigger state is satisfied. A successful and powerful model capable of describing (or modelling) this functional behaviour is the SPN. Thus, in order to maintain the structural features of the graph model and the functional features of the SPN model, a mapping is presented here, where the LG graph model is transformed into the SPN graph model by the following transformation: m : LG ! SPN where a graph-node Ni of the LG graph corresponds into a number of SPN places {Pi}, Ni ? {Pi}, and the {aij} relationships correspond into SPN transitions {tij}, {aij} ? {tij}. Thus, the structural properties of a graph node are transformed into a set of SPN places such that different states of the same object/subject correspond to an equal number of different SPN places based on the SPN definition. For instance, a SPN place Pk carries the structural properties of the same object/subject at the state k and all its structural deformations associated with the state k. In addition, when the structural properties of an object/subject are the same for all its corresponding SPN places then its functional behaviour may change and its structural properties may remain the same. Furthermore, the relationships among graph nodes, once represented as SPN transitions, can represent both functional states (or behavioural states that transfer the object/subject from the state j to state k) and transitions of relationships among the objects/subjects structural features. Consequently, the SPN graph carries not only the functional properties of the SPN but the structural properties of the LG graph as well. The SPN graph methodology has been successfully applied to describe facial muscle movements for the recognition of emotional facial expressions, as illustrated in the following.

Token: Voice or Emotion Associated Facial Patterns (lips)

Token: Voice or Emotion Associated Facial Patterns (lips)

Fig. 12 A sequence of facial features (lips-mouth) extracted from different image frames activated and associated with a voice or an emotional token

associated together to provide information related to the emotional behaviour of a person (see Figs. 12 and 13). SPNG Associations Here we present the results of the SPN graph associations for extracted facial features (mouth-lips) from different image frames (see Fig. 13). More specifically, when the facial features are detected and tracked in different frames, the LG graph method is used to establish a local graph for each region. These region-graphs are then associated for developing the global graph that connects all the region-graphs. This is the association pattern that represents a sequence of facial expressions associated with a particular emotion. By tracking and associating these sequences, the SPN creates patterns that allow to analyse and understand the emotions associated with the human behaviour that takes place in the sequences of frames. On the SPN graphical representations, the colours are used to illustrate the transitions and the flow using tokens that represent the cause of these transitions, for instance token = voice or emotion. Associating Facial Expressions with Spoken Language Patterns Here we attempt to associate facial expressions to spoken patterns in an integrated model to completely represent emotional behaviour regarding facial and spoken emotional status. The following example represents a happy emotion within a sequence of facial frames and the extracted LG graph representation (see Fig. 14). The frames used as

Illustrative Examples and Results Tracking and Associating Facial Emotion in Sequences of Images In this example, we present a sequence of three facial features extracted from different image frames and

123

Fig. 13 It shows the SPN association of the LG graph for the facial features (mouth-lips) extracted from different image frames, and their activation via tokens (orange, green colour). The displayed figure refers to a sequence of facial expressions related to happiness and laughter. The colour tokens could be a joke or a happy thought or else

Cogn Comput Fig. 14 : It graphically shows a happy emotional state (in 5 video frames) for a woman who pronounce the word ‘‘sta…’’. It also illustrates the LG graph associations among the facial features (mouth, eyes, and eyebrows) that compose a happy emotion. The major feature that characterizes this particular emotion is the morphology of the mouth

"Sta..."

exemplification and reported below contain only the speech utterance ‘‘sta’’ extracted from the Italian phrase: ‘‘Come stai?’’ (How are you?).

Conclusions Human language, gesture, facial expressions, and emotions are not entities amenable to study in pristine isolation. There is a link of converging and interweaving cognitive processes that cannot be totally untangled. The definition and comprehension of this link can be grasped only by identifying features involving mental processes more complex than those devoted to the simple peripheral pre-processing of the received signals. To understand how human communication exploits information that even arriving from several channels all potentially contribute to the communicative act and support the speaker’s communicative goal, we propose a multi-modal database structure comprising verbal and nonverbal (speech, gaze, gesture, and facial expressions) data. Gathering multi-modal data will enable us to develop new mathematical models describing their functioning and will raise new questions concerning the psychology of the communicative act. The scientific study of speech, gesture, emotion, and gaze as an integrated system is still at the beginning, and both the difficulty to determine relationships between gestures, speech, gaze, and facial expressions and the time-consuming nature of the perceptual analysis slow down the investigation process. The proposed methodologies may be able to provide better and easier tools for automatizing this process. A major initiative will be to compare human perceptual analysis of verbal and nonverbal features with their automated detection and analyses, exploiting time-linked measures of gaze and eye direction together with the alignment of fundamental frequency and perceptual

transcriptions of speech prosody (both time linked to motion measures). A genuine emotion may exhibit different relationships to speech and other nonverbal entities. The appropriate phasing of speech, gesture, and facial expressions is then critical to identify the real meaning of the conveyed message and the emotional states of the speaker, as well as to solve ambiguities in communication. We propose to synchronize the phasing of gestural motions and emotional states through dynamic information related to the motion displacement (for example with respect to a referential point) and through speech prosodic and facial expression analyses. The idea is then to identify, for each communicative expressions, physical and psychological features able, once combined together, to describe the structure of the whole set of actions involved in the transmission of a message and extract a mathematical model to allow its computational implementation. A further research step would be to compare normative and disabled emotional behaviours to be able to adapt our mathematical model to the specific needs of the final user. The proposed LG graph methodology and its consequential association with the SPN model could be a first step towards a global mathematical representation of communicative signals. Acknowledgments This work has been partially supported by an AIIS Inc. Grant and the European projects: COST 2102 ‘‘Cross Modal Analysis of Verbal and Nonverbal Communication’’, http://cost2 102.cs.stir.ac.uk/ and COST ISCH TD0904 ‘‘TMELY: Time in MEntal activitY (http://w3.cost.eu/index.php?id=233&action_number= TD0904).

References 1. Caccioppo JT, Klein DJ, Bernston GC, Hatfield E. The psychophysiology of emotion. In: Haviland M, Lewis JM, editors. Handbook of emotion. New York: Guilford Press; 1993. p. 119–42.

123

Cogn Comput 2. Ekman P. Facial expression of emotion: new findings, new questions. Psychol Sci. 1992;3:34–8. 3. Fridlund AJ. The new ethnology of human facial expressions. In: Russell JA, Fernandez-Dols J, editors. The psychology of facial expression. Cambridge: Cambridge University Press; 1997. p. 103–29. 4. Izard CE. Basic emotions, relations among emotions, and emotion—cognition relations. Psychol Rev. 1992;99:561–5. 5. Ortony A, Clore GL, Collins A. The cognitive structure of emotions. Cambridge: Cambridge University Press; 1988. 6. Esposito A. The amount of information on emotional states conveyed by the verbal and nonverbal channels: some perceptual data. In: Stilianou Y, Faundez-Zanuy M, Esposito A, editors. Progress in nonlinear speech processing, vol. 4392. Berlin: LNCS Springer; 2007. p. 245–6. 7. Atassi H, Riviello MT, Sme´kal Z, Hussain A, Esposito A. Emotional vocal expressions recognition using the COST 2102 Italian database of emotional speech. In: Esposito A, Campbell N, Vogel C, Hussain A, Nijholt A, editors. Development of multimodal interfaces: active listening and synchrony, vol. 5967. Berlin: LNCS Springer; 2010. p. 255–67. 8. Atassi H, Esposito A. Speaker independent approach to the classification of emotional vocal expressions. In: Proceeding of IEEE international conference on tools with artificial intelligence (ICTAI) Dayton, OHIO, USA. 2008; vol. 1, p. 487–494. 9. Bachorowski JA. Vocal expression and perception of emotion. Curr Dir Psychol Sci. 1999;8:53–7. 10. Breitenstein C, Van Lancker D, Daum I. The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn Emot. 2001;15:57–79. 11. Scherer KR, Banse R, Wallbott HG. Emotion inferences from vocal expression correlate across languages and cultures. J Cross Cult Psychol. 2001;32:76–92. 12. Esposito A, Riviello MT, Bourbakis N. Cultural specific effects on the recognition of basic emotions: A study on Italian subjects. In: Holzinger A, Miesenberger K, editors. LNCS Springer. 2009; vol. 5889, p. 135–148. 13. Morishima S. Face analysis and synthesis. IEEE Signal Process Mag. 2001;18(3):26–34. 14. Kakumanu P, Bourbakis N. (2006). A local-global graph approach for facial expressions recognition. In: Proceedings IEEE international conference on tools in artificial intelligence; 2006. p. 685–92. 15. Ekman P, Friesen WV, Hager JC. The facial action coding system, eBook. 2nd ed. London: Weidenfeld & Nicolson; 2002. 16. Kanade T, Cohn J, Tian Y. Comprehensive database for facial expression analysis. In: Proceedings of the 4th IEEE international conference on automatic face and gesture recognition (FG’00); 2000. p. 46–53. 17. Pantic M, Patras I, Rothkrantz LJM. Facial action recognition in face profile image sequences. In: Proceedings of IEEE international conference multimedia and expo. http://citeseer.ist.psu.edu/ pantic02facial.html; 2002. p. 37–40. 18. Goldenberg R, Kimmel R, Rivlin E, Rudzsky M. Behaviour classification by Eigen decomposition of periodic motions. Pattern Recognit. 2005;38(7):1033–44. 19. Granlund G. A Cognitive vision architecture integrating neural networks with symbolic processing. Kunstliche Intelligenz. 2005;2:18–24. 20. Keogh EJ, Pazzani MJ. Learning the structure of augmented Bayesian classifiers. IJAIT. 2002;11(4):587–603. 21. Maurer A. Bounds for linear multi-task learning. J Mach Learn. 2006;7:117–39.

123

22. Bourbakis N, Gattiker J, Bebis G. A synergistic model for representing and interpreting human activity and events from video. Int J Artif Intell Tools. 2003;12(1):1–16. 23. Littlewort G, Bartlett MS, Fasel I, Susskind J, Movellan JR. Dynamics of facial expression extracted automatically from video. In: Proceedings of IEEE conference on computer vision and pattern recognition, workshop on face processing in video. See http://citeseer.ist.psu.edu/711804.html; 2004. 24. Gattiker J, Bourbakis N. Representation of structural and functional knowledge using SPN graphs. In: Proceedings of IEEE international conference on SEKE, MD; 1995. p. 47–54. 25. Yamato J, Ohya J, Kenichiro (1992). Recognizing human action in time-sequential images using Hidden Markov Model. In: Proceedings of IEEE international conference on computer vision and pattern; 1992. p. 379–385. 26. Bourbakis N. Emulating human visual perception for measuring differences in images using an SPN graph approach. IEEE Trans System Men Cybern. 2002;32(2):191–201. 27. Braun D, Mayberry J, Powers A, Schlicker S. A singular introduction to the Hausdorff metric geometry. Pi Mu Epsilon J. 2005;12(3):129–38. 28. Busiello S. An algorithm to approximate discontinuous functions through fractals and its application to the synthesis and compression of sounds and images (in Italian). PhD thesis in Applied Mathematics, Universita` di Napoli ‘‘Federico II’’, Italy; 2004. 29. Naimpally S. All hyper topologies are hit-and-miss. Appl Gen Topol. 2002;3:45–53. 30. Sendov B. Hausdorff distance and image processing. Russ Math Surv. 2001;59(2):319–28. 31. Bourbakis N, Kakumanu P, Makrogiannis S, Bryll R, Panchanathan S. Neural network approach for image chromatic adaptation for skin colour detection. Int J Neural Syst. 2007;17(1):1–12. 32. Bourbakis N, Makrogiannis S. Stochastic optimization scheme for automatic registration of aerial images In: Proceedings of IEEE international conference tools with artificial intelligence; 2004. p. 328–336. 33. Fu KS, Mu JK. A survey on image segmentation. Pattern Recogn. 1981;13:3–16. 34. Moghaddamzadeh A, Bourbakis N. A fuzzy region growing approach for segmentation of colour images. PR Soc J Pattern Recognit. 1997;30(6):867–81. 35. Pavlidis T. Structural pattern recognition. Berlin: Springer; 1997. 36. Schettini R. Low-level segmentation of complex colour images. Signal Process VI Theor Appl; 1992; 535–538. 37. Yaun P, Moghaddamzadeh A, Goldman D, Bourbakis N. A fuzzy-like approach to edge detection in coloured images. IAPR Pattern Anal Appl. 2001;4(4):272–82. 38. Kubicka E, Kubicki G, Vakalis I. Using graph distance in object recognition. ACM eighteenth annual computer science conference proceedings. New York, NY: ACM; 1990. p. 43–8. 39. Sanfeliu A, Fu KS. A distance measure between attributed relational graphs for pattern recognition. IEEE Trans Syst Man Cybern. 1983;13(3):353–62. 40. Ahuja N, An B, Schachter B. Image representation using Voronoi tessellation. Comput Vis Graph Image Process. 1985;29:286–95. 41. Martinez A, Benavente R. The AR face database. CVC Tech. Rep. # 24. Universitat Auto`noma de Barcelona, Edifici O, 08193, Barcelona. http://rvl1.ecn.purdue.edu/*aleix/aleix_face_DB. html or http://RVL.www.ecn.purdue.edu. Accessed by Feb 1999. 42. Bourbakis N. Triangular representation of speech signals using LG graphs. ITRI-TR-11 internal report, Wright State University, Dayton, OH, USA; 2002. 43. Esposito A, Bourbakis N. The role of timing on the speech perception and production processes and its effects on language

Cogn Comput

44.

45. 46.

47.

impaired individuals. In: Proceedings of international IEEE symposium on BIBE, 0-7695-2727-2; 2006. p. 348–356. Ahissar M, Protopapas A, Reid M, Merzenich M. Auditory processing parallel reading abilities in adults. Proc Natl Acad Sci. 2000;97:6832–7. Tallal P. Temporal or phonetic processing deficit in dyslexia? That is the question. Appl Psycholinguistics. 1984;5:13–24. Tallal P. Fine-grained discrimination deficits in language-learning impaired children are specific neither to the auditory modality nor to speech perception. J Speech Hear Res. 1990;33:616–7. Tallal P, et al. Language comprehension in language-learning impaired children improved with acoustically modified speech. Science. 1996;272:81–4.

48. Breier JI, Fletcher JM, Foorman BR, Klaas P, Gray LC. Auditory temporal processing in children with specific reading disability with and without attention deficit/hyperactivity disorder. J Speech Lang Hear Res. 2003;46:31–42. 49. Heath SM, Hogben JH. The reliability and validity of tasks measuring perception of rapid sequences in children with dyslexia. J Child Psychol Psychiatry. 2004;45(7):1275–87. 50. Mody M. Phonological basis in reading disability: a review and analysis of evidence. Read Writ. 2003;16:21–39. 51. Esposito A, Di Benedetto G. Acoustical and perceptual study of gemination in Italian stops. JASA. 1999;106(4):2051–61. 52. Murata T. Petri nets: properties, analysis and applications. Proc IEEE. 1989;77(4):541–80.

123