This is a preprint of the paper presented at the 11th International Conference Beyond Databases, Architectures and Structures (BDAS 2015), May 26-29 2015 in Ustroń, Poland and published in the Communications in Computer and Information Science Volume 521, 2015, pp 585-597. DOI: 10.1007/978-3-319-18422-7_52
Evaluation Criteria for Affect-Annotated Databases Agata Kołakowska, Agnieszka Landowska, Mariusz Szwoch, Wioleta Szwoch, Michał R. Wróbel Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Poland
[email protected] Abstract. In this paper a set of comprehensive evaluation criteria for affect-annotated databases is proposed. These criteria can be used for evaluation of the quality of a database on the stage of its creation as well as for evaluation and comparison of existing databases. The usefulness of these criteria is demonstrated on several databases selected from affect computing domain. The databases contain different kind of data: video or still images presenting facial expressions, speech recordings and affect-annotated words. Keywords. Affective computing, database quality, affect-annotated databases
1
Introduction
Emotions play an important role in human interaction, often influencing our behavior in a more or less predictable way. This fact is usually neglected by standard human-computer interfaces, which are based on sets of rigid rules. However they might be effective and userfriendly in the sense of easiness of use, they do not resemble human interaction in any way. On contrary, affect-aware software may not only recognize and interpret users’ emotions, but also adapt its behavior and functionality to users’ needs according to the detected states. Such systems may be applied in many fields such as healthcare [1], entertainment [2], education [3], software engineering [4], etc. A great many methods have been developed in order to identify users’ emotional states. They are based on different data sources such as visual (facial expression, gestures) [5], audio (speech, voice) [6], textual (semantics) [7], physiological (heart rate, temperature, skin conductance etc.) [8], input devices (keyboard, mouse, touch-screen) [9] and on a combination of them, what can significantly improve emotion recognition abilities [6, 10]. Most of these approaches rely on supervised training to create a classifier of emotions. Its effectiveness and generalization ability highly depends on a set of data used for training. Depending on the data source, the datasets may differ significantly. For visual recognition it would be videos or images, and for the audio voice samples. Furthermore sentiment analysis technique would utilize dictionaries of affect-annotated words. The quality of these data has a great impact on the effectiveness of the recognition algorithms. There are many databases and datasets that can be useful in the process of creating a specific affect recognition system. Many of them are publicly available for non-commercial purposes, such as research or education. Unfortunately, they differ significantly in terms of
size, content and quality. Therefore algorithms that use different datasets for training are difficult to compare. Moreover datasets often consider narrow sets of recognized emotions, acquire training data in perfect conditions or from experiment participants showing very homogeneous characteristics. All these factors limit the usefulness of these datasets in different real-life applications. The problems mentioned before may cause a real dilemma while choosing a proper dataset for an application, so there is a need for a set of reliable criteria to facilitate making the right choice. Some evaluations of the available datasets can be found in the literature [10, 11], but none of them proposes a set of general factors to be taken into account during the process of designing any affect recognition system. In this paper, seven universal criteria are proposed to evaluate the quality and usefulness of affect-annotated databases regardless their data type and field of application. The paper is organized as follows: section 2 presents the database evaluation criteria, section 3 describes the proposed criteria for databases containing data of different types and gives some examples of available datasets, section 4 provides final conclusions.
2
Evaluation criteria for affectannotated databases
Affect-annotated databases are created for different purposes. Therefore, they can differ significantly in quality, size, completeness, formats, etc. Unfortunately, there are no commonly accepted standards for creation and evaluation of such database. In this section we propose seven general criteria that cover most aspects of affect-annotated database quality and usefulness of their content for different applications. Verifying any database against these criteria can give an answer about its generality and completeness that are required for its common and widely usage.
When creating or evaluating a set of emotionlabeled data several important questions must be answered taking into account different objectives that should be fulfilled, at least in some extent, to be useful for classifier’s training and validation. These aspects cover such topics as: 1. Representativeness is the key attribute of a training set in each supervised training process. Showing emotions differs strongly among people, so it is essential to gather as much diverse data as possible. In general, a database should contain samples from a large set of people who differ in age, gender, race, culture, occupation, and other aspects. Though it is possible to narrow any aspect of participants due to the assumed field of application, such as children emotion, received results could not be generalized on other cases, limiting the database usefulness. 2. Granularity of an affective database informs about the set of emotions or affect states that are distinguished and represented. This criterion seems crucial from its usability point of view. Unfortunately, there are no commonly accepted set of expressions that could be used in all applications. In fact, the choice of an interesting set of expressions depends on the field of applications and on the assumed model of emotions. When the discrete model is used, the granularity criterion is represented by the number of different labels assigned. In the case of dimensional model the number of possible values of its components determines the number of distinguishable affective states. 3. Size and completeness inform about the total number of fully or partially assembled samples for each human object and each class of emotions. To train a classifier, it is essential to have large enough number of examples per class (emotion). In general, datasets with greater number of data/items are more useful for training and testing purposes. Size and completeness criteria of a database are partially related to the representativeness factor, however, a database might be small and yet representative and complete. 4. Labeling quality is an important usability factor that can be considered on different levels of details. At minimum level it is required to label the emotional state of each sample and basic metadata about the human object. Such labeling can be done as simple naming of an affective state or by expressing it according to one or more commonly accepted affect models, such as PAD [12]. No matter if the gathered data are artificial or natural, an independent label evaluation made by a number of judges should be done. The second description level concerns additional data labeling that could facilitate data preprocessing and creation of recognition algorithms. 5. Procedural quality describes a set of factors that assure proper conditions of data acquisition. These factors include, among others, the way of emotion expression, that can be spontaneous or posed where the latter can be performed by amateurs or professional actors. The second important factor refers to
acquisition conditions which can be more or less natural. The common problem is that many collections of affective data contain posed expressions that come from studio recordings, which can highly differ from those observed in everyday life. Unfortunately, it is not easy to create a database of well indexed, spontaneous data from natural environment 6. Data-level quality refers to all technical aspects of data acquisition such as number of channels, sampling rate, resolution quantization error, data representation model, and others. On one hand, assuring high quality data is very important for data processing and further reasoning stages, but on the other hand, data gathered with a top-quality professional hardware, may need additional procedures for quality degradation to adapt them for consumer applications. 7. Database popularity and its availability are important factors that cannot be neglected in the process of database choosing for algorithm validation and evaluation. The popularity level, in the case of scientific purposes, can be evaluated by the number of citations of the first article that introduced the data set. In many cases the database popularity is highly related to its quality expressed by the previously given criteria. Database availability determines its usefulness for other researchers and contains the license and publication policy, standard conformance, and additional supporting tools for database exploration, complement and extension. The databases that are paid, or use custom propriety formats with no additional access tools cannot become popular and are practically useless. Almost all of the proposed criteria are general and should be adjusted according to the specific aspects of a database in a particular research field. In the following chapter we describe three important types of affectannotated databases, give specific descriptions of the evaluation criteria and use them to evaluate some of publicly available databases.
3
Evaluation of representative affectannotated databases
Affect recognition methods can use different information channels, such as video [1, 5], audio [2, 13], standard input devices [9, 14], physiological signals [8, 15], depth information [16], and others. These channels can be also used in different combinations in multimodal systems that typically achieve higher recognition rate due to fusion of different information. For each information channel a specific database should be used for supervised training algorithms as well as for validation of a developed approach. In this section three different kinds of databases, i.e. visual, audio and text, are considered according to the evaluation criteria proposed in the previous section. Three affectannotated databases of each kind were chosen to be evaluated in details with the predefined criteria. Their
choice was justified by the datasets diversity in order to depict, how the proposed criteria apply to various databases' characteristics. 3.1
Still images and video databases of facial expressions and emotions
Visual channel carries most information about the world surrounding humans. That is why it plays so important role in interpersonal communication. People send a lot of visual (non-verbal) information, especially using facial expressions, but also with gestures, body movements, and others. Analysis of information from a video camera is also the most popular source of noninvasive and non-intrusive information about human emotions. There are a number of algorithms based on these data, which can provide users’ emotions recognition. Some of these algorithms concentrate only on facial expression recognition (FER) as it is usually done by humans while analyzing affect of other people [17]. Still images is the most popular way to represent visual information. However, as facial expressions and human emotions have temporal character, video databases containing short recordings can be more useful. Visual databases contain usually two-dimensional images (2D) or video sequences. Unfortunately, algorithms using video camera are very sensitive to face illumination conditions, causing great problems with dark or unevenly illuminated scenes. Possible solutions include storing pairs of stereoscopy images or data with an additional channel for scene depth information. As depth sensors use additional light source (e.g. infrared) the information is generally insensitive to different ambient light conditions. Rapid development and increasing availability of relatively cheap RGB-D consumer sensors allows for creation of on-line systems for the recognition of facial expressions and, going further, human emotions and moods. Evaluation of affect-annotated databases containing visual information requires taking into account some additional aspects of criteria given in the previous section. Because most of the image and video databases features are very similar, we provide common evaluation criteria. The representativeness criterion should reflect the diversity of people participating in the sessions in the aspect of human gender, race, age, hairstyle, and others. The diversity requirements strongly depend on the purpose the images are to be used for. Other features worth considering are wardrobe accessories such as glasses or hats. In addition to the diversity it may be important to consider whether expressions were spontaneous or posed. In the later case, the professional actors participation may affect the assessment of representativeness. Size and completeness in the case of facial expression database are mainly determined by the number of video or image files and the number of participants. Another important factors are the total number of ex-
pressions recorded and the number of recordings per participant. For labeling quality two additional indexing levels should be considered for facial expressions. The first is Facial Action Coding System (FACS), which is a human-observer based system using so called action units AU to describe even subtle changes in facial features [18]. Apart from AU labeling, images or recording of faces should also include additional meta-data such as acquisition parameters, scene environment, and also location of characteristic facial points (landmarks). Procedural quality criterion covers both proper conditions of image or video acquiring as well as conditions for proper emotion expression. Scene composition describes face location and orientation in an image and plays an important role in face recognition as well as in recognition of facial expressions and emotions. In the facial expression recognition field it is often assumed that a face stays oriented towards the camera. However, it may not be true especially when expressing natural emotions. In that case human pose often changes dynamically what should be reflected in a database. Other important factors, that can influence face location or expression recognition, are: scene illumination, background and the presence of other people. Generally, recordings of human facial expressions and emotions may be divided into two categories: posed and spontaneous sessions. Posing allows for recording extreme expressions like human cry, envy, and so on. The drawbacks of posed recordings are human problems to act such emotions and their artificial character. On the other hand recording of spontaneous emotions is much harder and requires a lot of time to catch the desired emotion or expression. Moreover, it is almost impossible to record extreme emotions mentioned earlier. That is why databases of facial expressions and emotions should contain images or recordings of both types, possibly played by professional actors or actresses. Data-level quality strongly depends on the hardware used for data acquisition and the technical properties of files, such as resolution, number of colors, and sampling rate in the case of video recordings. The effective resolution of a face in an image is an important factor of the image quality and can influence recognition efficiency. For three-dimensional representation depth resolution (Z axis) should correspond to scene resolution in an image plane (X and Y axis). It means that the face should be located at certain “optimal” distance from the camera. Additionally such aspects as light conditions, and scene exposure as well as any auto-adjustment functions should be considered, such as auto-illumination level, automatic white balance etc. Table 1 contains detailed description of the selected image and/or video datasets with the use of the proposed criteria. It presents great variety of sample databases according to particular criteria allowing to construct own preferences for choosing learning and testing datasets.
Table 1. Selected image and video databases described with the proposed criteria Database Criteria
Extended Cohn-Kanade [19]
Representativeness
Diversity of races and ages
White, in the age of 20 - 33 years
Granularity
6 basic emotions + 1 neutral
8 emotions
Size and completeness Labeling quality
Procedural quality
Data-level quality
Popularity
3.2
128 persons, 593 expressions, 10708 images FACS, Action Units, Emotions (only for 327 expressions) Posed expressions, professional lighting and background PNG files: 104 files in BW, 640x480, poor quality, 13 files in color, 640x480, poor quality, 6 files in color, 720x480, good quality high (345 citations in Google Scholar)
Speech databases (audio)
Emotions affect our utterances. Not only is it important what the speaker says, but also how he or she says that. Features such as pitch and volume may be used do discriminate between some emotional states [11, 13]. The description below contains issues which should be taken into account while choosing or designing a speech database. The representativeness of a speech database is determined by the number of speakers, their profile and the type of texts uttered. The essential parameter of the database is the recordings language and the type of utterances. They can be either natural, taken for example from a TV program, or artificial, taken from actors in a studio. In the case of artificial utterance, it is important to know whether or not the experiment participants are professional actors, whose reactions in speech may be exaggerated [11]. The artificial texts may or may not have an emotional context. The latter option might entail difficulties with acted uttering, especially for non-professionals. The advantage of artificial utterances is that it is possible to record the same sentences from all actors. Moreover, from the point of view of future data processing, it is also important whether these are short utterances containing single sentences or words or long speeches, which require searching through to match specified moments. Other relevant features, which may affect the degree of manifesting emotions in speech, are speakers’ sex, age and nationality. Size and completeness are determined by the number of samples per speaker, which depend on the number of represented emotions. In the case of speech da-
MMI Facial Expression Database [20]
FEEDB [21] European people, in the age of 19-26 years 30 emotions and facial expressions
5 persons, 493 expressions
50 persons, 1550 recordings
Action Units
Action Units, SAM
Posed expressions, professional lighting and background
Posed and natural expressions, standard lighting conditions, people in the background
JPEG files: color, 1200x1600, very good quality
XED and AVI video files, BMP frames, standard VGA resolution
high (334 citations in Google Scholar)
Low
tabases their size may be also measured in the number of hours recorded. Labeling quality is one of the main factors determining the quality of training data. The simplest, though quite subjective, way of assigning labels may be done in the case of acted utterances when the actors are asked to demonstrate a specified emotional state. Usually the labels are judged depending on their recognizability or naturalness in order to reject the samples which do not fulfill prespecified requirements. Procedural quality depends on the way an experiment is designed and then conducted. However professional actors are able to act without special stimulus, some elicitation is usually performed. Sometimes short descriptions of situations are provided. A speech may be also guided by the experimenter. It is important whether or not the speaker is allowed to repeat the sentences he is not satisfied with. Finally the database may contain either all recordings or only the ones selected by judges. Data-level quality of recordings is determined by factors such as the number of channels, sampling rate, coding resolution. Moreover any background noise may affect the eventual analysis, which may be especially troublesome in the case of natural recordings. Another unfavorable attribute are long recordings in contrast to separated utterances, as they require manual splitting. File format is also an important parameter. Finally some databases contain only speech, whereas others are multimodal ones usually containing video as well. Table 2 contains detailed description of three selected speech datasets with the use of the proposed criteria that confirm usefulness of the proposed criteria in the task of speech database evaluation.
Table 2. Selected speech databases described with the proposed criteria Speech Database Criteria
Representativeness
Granularity
Size and completeness
Labeling quality
Procedural quality
Technical quality
Popularity
3.3
Emotional prosody Speech and Transcripts [22] 8 professional actors (5 females, 3 males); short (2 or 3 words) semantically neutral utterances (dates and numbers) in English 15 emotional states: disgust, panic, anxiety, hot anger, cold anger, despair, sadness, elation, happiness, interest, boredom, shame, pride, contempt, neutral over 9 hours recorded in 15 files; 15-30 samples per emotion for each user labels assigned according to the emotional states acted by the actors after being provided with the situational context emotions induced by short descriptions of situational context; possibility of repeating the utterances sampling rate 22.05 kHz; 16 bits per sample; two channels; WAV files; some background noise between utterances; transcripts available High
Affect-annotated dictionaries
Sentiment analysis of textual information derived from websites, forums or dialogues, that is frequently used in opinion mining, requires a dictionary of affectannotated words. There are several dictionaries, that differ in size, method of construction and labeling technique, including: Orthony's Affective Lexicon [25], General Inquirer[26], SentiWordNet [27], WordNetAffect [28], Subjectivity Lexicon [29], Affective Norms for English Words (ANEW) [30] to name just a few. The lexicons of affect-annotated words can be evaluated using the proposed criteria, however, detailed definition and metrics must be provided. Representativeness factor in evaluation of a dictionary is an important issue, as a selected word subset might be useful in some application areas only. To evaluate a dictionary the following characteristics should be considered: language, number of words and the inclusion criteria, especially inclusion of the most commonly used words, colloquialisms and perhaps offensive phrases. It's important to emphasis, that the
Berlin Database of Emotional Speech [23] 10 non-professional actors (5 females, 5 males) in the age of 21-35; sentences in German from everyday life, semantically neutral
Belfast Naturalistic Database [24] 125 speakers (31 males, 94 females); natural (clips from TV programs) and artificial (studio interactions) recordings
7 emotional states: anger, boredom, disgust, anxiety/fear, happiness, sadness, neutral
dimensional model (two dimensions: activation, evaluation)
about 500 utterances; 40.5 MB; each sentence read by every user with each of the 7 emotional states 500 out of 800 selected by 20 judges according to their recognizability and naturalness
298 clips lasting from 10 to 60 s
10 from 40 people selected by three judges; emotions induced by short descriptions of example situations; possibility of repeating sentences sampling rate 48 kHz, downsampled to 16 kHz; WAV files; the recording level adjusted between the loudest and the most quiet utterances
clips selected to cover a wide range of emotional states; each speaker recorded in both an emotional and a neutral state (according to judges)
very high, easy access
high
2D labelling: activation + evaluation; rated by 5 judges in regular intervals
clips saved in MPEG files, audio data extracted in WAV files
lexicon word set should be matched to characteristics of the application area. Labeling quality of a lexicon is the most important criterion, as people may differ significantly in meanings assigned to words, and some of the words are memes, that might have a special meaning in a specific culture. Representativeness of manually created dictionaries can be defined as factor of the size and diversity of the group of people, that annotated words. Evaluation of automatically generated databases representativeness depends both on the quality of the source material and the generation method and in some cases it might be difficult to evaluate the factor of automatically created lexicons due to lack of data regarding representativeness of the source. Although representativeness and labeling quality are the most important characteristics and should be evaluated separately, they are strongly correlated with the procedural quality of a lexicon. Procedural quality of a manually annotated lexicon depends on the repeatability of the construction process - how many people annotated one word, was there any kind of reliability
check performed. For automatically annotated databases the algorithm of annotation should also be checked against repeatability - would we obtain the same labels if considering different order of annotation or dissimilar starting point. Affective lexicons are usually distributed as dictionary files (one or multiple) containing word-label pairs. Technical quality of a dictionary can be decomposed into the following criteria: even/uneven labeling for all words - are all words labeled with the same precision and in the same way or is there uneven distribution of labels (are there 'empty' labels), group/subgroup annotation - some dictionaries contain overall as well as annotations that were provided by subgroups (e.g. male/female), the existence of some process metrics (e.g. number of evaluations, standard deviation), that allows to estimate the uncertainty related to the actual wordlabel assignment, application support. Table 3 contains detailed description of the selected lexicons with the use of the proposed criteria. The most important criteria for evaluation of affectannotated lexicons include: representativeness, granularity and labeling quality. Manually created lexicons contain a limited number of words, however they are the most reliable and popular. Automatically annotated
lexicons, although containing a large number of words, are usually assigned with a limited set of labels, that does not conform to some applications requirements.
4
Conclusions and future works
For many emotion recognition algorithms, the selection of a suitable training set is the key factor for proper operation. This paper presents criteria, which may be used in evaluation of the affect-annotated databases. Some of the seven defined criteria are general, some have to be specified in detail for different databases. The paper describes all the criteria from the perspective of three different database types: visual, speech, and dictionaries. The evaluation performed for three databases of each type has proved suitability of the developed criteria. The development of comprehensive and reliable affect-annotated databases is a difficult task. Therefore provided criteria may be treated as guidelines in the process of their developments. None of the proposed factors may be regarded as the primary one. Assigning weights to the criteria is the system’s designer task and it is closely related to a given application.
Table 3. Selected dictionaries described with the proposed criteria Dictionary Criteria Representativeness
Granularity Size and completeness Labeling quality
Procedural quality
ANEW [30] English, +multiple national, the most commonly used words; no colloquialisms; no swears Annotation based on PAD (continuous 3D scale) 1040, the most frequently used words only Manual, based on referential SAM questionnaire Each word annotated many times by different people, no reliability check, one combination of word sets only
Technical Quality
Single downloadable file: word-annotation: P,A,D: means, SD
Popularity
very high - referential lexicon
WordNet Affect [28] English, only terms related to affect Labels: emotion (positive, negative, ambiguous, neutral), attitude, feeling Manual (1903) Automatic(4787) Manual-automatic For manual part no information on labeling process is provided. Automatically annotated words were derived from manual core using WordNet Multiple files, downloadable, Version 1.0: information was provided on manual or automatic creation, Version 1.1: no information Moderate
SentiWordNet [27] English, (almost) all words Labels: positive, negative, neutral (of sum 1) 115.000 Automatic (semi-supervised learning and random-walk process) Derived automatically from WordNet, Post-checked with manual results (with p-normalized Kendall T-distance of 0,23-0,28)
Single downloadable file, no information on reliability. High
Acknowledgments The research leading to these results has received funding from the Polish- Norwegian Research Programme operated by the National Centre for Research and Development under the Norwegian Financial Mechanism 2009-2014 in the frame of Project Contract No Pol-Nor/210629/51/2013.
References 1. Tivatansakul, S., Chalumporn, G., Puangpontip, S., Kankanokkul, Y., Achalaku, T., Ohkura, M.: Healthcare System Focusing on Emotional Aspect Using Augmented Reality: Emotion Detection by Facial Expression. In: Advances in Human Aspects of Healthcare, pp. 375-384 (2014) 2. Jones, C., Sutherland, J.: Acoustic Emotion Recognition for Affective Computer Gaming. In: Affect and Emotion in Human-Computer Interaction, pp. 209-219, Springer Berlin Heidelberg (2008) 3. Landowska, A.: Affect-awareness Framework for Intelligent Tutoring Systems. In: 6th International Conference on Human System Interaction, Gdańsk 2013 4. Wróbel, M.R.: Emotions in the software development process. In: 6th International Conference on Human System Interaction, Gdańsk 2013 5. Gunes, H., Piccardi, M.: Affect Recognition from Face and Body: Early Fusion vs. Late Fusion. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3437 - 3443 (2005) 6. Schuller, B., Lang, M., Rigoll, G.: Multimodal emotion recognition in audiovisual communication. In: IEEE International Conference on Multimedia and Expo, Lausanne (2002) 7. Gill, A.J., French, R.M., Gergle, D., Oberlander, J.: Identifying Emotional Characteristics from Short Blog Texts. In: 30th Annual Conference of the Cognitive Science Society, pp. 2237-2242 (2008) 8. Szwoch, W.: Using Physiological Signals for Emotion Recognition. In: 6th International Conference on Human System Interaction, Gdańsk (2013) 9. Kołakowska, A.: A review of emotion recognition methods based on keystroke dynamics and mouse movements. In: 6th International Conference on Human System Interaction, Gdańsk 2013 10. Zeng, Z., Pantic, M., Roisman, G., Huang, T.: A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 39-58 (2009) 11. Pittermann, J., Pittermann, A., Minker, W.: Handling Emotions in Human Computer Dialog, Springer, Dordrecht (2010) 12. Russell, J., Mehrabian, A.: Evidence for a Three-Factor Theory of Emotions. J. Research in Personality 11, 273294 (1977) 13. Yacoub, S., Simske, S., Lin, X., Burns, J.: Recognition of Emotions in Interactive Voice Response System. In: 8th European Conference on Speech Communication and Technology, Geneva (2003) 14. Epp, C., Lippold, M., Mandryk, R.L.: Identifying emotional states using keystroke dynamics. In: Conf. on Human Factors in Computing Systems, pp. 715-724, Vancouver (2011)
15. Bailenson, J., Pontikakis, E., Mauss, I., Gross, J., Jabon, M., Hutcherson, C., Nass, C., John, O.: Real-time classification of evoked emotions using facial feature tracking and physiological responses. Int. J. Human-Computer Studies 66(5), 303 - 317 (2008) 16. Burgin, W., Pantofaru, C., Smart, W.D.: Using depth information to improve face detection. In: 6th International Conference on Human-Robot Interaction, pp. 119-120 (2011) 17. Castrillón-Santana, M., Déniz-Suárez, O., AntónCanalís, L., Lorenzo-Navarro, J.: Face and Facial Feature Detection Evaluation. In: 3rd International Conference on Computer Vision Theory and Applications (2008) 18. Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologist Press, Palo Alto, CA (1978) 19. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended Cohn-Kanade (CK+): A complete dataset for action unit and emotion-specified expression. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Workshops), pp. 94–101 (2010) 20. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: IEEE International Conference on Multimedia and Expo, Amsterdam (2005) 21. Szwoch, M.: FEEDB: a Multimodal Database of Facial Expressions and Emotions. In: 6th International Conference on Human System Interaction, Gdańsk 2013 22. Liberman, M., et al.: Emotional Prosody Speech and Transcripts LDC2002S28. Philadelphia, Linguistic Data Consortium, https://catalog.ldc.upenn.edu/LDC2002S28 (2002) 23. Burkhardt, F., Paechke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A Database of German Emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisbon (2005) 24. Douglas-Cowie, E., Cowie, R., Schroeder, M.: The description of naturally occurring emotional speech. In: 15th International Congress of Phonetic Sciences, Barcelona (2003) 25. Ortony, A., Clore, G.L., Foss, M.A.: The referential structure of the affective lexicon. Cognitive Science 11, 341-364 (1987) 26. General Inquirer, http://www.wjh.harvard.edu/~inquirer/, accessed 21.02.2013 27. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In: 7th Conference on Language Resources and Evaluation, pp. 2200–2204, Valletta (2010) 28. Strapparava, C., Valitutti, A.: WordNet-Affect: an Affective Extension of WordNet. In: 4th International Conference on Language Resources and Evaluation, pp. 1083–1086. Lisbon (2004) 29. Banea, C., Mihalcea, R.,Wiebe, J.: A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. In: 6th International Conference on Language Resources and Evaluation, Marrakech (2008) 30. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report C-1, The Center for research in Psychophysiology, University of Florida (1999).