Design and Implementation of Czech Database of Speech Emotions D. Uhrin, P. Partila, M. Voznak, Member, IEEE, Z. Chmelikova, M. Hlozak and L. Orcik
Abstract — This paper deals with design and implementation of a Czech database of emotion samples in the form of short sound records and the tool evaluating database samples by applying subjective methods. In order to create the database of emotion samples, it was necessary to extract sound records of broadcasting from three radio and one television channel. Emotion database was created after manual analyzis of records and cropping the samples that have past subjective test. After that all records in emotion database were evaluated by using our designed evaluation tool and results were analyzed. According to results, samples were assessed on how credible and reliable they are and how they represent different states of emotions. Keywords — Emotion, Emotion samples, Emotion state, Evaluation, Samples, Speech, Voice.
I. INTRODUCTION HE development of technology application and services allows to achieve natural interaction between human and digital technologies such as smartphones, televisions, computers, etc. Nowadays, digitized voice samples and gestures are used both for classical Voice over IP (VoIP) applications [1], [2], as well as to command different appliances. Human speech offers a lot of information that can be obtained with quite good precision. Human speech also includes secondary
T
The research leading to these results has received funding from the VSB-Technical University of Ostrava under grant SGS no. SP2014/72 and partially was supported by the project OPVK No. CZ.1.07/2.2.00/28. 0062 "Joint activities of BUT and TUO while creating the content of accredited technical courses in ICT". D. Uhrin is a PhD. student with the Department of Telecommunications, VSB-Technical University of Ostrava, 17. listopadu 15, 708 00 Ostrava-Poruba, Czech Republic (phone: +420596991641, e-mail:
[email protected]). P. Partila is a PhD. student with the Department of Telecommunications, VSB-Technical University of Ostrava, 17. listopadu 15, 708 00 Ostrava-Poruba, Czech Republic (phone: +420597321651, e-mail:
[email protected]). M. Voznak is an Associate Professor with the Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, 17. listopadu 15, 708 00 Ostrava-Poruba, Czech Republic (phone: +420-596991699, e-mail:
[email protected]). Z. Chmelikova is a Vice Dean for Bachelor Science with the Faculty of Electrical Engineering and Computer Science, VSBTechnical University of Ostrava, 17. listopadu 15, 708 00 OstravaPoruba, Czech Republic (phone: +420-596991632, e-mail:
[email protected]). M. Hlozak and L. Orcik are a PhD. students with the Department of Telecommunications, VSB-Technical University of Ostrava, 17. listopadu 15, 708 00 Ostrava-Poruba, Czech Republic (phone: +420-596991641, e-mail:
[email protected] and
[email protected]). .
information that characterizes speaker and his state. For example age, gender, emotional state or speech deviations such as lisp or other speech features [8]. From the title of this work you can interprete that this article describes emotional states and tools to determine emotional state by subjective method. Emotion is one of the human characteristics that represents mental state and also causes physiological changes to the human body. It also causes changes in human speech Information on a particular emotion extracted from human speech can find utilization in different domains. Customer satisfaction survey and customer interest in products are evaluated based on emotional state of the person. This information is a direct response to some form of stimulus. Agents in call centers are paid for their work and their ability to communicate with customers. Government sectors such as police, fire department and especially military forces put most emotional pressure on their employees compared to other sectors. Abilities of a soldier, a police officer or a fireman to perform work depends on commands received. Thus, using the systemfor emotional state recognition from human speech enables a protection against dispatching commands under a specific emotion such as apprehension, fear or terror. Security application usually requires a person to use his speech as authorization key. Human speech is affected by physiological changes that are caused by emotional state changes. For example, access can be denied to a person that authorized to access because authorization part of the system recognized changes in speech and evaluate it as unauthorized key. So it is useful to use secondary speech information which can prevent false authorized access. II. STATE OF THE ART It is a confirmed fact that emotional state has an impact on human speech and also physiological state of body. It resulted in improvement made in automatic classification of human emotion. It was achieved with the help of recognition techniques mainly in last few decades. Nowadays computation power of processors is on a very different level compared to the past ten-twenty years. This new hardware allows us to use these methods for emotion recognition in real-time mode . A lot of secondary information obtained from speech could not be processed earlier due to lack of computing performance and method used for it. However, the bad quality of training samples was also a problem. Nowadays, there are a lot of emotional recordings databases. Majority of databases samples show simulated
emotions as they are recorded by actors and do not show real life emotions. On the other hand quality of sound records is very high as it is recorded in a studio, excluding any extra noise. It is much easier to create this type of database compared to the real emotion database; real emotions samples are manually cut out of sound records that contain real emotions. Working with these recordings is easier because each one is labeled. These labels contain information about features like what kind of emotion is recorded, gender of an actor, etc. Also, actors are mostly pronouncing same sentences, it guarantees same context in the recordings. These recordings are more efficient for emotion classifier training. Here are some most known and recent emotion recordings database: Humane [3], Emotional Prosody Speech and Transcripts, Danish Emotional Speech Corpus, Berlin Emotional Speech Database [4], Serbian Emotional Speech Database [10]. III. CREATION OF SAMPLE DATABASE First step in creating real emotion database, not simulated by actors, was to record few hours of two radio broadcastings. Some of the recordings have been available on the official webpage archive of third radio station and also some of the recordings of television broadcasting were downloaded from share video portal Youtube. Television broadcasting samples of sound part was cut out to create the database samples. Parameters of database samples had to meet these requirements: - sample has to have a duration from one to six seconds; - sample should not contain environment noise; - sample has to contain human speech in a form of few words or a full sentence; - sample has to be uniquely named. The name of database sample consists of three parameters. The first one is state of the emotion. There are seven basic emotion states [5], [6]. Database samples were collected for only four emotion states as it is difficult to find the rest of states emotion recordings or is hard to recognize it by using subjective methods (boredom, disgust, fear). Second parameter is numbering the samples, third and the last one is gender of the speaker on sound recordings, see Fig. 1.
at level of 16 kHz. These parameters are more than enough when the source recordings of broadcasting was recorded to MPEG-2 Audio Layer III audio file with bitrate of 128 kbps. Some of the source recordings forming the database samples were recorded using VideoLAN Client media player. For editing and cutting of source recordings, Audacity software was used. By default, Audacity is unable to edit mp3 audio file, so LAME Encode library had to be installed. IV. EVALUATION TOOL Next step in building of emotional database was to createthe tool for evaluation of database samples that resulted in a final emotional database creation. Out of all possible methods evaluating emotion samples, we chose to use subjective method. Subjective methods consider using people (subject) to evaluate small amount of samples. The webpage represents evaluation tool as a direct tool for evaluation connected with MySQL database for saving results of subjective evaluation. The webpage consists of four pages. First one is invitation page, it invites evaluating subject and givesa short instruction. Second page is an evaluating page. It is the core of whole evaluating tool. It consists of html5 audio player for playing database sound samples and rollout menu for selection of state of emotion. Subject just play recording selected by algorithm and subsequently select emotion in rollout menu as per decision. Result is sent to and saved in MySQL database. The last two pages of the tool are final pages to announce the end of the process of evaluation to subject and error page which is announcing to subject that something went wrong during the evaluation process. Webpage tool also includes MySQL database. As it was mentioned above, database was used to save results of subjective evaluation. Database consists of two connected tables. These tables are shown in Fig. 2.
Fig. 2: MySQL database tables. Fig. 1: Database sample name. As an output format for database samples we used waveform audio file with 16-bit PCM (Pulse Code Modulation) coding, mono channel and sample frequency
First table shows information about individual emotional samples and contains four kinds of information. Number of the column shows how many times sample have been loaded to audio player. The column showing
emotionis represented by the first letter from English expression of selected sample. Ref_id column represents a unique name for the sample. Second table represents information about evaluation of sample and contains seven kinds of information. Meaning of the first one, ref_id, is same as in table one. Origin column shows how many times same emotion was selected by subject for each sample as was originally selected by author during the process of creating database samples. Information from third to sixth column shows emotional state selected by subject during the process of evaluation. Finally, the last column showsthe total number of sample evaluation. Typical script mysql_connect($host, $login, $pass) and mysql_select_db ($db_name, $connection_active) was used for connecting webpage to the MySQL.The webpage also logically contains scripts to perform queries by using mysql_query() with SELECT or UPDATE queries. It was also necessary to create a customs function for loading and saving data from or to the database named database_load(). Web page is also using POST and GET forms for obtained data from previous page loads. This is where it is important to create custom functions generate_sample() and audio(). First function is used to generate the name of the sample that will be loaded to html5 audio player using second function audio(). It is much easier to use htlm5 audio player. Flash or any other plugin do not have to be installed. All applications are a part of web browser and it is using a lot less of computation performance. Custom functions were created for insertion of obtained data to database tables and function mysql_close() was also used to close the connection with the database.
TABLE 2: VERACITY TABLE FOR FIRST FIVE SAMPLES WITH EMOTION KIND DETERMINED.
Evaluation veracity (%) 72.23 36.36 45.45 80.00 80.00
Emotion kind neutral anger neutral happiness neutral
Level of veracity low low low medium medium
Three final emotion databases were created on the basis of these three levels. Name of these emotion database samples was formed based on conditions as shown in Fig. 1. During the creation process, database samples were named to meet the conditions as you can see in Fig. 1 and emotion state of these samples were subjectively selected by authors. After the evaluation process , emotion state of some emotion samples changed depending on the evaluation output. First database with high veracity of samples, with average veracity of 97.02 percent, is suitable for learning neural emotion classifier. This classifier is being developed at our university [7]-[9]. Second database of samples with medium veracity and average veracity level of 81.61 percent is suitable for training of neural classifier and to verify its learning skills. Finally, the last database of emotion samples with low level of veracity and average veracity level of 57.66 percent is formed by samples that contain mixed emotion. For this emotion database, it was difficult to determine emotion state of samples. These samples require more evaluation to better determine emotion state.
V. RESULTS EVALUATION The evaluation of database samples was done by subjects represented by students aged from 18 to 26 years old. Selection rate of each emotion kind for random sample is shownin table 1. TABLE 1: PROBABILITY WITH WHAT WAS EMOTION SELECTED.
Emotion Neutral Anger Happiness Sadness
Probability for one sample (%) 33 25 19 23
Sample evaluation was followed by evaluation of the results of sample evaluation. For each sample from the database, we determined the level of evaluation in percentage. Table 2 shows first 5 samples of database with percentage of veracity, state of emotion and a level of veracity. After determining of veracity value, three levels of veracity were assigned to samples: low, medium, high. Low level has a range from 0 to75, medium level range is from 75 to 90 and high level percentage range is from 90 up to 100.
Fig. 3: Number of samples per emotion state and also per veracity level. Figure 3 shows a difference in count of individual levels of veracity within one emotion state. In case of neutral emotion it is almost exponential increase. Incase of happiness and sadness it is almost linear increase. And incase of anger it is approximately same number in all levels. VI. CONCLUSION The aim of this work was to create the emotion database of samples in a form of sound recordings so these recordings contain real emotions, not simulated. After long editing and cutting of source recordings, sample database
with size about 200 samples was prepared for evaluation. During the process of sample creating there was a problem with background noise and also noise caused by quality of phone calls recordings. Next step was to create the tool for subjective evaluation of this sample database with the purpose to create emotion database. The tool consists of the webpage connected with MySQL database. This MySQL database was used for saving results of the evaluation process and also for saving and loading of auxiliary variables. As a part of evaluation process, we analyzed the results. Based on thise analysis, three final emotion databases were created. The first two databases, with veracity level high and medium, are useable for learning and training of neural classifier. This classifier has been developed in our department since 2011 [7]-[9]. Last database with veracity level of low should be evaluated again as it contains samples with mixed emotions or samples that are not understandable. The idea of all this was to create the basis for open database of real emotions in a way that any subject could access database and have an option to insert samples that will be evaluated. As a possible future development, the tool has to be modified for inserting samples with strictly defined parameters and the tool should be publically accessible. For further development automatic evaluation of samples is an option to be used instead of subjective evaluation. Automatic evaluation of samples will have a form of neural classifier [7]-[9] and also function of automatic sending of results of sample evaluation should be implemented. REFERENCES [1]
[2]
[3] [4]
[5]
[6] [7]
[8]
[9]
F. De Rango, P. Fazio, F. Scarcello, F. Conte, "A new distributed application and network layer protocol for VoIP in mobile ad hoc networks", Mobile Computing, IEEE Transactions on, Volume 13, Issue 10, October 2014, Pages 2185-2198. F. De Rango, M. Tropea, P. Fazio, S. Marano, “Overview on VoIP: Subjective and Objective Measurement Methods”, Int. Journal of Computer Science and Network Security,Volume 6, N°1, January 2006, Pages 140-153. D. Ververidis, C. Kotropoulos, “A review of emotional speech databases” , Proc. Panhellenic Conference on Informatics (PCI). 2003, Pages 560-574. F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, “A Database of German Emotional Speech”, Interspeech 2005 – Eurospeech, 9th European Conference on Speech Communication and Technology, September 2003, Pages. 1517–1520. J. Ang, R. Dhillon, E. Shriberg, and A. Stolcke, “Prosody-based Automatic Detection of Annoyance and Frustration in HumanComputer Dialog”. In: Interspeech 2002 – ICSLP, 7th International Conference on Spoken Language Processing, 2002, pp. 2037–2040. C. M. Lee, S. S. Narayanan. “Toward Detecting Emotions in Spoken Dialogs”. IEEE Transactions on Speech and Audio Processing, Vol. 13, No. 2, 2005, pp. 293–303. M. Voznak, P. Partila, M. Mehic, S. Jakovlev, “Recognizing Emotions from Human Speech Using 2-D Neural Classifier and Influence the Selection of Input Parameters on its Accuracy”, 21st Telecommunications Forum Telfor, art no. 6716272, November 2013, Pages 482-485. P. Partila, M. Voznak, “Speech Emotions Recognition Using 2-D Neural Classifier”, Nostradamus 2013: Prediction, Modeling and Analysis of Complex Systems, Springer International Publishing, 2013, Pages 221-231. P. Partila, M. Voznak, M. Mikulec, J. Zdralek, “Fundamental frequency extraction method using central clipping and its importance for the classification of emotional state”, Advances
in Electrical and Electronic Engineering, Volume 10, Issue 4, 2012, Pages 270-275. [10] S.T. Jovicic, Z. Kasic, M. Dordevic, M. Rajkovic, "Serbian emotional speech database: design, processing and evaluation", 9th Conference Speech and Computer, SPECOM 2004, St. Petersburg, Russia, Pages 77–81.