Automatic Stress Detection from Speech by Using Discrete Wavelet Transforms Firoz Shah.A, Raji Sukumar .A, Babu Anto.P School of Information Science and technology, Kannur University
[email protected] ,
[email protected] ,
[email protected] Abstract: This paper deals with automatic recognition of stress from spoken words in Malayalam language. Automatic Stress Recognition from speech is one of the most interesting areas in speech and emotion related studies and applications. Automatic recognition of stress from speech finds applications mostly in affective computing. Stress detection from speech means to understand the exact stress level from human speech by using a machine with the help of some machine learning algorithms. We have created and evaluated an elicited mode database consisting of a total number of four hundred isolated spoken words. Discrete Wavelet Transform (DWT) is used for the feature extraction and Artificial Neural Network (ANN) is used for the training and testing phase of the machine learning. We have obtained an overall recognition accuracy of 85% from this experiment. Key Words: Automatic Stress Recognition, Discrete Wavelet Transform, Artificial Neural Network
1. INITRODUCTION Speech represents the mental aspects of individuals and is characterized by the different emotional cues and stress in voice. Stress is one of the most common emotional states of humans. Voice Stress Analysis (VSA) is the study of analyzing the mental states of peoples from their voice when they are under stress and how the brain functions with stressed states of human beings [1]. Stress from speech can be classified as lexical stress and rhythmic stress. Lexical stress is related with the syllables in isolated words where rhythmic stress can be used to indicate the relative effect on syllables in long sentences. Stress can be defined as the state away from neutral mode of speech because of frustrations, workload, and emotions such as pain, fear and tension [2]. Automatic recognition of stress from voice simply means to make a machine able to recognize the stressed state of the speaker from their voice. The behavior of humans under stress is entirely different as compared with the normal stage. Normally speech is the expression of different combinations of human thinking and decision making [3, 4]. The stress in voice strictly depends on individuals. One of the most interesting research areas in speech domain is the psychological state analysis of speakers. Stress can be analyzed by using simple feature extraction techniques to characterize different parameters that related to stress. Voice stress recognition find applications in making cockpit electronics, polygraph testing, robotics, interactive voice response systems, in call centre applications and in Human Computer Interfaces (HCIs). The most investigated parameters for voice stress calculation is fundamental frequency F0, energy of speech signals, loudness and jitter from speech etc [5]. Sudden increase in F0 and rapid fluctuations in F0 contour are the most treated features for stress analysis. Instead of using the phonetic and prosodic features of speech we have introduced Discrete Wavelet Transform (DWT) technique for the parametric representation of speech signals under stress. 2.
FEATURE EXTRACTION
Feature extraction is the process of extracting relevant features that can be used for the classifier training and testing. The computational complexity is strongly characterized by the feature vector developed. By introducing Discrete Wavelet Transform (DWT) technique for feature extraction in this work we can significantly reduce the size of the feature vector.
2.1 Discrete Wavelet Transform (DWT) Discrete Wavelet Transforms (DWT) is the process of transformation of a signal to the high frequency and low frequency components by using digital filtering techniques. In Discrete Wavelet Transform (DWT) we are taking into account only the low frequency components of the signal under consideration because low frequency components characterize a signal more than its high frequency components [6,7] The Discrete wavelet Transform (DWT) can be represented by the following equation
W ( j, k )
j
k
X ( k )2
j/2
(2 j n k )
(1)
Where Ψ (t) is the basic analyzing function called the mother wavelet The digital filtering technique can be expressed by the following equations
Yhigh [k ]
nX [n]g[2k 1]
(2)
Ylow [k ]
nX [n]h[2k 1]
(3)
Where Y high and Y low are the outputs of the high pass and low pass filters obtained by down sampling by 2.
3. SPEECH CORPUS Speech corpus for stress based studies can be classified into two. First one is the natural stressed speech database and the second is elicited stressed speech database. In the case of natural stressed database the stressed speech is recorded from the natural situations and in the second case the stress is induced into speech. We have created an elicited database consisting of total number of four hundred isolated spoken words for this experiment. The speech samples and their IPA format are given in table 1.
Words in Malayalam
Wordsin English amme acha
IPA format //æ/m/ m/ æ// //æ///tʃ ʰ / ɑ ː//
mole
// m/ ɒ / l/ ɛ //
mone
// m/ ɒ / n/ ɛ //
eda lethe
// ɛ / d/ɑ ː// // l/ ɛ // θ/ ɛ //
devi
// d/ ɛ / v/ ɪ // // n/ dʒ /ɑ ː/ n/ ɒ //
njano
kutty
//k/ʊ /t/t/i//
maye
// m/ ɑ ː/ j/ ɛ //
ayyo
poyi
//æ/aɪ /ɒ // //tʃ ʰ /t (ʰ )/ɑ ː// // v/ iː/ n/ d/ɑ ː// // k (ʰ )/ɔ ː/ n/ d/ juː// // p/ ɒ / i//
poda
//p/o/d/a://
pode
//p/o/d/I//
ede
// ɛ / d/i://
vave
//v/a:/v/ æ//
neeyo
//n/ ɛ / ɔ ɪ / ɒ //
chetta venda kandu
Table1.Words in database and their IPA format
4.
CLASSIFICATION AND RECOGNITION
Artificial neural network can learns from examples for a defined task, something which cannot be done using a conventional digital computer. Neural network is a complex pattern classifier composed of interconnected processing units called nodes, which can perform mathematical operations in a similar way as the human brain does [8]. A neural network solves problems by self learning and self organization and is characterized by their topologies, activation function and weight vectors which are used in their hidden and output layers for processing simple mathematical operations. Neural networks can perform computations in a more effective way because of their massively parallel computational structure, fault tolerance, ability for generalization and inherently adaptive mechanism of learning. The architecture of MLP is given in Fig.1.
Fig.1 The architecture of MLP neural network Multi Layer Perceptron (MLP) architecture is a widely using pattern classification, which has a layered architecture. MLP has the processing nodes arranged in layers. The nodes in each layer are connected to every other node in subsequent layers through feed forward connections. The connections have weights associated with them. Each signal traveling along a link is multiplied by its weight. The input layer, being the first layer, has input units that distribute the inputs to units in subsequent layers. In the following or hidden layer, each unit sums its inputs and adds a threshold to it and nonlinearly transforms the sum to produce the unit output .The output layer units often have linear activations values[9].
5. EXPERIMENT AND RESULT We have created a speaker and gender independent database consisting of 400 isolated spoken words. The database is created by using 10 male and 10 female speakers in the age group of 2030. We have selected and used the words which show maximum stress effectiveness when uttered. We collected the speech corpus by using a high quality studio recording microphone at a frequency range of 8 KHz (4 KHz band limited). The recorded speech samples are processed labeled and stored in the database. Discrete Wavelet Transforms (DWT) was used for feature extraction purpose. By using Daubechies4 type wavelet for discrete wavelet transformation we have obtained a good feature vector of size seven at the thirteenth level of decomposition. We chose the coefficients from the last level of decomposition to develop the feature vector. The feature vector developed was used to train the neural network. The database is divided into two.80% of the database is used for training and the remaining 20% is used for testing respectively. After training the neural network with the training feature vector, we tested network with the testing feature vector. From the experiment we could obtain a recognition accuracy of 85%.
6.
CONCLUSION
We have conducted the experiment for the automatic recognition stress from speech. An overall recognition accuracy of 85 % is achieved from this experiment. From the obtained results we could understand that DWT is a good feature extraction method for detecting stress from speech. We have used artificial neural network for the training and testing of the network. The efficiency of the algorithm can be evaluated by using different machine learning techniques.
REFERENCES [1] Aull, A. M. & Zue, V. W. (1985) “Lexical stress determination and its application to speech recognition”, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1549–1552 [2] Cairns, D., Hansen, J.H.L.( 1994) “ Nonlinear Analysis and Detection of Speech under Stressed Conditions” The Journal of the Acoustical Society of America, Vol. 96,No. 6, 3392-3400 [3] Streeter, L.A., MacDonald, N.H., Apple,W., Krauss, R.M., Galotti, K.M.( 1988) “Acoustic and Perceptual Indicators of Emotional Stress” Journal of the Acoustic Society of America 73(3), 917–928 [4] T. Banziger and K.R. Scherer, (2005) “The Role of Intonation in Emotional Expressions,” in Speech Communication, vol. 46, pp. 252-267 [5] B.-S Kang, C.-H. Han,S-T.Lee ,D.-H Youn ,and C.Lee, (2002), “Speaker depedent emotion recognition using speech signals ,” in proc.ICSLP 2000, Beijing,China. [6] S.G. Mallat,( 1989) “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.11, 674-693 [7] I.Daubechis, (1988) “Orthonormal Bases of Compactly supported wavelets” Communication on pure and Applied Math.Vol.41, 909-996. [8] Simon Haykin, Neural Networks A comprehensive foundation 2nd edition ISBN 8L-2032373-4 [9] Bishop, Christopher,( 1995) Neural Networks for Pattern Recognition Oxford