An Extraction Method of Acoustic Features for Music Emotion

Sensors & Transducers, Vol. 175, Issue 7, July 2014, pp. 83-87

Sensors & Transducers © 2014 by IFSA Publishing, S. L. http://www.sensorsportal.com

An Extraction Method of Acoustic Features for Music Emotion Classification 1 1

Jiwei Qin, 2 Liang Xu, 3 Jinsheng Wang, 1 Fei Guo

Center of Network and Information Technology, Xinjiang University, Urumqi 830046, P. R. China School of Information Science and Engineering, Xinjiang University, Urumqi, 830046, P. R. China 3 Sichuan Rainbow Consulting & Software Co., Ltd, Mianyang, 621000, P. R. China 1 Tel.: +86-991 858-2918, fax: +86-991 858-5005 E-mail: [email protected]

2

Received: 27 March 2014 /Accepted: 30 June 2014 /Published: 31 July 2014 Abstract: Taking user’s emotion in music retrieval and recommendation as application background, this paper presents a method to extract the features associated with music emotion from the existing physical features. Focusing on degree of stability, differentiation and relativity, the method simplifies the existing multidimensional features and extracts the music physical features associated with emotion. We verify the presented method on data from baidu.com. The experimental results show that the proposed method can improve the classification efficiency about ten percent under the condition that the precision can be guaranteed. Copyright © 2014 IFSA Publishing, S. L. Keywords: Music, Feature, Stability, Differentiation, Relativity.

1. Introduction Music as an emotion-loaded form is regarded as a common entrainment resource in everyday life as well as a part of the psychological function of the individual. It loads emotional content to augment or reduce the listener’s current emotion [1]. However, with the overwhelming number of available music, it has become important to find the right music based on emotion classification which fits the taste of a listener who is in a certain emotion state. So far, music emotion classification is mainly depend on music content, such as pitch, sound, timbre, melody, lyrics and so on [2]. And it can be divided into audio and lyrics text feature based on above music content [3]. Because of some music without lyrics text, the acoustic features are usually used to determine music emotion classification.

http://www.sensorsportal.com/HTML/DIGEST/P_2229.htm

Furthermore, each piece of music track has rich affective connotation. The musical emotion is change with the musical rhythm. So researchers usually selected the music clips which can depict the musical emotion connotation of the whole music and extract the features to determine the music emotion classification. For example, Thayer’s arousal-valence (AV) model analyzes three features: strength of sound data, tone and rhythm and uses them to establish a hierarchical emotion detecting system based on GMM and finally realizes the classification of musical clips [4]. Literature [5] selects acoustic features of music and classifies emotions with SVM energy. Literature [6] is based on the Nearest Neighbor to calculate the Euclidean distance of acoustic features and finds out similar documents. Literature [7-9] uses Naive Bayesian, self-organized

83

Sensors & Transducers, Vol. 175, Issue 7, July 2014, pp. 83-87 mapping, random forests, support vector machine, BP neural network and the Gauss mixture model for music emotion recognition and classification. In the aspect of lyrics, researchers start with lyrics to analyze music emotion. Undoubtedly, as was mentioned above, these studies have effectively promoted the development of emotion categories, and have been applied to the field of music affective recommendation systems. However, through the in-depth analysis, we concluded that the features extracted from audio are too many dimensions, and it is hard to identify the features related to emotion categories. To solve the problems, this paper simplifies the multidimensional features of existing audio signal and extracts music emotion features, improve the efficiency of classification.

2.2. The Filtering Process The features which are related to the emotion are chose from a large number of acoustic features. The extraction of the music emotion features is shown in Fig. 1. Firstly, according to the stability, the stable features are selected and excluded the others which are unstable. Secondly, the top n features are chose from the stable features based on the relativity. Then, we judge the correlation between all features.

2. Feature Selection 2.1. Indexes The proposed method of extraction feature depends on the degrees of stability, differentiation and relativity to select features associated with music emotion in this study. Stability is used to evaluate the stable degree of acoustic feature in certain emotion, and is calculated by coefficient of variation in expression (1).

Fig. 1. The extraction of the music emotion features.

3. Experiments S jk =

1 m  ( x j k i − x jk )2 m i =1

(1)

x jk

where Sjk refers the stability of the feature k on emotion j, that is coefficient variation (C. V), m is the total number of music tagged by emotion j, xjki is the value of the feature k of music i tagged by emotion j. Differentiation is the degree of dispersion of acoustic features in different emotions. The larger is the degree of dispersion, and the higher is the differentiation. If the mean from the value of acoustic features is expressed as , N is the number of emotion categories, as is defined by expression (2).

3.1. Data Set Data Source. The data is from the top 150 popular music of the Well-known China website "Ting.baidu.com". First, we mark the 150 pieces of music with emotion categories by 10 participants including 5 females and 5 males. Each musical emotion category is determined on the vote. In the preliminary experiment, the music emotion is divided into positive (84 pieces of music) and negative (66 pieces of music).

xi

D=

1 N  ( xij − xi )2 N j =1

(2)

Relativity is the relevant degree between acoustic feature and emotion. Here, relativity is determined by partial correlation and defined by expression (3).

r1,2(3) =

(r1,2 − r1,3 r2,3 ) 1 − r1,32 1 − r2,32

(3)

where r1,2 is the relativity of the character 1 and the character 2.

84

Data Processing. Because the musical file types and the lengths are differences, the data pre-processing of music is shown below. Firstly, we convert old music formats (wav, mp3, rm, wma, etc) into wav format, and determine the Sampling frequency is 44.1 kHz. Secondly, the highlight of the music we extract can express emotion fragments strongly. The length of the fragment is usually from 30 s to 50 s.

3.2. Feature Analysis Tool At present, jAudio as the most usual software is used to analyze music feature in the domain of

Sensors & Transducers, Vol. 175, Issue 7, July 2014, pp. 83-87 feature extraction. The features extracted by jAudio are divided into time domain features and frequency domain features. Among them, time domain features are based on analyzing the time domain waveform analysis of audio signal. They include root mean square, Low energy window ratio, zero-crossing rate, etc. Frequency domain features are based on Fourier Transform and frequency domain parameters processing. They include frequency spectrum centroid, spectrum center distance, mel-frequency cepstral coefficient (MFCC), etc. For example, we run jAudio to analyze the music The Sad Story of Sleep Beauty. Results are shown in the following Table 1.

3.3. Feature Selection The purpose of this experiment is to select feature by using the process we proposed in Figure 1 and using jAudio to extract music emotion features which are based on stability, differentiation, and relativity. After selecting features, we use these features to classify the music emotion and compare the accuracy and efficiency before feature selection in order to verify the validity of the screening method. We extract the physical feature of data set by jAudio. Each piece of music has 78 features that include time domain features and frequency domain features. Results are shown in the following Fig. 2.

Table 1. 78 dimensional fundamental acoustic features. No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Music ID M1 M2 M3 M4 M5 M6 M7 M8 M9 M 10 M 11 M 12 M 13

SPE_1 3843.391434 4040.420137 3001.245355 3331.891667 4374.261005 4295.6577 3488.141087 3166.247286 3073.209007 1041.8828 4105.1973 5678.7723 4113.0702

Feature Spectral Centroid Overall Standard Deviation Spectral Centroid Overall Average Spectral Rolloff Point Overall Standard Deviation Spectral Rolloff Point Overall Average Spectral Flux Overall Standard Deviation Spectral Flux Overall Average Compactness Overall Standard Deviation Compactness Overall Average Spectral Variability Overall Standard Deviation Spectral Variability Overall Average Root Mean Square Overall Standard Deviation Root Mean Square Overall Average Fraction Of Low Energy Windows Overall Standard Deviation Fraction Of Low Energy Windows Overall Average Zero Crossings Overall Standard Deviation Zero Crossings Overall Average Strongest Beat Overall Standard Deviation Strongest Beat Overall Average Beat Sum Overall Standard Deviation Beat Sum Overall Average Strength Of Strongest Beat Overall Standard Deviation Strength Of Strongest Beat Overall Average MFCC Overall Standard Deviation MFCC Overall Average LPC Overall Standard Deviation LPC Overall Average Method of Moments Overall Standard Deviation Method of Moments Overall Average SPE_2 0.695549 0.573987 0.755618 0.688131 0.689239 1.292135 0.636956 0.876183 0.600006 0.77731 0.59118 1.105 0.66154

SPE_3 2.25E+11 4.06E+11 2.78E+11 2.23E+11 4.00E+11 4.77E+10 3.39E+11 1.67E+11 1.81E+11 1.69E+10 4.41E+11 7.10E+10 3.73E+11

SPE_4 4.21E+15 9.37E+15 4.61E+15 3.77E+15 8.31E+15 4.76E+14 6.19E+15 2.64E+15 3.01E+15 2.49E+14 1.00E+16 1.15E+15 7.56E+15

SPE_5 0.237942 0.265472 0.208075 0.157063 0.348967 0.239453 0.11595 0.198098 0.072999 0.016425 0.28235 0.35019 0.31722

TEM_1 19.468901 20.420863 19.728555 19.796208 22.265976 21.159521 20.977822 19.222207 21.549576 21.764002 19.77616 21.727753 20.614281

TEM_2 0.077374 0.071031 0.08582 0.076021 0.144638 0.126161 0.129724 0.115381 0.084915 0.223887 0.071793 0.120632 0.076921

TEM_3 -120566.6288 -789.858384 35911.68082 -362294.7707 -118452.3734 -147981.9417 269682.8348 63631.99919 -110833.4357 -39866.04942 176631.7057 87367.54993 165317.6462

Dimension 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 13 13 10 10 5 5

TEM_4 741416310.7 1309059206 582024767.9 761456255.8 89067180.18 109613273.7 22724387.6 176556571.2 787416433.1 11127435.2 1272921535 242554457 1081386122

TEM_5 0.91604 0.932246 0.920346 0.955218 0.928665 0.925909 0.924671 0.935509 0.970336 0.822355 0.955931 0.890479 0.939847

TEMO_1 96.517337 109.45981 82.741299 84.065855 96.719861 98.675669 76.835351 120.61276 79.390591 75.363638 117.00537 85.411675 119.25653

TEMO_2 1.456979 1.854696 1.882719 1.875668 1.84585 1.971704 1.476659 2.360917 1.461544 1.466324 2.133166 1.41675 2.301048

Fig. 2. Screen shot of musical acoustic features.

85

Sensors & Transducers, Vol. 175, Issue 7, July 2014, pp. 83-87 3.3.1 Stability According to the emotion label, these music tracks are classified as positive and negative on emotion categories. In the same emotion, the physical features stabilities are calculated by coefficient of variation (C. V) by expression (1). The result of coefficient variation (C. V) is described in the Fig. 3. The features are stable when C.V is less than 0.4. The features have poor stabilities when C.V is larger. We mark larger C. V as red in Fig. 3 and remove them from the data set.

which is the ratio of the sum and the difference of characteristic value center point. As shown in Fig. 4, J13 is just the differentiation of the feature in negative emotion label and positive emotion label. According to this method, we can obtain the differentiation of all the features. Here, we choose the value which is more than 0.05, from large to small, there are ZCOSD (0.0744), SRPOSD (0.0669), LPCOSD4 (0.0517), LPCOSD7 (0.0504), LPCOSD8 (0.0501) and MOMOSD1 (0.0513).

3.3.3 Relativity 3.3.2 Differentiation In the condition of music negative emotion label and music positive emotion label, the differentiation of the features can be reflected as the absolute value

SPE_1

SPE_2

SPE_3

SPE_4

We use IBM SPSS Statistics 19.0 to analyze the relativity of the 6 variables. As the results in Table 2, six variables have relativity, because the significant of each variable is under 0.05.

SPE_5

TEM_1

TEM_2

TEM_3

TEM_4

TEM_5

Mean Medina Average on removing 20% extreme Standard deviation C.V

3281.728869 0.785024817 2.24E+11 3322.385777 0.756343 2.06E+11

4.10E+15 0.18815147 20.65660108 0.097200233 -51658.42155 692717917.1 0.933615283 3.27E+15 0.179077 21.1817735 0.0849255 -11522.70291 638242445.9 0.9372535

3332.255277 0.768044229 2.16E+11

3.78E+15 0.18642358 20.74287231 0.090941979 -35560.91515 655394970.5 0.938483125

Mean Medina Average on removing 20% extreme Standard deviation

3593.504376 0.788661917 2.12E+11 3545.279683 0.764186 1.96E+11

3.73E+15 0.19729852 21.67752439 0.093190524 -141480.766 696726299.3 0.931027167 3.17E+15 0.19854 21.972687 0.0882225 -114242.7656 644847484.9 0.942256

3595.456729 0.774451235 2.03E+11

3.35E+15 0.19536563 21.76376426 0.090270676 -131118.888 665173803.2 0.938941956

C.V

963.1821395 0.199260601 1.39E+11 3.28E+15 0.10054992 2.18231595 0.036211934 281607.0431 495428008 0.036097195 0.29349839 0.253827137 0.619205 0.79909258 0.53440944 0.105647388 0.372549872 -5.451328837 0.715194448 0.038663886

667.1965927 0.160519191 1.17E+11 2.89E+15 0.08345248 2.346833522 0.022636764 215206.466 424205559.8 0.049567083 0.185667394 0.203533589 0.553528 0.77331887 0.42297568 0.108261141 0.242908431 -1.521100516 0.60885386 0.053239137

Fig. 3. Screen shot of feature stability.

SPE_1

SPE_2

TEM_1

TEM_2

TEM_5

TEMD_1

TEMD_2

TEMD_4

SRPOSD

COSD


3281.728869 0.785024817 20.65660108 0.097200233 0.933615283 99.26590082 1.794312283 379621.881 0.06539167 3322.385777 0.756343 21.1817735 0.0849255 0.9372535 99.2996045 1.7569235 372071.478 0.0652

285.55 263.5

3332.255277 0.768044229 20.74287231 0.090941979 0.938483125 99.41244119 1.784559021 368047.615

0.065725

276.25

0.0654675

275.215


3593.504376 0.788661917 21.67752439 0.093190524 0.931027167 97.40154864 1.746146357 416518.032 0.07525357 272.17857

Differentiation

0.038538138 0.003887256 0.022319461 0.002656804 0.000484388 0.007984549 0.011167363 0.04519234 0.06689877

963.1821395 0.199260601

3545.279683

2.18231595 0.036211934 0.036097195 99.33462807 1.779194343 372727.054

0.764186

21.972687

0.0882225

0.942256

98.6154115

1.733128 400282.039

264

3595.456729 0.774451235 21.76376426 0.090270676 0.938941956 97.38955268 1.740282706 407425.977 0.07474706 269.10294 667.1965927 0.160519191 2.346833522 0.022636764 0.049567083 97.76090911 1.739895389 408010.408

Fig. 4. Screen shot of feature differentiation. Table 2. Partial correlation analysis on six variables. Correlation SRPOSD ZCOSD LPCOSD4 LPCOSD7 LPCOSD8 MOMOSD1

86

0.0746

SRPOSD 1.000 .978 .395 .430 .572 .844

ZCOSD .978 1.000 .405 .439 .584 .862

LPCOSD4 .395 .405 1.000 .652 .558 .404

LPCOSD7 .430 .439 .652 1.000 .797 .415

LPCOSD8 .572 .584 .558 .797 1.000 .557

MOMOSD1 .844 .862 .404 .415 .557 1.000

0.0748549 268.49475 0.01236

Sensors & Transducers, Vol. 175, Issue 7, July 2014, pp. 83-87

this paper can be also used to analyze and select feature in other areas.

3.4. Verification The effectiveness of feature selection method above is verified in this section. We use the selected six features to compare the results before and after the music emotion feature selection. By comparison, the effectiveness of method of feature selection is verified. We use Euclidean distance to calculate the central value distance between negative emotion, positive emotion and features ZCOSD, SRPOSD, LPCOSD4, LPCOSD7, LPCOSD8 and MOMOSD1. If the music feature is closer to the negative emotion, the music belongs to negative emotion. Otherwise, the music belongs to positive emotion. Results are shown in the following Table 3. The experimental results indicate that the accuracy is 58 percent for Positive and over 60 percent for negative in musical emotion classification. The result of the method proposed in this paper has no obvious difference with the method without feature selection. It is proved that the selection method is effective. The selected features can represent all of the original features. Besides all, by using the proposed method of feature selection, the time of music emotion classification decrease by 10 percent.

Table 3. The correct rate of extracted features in emotion classification.

Correct number Error number Accuracy

Positive 49 35 0.58

Negative 40 26 0.6

4. Conclusion According to the stability, the differentiation and the relativity of the features, this paper extracts the features of music emotion based on the original multidimensional acoustic features. Without changing the accuracy of the original music classification, the dimension of features is sharp dropped and the efficiency of music emotion classification is improved. After the feature selection, these features can be used alone to replace earlier manual annotation method. The method proposed in

Acknowledgment The research was supported in part by the National Science Foundation of China under Grant Nos. 41261087, 61261036, 61308120.

References [1]. Jiwei Qin, Qinghua Zheng et. al., An Emotionoriented Music Recommendation Algorithm Fusing Rating and Trust, International Journal of Computational Intelligence Systems, Vol. 7, Issue 2, 2014, pp. 371-381. [2]. Nicola Orio, Music Retrieval, A Tutorial and Review, Foundations and Trends in Information Retrieval, Vol. 1, Issue 1, 2006, pp. 1-90. [3]. Michael A. Casey, Remco Veltkamp, Masataka Goto et. al., Content-Based Music Information Retrieval: Current Directions and Future Challenges, in Proceedings of the IEEE, Vol. 96, Issue 4, 2008, pp. 668-696. [4]. Dan Liu, Lie Lu, Automatic mood detection from acoustic music data, in Proceedings of the International Symposium on Music Information Retrieval, Baltimore, USA, 2003, pp. 81—87. [5]. Ogihara M, Content based music similarity search and emotion detection, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Montreal, Quebec, Canada, 2004, pp. 17-21. [6]. Li Tao, Mitsunori Ogihara, Content-based music similarity search and emotion detection, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2004, pp. 705-708. [7]. Francois Pachet, Aymeric Zils, Evolving automatically high-level music descriptors from acoustic signals, in Proceedings of Computer Music Modeling and Retrieval, Springer Verlag, 2003, pp. 42-53. [8]. Wang Muyuan, Zhang Naiyao, Zhu Hancheng, Useradaptive music emotion recognition, in Proceedings of the International Conference on Signal Processing, Istanbul, Turkey, 2004. [9]. Alicja Wieczorkowska, Piotr Synak, Rory A Lewis, et al. Extracting emotions from music data, in Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, USA, 2005, pp. 456-465.

___________________

2014 Copyright ©, International Frequency Sensor Association (IFSA) Publishing, S. L. All rights reserved. (http://www.sensorsportal.com)

87