Combining auditory perception and visual ... - ACM Digital Library

2 downloads 0 Views 1MB Size Report
ABSTRACT. The regional recognition of Chinese folk songs is not only conducive to discovering music characteristics and regional styles of specific ...
Combining auditory perception and visual features for regional recognition of Chinese folk songs Xinyu Yang1, Jing Luo1, Yinrui Wang1, Xi Zhao1 and Juan Li2

1

Department of Computer Science, Xi’an Jiaotong University, No. 28 Xianning West Road, Xi’an, Shannxi, China 2 Center of Music Education, Xi’an Jiaotong University, No. 28 Xianning West Road, Xi’an, Shannxi, China

[email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT The regional recognition of Chinese folk songs is not only conducive to discovering music characteristics and regional styles of specific geographical folk songs, but also has important research value in the existing music information retrieval system. In this paper, an effective and novel approach for regional recognition of Chinese folk songs is proposed, which is based on the fusion of auditory perception and visual features using an ensemble SVM classifier. When the auditory perception features are extracted, the temporal relation among the frame features is fully considered. For the visual features, the color time-frequency maps are used to replace the gray-scale images to capture more texture information, and in order to better characterize the image texture, the texture patterns and the corresponding intensity information are both extracted. Experimental results show that the recognition method combined with auditory perception and visual features can effectively identify Chinese folk songs of different regions with an accuracy rate of 89.29%, which outperforms other state-of-the-art approaches.

Figure 1. Time-frequency map of Shaanxi folk songs. .

CCS Concepts • Information systems➝ Information retrieval➝ Retrieval tasks and goals➝ Clustering and classification • Computing methodologies➝ Machine Learning➝ Learning paradigms➝ Supervised learning by classification Figure 2. Time-frequency map of Jiangnan folk songs.

Keywords Auditory perception features; Visual features; recognition; Temporal structure; Features fusion.

. Chinese folk songs [1-5], because Chinese folk songs are generally improvised and orally sung, the creating rules of them are not as clear as genre songs. In the regions where the lifestyle, living environment and language features are relatively close, the geographical boundaries of folk songs are blurred. This also increases the difficulties of regional recognition.

Regional

1. INTRODUCTION Recently, with the rapid development of the digital audio music and the Internet, Chinese folk songs with unique national style began to be known and studied by more people. Due to the distinctive styles of Chinese folk songs, the retrieval based on regional styles has become an important retrieval method. However, there are still few studies on the regional recognition of

Similar to the music genre classification, the extraction of audio features is a difficult problem of the regional recognition of Chinese folk songs. In the existing Music Information Retrieval system (MIR), the audio features are generally divided into acoustic features [6-7], auditory perception features [8-10] and visual features [11-14]. Among them, the acoustic features were the most commonly used in the past, which benefit from the great success of them in the field of speech recognition. However, since the acoustic features are highly sensitive to the smaller acoustic changes in the audio signal, misjudgment can happen even only very small changes occurring in the same class of songs. To deal with this problem, auditory perception features are proposed. These features fully consider the auditory characteristics of the human ear and integrate a large amount of musical perceptual information, which can be more closely related to the process of perception and processing of the human ear and the nervous system. Thus, auditory perception features have become an

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICCAE 2018, February 24–26, 2018, Brisbane, Australia © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6410-2/18/02…$15.00

DOI: https://doi.org/10.1145/3192975.3193006

75

important selection in music classification. There is a new research direction. Researchers convert the music audio signal into a spectrogram and then extract visual features. The visual features can describe the density and direction of the spectrum texture, so as to reflect the rhythm of music. So, these visual features have great potential in the field of music classification. Auditory perception and visual features (especially visual features) are rarely used in the study of regional recognition of Chinese folk songs. Existing research are more concerned on the design of recognition classifiers and normally used the acoustic features. For example, Liu et al. first classifies the Chinese folk songs into 10 regions, then extract the tone, rhythm and loudness features, and finally classify 500 folk songs by SVM (Support Vector Machine) algorithm [1]. Xu et al. uses RBF neural network to classify 517 songs in 10 regions, and then selects the best feature combination by the contribution rate of classification [2]. Khoo et al. first selects folk songs of five regions with distinct regional characteristics, and then extracts the time-domain and frequencydomain features, and finally classifies 312 folk songs by using the regularized limit learning machine (R-ELM) [5]. Figure 3. The whole feature extraction and classification scheme.

However, these methods do not consider the musical characteristics of folk songs when extracting features. Chinese folk songs are different from genre music. The audio features of the genre music are repeated, so the features extracted with fixed frame length or block length can represent the songs effectively. But the folk songs are basically orally created and have no fixed rules. Therefore, it is necessary to concatenate the frame features as a whole to maximize the recognition accuracy. In addition, the temporal relationship between frame features should be considered, so as to reflect the dynamic changes in the folk songs. On the other hand, Figure 1 and Figure 2 show the obvious difference on the spectral texture: the spectral texture of Shaanxi folk song is relatively flat and has less change, while the spectral texture of Jiangnan folk song is more tortuous and more flexible. Therefore, visual feature should be useful to improve the classification performance.

we proposed effectively identify folk songs in different regions. Our method has an accuracy rate of 89.29%, which is superior to the existing methods. The article is organized as follows: In Section 2, the selection and extraction of auditory perception features are introduced in detail. In Section 3, we introduce the selection and extraction of visual features. In which, the selection, extraction, and dimensionality reduction of the texture feature of color spectrogram are described in detail. In Section 4, we show the experiment results and evaluations of the regional recognition. Finally, we summaries our work and discuss about the future work.

2. AUDITORY PERCEPTION FEATURES EXTRACTION

To our knowledge, it is the first time that we have used the auditory perception and visual features to recognize the Chinese folk songs in different regions. Audio features capture the perceptual information and the visual features capture the texture features. The whole feature extraction and classification scheme is depicted in Figure 3.

In this section, we introduce the extraction of auditory perception features and processing of auditory perception features based on the CHMM model. The Section 2.1 briefly describes the reasons for the selection of the auditory perception feature, including MFCC, Chroma, spectral centroid (SC). In Section 2.2, we introduce the processing of auditory perception features based on CHMM model.

For the auditory perception features, we employ Mel-Frequency Cepstral Coefficient (MFCC), Spectral Centroid (SC) and Chroma. Among them, MFCC and SC were used to extract the timber features, and Chroma is used to extract melodic features. Moreover, temporal relations between the features of each frame are considered. First, we use the improved Continuous Hidden Markov Model (CHMM) [15-16] to model the features of each class in the training set, and then calculate the output probabilities of each folk song corresponding to all CHMM models and combine these probabilities into a new feature vector. For the visual features, we use color spectrogram to replace the gray image in the existing research to transform the audio file of each folk song. Because color space contains more texture information, the recognition accuracy is improved. Uniform Local Binary Pattern (Uniform LBP) is used to extract texture patterns, and Contrast is adopted to extract the intensities of texture patterns. Finally, the two sets of features are classified by fusing the scores of heterogeneous SVM classifier. Experiments show that method

2.1 Auditory Perception Features In this paper, three most commonly used features are adopted to extract perceptual features. They are MFCC, Spectral Centroid (SC) and Chroma. MFCC combine the nonlinear feeling of the human ear to the frequency and the mechanism of the speech generation, reflecting the short-term amplitude spectrum of speech. SC is also one of the important features of the description of the timber properties. It is relatively low when the song includes more low-frequency content, and it will be high on the contrary. Correspondingly, MFCC and SC value also have very wide range, so the two features can be used to reflect the differences of the timber of different regions’ folk songs. Chroma feature is designed according to the periodicity (twelve average temperament system) of the human ear perception to the sound. It can be a good representation of core music elements in music.

76

Figure 4. Extraction and processing of auditory perception features. perception features: [𝑝1𝑚 , 𝑝2𝑚 , … , 𝑝𝑗𝑚 , … , 𝑝𝐾𝑚 ]𝑇 (m=1,2,…,𝐾), where 𝑝𝑗𝑚 represents the output probabilities set of the j-th folk song of the m-th region in the training set corresponds to all CHMM models.

2.2 Processing of Auditory Perception Features To maintain the temporal relationship between auditory perception features of each frame and keep the integrity of the auditory perception of each folk song, we use the following approach. Firstly, the auditory perception features of each frame are taken as time series, and CHMM, which is widely used in time series analysis, is used to model each class of folk songs in the training set. Then, the output probabilities of each folk song corresponding to all CHMM models are calculated, and the probabilities are concatenated as the new feature vector of the folk song.

Step 5: The feature vectors of all the test set are normalized as the final test vectors set of the auditory perception features: [𝑝1∗ , 𝑝2∗ , … , 𝑝𝑗∗ , … , 𝑝𝐾∗ ]𝑇 , where 𝑝𝑗∗ represents the output probabilities set of the j-th folk song in the test set corresponds to all CHMM models.

3. VISUAL FEATURES EXTRACTION In this section, we mainly describe the selection, extraction and processing of visual features for color spectrogram. The Section 3.1 describes the reason for choosing the uniform LBP and Contrast visual features. The extraction process of these visual features is described in Section 3.2. The Section 3.3 deal with the dimensionality reduction of visual features.

The extraction and processing of auditory perceptual features are shown in Figure 4. The main steps are as follows. Step 1: Randomly select N folk songs from each of K regions to form the training. All remaining folk songs forms the test set. Total number of samples is M. Auditory perception features are extracted for each frame.

3.1 Visual Features The extraction of texture features is based on the color timefrequency map rather than the gray map because the amount of information contained in the color space is more than the gray space. The recognition accuracy of the image will be improved, so the recognition accuracy of the folk song will increase too.

Step 2: Auditory perception features of each folk song in training set are used as an observation vector, and then the Baum-Welch algorithm is used to train the CHMM model for each region folk songs (The specific training processing of CHMM model based on the improved Bach-Welch algorithm can refer to [21]).

Local Binary Pattern (LBP) has been one of the most powerful and successful texture features in the last ten years [17]. The binary pattern is built considering the differences between the pixel and its equally spaced neighbours according to previously defined distance. Since the number 2𝑝 of standard LBPs increases sharply with respect to the number of sampling points 𝑃 in the domain radius 𝑅, and the actual number of pixels in this region is relatively small, using the standard LBP will result in an overly sparse histogram, which will lose statistical significance and texture features. Therefore, we choose uniform LBP instead of standard LBP. The uniform LBP reduce the number of LBPs in the domain radius 𝑅 from 2𝑝 to 𝑃 ∗ (𝑃 − 1) + 3 (contains 𝑃 ∗ (𝑃 − 1) + 2 uniform LBPs and remaining non-uniform LBPs), which can also preserve enough patterns with significant rendering capabilities.

Step 3: The probability of each observation vector corresponding to each CHMM model is calculated using the Viterbi algorithm, and then concatenate these probabilities into one-dimensional 𝑚 𝑚 𝑚 𝑚 𝑚 vector [𝑝𝑗1 , 𝑝𝑗2 , … , 𝑝𝑗𝑖 , … , 𝑝𝑗𝐾 ], where 𝑝𝑗𝑖 represents the output probability of the j-th folk song in the m-th region corresponds to the CHMM model of the i-th region folk songs. In the same way, the perceptual feature vectors of all the folk songs in the test set are transformed into a new set of feature vectors ∗ ∗ ∗ ∗ ∗ [𝑝𝑗1 , 𝑝𝑗2 , … , 𝑝𝑗𝑖 , … , 𝑝𝑗𝐾 ] , where 𝑝𝑗𝑖 represents the output probability of the j-th folk song in the test set corresponds to the ith CHMM model. Step 4: Feature vectors of all songs in the training set are normalized as the final training vectors set of the auditory

77

Figure 6. The extraction process of visual features that used G channel as the center (the three planes represent three color channels).

Figure 5. 58 kinds of 𝑳𝑩𝑷𝒖𝟐 𝑷,𝑹 operators of uniform LBP (𝑷=8). In addition, the recognition of texture is not only based on the distribution of the patterns, but also based on the strength of them. 𝑢2 However, it is not difficult to find that the value 𝐿𝐵𝑃𝑃,𝑅 of uniform LBP is not affected by any change of gray value. The 𝑢2 𝐿𝐵𝑃𝑃,𝑅 will be the same as long as the pixels values in the image field are in the same order, regardless of the intensity (or the contrast) information of the image. Therefore, we choose the Contrast feature to measure the intensity of uniform LBP. The variance is usually as the measure of the Contrast feature [18], which is defined as follows. 1

𝑉𝐴𝑅𝑃,𝑅 = ∑𝑃−1 𝑖=0 (𝑔𝑖 − 𝜇) 𝑝

from different channels. In addition, to solve the problem that the value of the 𝑉𝐴𝑅𝑃,𝑅 operation is continuous, we quantify the features space of 𝑉𝐴𝑅𝑃,𝑅 operations before calculating 𝑉𝐴𝑅𝑃,𝑅 histogram eigenvector, which can greatly reduce the dimension of features space. The quantization level is set to 16 in this paper (i.e., the quantized bit number is 4). As shown in Figure 6 is the extraction process of the visual features that used G channel as the center, the extraction process of the other two color channels is consistent. So, the main steps of visual features extraction are shown as follows. Step 1: All the audio files of folk songs are converted into color time-frequency map. Then we extract the gray images of RGB three color channels of each time-frequency map. At the same time we determine the neighbourhood radius 𝑅 and the number of neighbourhood elements 𝑃.

(1)

1

where, 𝜇 = ∑𝑃−1 𝑖=0 (𝑔𝑖 ). From this definition we can see that the 𝑝

Contrast feature is also rotation invariant. In this paper, the value of 𝑃 is set to 8 to avoid too high computational complexity. And the value of neighbourhood radius 𝑅 is only taken to be 1 or 2, this is because when the radius 𝑅 gradually increases, the correlation between the central pixel and its domain pixels is gradually weakened, which will make visual features have little significance. Figure 5 shows 58 kinds of 𝑢2 𝐿𝐵𝑃𝑃,𝑅 operators of uniform LBP.

Step 2: For each folk song, a pixel from the grayscale image of G channel is taken as center, the value of which is 𝑔𝑐𝐺 . Then the pixel 𝐴 values of 𝑃 points around its neighbourhood 𝑅 from the two external channels of R, B is extract, which are denoted by 𝐺 𝐺 𝐺 𝑅 ), (𝑔𝐵 𝐵 𝐵 (𝑔0𝑅 , 𝑔1𝑅 , … , 𝑔𝑃−1 0 , 𝑔1 , … , 𝑔𝑃−1 ) and (𝑔0 , 𝑔1 , … , 𝑔𝑃−1 ) . (In the experiment, the circular neighbourhood is used, and the pixel values on the circular neighbourhood are estimated by bilinear interpolation).

3.2 Processing of Visual Features Extraction

Step 3: For the point 𝐴 with the pixel value 𝑔𝑐𝐺 , the LBP codes 𝑢2 and three 𝐿𝐵𝑃𝑃,𝑅 operators corresponding to the three values 𝐺 𝑅 𝑅 𝑅 𝐵 ) (𝑔0 , 𝑔1 , … , 𝑔𝑃−1 ) , (𝑔0𝐵 , 𝑔1𝐵 , … , 𝑔𝑃−1 ) are and (𝑔0𝐺 , 𝑔1𝐺 , … , 𝑔𝑃−1 calculated.

When extracting the above two features, we have considered the following problem. Color images are combinations of the RGB channel. Therefore, the correlation between different color channels should also be calculated when extracting the texture features. In addition, since the values of Contrast feature are continuous, 𝑉𝐴𝑅𝑃,𝑅 operators obtained in different local regions may be completely different. This will result in very large dimension of features, so the discretization of 𝑉𝐴𝑅𝑃,𝑅 operators needs to be considered when extracting features.

Step 4: The operator 𝑉𝐴𝑅𝑃,𝑅 corresponding to the neighborhood 𝐺 ) is calculated by the equation (1). value (𝑔0𝐺 , 𝑔1𝐺 , … , 𝑔𝑃−1 𝑢2 Step 5: Repeat steps 2-4 until the three 𝐿𝐵𝑃𝑃,𝑅 operators and one 𝑉𝐴𝑅𝑃,𝑅 value corresponding to each pixel in the image of G channel are obtained. And then we can get three LBP histogram sequences corresponding to G channel by making statistics on the 𝑢2 verity of 𝐿𝐵𝑃𝑃,𝑅 operations in the G-R, G-B two interactive channels and G internal channel respectively. Similarly, we can also get the Contrast histogram sequence by make statistics on the verity of 𝑉𝐴𝑅𝑃,𝑅 values in G internal channels.

In order to solve the above two sub problems, we first transform the whole audio of each folk song into color time-frequency map, and then each color time-frequency map is transformed into three grayscale images corresponded to R, G, and B channels. Second, 𝑢2 we apply the 𝐿𝐵𝑃𝑃,𝑅 and 𝑉𝐴𝑅𝑃,𝑅 operations to the gray image of each color channel. To consider the correlation between the three 𝑢2 color channels, we do 𝐿𝐵𝑃𝑃,𝑅 operation for each pair color channels based on the single-channel operation, and the pixels at the neighbourhood center and the surrounding pixels are taken

Step 6: According to the steps 2-5, the R channel image and the B channel image are taken as the center, and the histogram

78

frame length 1024, and the frame shift 512. All of the regional recognition experiment of Chinese folk songs is divided into five groups. Within each group, we randomly select 80% of each region’s folk songs as the training set and the remaining 20% as the test set. We use accuracy as the evaluation metrics. The SVM is used as the classifier for recognition experiment of each group where the RBF kernel is employed and the parameters 𝐶 and 𝜆 are determined by grid search. When the audio perception features and visual features are extracted successfully, each kinds of feature set are classified by a SVM classifier. Then, we get the recognition accuracy of the combined features by using sum rule [23], which is a well-known late fusion method.

sequences corresponding to the R channel image and the B channel image are obtained. Step 7: All the histogram sequences corresponding to R, G and B three color channels are concatenated, transformed into a vector by taking the number corresponding to the histogram as the element. The corresponding visual features of each song can be obtained after vector normalization.

3.3 Dimensionality Reduction of Visual Features From the Section 3.2, we can see that the visual feature is the eigenvector which is composed of 9 kinds of LBP feature histogram sequences and 3 kinds Contrast feature histogram sequences corresponding to given radius 𝑅 and 𝑃. And since the LBP and Contrast features have 59 dimensions and 16 dimensions respectively, the dimension of visual feature will become 59*9+16*3=597 dimensions. The dimension will become relatively high, when number of songs increases. The computational complexity will also increase dramatically, so it is necessary to reduce dimensionality of visual features. There is no dimensionality reduction for the Contrast feature because it has less dimensionality and does not exist in the interaction channel. The contrast feature has less influence on the dimension change of the final visual features.

4.2 Parameters Determination for Features Extraction In order to obtain better recognition performance, several parameters should be tuned to optimize the process of the audio perception and visual features extraction.

4.2.1 Audio Perception Features The optimum parameters of the CHMM models which were used to extract audio perception features are determined by both the HMM states’ number W and Gaussian models’ number P (The detail introduction of the model can refer to [21]). We set the number W and the number P to values from 3 to 9. The initial values of the initial state probability distribution and initial state transition matrix are all generated randomly. The iteration number of training CHMM models is 200. The allowable error is 10-5 . Here, only the new vectors of auditory perception features extracted are used to make regional recognition of Chinese folk songs and obtain the optimum parameters of the CHMM models.

CV (Coefficient of Variance) is used to measure the difference of the same pattern in the same color channel of the three local folk songs. CV is calculated as equation (2). 𝐶𝑉 =

𝜎 𝜇

(2)

where, 𝜎 is the standard deviation of the patterns, and 𝜇 is the mean of the patterns. Larger CV values correspond to greater difference in the corresponding pattern of three regions’ folk songs, and vice versa.

Table 1. Averaged recognition accuracy of using combinations of different number of HMM states 𝐖 and Gaussian models 𝐏 𝑷

In this paper, the pattern with less difference is considered as the pattern with the CV value less than 𝛼 in the internal channel and the pattern with CV value less than 𝛽 in the interaction channel. In Section 4, we will experimentally determine the appropriate values of 𝛼 and 𝛽 for different radius 𝑅, and find the most suitable radius 𝑅 to make our system get the highest recognition accuracy. After the optimal radius 𝑅 is determined, the visual features corresponding to each song in the test set and training set can be connected in series with the corresponding auditory perceptual features as the new test feature vector and training feature vector set.

W

4. EXPERIMENTS AND ANALYSIS

3

4

5

6

7

8

3

75.71%

75.14%

76.00%

74.57%

76.57%

74.86%

4

78.00%

78.29%

77.14%

77.71%

77.43%

77.43%

5

80.29%

81.43%

80.57%

79.43%

78.86%

78.00%

6

78.86%

80.00%

79.14%

79.71%

77.14%

76.57%

7

77.14%

78.57%

78.86%

78.00%

76.86%

75.71%

8

75.43%

76.86%

77.71%

76.00%

75.43%

74.29%

As can be seen from Table 1 the best parameters of the model are W=5 , P=4 and the accuracy is 81.43%, when only using the auditory perceptual features. In addition, we can see that the accuracy of any combination of parameters in the Table 1 is more than 74%, which is much higher than the accuracy of the random guessing, which indirectly shows the feasibility of using auditory perception features to make recognition of Chinese folk songs.

In this section, we will evaluate the regional recognition accuracy of Chinese folk songs with a combination of auditory perceptual features and visual features through a series of experiments.

4.1 Dataset and Experimental Setup We use three types of very representative Chinese folk songs as our dataset. They are ShaanXi-XinTianYou (SX), JiangNanXiaoDiao (JN) and HuNan-HaoZi (HN). The dataset used in the experiment are from the “The integration of Chinese folk songs” [19], including 109 ShaanXi-XinTianYou, 101 JiangNanXiaoDiao and 134 HuNan-HaoZi, which are also used in our previous work in the regional classification and music analysis of Chinese folk songs [20-22].

4.2.2 Visual Features The number of neighborhood elements is set to 8, and R is set as 1 or 2. Moreover, we need to consider the recognition accuracy of Chinese folk songs with different values for R , α and β , respectively. Here, only the visual features extracted are used to make regional recognition of Chinese folk songs and obtain the optimum parameters. The average recognition results of five groups of experiments are shown in Table 2. It can be seen that

In the experiment, the audio file of the folk songs is monophonic, with the sampling rate 22050, the number of sampling bits 16, the

79

visual features. This also shows that the two characteristics can be combined to a stronger feature.

the recognition accuracies of folk songs are all more than 80% when R, α and β take different values, which shows the feasibility of using the visual features to make regional recognition of Chinese folk songs. The neighborhood radius R affects the extraction of visual features. We can see that smaller neighborhood radius is more suitable for the recognition of Chinese folk songs. Because when R is larger, the correlation between the element in the neighborhood center and the P elements around the radius R is weak. At the same time, we can see that the recognition accuracy of folk songs is the highest when α = 0.1 and β = 0.5, regardless of R = 1 or R = 2. And the accuracy of R = 1 is higher than R = 2, it can be concluded that α = 0.1, β = 0.5 is a more appropriate threshold, and R = 1 is the best radius for folk song recognition.

Table 4. Confusion matrix of the averaged recognition result using the perceptual features and visual features

𝐑=1

𝐑=2

0.40

0.50

0.60

0.70

0.05

84.00%

0.10

83.14%

82.29%

84.00%

83.43%

82.29%

84.30%

84.68%

84.29%

0.15

83.14%

83.14%

83.71%

82.29%

82.86%

84.00%

0.20

82.29%

83.43%

82.00%

83.14%

82.00%

0.05

82.86%

81.43%

82.29%

82.00%

81.43%

0.10

82.57%

82.86%

83.20%

82.86%

82.00%

0.15

82.00%

82.29%

81.43%

81.14%

82.26%

0.20

81.43%

82.00%

81.14%

82.57%

81.14%

HN

20.3

0.6

1.1

JN

1

19.3

0.7

HN

2.8

1.3

22.9

In order to prove the superiority of our method in the regional recognition of Chinese folk songs, we compare it with other folk songs recognition algorithms on the data set of this paper. The results are shown in Table 5. Our method that combined the auditory perceptual features with visual features has the highest recognition accuracy of Chinese folk songs in the existing algorithms based on audio files. It also shows that compared with the algorithms using only acoustic features (or part of the auditory perception feature) [1-5], [21-22], our method greatly improved the recognition accuracy.

𝛃 0.30

JN

SX

Table 2. Recognition accuracy of using different combinations of α and β 𝛂

SX

Table 5. Recognition accuracies of the proposed method and other existing approaches Method

In addition, in order to evaluate the dimensionality reduction, we compare the recognition accuracy and recognition time before and after dimensionality reduction using Matlab 2014b with CPU i54590 3.30GHz, 8GB RAM. Here, we use the optimum parameters obtained from the above analysis. The results are shown in Table 3. We can see that the dimensionality reduction not only improves the recognition accuracy of the folk songs, but also reduces the recognition time.

76.57%

Xu J et al. [2]

54.29%

Liu Y et al. [3]

83.71%

Liu Y et al. [4]

84.85%

Khoo S et al. [5]

56.29%

Li J et al. [21]

83.14%

Li J et al. [22]

88.71%

Only auditory perception features

81.43%

Only visual features

84.68%

a

Table 3. Confusion matrix of the averaged recognition result using auditory perception features only when W=5, P=4

Accuracy

Liu Y et al. [1]

Combined features (directly concatenated)

88.57%

Combined featuresb (sum rules)

89.29%

a

Directly concatenating two features vector into one vector. Using sum rules and employing ensemble SVM classifier.

b

Experiment group

Accuracy

Time (s)

Before

After

Before

After

1

82.01%

83.77%

0.2820

0.1431

2

83.07%

85.02%

0.2986

0.1620

3

81.29%

83.71%

0.2973

0.1560

4

82.65%

84.91%

0.2720

0.1420

5

82.57%

85.99%

0.2799

0.1440

5. CONCLUSION AND FUTURE WORK In this paper, we combine the auditory perception features and visual features to make regional recognition of Chinese folk songs. In our method, we not only fully consider the music characteristics of the folk songs, but also make further improvements to them when extracting perceptual features and visual features. For auditory perception features, temporal relations between the features of each frame are considered. For visual features, we use color spectrogram to extract the patterns of the texture spectrum and the intensities of these patterns. Finally, the two features are fusion combined using the ensemble SVM classifier. Experiments show that the recognition method we proposed has an accuracy rate of 89.29%, which outperforms existing methods. In the future, we will try to use more musical perceptual features and visual features to study the regional recognition of Chinese folk songs, and we will add more regional folk songs to improve our method. In addition, we also consider using more popular Deep Learning methods to learn new features for the regional recognition of Chinese folk songs.

4.3 Classification Performance The auditory perceptual features and visual features extracted using the optimized parameters will be combined to recognize Chinese folk songs. Table 4 shows the average recognition results of five groups of Chinese folk song recognition experiments. It is easy to calculate the recognition accuracy of the Chinese folk song using the perceptual and visual features, and the accuracy is 89.29%. This value is higher than the recognition accuracy of 81.43% using only auditory perceptual features and 84.68% only

80

[12] Costa, Y. M. G., Oliveira, L. S., Koerich, A.L., Gouyon, F. and Martins, J. G. 2012. Music genre classification using lbp textural features. Signal Processing. 92, 11, 2723–2737.

6. REFERENCES [1] Liu, Y., Xu, J., Wei, Y. and Tian, Y. 2007. The study of the classification of chinese folk songs by regional style. In Proc. of the Int. Conf. on Semantic Computing. 657–662.

[13] Nanni, L., Costa, Y. M. G., Lucio, D. R., Jr, C. N. S. and Brahnam, S. 2017.Combining visual and acoustic features for audio classification tasks. Pattern Recognition Letters. 88, 49–56.

[2] Xu, J., Wang, P. and Yan, L. 2008. Feature selection for automatic classification of chinese folk songs. In Proc. of Congress on Image and Signal Processing. 441–446.

[14] Nanni, L., Costa, Y. M. G., Lumini, A., Kim, M. Y. and Baek, S. R. 2016. Combining visual and acoustic features for music genre classification. Expert Systems with Applications. 45, C, 108–117.

[3] Yi, L., Lei, W., Liu, Z. L. and Peng, W. 2008. The feature selection of regional style classification of chinese folk songs. Acta Electronica Sinica. 36, S1, 152–156. [4] Liu, Y., Wei, L. and Wang, P. 2009. Regional style automatic identification for chinese folk songs. In Proc. of the WRI World Congress on Computer Science and Information Engineering. 5–9.

[15] Abdelaziz, A. H., Zeiler, S. and Kolossa, D. 2015. Learning dynamic stream weights for coupled-hmm-based audiovisual speech recognition. IEEE/ACM Trans. on Audio, Speech and Language Processing. 23, 5, 863–876.

[5] Khoo, S., Man, Z. and Cao, Z. 2013. Automatic han chinese folk song classification using the musical feature density map. In Proc. of the Int. Conf. on Signal Processing and Communication Systems. 1–9.

[16] Mccracken, M. and Patwari, N. 2012. Hidden markov estimation of bistatic range from cluttered ultra-wideband impulse responses. IEEE Trans. on Mobile Computing. 13, 7, 1–4.

[6] Panagakis, Y., Kotropoulos, C. L. and Arce, G. R. 2014. Music genre classification via joint sparse low-rank representation of audio features. IEEE/ACM Trans. on Audio, Speech and Language Processing. 22, 12, 1905–1917.

[17] Ojala, T., Pietikainen, M. and Maenpaa, T. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence. 24, 7, 971–987.

[7] Tahon, M. and Devillers, L. 2016. Towards a small set of robust acoustic features for emotion recognition: Challenges. IEEE/ACM Trans. on Audio, Speech and Language Processing. 24, 1, 16–28.

[18] Cao, H. G., Yuan, B. H. and Zhu, H. S. 2012. Recognition of intersected face based on contrast information and local binary pattern. Journal of Shandong University. 42, 4, 29–34. [19] T. E. C. of “Integration of Chinese folk songs”. 1994. Integration of Chinese folk songs. Chinese ISBN center.

[8] Ghosal, A., Saha, S. K., Dhara, B. C. and Chakraborty, R. 2012. Music classification based on mfcc variants and amplitude variation pattern: A hierarchical approach. Int. J. of Signal Processing Image Processing and Pattern Recognition. 5, 1, 131-150.

[20] Li, J., Wang, Y. and Yang, X. 2016. General characteristics analysis of Chinese folk songs based on layered stabilities detection audio segmentation algorithm. In Proc. of the 42nd Int. Computer Music Conference. 16–20.

[9] Nandedkar, V. 2011. Audio retrieval using multiple feature vectors audio retrieval using multiple feature vectors. Int. J. of Electrical and Electronics. 1, 1, 1–5.

[21] Li, J., Wang, Y. and Yang, X. 2017. Regional recognition of chinese folk songs based on lsd audio segmentation algorithm. In Proc. of the 9th Int. Conf. on Computer and Automation Engineering. 60–65.

[10] Fu, Z., Lu, G., Kai, M. T. and Zhang, D. 2011. A survey of audio-based music classification and annotation. IEEE Trans. on Multimedia. 13, 2, 303–319.

[22] Li, J., Ding, J. and Yang, X. 2017. The regional style classification of chinese folk songs based on gmm-crf model. In Proc. of the 9th Int. Conf. on Computer and Automation Engineering. 66–72.

[11] Wu, M. J., Chen, Z. S., Jang, J. S. R., Ren, J. M., Li, Y. H. and Lu, C. H. 2011. Combining visual and acoustic features for music genre classification. In Proc. of 10th Int. Conf. on Machine Learning and Applications and Workshops. 124– 129.

[23] Kittler, J., Hatef, M., Duin, R. P. W. and Matas, J. 1998. On combining classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence. 20, 3, 226–239.

81