Using multi-stream hierarchical deep neural network ...

5 downloads 0 Views 1MB Size Report
Dec 29, 2016 - Springer Science+Business Media New York 2017 ... two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio ..... sport, situation comedy, award ceremony, action movie and war movie, and ...
Multimed Tools Appl DOI 10.1007/s11042-016-4332-z

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection Yanxiong Li 1 & Xue Zhang 1 & Hai Jin 1 & Xianku Li 1 & Qin Wang 1 & Qianhua He 1 & Qian Huang 1

Received: 7 July 2016 / Revised: 29 December 2016 / Accepted: 29 December 2016 # Springer Science+Business Media New York 2017

Abstract Extraction of effective audio features from acoustic events definitely influences the performance of Acoustic Event Detection (AED) system, especially in adverse audio situations. In this study, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical Deep Neural Network (DNN). The DAF outputted from the proposed framework fuses the potential complementary information of multiple input feature streams and thus could be more discriminative than those input features for AED. We take two input feature streams and the hierarchical DNNs with two stages as an example for showing the extraction of DAF. The effectiveness of different audio features for AED is evaluated on two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio dataset with different signal-to-noise ratios. Experimental results show that DAF outperforms other features for AED under several experimental conditions. Keywords Audio feature . Acoustic event . Deep neural network

1 Introduction An acoustic event is one audio segment that humans would label it as a distinctive concept [14]. The goal of AED is to label temporal regions within an audio recording, resulting in a symbolic description such that each annotation gives the start time, end time and label for a single instance of a unique event type. Classification and detection of acoustic events are useful for multimedia retrieval [3], audio-based surveillance and monitoring [1, 17]. It has been under great attention of the research community with many recent evaluation campaigns

* Yanxiong Li [email protected]

1

School of Electronic and Information Engineering, South China University of Technology, 381 Wushan Road, Guangzhou, China

Multimed Tools Appl

[7, 11, 34, 36]. The AED problem has not yet been solved due to large variations of timefrequency characteristics within each kind of acoustic events, non-stationary additive background noise, overlap of acoustic events, and so on [27]. Overall performance of an AED system relies upon two main parts: the feature extraction stage and the classification stage. Therefore, in most of recent studies [21, 24, 26], these two stages are the main concerns of researchers for AED due to larger diversity and variability of acoustic event compared with speech signal. Classification is routinely performed by statistical machine learning algorithms like Gaussian Mixture Model (GMM) [13], Hidden Markov Model (HMM) [31], Support Vector Machine (SVM) [25, 35, 39, 42, 44] or decision tree [18] that model class-specific feature distributions or estimate discriminative system parameters. Recently, some works using DNN as a classifier for AED were reported in [4, 10, 22]. McLoughlin et al. used DNN as a classifier for robust classification of acoustic event and got better results than SVM classifier [22]. Gencoglu et al. used DNN classifiers for recognizing 61 classes of acoustic events [10], and obtained classification accuracies of 60.3 and 54.8% for DNN and HMM classifiers, respectively. A good feature should be able to discriminate effectively between different classes of acoustic events, while keeping the variation within a given audio class small. It should also be insensitive to external influences, such as noise or the environment. Mel-Frequency Cepstral Coefficients (MFCC) is one of the most popularly used features for AED. However, it is not suitable for discriminating acoustic events with non-stationary noise due to its poor robustness [30]. Gabor filter-bank features are the biologically-inspired spectro-temporal features, derived from a filter-bank of two-dimensional Gabor functions that decompose the spectro-temporal power density into components which capture spectral, temporal and joint spectro-temporal modulation patterns [32]. The usage of Gabor features for robust AED in combination with a HMM classifier has been proposed by Schädler et al., reporting improved performance relative to a MFCC baseline in adverse noise conditions [30, 33]. However, these features can’t deeply represent complex acoustic events and have their own shortcomings, such as poor robustness and discriminability. In order to represent the deep differences among various acoustic events and integrate the strongpoints of input feature streams, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical DNNs. In this framework, the hidden layers of each DNN are trained to extract information relevant for discriminating between acoustic events. The output of the narrowest hidden layer (i.e. bottleneck layer) in a DNN is used as a new feature representation. The bottleneck layer is a special hidden layer of a DNN, whose neurons are much fewer than that of other hidden layers. The activation signals in the bottleneck layer can be used as a compact representation of the original highdimensional inputs fed to the input layer of DNN [12, 41]. To the best of our knowledge, there is no report for extracting DAF using multi-stream hierarchical DNN in the application of AED. DAF not only takes advantage of the complementary information between different feature streams but also exploits the non-linear transformation and dimensionality reduction introduced by DNN. Evaluated on two audio corpora of BBC audio dataset and TV audio dataset, the proposed DAF is superior to both MFCCs and Gabor features for AED under several experimental conditions. The rest of this paper is structured as follows. Section 2 presents the proposed framework for DAF extraction. Section 3 presents experiments and discussions for AED using different audio features on two audio corpora, and finally followed by the conclusions in Section 4.

Multimed Tools Appl

2 Proposed framework for DAF extraction This section presents the proposed framework for extracting DAF where DNN is used as an extractor of audio features.

2.1 Components of the framework The method proposed here aims at digging the potential complementary information contained in multiple input feature streams by non-linear transformation of multi-stream hierarchical DNNs. In this study, we take two input feature streams with two stages of DNNs as an example as given in Fig. 1. In Fig. 1, the textbox of B×31^ denotes a context of 31 frames of features fed to the DNN, while the textbox of BDCT^ represents discrete cosine transform. Two input feature streams are respectively fed to DNN1 and DNN2 of the first stage, and then the transformed features are outputted from the bottleneck layers (pointed by dotted arrows in Fig. 1) of DNN1 and DNN2. The outputs of the bottleneck layer in both DNN1 and DNN2 are called bottleneck features for the input feature streams, and marked as BN1 and BN2, respectively. BN1 and BN2 are expected to be more discriminative, because they represent a non-linear feature transformation and dimensionality reduction of the corresponding input feature streams. Then, the combinations of BN1 and BN2 are fed to the DNN of the second stage as shown in the middle part of Fig. 1. The output of the bottleneck layer in the DNN of the second stage is the anticipated feature, i.e. DAF. The DAF, which merges the potential complementary information of feature stream 1 and feature stream 2 through hierarchical DNNs, could possess the strongpoints of all input features for AED. Each component DNN in Fig. 1, e.g. DNN1 of the first stage, is the Deep Belief Network proposed in [16]. The DNN is built from individual pre-trained RBM (Restricted Boltzmann Machine) pairs. Each RBM pair consists of V visible and H hidden stochastic neurons, v = [v1, v2, …, vV]T and h = [h1, h2, …, hH]T. The intermediate and final layers are Bernoulli-Bernoulli RBMs, while the DNN input layer is a Gaussian-Bernoulli RBM. In a Bernoulli-Bernoulli Bottleneck Layer

...

DCT

Bottleneck Layer

x31 BN1

Feature Stream1

(a) DNN1 of the first stage

Feature Stream2

x31

...

DCT

BN2 x31

(c)

...

DCT

(b)

DNN of the second stage

Bottleneck Layer DNN2 of the first stage

Fig. 1 The proposed framework for extracting DAF based on multi-stream hierarchical DNNs

Deep Audio Feature

Multimed Tools Appl

RBM, neurons are assumed to be binary, i.e. vbb∈{0, 1}V and hbb∈{0, 1}H. The energy function of the state Ebb(v, h) is defined by, E bb ðv; hÞ ¼ −

V H X X i¼1

vi h j ω ji −

j¼1

V X

H X vi bvi − h j bhj ;

i¼1

j¼1

ð1Þ

where ωji denotes the weight between the ith visible unit and the jth hidden unit. bvi and bhj are the corresponding real-valued biases. The parameters of the Bernoulli-Bernoulli RBM are θbb = {W, bh, bv} with weight matrix W = {ωij}V×H and biases bh = [bh1, bh2, …, bhH]T and bv = [bv1, bv2, …, bvV]T. The Gaussian-Bernoulli RBM visible neurons are real, i.e. vgb∈RV, whereas the hidden neurons are binary, i.e. hgb∈{0, 1}H. Hence, the energy function is 2 H V H V  X X X X vi −bvi vi h j ω ji þ − h j bhj : Egb ðv; hÞ ¼ − 2 σ 2σ i i¼1 j¼1 i i¼1 j¼1

ð2Þ

Every visible unit vi adds a parabolic offset to the energy function, governed by σi. Therefore, Gaussian-Bernoulli RBM parameters contain one extra term, θgb = {W, bh, bv, σ2}, with pre-given variance of σ2i . Given one energy function E(v, h) defined by either Eqs. (1) or (2), the joint probability associated with configuration (v, h) is defined by pðv; h; θÞ ¼

1 f−Eðv;h;θÞg ; e Z

ð3Þ

where Z is a partition function, Z = ∑v ∑h e{−E(v, h; θ)}. The training of DNN with bottleneck layer starts from Hinton’s unsupervised generative pre-training procedure [15, 16] in which a stack of RBMs will be trained to initialize weight matrices for better start points. After pre-training, the DNN with bottleneck layer is fine-tuned in a supervised manner with the error back-propagation procedure to optimize a specified objective function. The details for DNN training can be found in [16].

2.2 Feature extraction Both MFCCs and Gabor features are chosen as two streams of input features, considering their different strengths for representing the characteristics of acoustic event. MFCC is able to characterize acoustic event in cepstral domain which is expected to perform the best for AED in the condition of high SNR (Signal to Noise Ratio). On the contrary, Gabor feature is able to represent acoustic event in spectro-temporal domain which is expected to be robust in adverse noise conditions. Therefore, we hope to extract a more discriminative and robust feature, i.e. DAF, by combining MFCCs with Gabor feature through hierarchical DNNs.

2.2.1 MFCCs MFCCs extraction is a process that extracts features presenting the characteristics of a frame of acoustic event. The extraction procedure of MFCCs is shown in Fig. 2. Waveform of acoustic event is first split into overlapping frames and windowed using a Hamming window [19]. The Fast Fourier Transform (FFT) is used to compute the power spectrum

Multimed Tools Appl Fig. 2 The extraction of MFCCs

s[n] Hamming window

h[n]

|H[k]| |FFT|22

X [l] Mel bank & Logarithm

C[m] DCT

MFCCs

which is then smoothed with a bank of triangular filters. The center frequencies of these triangular filters are uniformly spaced on the Mel-scale. Finally, logarithmic filter-bank’s outputs are converted into MFCCs by taking the Discrete Cosine Transform (DCT). Here, we use 25 ms frame length with 10 ms overlap. The given waveform of acoustic event s[n] is preprocessed by applying Hamming window and thus the waveform is split into short frames. The windowed audio frame h[n] is computed by    2nπ h½n ¼ s½n  0:54−0:46cos ; ð4Þ N −1 where N is the number of sample dots of one frame (i.e. frame length). To analyze h[n] in the frequency domain, an N-point FFT is performed for converting them into their corresponding frequency components. The value and the magnitude of a frequency component are computed by H ½k  ¼

N−1 X

hðnÞe− j

2nkπ N

;

0≤k ≤N −1;

ð5Þ

n¼0

jH ½k j ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðReðH ½k ÞÞ2 þ ðImðH ½k ÞÞ2 ;

ð6Þ

where Re(H[k]) and Im(H[k]) represent real part and imaginary part of H[k], respectively. The logarithmic power spectrum on the Mel-scale is computed using a filter bank comprising L Mel filters [6], ! k lu X ð7Þ X ½l  ¼ log jH ½k jW l ½k  ; k¼k ll

where l = 0, 1, …, L-1; Wl[k] is the lth triangular filter; and kll and klu are the lower and upper bounds of the lth filter, respectively. The lower and upper bounds of a filter are determined by considering the relationship between the frequency and the Mel-scale [6]. A DCT is finally applied to X[l] for obtaining MFCCs, i.e. C[m] C ½m ¼

L X l¼1

 X ½lcos

 mðl−0:5Þπ ; L

ð8Þ

where m = 1, …, M. M is the order of MFCCs and is set to 12 here.

2.2.2 Gabor features The audio signal can also be represented by spectro-temporal modulation patterns called Gabor features as proposed in [29]. The usage of Gabor filters [9] is motivated by their

Multimed Tools Appl

similarity to spectro-temporal patterns of neurons in the auditory cortex of mammals [28]. Gabor features can be computed by convolving the log Mel spectrogram of audio with a set of 2-D modulation filters, with varying extents and tuned to different rates and directions [23]. In the context of this paper, we used a set of 41 Gabor filters firstly used in [30]. The adopted filter bank parameters (e.g. filter spacing, lowest and highest modulation frequencies in time and frequency), were empirically determined based on the suggestions in [30]. For each feature frame, every 2-D Gabor filter was convolved with a set of 23 log Mel spectrum frequency bands, with frequencies ranging from 64 to 4000 Hz. A feature vector with 41 ∗ 23 = 943 dimensions represents the initial feature dimensions. The initial dimension of feature vector was reduced to 311 by subsampling the 23-dimension filter’s output for each filter, which aims at decreasing the redundancy of the selected features.

2.2.3 Bottleneck features Generally, bottleneck features are generated by a specially structured DNN which contains a narrow hidden layer with the smallest number of neurons, i.e. bottleneck layer. Since the number of neurons in the bottleneck layer is normally much smaller than other hidden layers, DNN training will force the activation signals in the bottleneck layer to form as a compact representation with lower dimension for the original inputs as long as the DNN is carefully trained to generate a low classification error rate in the output layer. Figure 3 gives the diagram of a DNN for extracting the bottleneck features. The MFCCs or Gabor features are first fed to the input layer and then hidden layers of the DNN, and the transformed features, i.e. bottleneck features, are finally extracted from bottleneck layer. The symbols of NI, NH, NB and NO in Fig. 3 are the neurons numbers of input layer, hidden layer, bottleneck layer and output layer, respectively. The determinations of them will be discussed in Sections 3.2 and 3.3.1.

3 Experiments and discussions This section begins with introducing experimental data from two audio corpora, and then presents experimental setups including the parameters setups of DNNs training and feature extractions, the definitions of performance evaluation metrics. Finally, different classifiers and features are compared for AED under several experimental conditions.

NO

Fig. 3 The diagram of a DNN for extracting bottleneck features

Output layer NB

Bottleneck layer

NH NH

Hidden layers

NH NI

Input layer

Multimed Tools Appl

3.1 Experimental data Experimental evaluation is carried out on the data from two audio corpora, BBC audio dataset and TV audio dataset. The BBC audio dataset is collected from BBC Sound Effects Library [2], and consists of a wide range of audio recordings, including a variety of transport, animals, musical instruments, speech, and other different themes. The length of the BBC audio dataset is 10.19 h in total. The TV audio dataset is the sound tracks of different types of videos, including sport, situation comedy, award ceremony, action movie and war movie, and in the total length of about 26.6 h. These videos are downloaded from the Internet and their details are given in Table 1. The acoustic events from the BBC audio dataset are relatively pure due to its high quality recording environment. On the contrary, the acoustic events in the TV audio dataset are of different complexities. For example, the acoustic events from war/action movie are relatively complex, and there are some mixed acoustic events, e.g. speech + music. Both the BBC audio dataset and TV audio dataset are popularly used as evaluation datasets in many applications, such as acoustic event classification [43] and audio retrieval [20]. Each audio file in the experimental data is first annotated by one of 10 postgraduates (i.e. annotators) and then carefully checked by one of another 10 postgraduates (i.e. verifiers). If there are no different opinions about the annotations in one audio file between the annotator and the verifier, then the annotations are considered as correct; otherwise, the controversial annotations will be checked by other 3 verifiers and the opinions agreed by at least 2 verifiers are used as the final annotations. The annotation information includes the endpoints and identities of acoustic events, and so on. Table 2 lists the details of training data which are segments of acoustic events contained in the TV audio dataset and BBC audio dataset. The TV audio dataset contains 10 types of acoustic events including Applause, Cheer, Fighting, Gunshot, Laugh, Music, Speech, S+M (Speech + Music), Background noise and Other, totally 24384 samples with total duration of 16.36 h (58888 s). In the BBC audio dataset, there are 14 types of acoustic events including Male, Female, Baby, Applause, Laugh, Bell, Drum, Bird, Bass, Flute, Running water, Gunshot, Rain and Wind, totally 6079 samples with total duration of 8.85 h (31853 s). One sample of acoustic event in Table 2 is one audio segment with duration of 2~5 s. 80% of the data listed in Table 2 are used as training subset, while the rest 20% are used as cross validation subset for the training of feature extractor and classifiers. Table 3 gives the information about test data for evaluating the performance of different classifiers and features for AED. The test data in TV and BBC audio datasets contain 7 audio files with total duration of 4.03 h and 8 audio files with total duration of 1.34 h, respectively. Table 1 The details of the corresponding videos of TV audio dataset Data types Sport

Video names

2014 and 2015 Australian Open Women's Singles Finals; 2015 US Open Men’s Singles Final; 2015 Basel Open Men’s Singles Final Situation Friends (Season 5); The Big Bang Theory (Season 8); comedy Home With Children (Season 1) Award 65th Emmy Awards; 72th Golden Globe Awards; ceremony 87th Oscars; 32nd Hong Kong Film Awards Action movie Kill Bill; Swordfish; The Bourne Identity; Dragons Forever; Kids From Shaolin; Mr. & Mrs. Smith War movie Pearl Harbor; Saving Private Ryan; Wind talkers

Duration 4.2 h 6.5 h 5.1 h 5.7 h 5.1 h

Multimed Tools Appl Table 2 The details of training data TV

BBC

Type

No. of samples

Duration (s)

Type

No. of samples

Duration (s)

Applause Cheer Fighting Gunshot Laugh Music Other S+M Background Speech ALL

728 1810 132 781 1581 1864 1722 2411 4935 8420 24384

2417 3689 598 3019 2766 11463 3849 4510 10846 15731 58888

Male Female Baby Applause Laugh Bell Drum Bird Bass Flute Running water Gunshot Rain Wind ALL

663 626 473 433 689 608 422 535 160 368 218 451 215 218 6079

3264 2852 2192 2342 2083 1020 1901 3031 877 2630 2336 2300 2379 2646 31853

The names and durations of audio files in the test data are given in the third column of Table 3. For example, the first test file of TV dataset is Mr. & Mrs. Smith with duration of 35 min and 26 s. The duration of each test audio file ranges from 19 to 54 min for the TV audio dataset, and is about 10 min for the BBC audio dataset. Each audio file contains different number of acoustic events with different durations. For the TV audio dataset, the duration of Speech accounts for about 35.5% of the total. For the BBC audio dataset, the duration of Male Speech accounts for about 26.64% of the total. The test data are different from the training data. All audio data are saved as single channel WAV files with 16 KHz sampling frequency and 16 bits quantization.

3.2 Experimental setup DNNs training is carried out using the TNet toolkit1 [38]. The learning rate is fixed at 0.005 as long as the single epoch increment in cross-validation frame accuracy is higher than 0.5%. For the subsequent epochs, the learning rate is halved until the cross-validation increment of the accuracy is less than the stopping threshold of 0.1%. Here, DNN is used as both feature extractor and classifier. The structures and parameters of the DNN feature extractor and the DNN classifier are the same, except that the former has one bottleneck layer while the latter does not. For the DNN feature extractor, the bottleneck layer has 26 neurons with location at the layer directly followed by the output layer. To model the dynamic properties of acoustic event, adjacent frames are also taken into consideration. In the first stage of the proposed framework in Fig. 1, a context of 31 frames is selected for the DNN fed by the feature stream of MFCCs. The neuron number of the input layer of DNN fed by MFCCs with 12 dimensions is hence 12∗31 = 372, and then reduced to 192 by the DCT. Gabor features inherently have a context description since a set of 2-D spectro-temporal filters and are exploited in the feature extraction process. Therefore, the Gabor features with 311 dimensions are directly used as another feature stream. In the second stage of the proposed 1

http://speech.fit.vutbr.cz/software/neural-network-trainer-tnet

Multimed Tools Appl Table 3 The details of test data Corpus #F (File names, File duration) TV

7

BBC

8

(Event types: PD (%))

TD

4.03 h (Speech: 35.5); (Background: 15.5); (Mr. & Mrs. Smith, 35:26); (65th Emmy (Applause: 2.3); (Cheer: 4.2); (Fighting: Awards, 54:04); (Friends, 19:34); (Home 0.8); (Gunfire: 3.6); (Laugh: 5.5); With Children, 21:40); (The Big Bang (Music: 22.7); (Other: 4.6); (S+M: 5.3) Theory, 22:13); (Women's Singles Finals, 43:00); (Pearl Harbor, 45:52) (BBC_1, 10:03); BBC_2, 10:03); BBC_3, (Male: 26.64); (Female: 23.8); (Baby: 4.7); 1.34 h (Applause: 5.0); (Laugh: 4.5); (Bell: 2.3); 10:05); BBC_4, 10:03); BBC_5, 10:00); (Drum: 4.2); (Bird: 6.5); (Bass: 0.9); BBC_6, 10:13); BBC_7, 10:00); BBC_8, (Flute: 5.7); (Running water: 5.0); 10:03); (Gunshot: 5.0); (Rain: 4.4); (Wind: 1.4)

#F number of audio files, PD percentage of duration, TD total duration

framework in Fig. 1, we use the same temporal context with 31 frames. Therefore, the neuron number of the input layer of the DNN at the second stage is (26 + 26)*31 = 1612, and then reduced to 832 by the DCT. Finally, the bottleneck features with 26 dimensions from the bottleneck layer of the DNN of the second stage are used as DAF. The neuron number of the output layer of the DNN depends on the classification task and is equal to the number of types of acoustic event. That is, the neuron number of the output layer for the TV audio dataset is equal to 10, whereas the neuron number of the output layer for the BBC audio dataset is equal to 14. The DNN classifiers are also trained by the data listed in Table 2. Precisions Rate (PR) and Recall Rate (RR) are popularly used as performance metrics for AED [36], and F score defined as the harmonic mean of PR and RR is used to characterize the overall performance of different features for AED. The definitions of PR, RR and F score are given in Eqs. (9), (10) and (11), respectively. PR ¼

c ; cþw

RR ¼



ð9Þ

c ; r

ð10ÞÞ

2  PR  RR ; PR  RR

ð11Þ

where r, c and w denote total number of labeled frames, total number of correctly detected frames and total number of incorrectly detected frames of acoustic events in the test data, respectively. Table 4 The relationship between F score and the neuron number of hidden layer F score (%)

TV BBC

100

300

500

700

900

1100

68.43 86.45

69.40 87.43

70.47 89.15

68.55 88.20

68.25 87.18

67.21 86.66

Multimed Tools Appl Table 5 The relationship between F score and the number of hidden layers F score (%)

TV BBC

1

2

3

4

5

6

7

70.47 89.15

72.71 92.21

75.21 95.46

77.03 93.27

79.53 90.72

76.47 88.13

74.15 86.24

3.3 Results and discussions In this subsection, we first discuss the determinations of DNN structure, i.e. the number of hidden layers and the number of neurons in each hidden layer. Then we compare the performance of different popular classifiers, e.g. DNN, HMM, GMM, SVM, evaluated on the test data using the proposed DAF. Finally, different features are compared on the test data with different SNRs.

3.3.1 Settings of DNN structure As introduced in Section 2.1, the type of the DNN used in this study is DBN which consists of one input layer, one output layer and several hidden layers. The number of neurons of input layer depends on the dimensions of input features, while the number of neurons of output layer is equal to the number of types of acoustic events being detected. The determinations of neuron numbers of both input and output layers have been discussed in Section 3.2. In this section, we discuss how to set the number of hidden layers and the neuron number of each hidden layer. We first determine the neuron number of each hidden layer by using the DNN with one hidden layer, and then increase the number of hidden layers by fixing the neuron numbers of hidden layers. Table 4 gives the relationship between F score and the neuron number of hidden layer. The DNN with one hidden layer is tuned on the training data of both the TV and BBC audio datasets for AED. It can be seen from Table 4 that the DNN obtains the highest F score when the neurons number of its hidden layer is equal to 500. Table 5 presents the relationship between F score and the number of hidden layers. The DNN is tuned on the training data of both TV and BBC audio datasets for AED. We can know from Table 5 that the DNN yields the highest F score when the number of its hidden layers are equal to 5 and 3 for TV and BCC audio datasets, respectively. Hence, the neuron number of each hidden layer of the DNN is set to 500 for both the TV and BBC audio datasets. The numbers of hidden layers of the DNN are set to 5 and 3 for the TV audio dataset and BBC audio dataset, respectively.

Table 6 The comparison of different classifiers for AED evaluated on the test data, (%) Corpus

TV BBC

DNN

HMM

GMM

SVM

RR

PR

F

RR

PR

F

RR

PR

F

RR

PR

F

76.10 92.33

78.60 95.37

77.33 93.83

71.46 86.74

71.58 87.83

71.52 87.28

70.09 85.18

69.90 86.89

69.99 86.03

71.85 88.25

72.24 89.32

72.04 88.78

Multimed Tools Appl Table 7 The comparison of different features for AED evaluated on the test data, (%) MFCCs

TV BBC

Gabor features

DAF

RR

PR

F

RR

PR

F

RR

PR

F

63.88 88.05

70.11 90.92

66.85 89.46

68.49 87.39

69.11 84.00

68.79 85.66

76.10 92.33

78.60 95.37

77.33 93.83

3.3.2 Classifiers comparison In this section, we compare the popular classifiers for AED by using the proposed DAF as their input feature. These classifiers are first trained using the training data, and then evaluated on the test data for AED. The parameters of the DNN classifier are set according to Section 3.3.1. The HMM and GMM classifiers are implemented by using the HTK toolkit [40], while the SVM classifier is realized by using the LIBSVM toolkit [5]. The parameters settings of HMM, GMM and SVM are optimally determined using the training data in Table 2, and their main parameters are experimentally set as follows. HMM: 6 states with 32 Gaussian components per state; GMM: 128 Gaussian components; and SVM: the kernel function of RBF (Radial Basis Function), one-vs-one multi-class method. The experimental results are given in Table 6. Evaluated on the test data of TV audio dataset, the F score obtained by the DNN classifier is 77.33% and is higher than that obtained by other classifiers, such as HMM, GMM and SVM. Evaluated on the test data of the BBC audio dataset, the DNN classifier yields the F score of 93.83% which is also higher than that obtained by other classifiers, e.g. HMM, GMM and SVM. These results indicate that the DNN classifier outperforms other classifiers in terms of F score for AED under the same experimental conditions. Hence, we choose the DNN classifier for comparing the performance of different features in the following experiments.

3.3.3 Features comparison Table 7 gives the performance achieved by different features for AED. For the TV audio dataset, the F scores are improved by 10.48% and 8.54% compared with MFCCs and Gabor features, respectively. For the BBC audio dataset, DAF obtains the improvement of F scores by 4.37% and 8.17% in comparison with MFCCs and Gabor features, respectively. These results show that DAF is superior to both MFCCs and Gabor features in terms of F score for Table 8 Confusion matrix of AED for DAF when evaluated on the TV audio dataset Applause Cheer Fighting Gunshot Laugh Music Other S+M Applause 85.52 Cheer 3.56 Fighting 0.42 Gunshot 0.00 Laugh 0.20 Music 0.53 Other 0.83 S+M 0.06 Background 0.91 Speech 0.09

3.63 68.72 2.36 0.67 1.03 0.68 0.99 0.25 0.24 0.41

0.00 0.02 29.26 0.84 0.00 0.30 2.50 0.11 0.29 0.06

0.00 1.18 5.62 75.87 0.01 0.84 2.98 0.45 0.41 0.11

0.72 1.57 2.28 0.00 86.92 0.49 2.28 0.07 3.71 0.90

4.17 5.77 12.76 14.53 0.73 88.13 11.56 11.81 17.93 0.90

Background Speech

0.00 0.39 0.06 0.57 0.56 0.86 11.95 4.18 5.91 5.03 0.56 1.26 0.34 0.19 0.66 1.82 3.91 0.93 41.71 1.28 9.56 0.35 73.32 0.24 8.84 0.29 25.12 0.33 1.46 0.36

5.51 17.19 25.26 1.24 9.92 2.37 26.31 13.34 42.26 95.38

Applause Baby Bass Bell Bird Drum Female Flute Gunshot Laugh Male Rain Water Wind

81.98 0.00 0.00 0.13 0.19 0.06 0.21 0.00 0.22 0.07 0.72 0.06 0.00 0.03

Applause

12.51 93.92 0.81 5.08 0.29 0.70 2.71 0.53 0.13 0.37 1.32 0.00 0.20 0.05

Baby

0.02 0.02 94.76 0.00 0.00 0.23 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.02

Bass

0.17 0.46 0.53 84.64 0.03 0.19 0.06 0.74 0.08 0.09 0.05 0.01 0.03 0.04

Bell 0.11 0.21 0.02 3.57 98.87 0.00 0.05 0.45 0.00 0.09 0.04 0.00 0.03 0.02

Bird 0.00 0.00 2.99 0.00 0.07 95.22 0.30 0.11 0.01 0.17 0.16 0.03 0.11 0.04

Drum

Table 9 Confusion matrix of AED for DAF when evaluated on the BBC audio dataset

0.59 0.97 0.05 0.12 0.20 0.29 87.95 0.14 0.25 2.17 7.69 0.39 0.26 0.13

Female 0.01 3.81 0.03 6.14 0.07 0.07 0.43 97.79 0.94 0.05 0.16 0.00 0.00 0.02

Flute 1.49 0.02 0.00 0.10 0.03 2.07 0.42 0.08 97.04 0.45 1.55 0.09 0.02 0.01

Gunshot 2.75 0.00 0.02 0.00 0.01 0.49 0.37 0.00 0.67 94.68 0.74 0.01 0.01 0.00

Laugh 0.36 0.58 0.20 0.22 0.23 0.68 7.46 0.09 0.66 1.83 87.53 0.37 0.40 0.25

Male

0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 99.04 0.00 0.00

Rain

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.02 0.00 98.94 0.00

Water

0.01 0.01 0.59 0.00 0.00 0.00 0.01 0.07 0.00 0.02 0.00 0.00 0.00 99.39

Wind

Multimed Tools Appl

Multimed Tools Appl Babble

90 80 70

70 60

F-sore(%)

F-s ore(% )

MFCC Gabor DAF

80

60 50 40

50 40

30

30

20

20

10

10

0

Destroyerops

90 MFCC Gabor DAF

clean

20dB

10dB

0

0dB

clean

20dB

SNR F16

90

MFCC Gabor DAF

70

MFCC Gabor DAF

80 70 60

F-sore(%)

60

F-s ore(% )

0dB

Factory1

90

80

50 40

50 40

30

30

20

20

10

10

0

10dB SNR

clean

20dB

10dB SNR

0dB

0

clean

20dB

10dB

0dB

SNR

Fig. 4 F scores of different features evaluated on the TV dataset with different SNRs

AED. In addition, the improvement obtained by DAF is more significant for the TV audio dataset (with lower SNR) than the BBC audio dataset (with higher SNR). That is, the advantage of DAF over other traditional features is more obvious when the SNR of the test data is lower. A confusion matrix contains information about actual and predicted classifications done by a classifier, and is often used to describe the performance of a classifier on a set of test data. The definition and calculation for the confusion matrix are discussed in detail in [8]. To know the confusion details among different acoustic events obtained by DAF, Tables 8 and 9 give the confusion matrices of AED when evaluated on TVand BBC audio datasets, respectively. It can be seen from Table 8 that the acoustic event with the lowest F score (25.12%) is Background which is dramatically confused with Speech (confusion of 42.26%). The reason may be that Background contains many background speech segments which are similar to Speech. Similarly, Fighting is greatly confused with Speech with a very low F score (29.26%). F scores for all acoustic events in Table 9 are obviously higher than that in Table 8. As for BBC audio dataset, the acoustic event with the lowest F score (81.98%) is Applause which is confused with Baby (confusion of 12.51%) and Laugh (confusion of 2.75%). The reason may be that Applause contains segments of both scream (similar to Baby sound) and laugh which often simultaneously occur together with Applause. To evaluate the robustness of different features, audio files in the test data are artificially mixed with noises from the NOISEX-92 database [37] at SNRs from 0 to 30 dB. Noises of BBabble^, BFactory1^, BF16^ and BDestroyerops^ are used and their energies almost concentrate on lower

Multimed Tools Appl Destroyerops

Babble

100

100 MFCC Gabor DAF

F-s ore(% )

60 40 20

0

clean

30dB

20dB SNR

10dB

40

0

0dB

F16

clean

30dB

MFCC Gabor DAF

40

10dB

20

0dB

MFCC Gabor DAF

80

F-sore(%)

60

20dB SNR Factory1

100

80

F-s ore(% )

60

20

100

0

MFCC Gabor DAF

80

F-sore(%)

80

60 40 20

clean

30dB

20dB SNR

10dB

0

0dB

clean

30dB

20dB SNR

10dB

0dB

Fig. 5 F scores of different features evaluated on the BBC dataset with different SNRs

frequencies. This characteristic of these four types of noises is similar to that of realistic nonstationary noises. Figures 4 and 5 give the F scores obtained by different features evaluated on the test data with different SNRs in two audio datasets. With the decrease of SNRs, F scores obtained by different features all decrease. DAF outperforms both MFCCs and Gabor features for AED under all situations of different SNRs. Tables 10 and 11 give the average F scores of three features evaluated on the test data contaminated by four kinds of noises. The DAF achieves average F score improvement by 9.08%, 12.01%, 6.79%, 13.64% than MFCCs, and 5.29%, 3.56%, 4.75%, 3.26% than Gabor features for the TV audio dataset, respectively. For the BBC audio dataset, the DAF averagely improves F scores by 8.77%, 10.64%, 8.76%, 10.40% than MFCCs and 11.19%, 6.29%, 7.04%, 6.99% than Gabor features, respectively. These results highlight the noise-robust discriminative capabilities of the proposed DAF.

Table 10 Average F scores of three features evaluated on the TV test data contaminated by four kinds of noises

MFCCs Gabor features DAF

Babble

Destroyerops

F16

Factory1

53.21 57.00 62.29

47.06 55.51 59.07

46.64 48.68 53.43

43.05 53.43 56.69

Multimed Tools Appl Table 11 Average F scores of three features evaluated on the BBC test data contaminated by four kinds of noises

MFCCs Gabor features DAF

Babble

Destroyerops

F16

Factory1

59.51 57.09 68.28

49.82 54.17 60.46

50.99 52.71 59.75

47.72 51.13 58.12

4 Conclusions This paper proposed a framework for extracting DAF by using multi-stream hierarchical DNNs. We take two input feature streams with two stages of DNNs as an example. Both MFCCs and Gabor features were chosen as two streams of input features, considering their different strengths for representing the characteristics of acoustic events. We first discussed the extraction of DAF in detail and then compared the performance of different classifiers (DNN, HMM, GMM and SVM) and features (DAF, MFCCs and Gabor features) for AED. Evaluated on both the BBC audio dataset and TV audio dataset, the experimental results showed that the DNN classifier performed the best among all classifiers, and that DAF was superior to both MFCCs and Gabor features for AED under different experimental conditions. The proposed framework could fuse the potential complementary information contained in multiple input feature streams by non-linear transformation of multi-stream hierarchical DNNs. In the future, we will discuss the performance of DAF by increasing the numbers of input feature stream and the stages of DNNs. As for the robustness of different features for AED, it was found that DAF didn’t perform quite satisfactory in lower SNR conditions, e.g. 0 dB. It could be solved by choosing more robust input features. Finally, we will consider other kinds of networks for extracting DAF, e.g. convolutional neural networks, and hope to obtain more effective features for AED under different experimental conditions. Acknowledgments The work was supported by the National Natural Science Foundation of China (61101160, 61271314, 61571192), the Fundamental Research Funds for the Central Universities, South China University of Technology, China (2015ZZ102), Project of the Pearl River Young Talents of Science and Technology in Guangzhou, China (2013J2200070), Science and Technology Planning Project of Guangdong Province (2014A050503022, 2015A010103003) and the Foundation of China Scholarship Council (201208440078).

References 1. Atrey PK, Maddage M, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc. of IEEE ICASSP, pp 813–816. IEEE 2. British Broadcasting Corporation (BBC), BBBC Sound Effects Library,^ http://www.sound-ideas.com/bbc.html, Accessed May 2015 3. Bugalho M, Portelo J, Trancoso I, Pellegrini T, Abad A (2009) Detecting audio events for semantic video search. In: Proc. of INTERSPEECH, pp 1151–1154. ISCA 4. Cakir E, Heittola T, Huttunen H, Virtanen T (2015) Polyphonic sound event detection using multi label deep neural networks. In: Proc. of International Joint Conference on Neural Networks, pp 1–7. IEEE 5. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. ACM 6. Childers DG, Skinner DP, Kemerait RC (1977) The cepstrum: a guide to processing. In: Proceeding of IEEE, 65(10):1428–1443. IEEE 7. Diment A, Heittola T, Virtanen T (2013) Sound event detection for office live and office synthetic AASP challenge. In: Proc. of IEEE AASP challenge on detection and classification of acoustic scenes and events. IEEE

Multimed Tools Appl 8. Fawcett T (2011) An introduction to ROC analysis. In: Pattern Recognition Letters, 27(8):861–874. Elsevier 9. Gabor D (1946) Theory of communication. In: Institute Electronica, no. 93, pp 429–457 10. Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. In: Proc. of the 22nd European Signal Processing Conference, pp 506–510. ISCA 11. Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A database and challenge for acoustic scene classification and event detection. In: Proc. of EUSIPCO, pp 1–5. ISCA 12. Grezl F, Karafiat M, Kontar S, Cernocky J (2007) Probabilistic and bottle-neck features for LVCSR of meetings. In: Proc. of IEEE ICASSP, pp 757–760. IEEE 13. Heittola T, Klapuri A (2008) TUT acoustic event detection system 2007. In: multimodal technologies for perception of humans, vol. 4625 of the series Lecture Notes in Computer Science, pp 364–370. Springer 14. Heittola T, Mesaros A, Virtanen T, Gabbouj M (2013) Supervised model training for overlapping sound events based on unsupervised source separation. In: Proc. of IEEE ICASSP, Vancouver, Canada, pp 8677–8681. IEEE 15. Hinton GE, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. In: IEEE Signal Processing Magazine, 29(6):82–97. IEEE 16. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18: 1527–1554, MIT Press 17. Jin F, Sattar F, Krishnan S (2012) Log-frequency spectrogram for respiratory sound monitoring. In: Proc. of IEEE ICASSP, pp 597–600. IEEE 18. Lin KZ, Pwint M (2010) Structuring sport video through audio event classification. In: PCM 2010, Part I, LNCS 6297, pp 481–492. Springer 19. Loren DE, Robert KO (1968) Programming and analysis for digital time series data, United Stated Department of Defense, first edition, Shock and Vibration Information Center 20. Lu L, Hanjalic A (2009) audio keywords discovery for text-like audio content analysis and retrieval. In: IEEE Trans. on Multimedia 10(1):74–85. IEEE 21. Ma L, Milner B, Smith D (2006) Acoustic environment classification. In: ACM Trans. On Speech Language Processing, 3(2):1–22. ACM 22. McLoughlin I, Zhang HM, Xie ZP, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. In: IEEE Trans. on Audio, Speech, and Language Processing, 23(3):540–552. IEEE 23. Moritz N, Anemüller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In: Proc. of IEEE ICASSP, pp 5492–5495. IEEE 24. Niessen ME, Van Kasteren TLM, Merentitis A (2013) Hierarchical modeling using automated subclustering for sound event recognition. In: Proc. of IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4. IEEE 25. Nogueira W, Roma G, Herrera P (2013) Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier. In: IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE 26. Okuyucu C, Sert M, Yazlcl A (2013) Audio feature and classifier analysis for efficient recognition of environmental sounds. In: Proc. of IEEE International Symposium on Multimedia, pp 125–132. IEEE 27. Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. In: IEEE Trans. on Audio Speech & Language Processing, 23(1):20–31. IEEE 28. Qiu A, Schreiner C, Escabi M (2003) Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. J Neurophysiol 90(1):456–476, American Physiological Society 29. Schadler MR, Kollmeier B (2012) Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems. In: Proc. of INTERSPEECH, pp 1–4. ISCA 30. Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Am 131(5):4134–4151, Acoustical Society of America 31. Schröder J, Cauchi B, Schädler MR, Moritz N, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) Acoustic event detection using signal enhancement and spectro-temporal feature extraction. IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE 32. Schröder J, Goetze S, Anemüller J (2015) Spectro-temporal gabor filterbank features for acoustic event detection. In: IEEE/ACM Trans. on Audio, Speech, and Language Processing, 23(12):2198–2208. IEEE/ACM

Multimed Tools Appl 33. Schröder J, Moritz N, Schädler MR, Cauchi B, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) On the use of spectro-temporal features for the IEEE AASP challenge detection and classification of acoustic scenes and events. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 1–4. IEEE 34. Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M (2007) Clear evaluation of acoustic event detection and classification systems. Lecture notes in computing science, 4122:311–322. Springer 35. Temko A, Nadeu C (2009) Acoustic event detection in meeting-room environments. In: Pattern recognition letter, 30(14):1281–1288. Elsevier 36. Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M (2009) Acoustic event detection and classification. In: Computers in the human interaction loop, pp 61–73. Springer 37. Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. In: Speech Communication, 12(3):247–251. ISCA 38. Veselý K, Lukáš B, František (2010) Parallel training of neural networks for speech recognition. In: Proc. of INTERSPEECH, pp 439–446. ISCA 39. Wang S, Yang X, Zhang Y, Phillips P, Yang J, Yuan T (2015) Identification of green, Oolong and black teas in China via wavelet packet entropy and fuzzy support vector machine. In: Entropy, 17(10):6663–6682. MDPI 40. Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC (2006) The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge 41. Yu D, Seltzer ML (2011) Improved bottleneck features using pretrained deep neural networks. In: Proc. of INTERSPEECH, pp 237–240. ISCA 42. Zhang Y, Chen S, Wang S, Yang J, Phillips P (2015) Magnetic resonance brain image classification based on weighted-type fractional Fourier transform and nonparallel support vector machine. Int J Imaging Syst Technol 25(4):317–327, Wiley 43. Zhang X, He Q, Feng X (2015) Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In: Proc. of IEEE ICASSP, pp 166–170. IEEE 44. Zhang Y, Wu L (2012) Classification of fruits using computer vision and a multiclass support vector machine. In: Sensors, 12(9):12489–12505. MDPI

Yanxiong Li received the B.S. degree and the M.S. degree both in electronic engineering from Hunan Normal University in 2003 and 2006, respectively, and the Ph.D. degree in telecommunication and information system from South China University of Technology (SCUT) in 2009. From Sept. 2008 to Sept. 2009, he worked as a research associate with the Department of Computer Science at the City University of Hong Kong. From Sept. 2013 to Sept. 2014, he worked as a postdoctoral researcher with the Department of Computer Science at the University of Sheffield, UK. From Jul. 2016 to Aug. 2016, he worked as a visiting scholar with the Institute for Infocomm Research (I2R), Singapore. He currently works as an associate professor with the School of Electronic and Information Engineering at SCUT. His research interests include audio processing, pattern recognition.

Multimed Tools Appl

Xue Zhang received the B.Eng. degree in electronic engineering from Xiamen University of Technology in 2015. She is currently a Master student at the School of Electronic and Information Engineering, South China University of Technology. Her research interests include audio analysis.

Hai Jin received the B.Eng. degree and M. Eng. degree in electronic and information engineering from Anhui University in 2013 and South China University of Technology in 2016, respectively. He currently works with IFLYTEK CO., LTD., China. His research interests include speech quality evaluation.

Multimed Tools Appl

Xianku Li received the B.Eng. degree in electronic engineering from Yangtze University in 2016. He is currently a Master student at the School of Electronic and Information Engineering, South China University of Technology. His research interests include audio signal processing.

Qin Wang received the B.Eng. degree in electronic engineering from Hainan University in 2014. She is currently a Master student at the School of Electronic and Information Engineering, South China University of Technology. Her research interests include audio content analysis.

Multimed Tools Appl

Qianhua He received the B.S. degree in physics from Hunan Normal University in 1987, the M.S. degree in medical instrument engineering from Xi'an Jiaotong University in 1990, and the Ph.D. degree in communication engineering from South China University of Technology (SCUT), in 1993. Since 1993, he has been at the Institute of Radio and Auto-control of SCUT. From 1994 to 2001, he worked with the department of computer science, City University of Hong Kong for about 3 years in 4 periods. From 2007.11 to 2008.10, he worked with University of Washington in Seattle as a visiting scholar. He currently works as a professor with the School of Electronic and Information Engineering at SCUT. His research interests include speech recognition, speaker recognition and its security, optimal algorithm design.

Qian Huang received the B.Eng. degree in electronic engineering from Changsha University of Technology in 1989, and received the M.Eng. degree in electronic engineering from South China University of Technology (SCUT) in 1992, and received the Ph.D. degree in electronic and information engineering from SCUT in 2005. From June 1997 to June 1998, she worked as a research associate with the City University of Hong Kong. From Oct. 2006 to Oct. 2007, she worked as a visiting scholar with the University of Bradford, UK. She currently works as a professor with the School of Electronic and Information Engineering at SCUT. Her research interests include audio and image processing, pattern recognition.

Suggest Documents