An Online Learning Algorithm for Voice ... - ACM Digital Library

0 downloads 0 Views 553KB Size Report
Oct 22, 2018 - machine instead of a back-propagation training algorithm to improve the ..... the unsupervised domain adaptation by training large-scale deep.
An Online Learning Algorithm for Voice Activation Detection Based on a Pretrained Online Extreme Learning Machine Tianle Zhang

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Yunlei Yang

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Muzhou Hou*

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Hongli Sun

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Zhong Gao

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

ABSTRACT There1 is often a difference between the training library and the test library in the inclusion of noise, which affects the practical application of speech endpoint detection. We propose to use a generalized regularized online sequential extreme learning machine with forgetting factor (GR-OSELM-FF) for voice activation detection to adapt to the difference between test and training samples. Practical usability is our driving motivation; the proposed model should be easily adapted to new conditions. When a new voice stream arrives in the test or the actual application phase, the proposed model can directly adjust the output weight. To overcome the weakness of ELM's vulnerability to random hidden layer parameters, we use an extreme learning machine-based autoencoder (ELM-AE) to initialize the model 1

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CSAE '18, October 22–24, 2018, Hohhot, China © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6512-3/18/10…$15.00 https://doi.org/10.1145/3207677.3278024

Futian Weng

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Zheng Wang

School of Mathematics and Statistics Central South University Changsha 410083 China [email protected]

Jianshu Luo

College of Science National University of Defense Technology Changsha 410073 China [email protected] parameters instead of using random initialization. The experimental results show that the pretrained models achieve better performance with ELM-AE, which can obtain the potential information from the data. The experimental results also show that the proposed algorithm maintains good accuracy and omission rates in different SNR noise environments and realworld voice samples.

CCS CONCEPTS • Theory of computation → Theory and algorithms for application domains; Machine learning theory; Online learning theory

KEYWORDS Voice activation detection, online extreme learning machine, extreme learning machine-based autoencoder, pretraining.

1 INTRODUCTION Voice activation detection (VAD) [1], as an important preprocessing step in speech recognition, can reduce the amount of computation and storage space and can greatly improve the accuracy of speech recognition. VAD aims to distinguish between the speech signal and background noise from the noise signals in complex real-world environments and to determine the beginning and ending points of speech.

CSAE2018, October 2018, Hohhot, China At present, the traditional algorithms for endpoint detection are divided into two categories: one is based on feature extraction, and the other is model-based. In the 1970s, Rabiner [2] used the short-term energy and short-time zero-crossing rate of sound segments as features to detect speech start and end points. Then, for a long time, the algorithm based on feature extraction was the dominant algorithm for VAD. The methods based on feature extraction extract the feature parameters in the time domain and the frequency domain. According to the different distribution rules of speech/nonspeech, a certain threshold is set to distinguish speech from nonspeech; the main features used to set the threshold are short-term energy, short-term zero-crossing rate, linear prediction cepstrum coefficient (LPCC) [3], etc. However, the methods based on feature extraction are not ideal in a complex real environment with a low signal-to-noise ratio (SNR). Model-based speech endpoint detection algorithms are receiving more and more attention. The most representative algorithms are based on statistical modeling methods such as Gaussian mixture model (GMM) [4], support vector machine (SVM) [5,6], and neural network methods [7-13]. However, the neural network-based VADs are still far from implementation in practical applications due to their high computational costs and iterative complexity. One of the problems in VAD is that it is often helpless for new audio streams in practical applications. In the new environment or given new labeled samples, it is often difficult to adjust the parameters to adapt to new environments and maintain performance. For example, we apply VAD in the high SNR environment, which has been trained by using the corpus of low SNR environment. This VAD algorithm needs to be able to easily adjust the parameters in the face of factory samples. The aim of this study was to try to solve this problem. In this paper, to address some of the shortcomings of the existing technologies, we propose an endpoint detection algorithm based on an extreme learning machines framework. The proposed algorithm uses OSELM with a forgetting factor and generalized regularization parameters (GR-OSELM-FF) to adjust the output weights more easily and quickly adapt to new environmental noise relative to traditional and iterative neural networks-based VAD. The forgetting factor can reduce the impact of the source corpora and reduce the overfitting phenomenon. ELM is a special neural network with random hidden layer parameters whose performance is easily affected by random parameters and is not stable. We used an extreme learning machine-based autoencoder (ELM-AE) and an unsupervised learning method to pretrain GR-OSELM-FF because the potential information in speech can be reflected in the output weight of ELM-AE. The contents of this paper are as follows. Section 2 describes the voice activity detection. Section 3 describes the extreme learning machine-based autoencoder. Section 4 describes the generalized regularized online sequential extreme learning machine with variable forgetting factor. Section 5 describes the method of proposed algorithm for VAD. Section 6 reports the experimental setup and the results. Finally, section 7 summarizes the work and provides future research directions. 2

Tianle Zhang et al.

2 VOICE ACTIVITY DETECTION At present, many effective VAD algorithms have been applied in practice and have certain robustness to background noise. However, these traditional classification algorithms are usually based on a heuristic design, which requires a large number of empirical thresholds to be reserved. Performance degrades under low SNR conditions. Therefore, how to obtain a simple and robust model in various noise environments is still the focus of research in this field.

2.1 Feature Extraction Voice feature selection is closely related to the performance of the VAD algorithm. The more features there are, the better the performance of the VAD is, but the increase in features seriously affects the efficiency of VAD identification and calculation. In this paper, the combination of MFCC and the differential coefficient of MFCC (DC- MFCC) are used as feature input. MFCC: The Mel frequency expresses a common correspondence from the speech frequency to the perceived frequency, which conforms more with the auditory characteristics of the human ear. Its expression is

f Mel  2594log10 1  f / 700

(1)

Flowchart for extracting MFCC is shown in Fig. 1. And the specific process for MFCC is as follows: Step 1: Discrete Fourier transform (DFT) is performed on speech vectors that have been preprocessed. Step 2: The resulting discrete spectrum is filtered by a sequence triangle filter to obtain a set of coefficients mi ; Step 3: The discrete cosine transform is used to transform the filter output into the cepstrum domain. The formula for the discrete cosine transform is: Ci 

2 P  i   mi cos  N  j  0.5 P j 1

i  1, 2, , N

(2)

Figure 1: Flowchart for extracting MFCC. DC-MFCC: The standard cepstrum parameter MFCC only reflects the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the differential spectrum of these static features. Experiments show that combining the dynamic and static features can effectively improve the recognition performance of the system. The difference parameter can be calculated using the following formula:

An Online Learning Algorithm for Voice Activation Detection Based on a Pretrained Online Extreme Learning Machine

di  n  

1 N

k

N



2 j  N

j  ci  j  n 

CSAE2018, October 2018, Hohhot, China

(3)

k  N

where di  n represents the nth coefficient of the first-order difference MFCC feature vector of the ith frame speech and ci  n represents the nth coefficient of the MFCC feature vector of the ith frame speech. By substituting the results of the above equations, the second-order differential parameters can be calculated.

2.2 Preprocessing of VAD and Parameters

networks (SLFNs) that was proposed by G B Huang et al. in 2006 and then extended to “generalized” SLFNs where the hidden layer neurons need not be neuron alike [14]. The hidden neuron’s parameters are randomly generated without iterative tuning and are independent of the data [15]. The network topology of ELM is shown in Fig. 3 Given a training set with N distinct samples



 =  x j , t j  x j   n , t j   m , j  1, , N



,

where

X

is

the

predictive variable with dimension n and t is the objective variable with dimension m. The mathematical model of SLFNs with M hidden neurons and an activation function G   are described as

Sound signals need to be preprocessed before voice activation detection, including pre-emphasis, adding windows, and framing. Pre-emphasis can compensate for the high-frequency components in the speech signals. Adding windows and framing ensure that the input signal is smooth and stable because the speech signal has a short-term stationary characteristic. The parameters of preprocessing and the MFCC features are shown in Table 1. Table 1: Parameters of Preprocessing and MFCC Features Parameters Sample frequency Pre-emphasis coefficient Frame length Frame shift Window function MFCC DC-MFCC

Value 16 KHz 0.95 16 ms 10 ms Hamming 13 3

2.3 Voice Activity Detection VAD includes a training phase and testing phase. A schematic diagram is shown in Fig. 2. Whether in a training phase or a testing phase, the voice signal needs to undergo preprocessing including pre-emphasis, adding windows, and framing; the preprocessing parameters are shown in Table 1. In the learning stage, first, extract the features of the preprocessed sound signal, frame by frame, including MFCC, one-order differential coefficient of MFCC and two-order differential coefficient of MFCC. Second, normalize the extracted features, and use the normalized features as input and labels to train according to supervised learning. After the training phase, a trained neural networks-based VAD model will be obtained. In the testing phase, perform the same feature extraction and normalization for the speech signals after preprocessing frame by frame. Then, use the feature as the input for the trained neural networks; the detection results will be output.

3 ELM-BASED AUTOENCODER The extreme learning machine is a novel learning algorithm based on the structure of single-hidden layer feedforward

Figure 2: Schematic diagram of VAD.

Figure 3: The network topology of ELM. n

  G  w , b , x , j  1, 2, , N i 1

i

i

i

j

(4)

where bi  R is the randomly assigned bias of the i th hidden node and wi  R is the randomly assigned input weight vector connecting the i th hidden node to the output node.

3

CSAE2018, October 2018, Hohhot, China

Tianle Zhang et al.

G  wi , bi , xj  is the output of the i th hidden node with respect to the input sample xj . When the SLFN is completely approximating the data, that is, when the error between the output tˆi and the actual t i is zero, the relationship is: n

  G  w , b , x = t i 1

i

i

i

j

j

, j  1, 2, , N

(5)

Eq (5) can be written as

H  T

autoencoder network can explore the implicit information in the original input data and encode the implicit information into the output weights. The autoencoder network usually uses backpropagation to train, but it is inefficient due to the need for iteration, especially in the deep learning network structure of multi hidden layers. Therefore, we train the autoencoder network by the training algorithm of the extreme learning machine instead of a back-propagation training algorithm to improve the training efficiency of the deep autoencoder network. One hidden layer ELM-based autoencoder network diagram is shown in Fig. 4

(6)

where

H  [h

T 1

h

T 2

 G  w1 , b1 , x1   G  w n , bn , x1      h ]     (7)  G  w1 , b1 , x N   G  w n , bn , x N     N n T N

 1T   t1T   T  T  t    2  ,T   2       T T t N  N  m   N  nm

(8)

The output weights can be computed by finding the leastsquare solutions of the SLFNs in Eq (6), which is given as

 =H T †

(9)

where H† is the Moore-Penrose generalized inverse of the matrix H. If HTH is not a singular matrix, then, Eq (9) can be written as  =H †T =  H T H  H T T 1

(10)

Figure 4: The network topology of an ELM-based autoencoder. In fact, original data can provide valuable information for the network parameters, so we can use the output weights of ELMAE to pretrain the ELM. The output weights of the ELM-AE are used as the input weights for the following neural network, such as the ELM classifier, to achieve better performance, as it is a data-driven method. Then, we introduce an algorithm to pretrain ELM by ELM-AE as shown in Fig. 5

In addition, we add an l-2 regularization constraint to the loss function of ELM which can improve the generalization and robustness. Eq (10) becomes the following formula: 1  T I  T H   H H  T  C  = 1  I  T T  C  H H  H T   

if N  L

(11) if N  L

where C assigns the regularization, I assigns the unit matrix, N assigns the number of samples and L assigns the number of hidden layer neurons. An autoencoder network is a special case of a neural network that is usually used for unsupervised learning, which is the most promising field in deep learning. In an autoencoder network, the learning phase can be divided into two stages: encoding and decoding. In the encoding stage, the input data is mapped to a high-dimension or low-dimension space, and a high-dimension or low-dimension representation can reconstruct the original input data in the decoding stage. The desired output T  m of the autoencoder network is set to be the same as the original n input data x j   ( T  X ). Through this special structure, the

4

Figure 5: The flow chart for pretraining the ELM by ELMAE.

4 GENERALIZED REGULARIZED ONLINE SEQUENTIAL EXTREME LEARNING MACHINE WITH VARIABLE FORGETTING FACTOR Unfortunately, ELM is a batch learning algorithm. However, in practical applications such as solar radiation forecasting, stock

An Online Learning Algorithm for Voice Activation Detection Based on a Pretrained Online Extreme Learning Machine

CSAE2018, October 2018, Hohhot, China

price forecasting, and weather forecasting, the time series data often arrives in the form of a data stream and may never end. The potential distribution and changing trend of the data continuously changes with time. When obtaining new data, traditional artificial intelligence algorithms such as ANN, SVM, and ELM need to gather both old and new data to retrain. Therefore, it is difficult to address “big data” and nonstationary and time-varying time series prediction problems. To effectively solve the nonstationary time series problem of a data stream, the online sequential extreme learning machine (OSELM) was developed. The learning of OSELM includes a preliminary ELM batch learning process and a continuous oneby-one or block-by-block learning process. Since OSELM involves the calculation of inversion during the update learning process, once the autocorrelation matrix of the hidden layer output matrix is singular or ill-conditioned, the generalization ability of OSELM will be severely degraded. Therefore, Huynh H T [16] combined Tikhonov regularization with OSELM and proposed a regularization of OSELM to improve the stability and generalization of the algorithm. To address the role of new samples in the process of online learning, the concept of the forgetting factor [17] was introduced to OSELM to strengthen the role of new samples by forgetting old samples so that the updated predictive model is closer to the current state of the time-varying system. Theoretically, the R-ELM-FF algorithm is equivalent to minimizing the following least squares loss function with FF and l 2 regularization, k

J FR   k     k  i ti  hi  k   k  k 2

2

Assume that the data sample arrives as a data stream; the

activation function is G  w , b, x  , the number of hidden neurons is n, the regularization parameter is  , and the forgetting factor

is  . Step 1: Initialization. Given the initial training subset



follows: 1) Generate

(13)

 k   k 1  P h  tk  hk  k 1  T k k

k

parameters

3)

Calculate

the

initial

output

weight

matrix

 k 1  Pk 1H kT1Tk 1 where Pk 1   H kT1 H k 1   I  , Tk 1  t1 t 2  t k 1  1

T

Step 2: Online learning and forecasting. Perform the

following steps for the new sample  xk , t k  : 1)

Compute the hidden layer output vector for the new

sample input xk : hk  G  a1, b1, xk   G  an , bn , xk   2) Calculate the network output, that is, the forecasted value of tk : tˆk  hk  k 1 3)

Update the output weight by the actual label tk

1



Pk 1 

Pk  Pk* 

 1      1     Pk 1  I  Pk 1  Pk 1 2   

Pk*hkT hk Pk* 1  hk Pk*hkT

 k   k 1  Pk hkT  tk  hk  k 1    1    Pk  k 1 4)

Return to Step 2.

5 VOICE ACTIVITY DETECTION WITH PRETRAINED GR-OSELM-FF BY ELM-AE

Pk 1hkT hk Pk 1  Pk       hk Pk 1hkT  Pk 1

time

neuron

as

Compute the hidden layer output matrix by Eq (7) H k 1  [h hT2  hTk 1 ]T

where  is the FF parameter with the weight of the old and

as

layer

proceed

T 1

Pk* 

the recursive calculation formula [18],  k can be deduced,

hidden

,

2)

(12)

new samples and  is the regularization parameter, which can improve the stability and generalization ability of the algorithm. Using the recursive least squares method to solve Eq (12), and

the

randomly  w i , bi  , i  1, 2, , n;

i 1

However,



 k 1 =  x j , t j  x j   n , t j   m , j  1, , k  1

increases,  k  k

2

decreases

exponentially and trends toward zero, which will cause the regularization function to fade to failure. GR-ELM-FF introduces a new constant coefficient regularization term  k

2

in its cost

function to replace the exponential regularization term  k k

2

Based on the speech training set, we construct an ELM-AE network to mine the potential information in the speech to the output weight. We use the output weights of ELM-AE to initialize the input layer parameters in the initialization network of GR-OSELM-FF. We train this initialized network speech and labels of the training set and online learning and forecast the new corpus segment by segment. The flow diagram of VAD by pretraining GR-OSELM-FF is shown in Fig. 6.

.

Its cost function is expressed as follows, k

J FR   k     k  i ti  hi  k    k 2

2

(14)

i 1

The GR-OSELM-FF works as follows:

5

CSAE2018, October 2018, Hohhot, China

Tianle Zhang et al. unvoiced speech. The omission rate represents voiced speech that has not been extracted, which will lead to a decrease in the accuracy of the following speech recognition. The false alarm rate refers to the recognition of environmental noise as voices, resulting in an increase in the storage of human voice data and a significant decrease in the efficiency of speech recognition. The experimental result is shown in Table 3. Fig. 7 shows the comparison of the results of the last speech after online learning in the TIMIT test set, which demonstrates excellent performance and learning ability.

Figure 6: The flow diagram of VAD by pretraining GROSELM-FF.

6 EXPERIMENTS AND RESULTS The TIMIT speech library, NoiseX-92 noise library, and realworld corpus (corpus collected from a realistic paging microphone) were used to evaluate the performance of the proposed method for voice activation detection. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) and NOISEX-92 are publically available voice libraries. The original NOISE-92 data were sampled at 19.98 KHz and stored as 16-bit integers. They were downsampled to 16 KHz before starting our experiments. Four kinds of noise signals (white, factory, babble and pink noise) from the NOISEX-92 database were added to the clean speech signals of TIMIT with different signal-to-noise ratios, and the SNRs were 15 dB, 10 dB,5 dB, 0 dB and -5 dB. TIMIT datasets are divided into training sets and test sets in a ratio of 7:3. We used two-thirds of the training sets for training, and the rest were used as validation sets. To verify that the pretrained model had better performance, we used random input parameters and pretraining parameters to test the performance in the training set and verification set, respectively, and repeated the test 50 times to compare the expectation and variance of the accuracy of the two models. The hyperparameters (the hidden neurons number L, forgetting factor  , regularization parameter  ) were tuned by a gridsearch strategy. Additionally, the sigmoid function was applied as the activation function for the hidden layer. The performance comparison of pre-trained and non-pre-trained models in validation set is shown in Table 2. To detect the robustness for noise of the algorithm, we randomly intercepted four kinds of noise signals (white, factory, babble and pink noise) with different signal-to-noise ratios, and added them into the speech signals from the test set of TIMIT. We used a mixed noise signal, pure speech signal and real-world corpus as online learning and recognition data. We used the accuracy, omission rate and false-alarm rate as performance evaluation indexes of the algorithm. As a preprocessing stage for speech recognition, VAD is very important for the performance of human voice recognition and extraction. The accuracy rate reflects the recognition accuracy of the algorithm for voiced and

6

Table 2: Performance Comparison in Validation Set Measure Pretrain GR- Pretrain OSELM Pretrain ELM ed GR- OSEL ed ed ELM OSELM- M-FF OSELM FF Accuracy (%)

91.33

90.30

89.51

89.04

89.38

89.17

Accuracy 0.0017 Std. Accuracy 91.01 Min. Accuracy 92.10 Max.

0.002

0.0019

0.0034

0.0018

0.0032

89.92

89.15

88.34

89.02

88.84

90.6

90.11

90.06

90.1

89.57

7 CONCLUSIONS In this paper, we used a variant of an online learning machine to establish a speech endpoint detection model and used ELM-AE to improve extreme learning machine performance through the use of random hidden layer parameters. This is the first work to use an online extreme learning machine for voice endpoint detection. The experimental results showed that the ELM-AE can make up for the shortcoming that ELM’s performance is easily affected by random weights and biases. This algorithm is not only easy to implement but also more robust to different noise signals, and it can maintain good recognition performance in different noise environments. In the future, we will improve the performance of the unsupervised domain adaptation by training large-scale deep models with a large number of noise types using massive computing power. In the future, our work will focus on two points: 1) We will consider the manifold algorithm in semi-supervised learning and unsupervised learning to use a small number of labeled samples and unlabeled samples in the online learning phase. As is well known, sample labeling work consumes many human and financial resources. 2) We will focus on migration learning and use unsupervised domain-adaptive methods to resolve differences between the training source domain and the migration target domain. This approach will improve the robustness of the algorithm in scenario training and application in a new noisy environment.

Figure 7: The voice activation detection comparison of the last sound of the test set of TIMIT.

Table 3: Performance Comparison when the Training Set of TIMIT Is Used to Train. The Test Set Includes a Real-World Corpus (Collected from a Realistic Paging Microphone) and the TIMIT Test Set Mixed with Different SNR Noises (White, Pink, Babble, Factory) Were Used to Test

SNR(dB) Pure Voice Real-World Voice

Acc (%) 73.21

ELM OR (%) 18.71

F-A (%) 68.08

Acc (%) 75.58

OSELM OR (%) 16.84

F-A (%) 81.94

Acc (%) 93.03

73.29

26.71

0.00

76.93

20.71

2.36

79.43

19.29

4.28

9.38 1.06 12.3 1.86 2.14 1.62 2.07 10.89 11.75 2.76 1.00 1.30 1.61 1.96 2.3 0.74 0.95 1.24 1.61 2.00

73.10 76.84 70.66 70.01 67.85 76.17 70.54 52.10 50.36 50.43 78.31 76.40 73.01 68.95 65.63 78.53 77.29 75.23 72.86 70.85

86.74 87.60 89.08 90.29 90.35 87.55 89.15 89.01 89.47 92.28 84.48 86.52 89.53 88.37 88.91 86.53 86.84 85.68 87.7 87.86

3.15 3.32 3.16 2.98 3.24 4.72 4.78 6.71 6.80 4.07 6.06 4.06 4.19 4.34 4.35 3.08 3.28 5.54 3.78 4.10

49.98 35.3 39.79 30.76 20 41.10 26.86 23.44 28.79 25.39 47.27 44.95 45.27 36.96 32.18 52.28 58.15 50.96 46.91 43.12

-5 64.02 30.25 59.43 80.87 0 77.3 13.59 68.11 87.78 White 5 64.13 30.1 69.62 78.57 10 77.66 13.70 67.86 87.87 15 77.04 14.40 67.93 87.88 -5 80.95 9.15 74.16 87.36 0 79.62 10.71 73.75 86.53 Babble 5 74.16 17.82 66.69 81.98 10 72.28 20.19 64.84 81.44 15 76.15 15.06 70.56 89.35 -5 85.04 4.19 76.63 87.67 0 83.77 5.77 75.60 87.62 Factory 5 81.94 8.03 74.11 87.75 10 80.27 10.15 72.39 87.91 15 78.74 12.08 70.92 88.00 -5 84.32 5.09 75.95 87.87 0 83.20 6.56 74.51 87.83 Pink 5 71.75 20.39 67.81 87.81 10 80.25 10.31 71.35 87.76 15 78.76 12.16 70.11 87.65 (notes: Acc, Accuracy; OR, Omission rate; F-A, False-alarm rate)

GR-OSELM-FF OR F-A (%) (%) 4.51 43.39

7

ACKNOWLEDGMENTS This work was partially supported by the National Science Foundation (61375063, 61773404, 11301549 and 11271378) and by the Fundamental Research Funds for the Central Universities of Central South University (2018zzts322).

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]

[9] [10] [11] [12] [13] [14] [15]

[16] [17] [18]

8

Wu B F and Wang K C. 2005. Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments. IEEE Transactions on Speech & Audio Processing, 13(5), 762-775. Rabiner L R and Schafer R W. 1978. Digital Processing Speech Signals. Proceedings of the IEEE, 68(10). Yang X, Tan B, and Ding J, et al. 2010. Comparative Study on Voice Activity Detection Algorithm. International Conference on Electrical and Control Engineering. IEEE Computer Society, 599-602. Sohn J, Kim N S, and Sung W. 1999. A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letters, 6(1), 1-3. Dong E, Liu G, Zhou Y, et al. 2003. Applying support vector machines to voice activity detection. Journal of China Institute of Communications, 2, 1124-1127 vol. 2. Feng Z, Feng J, and Dai F. 2016. The Application of Extreme Learning Machine and Support Vector Machine in Speech Endpoint Detection. International Journal of Control & Automation, 9(12), 191-202. Hussain A, Samad S A, and Fah L B. 2000. Endpoint detection of speech signal using neural network. TENCON 2000. Proceedings. IEEE, vol.1, 271-274. Wang Q, Du J, and Bao X, et al. 2015. A Universal VAD Based on Jointly Trained Deep Neural Networks. 2015 INTERSPEECH, 16th Annual Conference of the International Speech Communication Association, Dresden: IEEE Press, 2015. Wang L, Zhang C, and Woodland P C, et al. 2016. Improved DNN-based segmentation for multi-genre broadcast audio. IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5700-5704. Zhang X L. 2014. Unsupervised domain adaptation for deep neural networkbased voice activity detection, 6864-6868. Lin Y, Liang Y, and Yoshida S, et al. 2016. A Hybrid Algorithm of Extreme Learning Machine and Sparse Auto-Encoder. International Conference on Smart Computing and Communication. Springer, Cham, 194-204. Hussain A, Samad S A, and Fah L B. 2000. Endpoint detection of speech signal using neural network. TENCON 2000. Proceedings. IEEE, vol.1, 271-274. Zhang X L, and Wu J. 2013. Deep Belief Networks Based Voice Activity Detection. IEEE Press. Huang G B, Zhu Q Y, and Siew C K. 2006. Extreme learning machine: Theory and applications. Neurocomputing, 70(1), 489-501. Yun L Y, Zhang T L, and Hou M Z, et al. 2018. Application Research on a Neural Network Method for a Class of Two-point Boundary Value Problems of Differential Equation. Journal of Xuzhou Institute of Technology(Natural Sciences Edition), 33(02), 57-63. Huynh H T, and Won Y. 2011. Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks. Pattern Recognition Letters, 32(14), 1930-1935. Zhao J, Wang Z, and Dong S P. 2012. Online sequential extreme learning machine with forgetting mechanism. Neurocomputing, 87(15), 79-89. Xian Z. 2011. Selective forgetting extreme learning machine and its application to time series prediction. Acta Physica Sinica, 60(8), 080504-1056.

Suggest Documents