Minimum mean squared error time series classification ... - CiteSeerX

11 downloads 0 Views 208KB Size Report
Abstract—The echo state network (ESN) has been recently proposed as an ... on the train (test) set, compared to 100% (94.7%) for a hidden. Markov model ..... [7] D. L. Wang and D. Terman, “Locally excitatory globally inhibitory oscillator ...
IEEE International Symposium on Circuits and Systems, Island of Kos, Greece, pp. 3153-3156, May 21-24, 2006

Minimum mean squared error time series classification using an echo state network prediction model Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida Gainesville, FL, USA 32611 Email: [email protected]

Abstract— The echo state network (ESN) has been recently proposed as an alternative recurrent neural network model. An ESN consists of a reservoir of conventional processing elements, which are recurrently interconnected with untrained random weights, and a readout layer, which is trained using linear regression methods. The key advantage of the ESN is the ability to model systems without the need to train the recurrent weights. In this paper, we use an ESN to model the production of speech signals in a classification experiment using isolated utterances of the English digits “zero” through “nine.” One prediction model for each digit was trained using frame-based speech features (cepstral coefficients) from all train utterances, and the readout layer consisted of several linear regressors which were trained to target different portions of the time series using a dynamic programming algorithm (Viterbi). Each novel test utterance was classified with the label from the digit model with the minimum mean squared prediction error. Using a corpus of 4130 isolated digits from 8 male and 8 female speakers, the highest classification accuracy attained with an ESN was 100.0% (99.1%) on the train (test) set, compared to 100% (94.7%) for a hidden Markov model (HMM). HMM performance increased to 100.0% (99.8%) when context features (first- and second-order temporal derivatives) were appended to the cepstral coefficients. The ESN offers an attractive alternative to the HMM because of the ESN’s simple train procedure, low computational requirements, and inherent ability to model the dynamics of the signal under study.

I. I NTRODUCTION Nonlinear dynamic models have been applied to many problems, including system identification [1], [2], prediction [3], channel equalization [4], detection [5], and classification [6]–[8]. Motivations for the use of nonlinear dynamic models over other methods include superior performance in various applications (e.g., phoneme recognition [9], [10], and MackeyGlass prediction [11]) as well as biological inspiration [12], [13]. Many times a nonlinear dynamic model is formulated as a recurrent neural network (RNN) in order to employ one of the available RNN train methods [14]. A recent addition to the class of recurrent neural networks is Jaeger’s echo state network (ESN) [11], which uniquely divides the parameters of the model into two parts: 1) a reservoir of sigmoidal processing elements fully interconnected

with a static (untrained) random weight matrix, and 2) a readout which projects the reservoir state values onto a linear regressor. The reservoir weight matrix is randomly determined and subject to simple constraints in order to satisfy the “echo state” requirements [11]. The constraints on the reservoir weight matrix guarantee that the reservoir state is driven by the input with fading memory. The simple train procedure of the ESN distinguishes it from other RNN designs that employ computationally expensive iterative train methods, making the ESN an attractive model for a wide range of applications. Artificial neural networks are typically employed as classifiers by training the network to produce one of two Bayesian terms: 1) the class-conditional density function p(x|ωi ) , or 2) the a posteriori probability P (ωi |x) for the ith class ωi [15]. The density p(x|ωi ) can be estimated indirectly by modeling the production of x through a predictive model: x ˆ(n + 1) = F(χ(n), . . . , A)

(1)

where χ(n) is a set of the history of x up to x(n), and the model parameter set A of F is determined such that the meansquared error (MSE) between x(n+1) and the estimate x ˆ(n+ 1) is minimized. With some assumptions on the distribution of the error signal, the minimization of the prediction MSE has been shown to be equivalent to the maximization of p(x|ωi ) [16]. Classification using predictive models has been previously shown using a multilayer perceptron [16] and a time delay neural network [17] functioning on frame-based speech features. The above network architectures are feed forward with limited or no memory of the input history. The ESN, in contrast, is recurrent and contains a fading memory of the input history in the network reservoir. Furthermore, the abovementioned feed-forward networks were trained using backpropagation techniques, a relatively slow process that modifies all the weights of the network and a process which is not guaranteed to find the global minimum of the network cost function. However, the vast majority of the weights in an ESN are not trained, and the remaining weights are easily trained because the cost function relating the output weights to the

prediction MSE is convex. In the following experiments, an ESN was trained to predict frame-based speech features in a classification experiment.

18.5

tanh(Win u(n + 1) + Wx(n))

= Wout x(n).

(2)

The input is tied to each processing element in the reservoir by an input weight matrix Win , the processing elements are recurrently connected by a weight matrix W, and the reservoir is tied to the output through a linear readout Wout . The ESN described in Eq. 2 is a simplified version of a fully connected ESN [11], removing all feedback except that in the reservoir. To establish an ESN described in Eq. 2 with the property of echo states, the recurrent weight matrix W must satisfy the condition σmax < 1 where σmax is the largest singular value of W. To guarantee the echo state property for the following experiments, W was initialized with random values from a zero-mean, unity-variance normal distribution, then scaled by 0.9/σmax . The input weight matrix Win was set with random values from a normal distribution with a mean of zero and a standard deviation of 0.2. B. Multi-filter readout To account for the nonstationary nature of speech, a multifilter readout for each digit model was trained using dynamic programming. The multi-filter readout allowed each filter to segment a portion of the speech signal, reducing the degree of nonstationarity for each segment. For each digit ωi , a set of K filters {Wout,k (ωi )}, k = 1 . . . K was constrained such that if y(n) was generated using Wout,k (ωi ), y(n + 1) could only be generating using filter k or k + 1, up to the last filter in the set. The readout was further constrained to always start with the first filter and end with the last filter for each utterance. Therefore, the sequence k(n) represented a filter switching sequence. To batch train the multi-filter readout for each digit, the following procedure was used. First, the reservoir states x(n) for each input u(n) were generated using Eq. 2, then the K filters Wout,k (ωi ) were initialized by segmenting x(n) into K equal-length subsections xk (n) along with the desired signal dk (n). For each class ωi , the following terms were accumulated over all train utterances: Ak (ωi )



Bk (ωi )



XTk Xk

XTk Dk

(3)

where Xk = [xk (1), xk (2), . . . , xk (Nk )]T and Dk = [dk (1), dk (2), . . . , dk (Nk )]T represent the reservoir states and desired signals, respectively, for the k th segment. After accumulation over all train utterances, Wout,k (ωi ) = A−1 k (ωi )Bk (ωi ).

(4)

Prediction gain, dB

An ESN with an input u(n), a reservoir state x(n) with M processing elements, and an output y(n) is described in Eq. 2. y(n)

7

5 2

3

A. Echo state network

=

0

18

II. M ETHODS

x(n + 1)

9

4

1

17.5

6 17

8

16.5

16

15.5 0

5

10

15 Iteration

20

25

30

Fig. 1. Prediction gain (ratio of desired signal power to error signal power, in dB) for each ESN predictor model, M = 60, F = 12 Error signal power, over all utterances of each digit, decreased after each iteration of the train procedure until convergence. Each curve terminated when the prediction gain for successive iterations remained unchanged.

After initialization of the readout filters, the output from the k th filter yk (n) for each utterance of a digit was produced, which was then used to determine the error sequence e2k (n) = ||d(n) − yk (n)||2 . The optimum filter switching path k∗ (n) through e2k (n) for each utterance was determined using a dynamic programming algorithm similar to the Viterbi algorithm used with HMMs [18]. The path k∗ (n) for each utterance was then used to generate new segments Xk∗ and Dk∗ , which were accumulated according to Eq. 3. After processing all utterances, new filters Wout,k (ωi ) were generated according to Eq. 4. The above train procedure was previously proven to converge [17]. Figure 1 shows convergence of the training algorithm by the asymptotic curves for prediction gain on the train data for each ESN prediction model with M = 60 processing elements and F = 12 filters. Prediction gain is the ratio of desired signal power and error signal power, in dB. The curves for each digit prediction model in Figure 1 terminated when the prediction gain did not change between successive iterations of the train procedure. An example of the segmentation of an utterance of the word “six” from a male test speaker using the trained readout filter for an ESN with M = 100 and F = 6 is shown in Figure 2. The vertical gray bars denote phoneme boundaries in the utterance and appear to match well with the endpoints for the different regions (phonemes) of the utterance. C. Automatic speech recognition experiment Isolated utterances of the English digits “zero” through “nine” from 8 male and 8 female speakers were used to train the classifiers used in the experiments. The train set consisted of 26 utterances of each digit from 4 male and 4 female speakers, while the test set consisted of an equal number of utterances from 8 different speakers. Therefore,

TABLE I C LASSIFICATION ACCURACY FOR THE ESN CLASSIFIER , WITH M PROCESSING ELEMENTS AND F FILTERS IN READOUT. I DENTICAL TEST

1500

AND TRAIN SPEAKER SETS WERE MAINTAINED FOR ALL EXPERIMENTS . E XPERIMENTS WITH THE SAME M USED THE SAME RESERVOIR WEIGHTS W AND INPUT WEIGHTS Win .

Amplitude

1000

500

0

−500

−1000 0

ESN 200

400

600

800

1000

1200

Time, ms

Fig. 2. Example of segmentation of the word “six” using an ESN predictor model with M = 100 processing elements and F = 6 filters in the multifilter readout. Each vertical gray line represents a transition in the optimum switching path, determined using dynamic programming.

each digit model was trained with about 200 utterances from 8 speakers and tested on about 200 utterances from 8 different speakers. Frame-based speech features, using human factor cepstral coefficients (HFCC) [19], were extracted from the speech utterances using 20-ms Hamming windows and preemphasis (α = 0.95) at 100 frames per second. The features were the log energy of each frame and the first 12 cepstral coefficients. The ESN classifier was compared to a left-to-right HMM, trained using a maximum likelihood procedure [20]. Covariance matrices were assumed diagonal, and the HMM was used with and without context features (first- and secondorder temporal derivatives over ±4 frames [21]). III. R ESULTS AND D ISCUSSION Table I shows the classification accuracy for the ESN classifier on test and train data using various number of processing elements M and number of readout filters F . To decrease the variation in scores, the same test and train speakers were maintained for all experiments for both classifiers. In addition, experiments with equal M used the same input weights Win and reservoir weights W. Experiments with M and F fixed, using several different W and Win , showed that accuracy scores varied by about ±0.5 percentage points. For almost all experiments in Table I, the train set was nearly completely learned by the ESN classifier, while accuracy on the test set ranged between 97 and 99%. Larger M increased the effective memory depth of the reservoir and also the number of basis functions available to predict the next frame of features, leading to improved performance, while larger F lead to finer segmentation of each utterance. Beyond a certain point, increasing M did not improve classification accuracy, due to several factors: limited train data, fading influence of past samples on prediction, and saturating reservoir memory depth.

M 13 13 13 13 30 30 30 30 60 60 60 60 100 100 100 100 150 150 150 150 200 200 200 200

F 1 4 8 12 1 4 8 12 1 4 8 12 1 4 8 12 1 4 8 12 1 4 8 12

Test % 88.6 95.4 98.1 98.3 89.0 97.0 98.3 98.6 94.4 98.6 98.6 99.1 97.2 98.5 98.7 98.2 96.2 97.9 97.7 98.4 96.5 98.0 97.4 98.3

Train % 97.0 99.8 100 100 99.5 100 100 100 99.9 100 100 100 99.9 100 100 100 100 100 100 100 100 100 100 100

Table II shows the results for the HMM classifier for various number of states N and number of mixtures per state M with and without context features. The first row in Table II shows HMM performance when each digit model was essentially a single 39-dimension Gaussian kernel with a diagonal covariance matrix (78 parameters estimated). Accuracy on the test set was 78.9%, compared to 88.6% for the first row in Table I, which was essentially a single-state prediction model (13 parameters estimated). For larger HMMs, performance on the test set was near 99.5% correct, although the results for HMMs trained with no context information saturated at 94.7% regardless of model size. The ESN classifier did not use temporal derivatives as input to the reservoir because of the natural inclusion of temporal information in the predictive model framework. Computational cost for each classifier was estimated according to execution time. While not the best measure, execution time provides some insight into the efficiency of each classifier for a given set of train data. As many parameters as practical shared between the ESN and HMM classifier experiments were fixed (e.g., same computer, same amount of data processed, similar number of train iterations, similar degree of code optimization). For the HMM results in the first row of Table II for M = N = 1 (78 parameters per model), training required 142 seconds and testing on the test set required 153 seconds. By comparison, for the ESN results in the first row of Table I for M = 13 and F = 1 (13 parameters per model), training required 3 seconds and testing on the test set required

TABLE II C LASSIFICATION ACCURACY FOR AN HMM CLASSIFIER , WITH N STATES AND M MIXTURES PER STATE . I DENTICAL TEST AND TRAIN SPEAKER SETS WERE MAINTAINED FOR ALL EXPERIMENTS .

R ESULTS FROM (N.C.) ARE SHOWN IN THE LOWER GROUP.

MODELS TRAINED WITH NO CONTEXT

N 1 1 6 6 8 8 10 N.C. 1 6 10 10

M 1 6 1 5 1 6 6

Test % 78.9 99.4 99.3 99.6 99.5 99.7 99.8

Train % 84.2 99.9 99.9 100 99.9 100 100

1 1 1 6

77.3 93.3 94.5 94.7

88.6 99.4 99.0 100

8 seconds, more than an order of magnitude lower than the HMM. The HMM results for N = 8, M = 1 (638 parameters per model) required 1703 seconds for training and 533 seconds for testing on the test set. The ESN results in Table I for M = 60 and F = 12 (720 parameters per model) were near the top of the performance range for the ESN classifier, and training required 268 seconds while testing on the test set required 68 seconds. The highest HMM accuracy listed in Table II was for N = 10 states and M = 6 mixtures per state (4758 parameters per model) and required 6318 seconds for training and 1265 seconds for testing on the test set. The slight performance gain of the HMM compared to the ESN was achieved with a significant increase in execution time and a modest increase in the number of estimated parameters per model. IV. C ONCLUSIONS We have demonstrated the use of an echo state network predictor model as a classifier in an automatic speech recognition experiment. Unlike previous realizations of hybrid connectionist-stochastic predictive models for classification, the ESN classifier in the current work did not require training for the vast majority of the model’s parameters. The simple train procedure of the ESN led to significantly shorter execution times for slightly lower performance compared to an HMM. Time warping in the ESN classifier was achieved using a dynamic programming procedure, similar to the Viterbi algorithm, that determined the optimum filter switching path over the duration of a word utterance. Both classifiers were able to fully learn the train sets, while the HMM performance on the test set was about 0.7% above the range of ESN performance. An advantage of the HMM over the current implementation of the ESN classifier was the use of multiple Gaussian kernels per state. While the ESN readout filters accounted for time warping in the same manner as the states of an HMM, each segmented region was modeled by only a single linear regressor. The use of multiple regressors per

segmented region, which could be achieved by modifying the path constraint in the dynamic programming algorithm, may improve ESN classification performance. R EFERENCES [1] M. J. Korenberg, “Parallel cascade identification and kernel estimation for nonlinear systems,” Ann. Biomed. Eng., vol. 19, pp. 429–455, 1991. [2] H. Jaeger, “Adaptive nonlinear system identification with echo state networks,” in Advances in Neural Information Processing Systems, 2002, S. T. S. Becker and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003, pp. 593–600. [3] H. D. I. Abarbanel, T. W. Frison, and L. S. Tsimring, “Obtaining order in a world of chaos,” IEEE Sig. Proc. Mag., pp. 49–65, May 1998. [4] H. Jaeger and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, no. 5667, pp. 78–80, 2004. [5] L. A. Feldkamp and G. V. Puskorius, “A signal processing framework based on dynamic neural networks with applicatoin to problems in adaptation, filtering, and classification,” Proc. IEEE, vol. 86, no. 11, pp. 2259–2277, 1998. [6] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proc. National Academy of Scienes, vol. 79, pp. 2554–2558, April 1982. [7] D. L. Wang and D. Terman, “Locally excitatory globally inhibitory oscillator networks,” IEEE Trans. Neural Networks, vol. 6, no. 1, pp. 283–286, January 1995. [8] F. C. Hoppensteadt and E. M. Izhikevich, “Pattern recognition via sychronization in phase-locked loop neural networks,” IEEE Trans. Neural Networks, vol. 11, no. 3, pp. 734–738, May 2000. [9] A. J. Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Trans. on Neural Networks, pp. 298–305, Mar 1994. [10] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, June-July 2005. [11] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks,” German National Research Center for Information Technology, Fraunhofer Institute for Autonomous Intelligent Systems, Tech. Rep., December 2001, GMD Report 148. [12] C. A. Skarda and W. J. Freeman, “How brains make chaos in order to make sense of the world,” Behavioral and Brain Sciences, vol. 10, pp. 161–195, 1987. [13] D. Watts and S. Strogatz, “Collective dynamics of small-world networks,” Nature, vol. 393, pp. 440–442, 1998. [14] A. F. Atiya and A. G. Parlos, “New results on recurrent network training: unifying the algorithms and accelerating convergence,” IEEE Trans. Neural Nets., vol. 11, no. 3, pp. 697–709, May 2000. [15] J. Schurmann, Ed., Pattern Classification: a Unified View of Statistical and Neural Approaches. New York, NY: John Wiley & Sons, Inc., 1996, ISBN: 0-471-13534-8. [16] E. Levin, “Word recognition using hidden control neural architecture,” in Int. Conf. Acoust., Speech, and Sign. Process., vol. 1. Albuquerque, NM: IEEE, Apr. 1990, pp. 433–436. [17] K. Iso and T. Watanabe, “Speaker-independent word recognition using a neural prediction model,” in Int. Conf. Acoust., Speech, and Sign. Process., vol. 1. Albuquerque, NM: IEEE, Apr. 1990, pp. 441–444. [18] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, Eds. San Mateo, CA: Kaufmann, 1990, pp. 267–296. [19] M. D. Skowronski and J. G. Harris, “Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition,” J. Acoust. Soc. Am., vol. 116, no. 3, pp. 1774–1780, September 2004. [20] K. Murphy, “Bayes network toolbox for matlab,” 2005, URL: http://www.cs.ubc.ca/ ∼murphyk/ Software/BNT/bnt.html [Aug. 9, 2005]. [21] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, and Sign. Process., vol. 29, no. 2, pp. 254–272, Apr 1981.

Suggest Documents