MULTI-LAYER PERCEPTRONS AND PROBABILISTIC NEURAL. NETWORKS FOR PHONEME RECOGNITION*. &ell 0. E. Elenius und Huns G. C. ~riive'n*".
Dept. for Speech, Music and Hearing
Quarterly Progress and Status Report
Multy-layer perceptrons and probabilistic neural networks for phoneme recognition ˚ en, ´ H. G. C. Elenius, K. O. E. and Trav
journal: volume: number: year: pages:
STL-QPSR 34 2-3 1993 001-006
http://www.speech.kth.se/qpsr
STL-QPSR 2-3/1993
MULTI-LAYER PERCEPTRONS AND PROBABILISTIC NEURAL NETWORKS FOR PHONEME RECOGNITION* &ell 0. E. Elenius und Huns G. C. ~riive'n*"
ABSTRACT Two artificii neural networks have been truined to recognise phonemes in continuous speech: multi-lnyer perceptron (MLP) nets ~2ndprobabilistic neurul networks (PNN). The speech rnaterinl was recorded by one male Swedish speaker and the sentences were phonetically labelled. Fifty sentences were used for training crnd another Jvty were used for testing. Both networks had cl single hidden layer arzd 38 output rzodes corresponding m Swedish phonemes. The MLP was trained by the supervised buckpropagation cllgorithm. The PNN was truined by n self-organising clustering nlgoridlrn, n stochastic approximation to the expectation maximisatiorz algoritlzm. The classification results for a feed-forwurd MLP and the PNN were rather similar, but an MLP with simple recurrency using context nodes gave the best performaizce. Serverul other dlflerences of practical value were noted.
1. BACKGROUND Phoneme recognition can be viewed as a pattern classification problem, the task being that of classifying multivariate measurements derived from a speech signal. Multi-layer perceptrons (MLP) and probabilistic neural networks (PNN) are both feed-forward neural networks that can be used as general purpose classifiers. Viewed within the framework of Bayesian classification the MLP and the PNN are related, but use two essentially comple~nentary statistical models. The MLP models the discriminant functions for different phoneme categories, essentially by a piece-wise planar approximation. It can be shown that several of the commonly used cost functions are tninilnised when network outputs correspond to a posteriori class probabilities, see Michael and Lippman (1 99 1). The PNN approximates class conditional probability densities by a Gaussian mixture but has no explicit model of the discriminants. In the two networks, connection weights correspond to normal vectors and mean values, respectively.
2. SPEECH MATERIAL The speech material consisted of 100 Swedish sentences and was recorded by one male speaker. It was sampled at 16 kHz using a 6.3 kHz low-pass filter, see Hunnicutt (1987). The smoothed output of 16 Bark-scaled filters in the range 200 Hz to 6.0 kHz were input to the network. The interval between successive spectral frames was l G ms and the integration time was 25.6 ms, compare Fig. 1. Fifty sentences were used for training and fifty were used for testing. The number of phonemes for the training material was 2202 and the total number of 10 ms frames was 15258 (2064 and 13671 * This paper has been published in the Proceedings of EuroSpeech '93, Berlin. ** Department o f Numerical Analysis and Cotnpurinp Science (NADA) KTH, S-100 44 Stockholm, Swederi
for the test material). The sentences were phonetically labelled (Nord, 1988) and had a natural (large) variation in distribution. M
S
0
R
J
S
F
U
L
T
B
I:
L
E
N
Figure 1. Example spectrogram of the speech material using the network speech input representation of 10 ms k a m e rate and 16filter amplitudes. Part of sentence: "omsorgsfullt bilen" (cartIfd(y the car). The manual labelling is also shown.
3. NETWORKS Both networks had a single hidden layer, containing 64 nodes in the MLP and 128 in the PNN. There were 38 output nodes corresponding to Swedish phonemes. The MLP was trained with the back-propagation algorithm and the cross-entropy criterion (Solla et al., 1988). The PNN was trained by a self-organising clustering algorithm, a stochastic approximation to the expectation maximisation (EM) algorithm, compare Triven (1993b). The MLP-network and some of its results have earlier been described in Elenius & Blomberg (1992), whereas the PNN experiments have been described in Trivkn (1993a). A straightforward way to include coarticulation information, which naturally is important in phoneme recognition, is to use an input window to the network that extends over several spectral frames. We used this technique for both types of networks. Another way of including this information is to add recurrency to the network, compare Watrous (1990), and Robinson & Fallside (1991). We have tried to add simple recurrency to the MLP network using the technique of context nodes, compare Elman (1988) and Servan-Schreiber et al. (1988). The context nodes contain delayed values of other nodes in the network. We used context nodes for the hidden layer that stored the one frame delayed (10 ms) values of them. The context nodes were connected to the hidden nodes with trainable weights. We also used 10 ms delayed context nodes for the output nodes connected to the outputs. Using both types of context nodes simultaneously gave a somewhat better performance than just using either of them separately. The PNN has a connectivity structure similar to the time-delay-neural-network described by Waibel et al. (1989a) and Lang et al. (1990). The hidden nodes have an input window extending over three spectral frames (30 ms). Hidden nodes are connected to output nodes through time delayed connections (five connections 30 rns apart between each pair of hidden- and output nodes). The training of the hidden nodes is unsupervised. The hidden nodes learn to recognise time invariant features using an EM algorithm for invariant Gaussian mixtures, described in Trsven (1993). Briefly, hidden node outputs are smoothed over five translations in 10 ms steps (from -20 to +20 ms) of the input window. During learning the contributions from different translations are weighted with the probability
STLIQPSR 2-311 993
of the translated pattern. The output nodes are linear and trained with the least-rneansquare algorithm. The PNN did not have any recurrent connections. 4. RESULTS
The classification performance of the networks was evaluated against the manual segmentation. It was first evaluated on the frame level: a frame was counted correctly labelled if the phoneme node with the maximum output value corresponded to the label assigned by the human labeller. Note that although the output is associated with a specific frame, the input window usually contains more than one frame. The relation between performance and the size of the input spectral window for the feed-forward MLP is shown in Table 1. The spectral amplitudes were fed to a hidden layer of 64 nodes. window size % correctJj.ames
30 ms 64
10 ms 58
50 ms 66
70 ms 70
Table I . Frame level phoneme recognitionperformar7cefor dflferent input window sizes of the feed-forward MLP network.
We also evaluated the segrnent-level classification performance for some conditions. In this case the mean output over all consecutive frames with the same (manually assigned) phonetic label was used. We noted if the correct phoneme had the maximum mean output. We likewise measured if it was among the two (or three) nodes with the highest mean output. Results for the PNN with 150 ms input window and the feed-forward MLP with 70 ms window may be seen in Fig. 2 together with the result of a simple recurrent MLP with 50 ms window. The results of a the MLP without simple recurrence and the PNN were rather similar. Introducing coarticulation information by expanding the size of the input spectral window was shown to significantly improve performance for both methods, compare Table 1. However, an MLP with a 100 ms input window did not give better results than a 70 ms window, probably since it had too many weights, compare below. Table 2 contains a list of the most frequent confusions made by the PNN. Confusions are generally towards phonemes with a high a priori probability that are acoustically similar. The same type of errors were observed for the MLP network. correct
e
reported
i
frequency
24
m e
p n
19 18
z t
r e
r a
k e
f t
18 18 17 , 1 3
a t
a
12 12
:
j
l
i
:
r
s t
q n
12 12 11 1 1
d n
o
o
i
:
e
10 10 10
Table 2. The 16 most Pequent conji4sions made by the PNN. Errors are in absolzlte numbers. The rest of the errors had less than I0 occurrences
PNN 150 ms wndow
Frames tra~ning
MLP 70 ms wndow
Frames test
Segment Ist
Segment 2nd
Segment 3rd
Figure 2. The performance of the PNN and the MLP with and without simple recurrence. Same speech material of a single male Swedish speaker evaluated on frame and segment (phoneme) level. Segment-level results indicate when the correct phoneme is among the top I , 2 or 3 nodes with highest output values. Frame-level results are also shown for the training material.
The MLP with recurrence performed significantly better than the other networks, except when allowing the correct phoneme segment to be among the three best ones, when the PNN performed equally well. Adding simple recurrence to an MLP net with a 10 ms input window gave a substantial improvement from 58% to 69% frame recognition. This is only 1% less than the results for the 70 ms window feed-forward MLP that has more than double the number of connections (9 600 weights compared to 4 196). This illustrates the power of using recurrency. The net now has the ability to keep a "memory" of relevant events that have passed. The net is also allowed to adjust the integration time according to what is optimal for each case. Adding context nodes to a net with a 50 ms input window resulted in 6% improvement (to 73%), which shows that recurrency may be combined with an input window, Cho et al. (1990). these type of nets get many weights (13 092 in our case), which means that the training becomes problematic with the around 15 000 frames available. This is far too few considering the rule of thumb saying that the number of training examples should be around 10 times the number of weights. Accordingly, using context nodes with a 70 ms input window did not improve performance, which thus most probably is due to the limited training set. Also, adding another set of context nodes keeping the 10 ms delayed values of the first set of context nodes, which should result in an improved and more flexible recurrency, gave no improvements, probably due to the increased number of weights. Lee et al. (1991) have shown a small positive effect by using a similar technique.
I
ow ever,
~
I
i
I I
REFERENCES Altosaar, T. and Karjalainen, M. (1992). "Diphone-based speech recognition using time-event neural networks," Proceedings of ICSLP 92, Vol. 2, pp. 979-982. Banff. Cho, Y.D., Kim, K.C., Yoon, H.S., Maeng, S.R. and Cho, J.W. (1 990). "Extended Elman's recurrent neural network for syllable recognition", Proceedings of ICSLP 90, Vol. 2, pp. 1057-1060. Kobe. Elenius, K. and Blomberg M. (1992). "Experiments with artificial neural networks for phoneme and word recognition," Proceedings of ICSLP 92, Vol. 2, pp. 1279-1282. Banff. Elman (1988). "Finding structure in time", Technical Report 8801. Centre for Research in Language, University of California, San Diego. Hunnicutt, S. (1987). "Acoustic correlates of redundancy and intelligibility," Tech. Report, STL-QPSR, No. 2-3, pp. 7-14, Dept. of Speech Communication, KTH, Stockholm. Lang, K.J., Waibel, A.H., Hinton, G.E. (1990). A time-delay neural network architecture for isolated word recognition, Neural Networks, Vol. 3, pp. 23-43. Lee, S.J., Kim, K.C., Yoon, H.S. and Cho, J.W. (1991). "Application of fully recurrent neural networks for speech recognition," Proceedings of ICASSP-91, pp. 77-80. Toronto, Canada. Lee, Y. (199 1). "Handwritten digit recognition using K-nearest-neighbour, radial-basis function, and backpropagation neural networks", Neural Computation, Vol. 3, No. 3, pp. 440-449. Michael, D., Lippmann, R.P. (1991). "Neural network classifiers estimate Bayesian a posteriori probabilities", Neural Computation, Vol. 3, No. 4, pp. 461-483. Nord, L. (1988). "Acoustic-phonetic studies in a Swedish speech data bank," Proc. SPEECH188, Book 3, pp. 1147-1152 (7th FASE Symposium), Institute of Acoustics, Edinburgh. Pearlmutter, B.A. (1990). "Dynamic Recurrent Neural Networks." Technical Report CMUCS-88-191, Computer Science Department, Carnegie Mellon University, Pittsburgh. Robinson, T. and Fallside, F. (1991). A recurrent error propagation network speech recognition system. Computer Speech and Language, 5(3), 259-274. Sewan-Schreiber, D., Cleeremans, A, and McClelland J. L. (1988). "Encoding sequential structure in simple recurrent networks", Carnegie Mellon University, CMU-CS-88-183, Computer Science Department, Carnegie Mellon University, Pittsburgh. Solla, S., Levin, E. and Fisher, M. (1988). "Accelerated Learning in Layered Neural Networks," Complex Systems, 2, pp. 625-640. Triven, H.G.C. (1991). "A neural network approach to statistical pattern classification by 'semi-parametric' estimation of probability density functions," IEEE Trans. on Neural Networks, Vol. 2, No. 3, pp. 366-377. TrAven, H.G.C. (1 993a). "Invariance constraints for improving generalisation in probabilistic neural networks," Proc. IEEE ICNN193,pp. 1348-1353. San Francisco. Triven, H.G.C. (1 993b). On Pattern Recognition Applications of Artificial Neural Networks, Dissertation, Thesis TRITA-NA-P9318, Dept. of Numerical Analysis and Computing Science (NADA), KTH, Stockholm. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989a). "Phoneme recognition using time-delay neural networks", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 37, No. 3, pp. 328-339. Waibel, A., Sawai, H., and Shikano, K. (1989b). "Modularity and scaling in large phonemic neural networks", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 37, No. 12, pp. 1888-1898. Watrous, R. (1990). "Phoneme discrimination using connectionist networks", JASA, Vol. 87, NO. 4, 1753-1772.
~
I
1
I I I
I ! 8
t
I