Speech coding with multi-layer networks - CiteSeerX

3 downloads 0 Views 381KB Size Report
S4.12. SPEECH CODING WITH MULTI-LAYER NETWORKS. Yoshua BENGIC, Regis CARDIN', Piero COS1 **, Renato DE MORI'. 'School of Computer Science ...
S4.12 SPEECH CODING WITH MULTI-LAYER NETWORKS

Yoshua BENGIC, Regis CARDIN', Piero COS1 **, Renato DE MORI' 'School of Computer Science, Mc Gill University 3480 University Street, MONTREAL,QUEBEC ,CANADA H3A2A7 "Centro di Studio Der le Ricerche di Fonetica CNR, via G. Oberdan, 10. 35122 PADOVA, ITALY

ABSTRACT Combining a structural or knowledge-based approach for describing speech units with neural networks capable of automatically learning relations between acoustic properties and speech units is the research effort we are attempting. We are investigating how speech coding can be performed by sets of multilayer neural networks whose execution is decided by a data-driven strategy. Coding is based on phonetic properties characterizing a large population of speakers. Results on speaker-independent recognition of vowels using an ear model for preprocessing are reported.

1. INTRODUCTION

Important efforts have been devoted in the last years to the coding of portions of the speech signal into repmeentatlone. An example of such coding is vector quantization that describes a speech interval with a label belonging to a given vocabulary. Such method for obtaining descriptions of speech intervals has been applied to speech transmission and lo Automatic Speech Recognition (ASR). Extensive and successful experiments [ l ] have been performed on the automatic recognition of large vocabularies with speaker-dependent coding. This paper introduces new coding schemes for speaker-independent ASR based on the conjecture that coding speech for recognition does not have the same objectives as coding speech for transmission. In the case of recognition coding should be based on propertlee of speech, characteristics of a large population of speakers. In order to capture speech characteristics, decision functions, which have to describe the existence of phonetlc propertlo8 in speech intervals. have lo be learned by examples rather than being defined by algorithms conceived by the coder designer. The main motivation for such a new approach to coder design is that perceptually significant features do not necessarily exhibit small distortions in the space of acoustic parameters. rather they are expected to cluster input patterns that are perceived as the same sound with a small distortion in a perceptual space. Moreover, speech coding will be more accurate if acouetlc contexte around the interval lo be described are taken into account. Different descriptions from acoustic data can be extracted by Property Extractore (PE) in different acoustic eltuatlone. An acoustic segment of broad-band noise is described in a way that differs from that used for a segment showing transitions of vocal tract resonances. It will be shown that the association of spectral samples collected in a short time interval with heterogeneous descriptions relative to larger intervals and extracted by knowledge-based algorithms can be obtained with Multl-Layer Network. (MLN). Recently, a large number of scientists are investigating and applying learning systems based on multi-layer neural networks. Definitions of MLNs. motivations and algorithms for their use can be found in 13-41, Theoretical results have shown that MLNs can perform a variety of complex functions[4]. Applications have also shown that MLNs have interesting generalization performances capable of capturing information related l o pattern structures as well as characterization of parameter variation ([S-81). Furthermore, algorithms for MLNs allow learning l o be c o m p e t l t l v e [5]. A need for competitive learning of speech properties has been made evident in recent publications [g-lo].

164 CUl673-2~89/~164 $1.00

0 1989 IEEE

In the popular Hidden Markov Models (HMM) there is a model for each unit, an MLN is built for a set of units. Another reason for the interest in MLNs is the possibility of conceiving efficient architectures for them. We will investigate in this paper how learning p h o n a t l c propertlee can be performed by eete of MLNs whose execution is decided by a data-drlven stretegy. This strategy analyzes morphologies of the input data and selects the execution of a set of MLNs as well as the time and frequency resolution of the spectral samples that are applied at their inputs. In principle an MLN can implement every function a digital computer can implement provided that a suitable number of nodes, a suitable topology and a suitable choice of weights for internode links are chosen. The problem here is that of Iearnlng internode weights that characterize an acceptable model of an ideal coder that will detect phonetic properties whenever they are present in a pattern even if not all the instantiations of a property have been used for training the coder.

2. SPEAKER-INDEPENDENT RECOGNITION OF TEN VOWELS IN FIXED CONTEXTS A first experiment was performed for speaker-independent vowel recognition. The purpose was that of training an MLN capable of discriminating among 10 different American-English vowels represented with the ARPABET by the following VSET: VSET : (iy,ih.eh.ae,ah.uw,uh,ao.aa.er)

(1)

The interest was to investigate the generalization capability of the nehvork with respect to inter-speaker variability. Some vowels (ix.ax,ey,ay.oy.aw,ow) were not used in this experiment because we attempted to recognize them through features learned by using only VSET. The words used are those belonging to the WSET defined in the following: WSET: (BEEP.PIT,BED,BAT.BUT,FUR,FAR,SAW,PUT,BO

(2)

The signal processing method used for this experiment consists of an ear model (see [2]) with 40 channels, simulating the probability of firing as a function of time for a set of similar fibers acting as a group in the auditory nerve, followed by a General Synchrony Detector enhancing spectral peaks due to vocal tract reeonancee. The output of the Generalised Synchrony Detector (GSD) was collected every 5 msecs and represented by a 40coefficients vector. This type of output is supposed to retain most of the relevant speech spectral information for doing sonoranl phonemes discrimination. The GSD output of the vocalic part of the signal was sent to an MLN. Vowels were automatically singled out by an algorithm proposed in 1121 and a linear interpolation procedure was used to reduce to 10 the variable number of frames per vowel (the first and the last 20 ms were not considered in the interpolation procedure). The resulting 400 (40 spectral coefficients per frame x 10 frames) spectral coefficients became the inputs of the MLN. The network also had a single hidden layer with 20 nodes, and ten ourput nodes, one for each vowel. The Error Back Propagation Algorithm (EBPA) is a learning algorithm for a class of non-linear neural networks. These networks are made of connected units. Each unit computes a nonlinear function of the weighted sum of its inputs and produces an output that can be sent to many other units. The type of network

that we considered is feedfornard (non-recurrent) and organized In layers. Units which are neither input nor output units are called hdden unit.. The network learns to compute a non-linear function from the input units l o the output unit% The EBPA permits to compute iteratively the weights that mlnlmlze a square error measure defined over s set of training Inputfoulput examples: training set

-

( (IN1 , O W ) )

where IN1 is a vector of Input values and O U q Is a vector of desired output values (OUTi1.0UTi2....0UTin). The minimized square error measure is E

-

Zi Zj (OUTij .Yj(1NI))

*

where i varies over the training set of examples and j varies over the n units on the output layer. The basic EBPA uses gradient descent In the space of weights lo minimize E. Hence the basic learning rule Is:

use of the ear model allowed to produce spectra with a limited number of well defined spectral lines. This represent. a good use of speech knowledge according IOwhich formants are vowel parameters with low variance. The use of mala and female voices allowed the nehvork to perform an excellent generaiization with samples from a limited number of speakers. Encouraged by the results of this firat experiment, other problems appeared worth to be investigated with the proposed approach. The problems are all related to the possibilities of extending what has been learned for ten voweb to recognize new vowels. An appealing genersllzatlon porribillty relies on the recognilion of vowel features. By karning a set of features In a set of vowels. new vowels can be characterized just by different combination of the learned features. Features like the place of articulation and the manner of articulation related to tongue position are good descriptors of the vowel generation system. It can be expected that their values have low variance when different speakers pronounce the same vowel.

3. THE RECOGNITION OF PHONETIC FEATURES AW =

-

learning-rate ' JEWW

(4)

where JE/JW can be computed by back-propagating the error from the output units to the hidden units, a8 described In [3]. In order l o reduce the training lime and accelerate the learning. various techniques can be used. Strktly speaking, the gradient descent rule Implies a modification of the weights after all the examples have been shown. This is called batch learning. However, it was experimentally found, at least for pattern recognition problems, that it is much faster to perform on-line learning. i.e. updating the weights after the presentation of each example. On the other hand, batch learning provides an accurate measure of the performance of the network as well as of the gradient JElJW and these two informations can be used l o adapt the learning rate during the training In order to minimize the number of necessary training iterations. In our experiments we used various techniques l o Improve learning time and generalization : jumping from on-line learning to batch learning when appropriate. using local (weight specific) and adaptive learning rates, balancing the presentation of examplars among the different classes, training small modules with a simpler architecture and then adding them more hidden units or hidden layer, breaking up the problem into small modules (that can then be combined using Waibel's glue unils [lo]). training on timeshifted versions of the inputs to learn time invariance and Insensitivity to errors in the segmentation preprocessing. An Important issue in using a neural network for pattern recognition is generalization. i.e. how a trained network will perform on new examples that were not part of Its training set. For this crucial issue, the architecture of the network has a great importance. It is generally admitted that solutions l o the problem of learning the training se1 that have a minimal number of degrees 01 freedom, or a lot of constraints imposed on them will probably generalize bener. A technique that we used to Improve generalization is to monitor the generalization ability of the network during training by using an additional se1 of -control" test examples. The network used for further test is the one that performed best on the control test set. Back-propagation is useful for learning features that associate or diflerentiate pairs of patterns with the possibility of creating different regions for different acoustic realizations (e.g. allaphones) of the same phonological unit. In contrast with Hidden Markov Modeis, neural networks can learn from presentations of examples from all classes with the possibility of emphasizing what makes classes different and different examples of the same class simllar[5]. Under the above described experimental conditions, the voices of 13 speakers (7 male, 6 female) were used for learning with 5 samples per vowel per speaker. The voices of seven new speakers (3 male, 4 female) were used for recognition with 5 samples per vowel per speaker. In 95.4% of the cases, correct hypotheses were generated with the highest evidence, in 98.5% of the cases correct hypotheses were found in the top two candidates and In 99.4 % of the cases In the top three candidates. The same experiment with FFT spectra instead of data from the ear model gave 87% recognition rate in similar experimental conditions. The

165

The same procedure introduced In the previous section was used for learning three nehvorka, namely MLNVt, MLNV2 and MLNV3. These networks have the same structure as the one introduced In the previous section with the only difference that they have more outputs. MLNV1 has five additional outputs corresponding to the five places of articulation PL1....,PLi,...,PLS. MLNV2.has live new oulputs. namely MNt ,...,MNj,...MN5. MLNVB has two additional outputs, namely T-tense and L-lax. The ten vowel used for this experiment have the features defined in Table 1.

Fsauns

Table I. Phonetic features (target outputs) for vowels of the experiment : place. manner and tenseness of articulation. After having learned the weights of the three networks. the outputs corresponding l o the IndivMual vowels were Ignored and confusion matrices were derived only for the outputs corresponding to the phonetic features. An error corresponds to the fact that an output has a degree of evidence higher than the degree of the output corresponding to the feature possessed by the vowel whose pattern has been applied at the input. The confusion matrix for the features is shown in Table 11. The overall error rates are 4.57%. 5.71% and 5.43% respectively for the three sets of features. Error rates were always zero after a number of training cycles (between 60 and 70) of the three networks. Several rules can be conceived for recognizing vowels through their features. The most severe rule is that a vowel is recognized if all the three features have been scored with the highest evidence. With such a rule, 313 out of 350 vowels are correctly recognized corresponding to 89.43% recognition rate. In 28 cases, combinations of features having the highest score did not correspond to any vowel, so a decision criterion had to be

introduced in order to generate the best vocalic hypothesis. It is important to consider as an error the case in which the features of a vowel not contained in the set defined by (1) receive the highest score. Considering these vowels as well as other vowels not in (1) an error rate of 2.57% was found. This leads to the conclusions that an error rate between 2.57% and 10.57% can be obtained depending on the decision criterion used for those cases for which the set of features having the highest membership in each network do not correspond to any vowel. m e a ol Aruwlfilon umunr 01 Ankulamn

w

M

F m I p

yd

H0"jI'PTn"l

P r

code always corresponded lo the same vowel. When the wrong code corresponds to more than one vowel s procedure is executed that computes euclidean distances on gravity centers. With this criterion. which is derived from the test data, an error rate of 3.24% can be obtained. This error rate cannot be used for establishing the performances of the feature networks because it corrects some errors by recoding the memberships using a function that has been learned by analyzing the test data. Nevertheless, it suggests that feature based MLNs may outperform a straighllomard phoneme-based MLN if successive refinements are performed using more than one training set. In fact, alter a few experiments. interpretations for the codes PL-00001, MN-00001 and PL-01000, MN-10000 can be inferred and awlied l o successive experiments leading to a correct recognition rate close lo 96%. 4. STRUCTURE OF A CODER USING MLNa

0

n 0 U

n 5

a d

R.cognhd Faatuna

Table 11. Confusion matrix of phonetic features for new speakers. An appealing criterion consists in computing the centers 01 gravities of the place and manner of articulation using the following relation: 5 5 (5). CG = ( Z i Mi) ) I( Z c(i)) i-1 i-1 Let CGP and CGM be respectively the center of gravity of the place and manner of articulation. A degree of 'tenseness" has be computed by dividing the membership of 'tense" by the sum of the memberships of "tense" and .lax*. Each sample can now be represented as a point in a three-dimensional space having CGP. CGM and the degree of tenseness as dimensions. Euclidean distances are computed for those sets of features not corresponding lo any vowel with respect Io the points representing theoretical values for each vowel. With centers of gravity and euclidean distance an error rate of 7.24% was obtained. Another interesting criterion consists in introducing a subjective probability for a feature defined as the ratio of the feature membership over the sum of the memberships of the other features. For example for feature PLi a probability xi is defined as follows: 5 xi c(PLi) I ( Z p ( P W (6) k= 1

-

The probability of a vowel is then defined as the product of the subjective probabilities of the features of the vowel. As the denominator of the probability of a vowel is the same for all the vowels, the vow1 with the highest probability is the one with the highest product of the evidences of its features. By smoothing each membership with its neighbors and multiplying the memberships of the features of each vowel an error rate of 8.8% was obtained. The error rate obtained with gravity centers is not far from the one obtained in the previous section with ten vowels. In this case the possibility of error was higher because the system was allowed to recognize feature combinations for all the vowels of the American English. For those cases for which the features which reached the maximum evidence did not define a set corresponding lo any vowel of American English an error analysis was made. Most of the errors were systematic (PL2 confused with PL4 and MN2 confused with MN4). Features for maximum evidence can be used as a code for describing an unknown vowel. When this code does not correspond l o any acceptable vowel, it can be mapped into the right one corresponding to the true features of the vowel when the wrong

166

A coder can be conceived using MLNs. The structure of the coder is shown in Figure 1. Different coder components are executed in different situations. The input layer of each MLN is fed by a Property Extrector (PE), that acts as a window analyzing the data with variable time and frequency resolution. MLNs are organized into sets. Let MLNSi be the set of MLNs activated in situation SI. Some of the MLNs in MLNSi are executed concurrently under the only condition that situation Si is detected. Let the collection of these MLNs be called Concurrent MLN (CMLNI). The MLNs which are components of the same CMLN are trained separately. This allows to perform learning on each subnetwork with a reasonable number of training data.

-

ULNpl

ALG.ploa

MLNfr

ALG-fr

A

Figure 1 . General scheme of the coder using MLNs. There is only one CMLN for each situation. Each MLN of a CMLN is trained for detecting a set of phonetic properties. There are four situations Si. The coder in each situation has the same structure and the components MLNij of CMLNi have the same type of function for all i and the same j according l o the following scheme: MLNit generates hypotheses about the types of phonemes, MLNi2 generates hypotheses about the place of articulation. MLNi3 generates hypotheses about the manner of articulation, MLNi4 generates hypotheses about voicing or tenseness. The set of outputs of MLNij is denoted O(ij); the outputs of the four MLNs of CMLNI can be described as follows. O(l1) contains the properties characterizing types of phonemes: O(11) : (oral-sonorant, nasal, plosive, fricative, silence] (7). O(t2) contains properties describing the place of articulation. For situation S i , the set is: (8). O(12) : (PLl,PL2,PL3,PL4,PL5) where PLk (tsPLkl5) is a value of the place of articblation of a sound (PLl is for "back", PL5 is for 'front'). Notice that these sets of properties are complete in the sense that every sound must have a place of articulation or it must belong to a phoneme type. For a specific pattern, the network may not be able IO precisely hypothesize a place of articulation. rather it will oroduce a

dl8trlbutlon of evidences of the f i e places. The outputs of the subnetworks in CMLNi may have associated memberships showing that their values would not allow to make a reliable decision. This suggested to introduce Varlmblo Dopih Anmlysls (VDA) following a concept already proposed in previous works [ t t ] . VDA consists in recognizing situations in which new degrees of evidence have to be proposed based on new more specialized subnetworks that could be fed by new acoustic properties. An example of a precondition for performing VDA is when the place of articulation is greater than a threshold (e.g. 2.7) indicating that a nasal sound could be present. In such a caw a more accurate distinction on the place of articulation has to be performed in order to distinguish among nasal sounds. This condition triggers Me execution of a new MLN whose outputs are the places of articulations of nasal sounds and of front vowels. Another subnetwork for VDA is that for the distinction between the places of articulations of phonemes /bl and Id/. It will be executed only when the evidences of such phonemes are close and among the highest in the acoustic segment under analysis and will use detailed data (mostly related to relevant properties for such a type of confusion) at their input. The VDA unit is a pool of MLNs that can be invoked under specified preconditions.

described a modular speech coding system based on sets of MLNs whoea execution is decided by a datadriven strategy. These MLNs are organized In function of situations detected in the speech signal and in function of the type of phonetic features they have to detect. In cases when MLNs don? have a reliable output. a variable depth analysis is performed with specialized MLNs in order to discriminate finely among a small number of phonemes and using corresponding specialized properly extractors as input.

-

This work was supported by the Natural Science and Engineering Council of Canada (NSERC). The Centre de Recherche en lnformalique de Montreal (CRIM) kindly provided computing time on its facilities.

REFERENCES [ t l F. Jelinek. "The development of an experimental discrete dictation recognizer'. IEEE Proceedings, pp.1616-t 624, November 1984.

j j

[21 Seneff S. (1985). 'Pitch and spctral analysis of speech based on an auditory synchrony model". RLE Technical Report 504 , MIT.

31

[31 D.E. Rumelhart, G.E. Hinton and R.J. Williams, "Learning internal representation by error propagation'. Parallel Distributed Processing, vol. 1. pp.318-362, MIT Press, 1986.

I U

PE31 PE32 PE33 PE34

FFT

OII

N 32

*

-

[4] D.C. Plout & G.E. Hinton. 'Learning sets of filters using back propagation",Computer Speech & Language, 1987, vo1.2. pp.35-61.

M

011

[5] H. Bourlard and C.J. Wellekens, 'A Link between Markov Models and Multilayer Perceptrons". Proc. of the 1988 IEEE Conference on Neural Information Processing Systems. Denver, CO.

N + 33

[6] H. Bourlard and C.J. Wellekens. "Multilayer perceptron and automatic speech recognition'. IEEE first International Conference on Neural Networks, San Diego. pp. IV407-IV416, June 1987. sign.1

Figure 2

-

171 R.L. Watmus and L. Shastri. "Learning phonetic features using connectionist networks', Proceedings of the 10th International Joint Conference on Artificial Intelligence, 1987, pp.851-854.

Structure of CMLNt

The following considerations motivate the use of different PE and of different CMLNs in different altuatlons. In the speech signal there are events characterized by abrupt transients. A plosive sound or the beginning of an utterance may produce events of that nature. A fricative sound will produce a similar situation in which a sequence of spectra exhibiting broad-band noise is followed by a sequence of spectra exhibiting narrow-band resonances.ln such situations which will be indicated 88 53, it is important to analyze possible bursts requiring a PE with high time resolution and not very high frequency resolution In high frequency bands. The same considerations in high frequency bands can be made for characterizing fricatlon noise. Notice that the selection of properly extractors is suggested only by the data. In each situation any type of phonetic or phoneme hypothesiscan be generated. In situation S1 speech segments exhibit narrow-band resonances. Situation S2 is the one in which spectra exhibit broad-band energies or are silence-like. Situation S3 is the transition between S 2 and S1. Situation S4 is the transition between S1 and S2.

[SI A. Waibel. 1. Hanazawa. K. Shikano, 'Phoneme recognition : neural networks vs hidden Markov models'. proc. ICASSP 1988, 8.S3.3. [9] L.R. Bahl. P.F. Brown. P.U. De Souzo and R.L. Mercer, "Speech recognition with continuous-parameter Hidden Markov Models'.Proc. ICASSP 1988. pp. 40-43. [to] A. Waibel, 'Modularity in Neural Networks for S p e e c h Recognition'. Proc. of the 1988 IEEE Conference on Neural Information Processing Systems, Denver, CO. [ t t ] R. De Mori, E. Merlo. M. Palakal and J. Rouat, 'Use of procedural knowledge for automatic speech recognition", Proceedings of the tenth International Joint Conference on Artificial Intelligence, Milan, August 1987, pp. 840-844 'Parallel [t2] De Mori R.. Laface P . & Mong Y. (t985), IEEE algorithms for syllable recognition in continuous speech', Transactions on Panern Analysis and Machine Intelligence, Vol. PAMI-7. N. 1, pp. 56-69. tM5.

5. CONCLUSION

[l3] Y. Bengio, R. Cardin, P. Cosi, R. De Mori. 'Use of Multilayered

The preliminary experiments reported here on speaker normalization combining multi-layered neural networks techniques and a knowledge-based approach show promising results. Multilayer neural networks trained to characterize vowels preprocessed by an ear model - in function of phonetic features were able to generalize to new speakers and to new vowels. We

Networks for Coding Speech with Phonetlc Features', Proc. of the 1988 IEEE Conf. on Neural Information Processing Systems. 1141 Y. Bengio, R. De Mori, *Speaker Normalization and Automatic Speech Recognition Using Spectral Lines and Neural Networks'. Pmc. of the 1988 Canadian Conf. on Artificial Intelligence.

167

. . .

.

.

Suggest Documents