Estimation of Speech-Formant Dynamics using

0 downloads 0 Views 487KB Size Report
Code Book of Vector. Templates. (16-dim. PARCORgram) hn hn-1. ADALINE ... the paper with the same title distributed for academic non commercial purposes, ...
Estimation of Speech-Formant Dynamics using Neural Networks Formant Dynamics is an interesting research field in Speech Perception, Speech Parameterizacion, Synthesis, and Recognition. The Spectrogram shown in Fig. 1.a, corresponding to the utterance “I wished you were here a year ago” reveals important dynamic changes in the positions of the poles in the Transfer Function of the Vocal Tract with time. In the associated template shown in Fig. 1.b, the set of PARCOR parameters generating the Spectrogram may be seen. As it could be expected, the changes in the Spectrogram are related with changes in the PARCORgram. In many applications, such as Phone Labelling [Robinson.94], or Formant Tracing [Nagayama.94], it should be desirable to detect the set of Vector Parameters responsible for these changes, or to inferr Formant Dynamics from Vector Parameters. Through the present work one such application using Gradient-Adaptive Lattices and Time-Delay Neural Networks will be described.

Figure 1. a) Spectrogram of the utterance I wished you were here a year ago, by a male speaker. b) Associated set of PARCOR parameters used to generate the Spectrogram, obtained from the output of a Gradient Adaptive Lattice Filter re-sampled every 5 msec

The problem the researcher has to face when relating the Spectrogram, with the Vector Parameters, such as PARCOR, LPC, FFT-mel-scale, Cepstrum, or others, is the difficulty in explicitly expressing their underlying nonlinear relations. As an example, consider the relation among the sets of formants {ϕin} and PARCOR parameters {hin}: ϕin = F{hin}; 1≤i≤k (1) which is strongly nonlinear, and can not be established analytically. Nagayama showed recently [Nagayama.94] that using standard Back-Propagation Networks, a relationship equivalent to Eq. 1 could be established among FFT-derived mel-scale parameters and x-y mappings on the Formant Space. These are known as Sammon Mappings for an early work due to Sammon [Sammon.69]. The present approach demonstrates the possibility of establishing associative linear relationships in the domain of the dynamic features of both sets of parameters, as far as the sampling conditions ensure the quasi-stationarity of the spectrum. Besides, it will be shown that certain Time-Delay Networks may optimally represent such relationships, as: ϕ’n = W [hTn, hTn-1]T (2) where W is the Matrix of Weights associating the sets of features, which may be expressed as: W = [ℑ[F], - ℑ[F]] (3) ℑ[F] being the Jacobian Matrix derived from Eq. 1. W may be evaluated by a Time-Delay Neural Network, using the Gradient Algorithm, to minimize the error between the estimated and the real outputs of the network: εn = ϕ’n - yn = ϕ’n - Wn xn (4) xn = [hn, hn-1]T (5) (6) wijn+1 = wijn + 2 µ εjn xin; 1≤j≤r; 1≤i≤k µ being the Adaptation Step, and k=16 and r=2 being the dimensions of the input and output vectors, as only the first two formants will be used to characterize most of the dynamic sounds of interest. Figure 2 shows the General Framework to train and apply such a Neural Network. The main relevance of this technique is the low computational cost of the Retrieving Process, provided that the Network has been properly trained.The sounds of interest for our study comprise mainly glides and diphthongs, although the set of plosives (voiced and unvoiced), the affricates (voiced and unvoiced), the voiced approximants, and in general, those sounds in which Formant Dynamics is determinant in their perception and discrimination [Klatt.87] could be also included. For the experiments shown in Fig. 3 we used the following materials: Training Group. 5 samples of Draft of the paper with the same title distributed for academic non commercial purposes, published as Gomez, P., Rodellar, V., Alvarez, A., Bobadilla, J., Bernal, J., Nieto, V., & Perez, M. (1995). Estimation of Speech-Formant Dynamics using Neural Networks. In Fourth European Conference on Speech Communication and Technology.

©ISCA http://www.isca-speech.org/archive/eurospeech_1995/e95_2221.html

the words father, fee, and foot spoken by 6 different speakers (3 male and 3 female). Testing Group. 5 samples of the utterance I wished you were here a year ago spoken by the same speakers. A Neural Network as the one shown in Fig. 2 was trained with material from the Training Group. Vector Template hn

Speech Trace u(t)

Grad.-Adap. Lattice

a)

LPC parameters an L-D Algorithm

Code Book of Vector Templates

Spectrogram fn(m)

Transf. Func. Evaluation

Code Book of Objectives

Formant Extractor

Code Book of Vector Templates (16-dim PARCORgram)

ϕ’1n, ϕ’2n

b)

hn

y1n

ϕ’1n

+

ADALINE Wn y2n

hn-1

∆Wn

ϕ’2n

Code Book of Objectives (First-Two-Formants from the Spectrogram)

εn

Gradient Adaptive Algorithm

Figure 2. General Framework for training and applying the TD-ADALINE. a) Methodology for Code-Book building. b) Training the Neural Network.

The Code-Books of Input Vector Templates and Output Objectives were composed of 1476 16-dim. and 2-dim. vectors. 273 Training Epochs were necessary to carry the Network to a convergence rate under 0.001. Once the training was completed, the recordings in the Testing Group were fed to the Network. Figure 3 shows the x-y plots produced by the Network for one of these samples (Male Speaker #3, Utterance #3), its spectrogram being given in Fig. 1.

Figure 3. a) Trajectory for the group /aωI/, corresponding to the fragment between 640 and 880 msec. b) Idem for the group /çjuωεrç/ between 960 and 1220 msec. c) Idem for the group /çiεrεjir/ between 1300 and 1980 msec. d) Idem for the group /gëöow/ between 2020 and 2400 msec.

In Fig. 3.a the trajectory showed that the speaker tended to articulate /awI/ instead /ajwI/. The second frame, corresponding to Fig. 3.b, shows that /š/ modifies the next diphthong, which is produced as / çjuωεrç/, and that the aspiration following the rhotic at the end of the frame was articulated as /ç/ rather as /h/ [Laver.94]. The frame of Fig. 3.c showed a complicated pattern, oscillating rapidly between /I/, /i/, /ε/ and /ër/. Finally, in frame 3.d, the plot shows the influence of the palatal oclusion produced by /g/ on the next sound, which initiates as /I/, followed by /ë/, and approaches to /ö/ ending in /o/. Due to the simple mechanisms involved in the Retrieving Process, these x-y plots were obtained in real time on a PC 486 machine, without extra needs of additional DSP hardware. The data acquisition platform was a SoundBlaster 16 ASP card. The system may be used as a Microphonic Joystick, for Perception and Production Reinforcement in applications of Computer-Assisted Language Learning/Training (CALL-CALT) [Gómez.94]. This is especially useful in Accent Reduction for non-native speakers studying English as a Foreign Language [Sarabasa.94]. References [Klatt.87] D. H. Klatt, “Review of text-to-speech conversion for English”, J. Acoust. Soc. Am, Vol. 82, No. 3, September 1987, pp. 737-793. [Laver.94] J. Laver, Principles of Phonetics, Cambridge University Press, Cambridge, UK, 1994.

Draft of the paper with the same title distributed for academic non commercial purposes, published as Gomez, P., Rodellar, V., Alvarez, A., Bobadilla, J., Bernal, J., Nieto, V., & Perez, M. (1995). Estimation of Speech-Formant Dynamics using Neural Networks. In Fourth European Conference on Speech Communication and Technology.

©ISCA http://www.isca-speech.org/archive/eurospeech_1995/e95_2221.html

[Nagayama.94] I. Nagayama, N. Akamatsu and T. Yoshino, “Phonetic visualization for Speech Training System by using Neural Network”, Proc. of the ICSLP’94, Yokohama, Japan, September 18-22, 1994, pp. 2027-2030. [Robinson.94] A. J. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation”, IEEE Trans. on Neural Networks, Vol. 5, No. 2, March 1994, pp. 298-305. [Gómez.94] P. Gómez, D. Martínez, V. Nieto and V. Rodellar, “MECALLSAT: A Multimedia Environment for Computer-Aided Language Learning incorporating Speech Assessment Techniques”, Proc. of the ICSLP’94, Yokohama, Japan, September 18-22, 1994, pp. 1295-1298. [Ritter.92] H. Ritter, T. Martinetz and K. Schulten, Neural Computation and Self-Organizing Maps, Addison-Wesley, Reading, MA, 1992. [Sarabasa.94] A. Sarabasa, “Perception and Production Saturation of Spoken English as a first phase in reducing Foreign Accent”, Proc. of the ICSLP’94, Yokohama, Japan, September 18-22, 1994, pp. 2015-2018. [Sammon.69] J. W. Sammon, “A Nonlinear Mapping for Data Structure Analysis”, IEEE Trans. on Computers, Vol. 18, No. 5, 1969, pp. 401-409.

Draft of the paper with the same title distributed for academic non commercial purposes, published as Gomez, P., Rodellar, V., Alvarez, A., Bobadilla, J., Bernal, J., Nieto, V., & Perez, M. (1995). Estimation of Speech-Formant Dynamics using Neural Networks. In Fourth European Conference on Speech Communication and Technology.

©ISCA http://www.isca-speech.org/archive/eurospeech_1995/e95_2221.html