A Method To Extract Articulator Parameters From The ...

2 downloads 0 Views 280KB Size Report
Science Primer," William 8: Wilkins. [19] Martins,M. (1971), "Aiijlise aclistica das vogais orais t6nlcas em PortuguW Instittito de FoiiCtica da Faculdadc de.
A method to extract articulatory parameters from the speech signal using Neural Networks Antdnio Branco-l-,Ana Tom& Antdnio Teixeirat, Francisco Vaz$ fllepai'tanzento de Blectrdnica e TeleconzunicaGbesda Universidade de Aveiro Cainpu Universitcirio - 3800 AVEIRO, PORTUGAL einail: [email protected] $ INESC Canzpus Urziversitcirio 3800 AVEIRO, PORTUGAL

Abstract: In tliis palier we present

ci method that uses artiji:cicil n e u r d networks $)I. acoustic to cir.ticulatory mapping. A I L assemhly qf Kr~lzonen neural nets is used, in the first stage a itetwork maps cepstral values, eciclz neuron contains a subnet in LE second stage that i m p s the articulatory space. The method allows both the acoustic to articulatory mapping, ensuring smooth varying vocal truct skapes, r m l the .study of the non urziqueness pi.obletii.

Several methods have been studied for the acoustic to articulatory mapping and have been reported by Schroeder[l I ] , Wakita[lG], Sondlii[l4] and Atal[ I], but the main conclusion of their work i s that the speech signal is inadequate to determine uniquely the area function. Schroeter et ul.[121 used a codebook look-up procedure combined with dynamic programming. Raliim et al. [ 101 introduced a neural network approach using an assembly of MLPs, each designed for a specific region of the articulatory space.

I. Introduction

This paper describes a neural network approach that uses an assembly of Kohonen[3,4,5,6] neural nets. In the first stage a network maps LPC derived weighted cepstral values[9]. Each neuron of the first stagesontains a sub-net in a second stage that maps the articulatory space. The method allows both the acoustic to articulatory mapping ensuring smooth varying vocal tract shapes and the study of the non uniqueness problem.

Articulatory speech synthcsis, models the vocal apparatus with slowly varying physiological parameters(Fig. ])(e.g. tongue body, lips, velum, etc) and promises naturalness in tcxt-to-speech applications. Nevertheless, the estimation of the vocal tract shape from speech signal is a long standing problem, this i s solely due to the non uniqueness of the acoustic to articulatory mapping.

11. Articulatory synthesizer The articulatory model that is used was first devcloped by Mermelstein[8]. The points reprcsented in Fig.1 are the main articulators: C is the tongue body, V is he velum, T is the tongue tip, J represents the jaw, the lips are represented by LIPP(protrusion) and LIPO(opening), P represents the hyoid. The pharynx lowest part is defined by WH, HKI andGIK.

Fig. I

-

Vocal tract outline based

OII

the Mel-melatein model [ 8 ] .

0-7803-41 37-6/97/$10.0001977 IEEE

Considering a lossy vocal tract(heat conduction, viscosity an yelding walls), assuming plan wave propagation and reducing the vocal tract to K cylindrical sections terminated with a radiation impedance one can compute the transfer function of the vocal tract. W e computed the transfer function of the vocal tract using an hybrid time-frequency domain method, based on the work of Sondhi[ 13,151. The formant generators were

DSP 97 - 583

computed using a fast and robust method proposed by Lin[7]. The glotal excitation was modelled with Rosenberg polynomial[ 171. For each articulatory configuration we were able to compute the LPC derived cepstral parameters from the synthesized sound.

neighbours. E(t) is a decreasing learning constant. A(i*, i) is a decreasing neighbourhood function between the winner and the neuron that updates the weights. The decreasing neighbourhood function can be modelled with a function like a "Mexican hat", given by

111. Topology The principal net is a Kohonen net with lateral dimension of 6x6, with 18 Cepstral parameters for input. Each centroid of the Cepstral net contains a Kohonen subnet 2x2, with both articulatory and cepstral parameters for input (see Fig.2).

where r, is the place of the neuron and rl* is the place of the winner. The neighbourhood decrease along the time is given by the decreasing 0 , o(t)=oi(o,

Cepstrl Kohoiien

/Oi)t/tmaX

/

-where Oi is the initial sigma, 0 , is the final sigma, t is the time and t,,,, is the maximum time of decrease /'

,

/

'

/

_ _

The decreasing learning constant function is given by, E(t) = &,(E, / & , ) L / t m a x

where E, is the initial learning constant, &, is the final learning constant, t is the time and tl,,ztxis the maximum time for decreasing. Good results were obtained with t,,,,,=lOO iterations. -

18 cepstral parameters

Spatid Kohnnsa

Fig.2 - Reprewitation of the dynamic Kohoncn codebook Each centroid of the Cepstral Kohonen iiet(6x6) contains a spatial Kohonen iiet(2x2).

The articulatory parameters used for input were the articulators of the articulatory model: -

C, V, T, J, LIPP, LIPO, P, WH, HKl and GlK.

As a "thumb rule" for a good expansion of the Kohonen net we used for the initial neighbourhood half the number of neurons of the side of the plane (ex: for a Kohonen net 4x4, an initial neighbourhood of 2,0 must used). Good results for the learning were obtained using a learning constant of O,8. The relation between the initial and final values for neighbourhood and learning constant used were 1/10, The values used were.

o, =Half of the plane side of the Kohonen net 0 1

111. Training The Assembly Of Kohonen Neural Networks A- Learning Rule

0 , =-

IO

E, =

0.8

E, =0.08

Based on the previous work of Kohonen[3,4,5,6] the weights are updated with the following rule: w I , i ( t + 1) = w ( t ) + e( t)(x I (t) - w i,i (t)).A(i*, i) , for i E Ni*, j = 1,2,..., N. Where xi is the j input at time t, and w,,,(t) is the weight of the connection between input j and neuron i at time t. Ni* is the winner i* and its

B- Training patterns The training patterns are generated by sampling randomly the articulatory space of the Merinelstein model[8] for voiced sounds in static configurations. The input patterns for the main network were 18 LPC derived weighted cepstral parameters with the

DSP 97 - 584

Juang[2] lifter. The input patterns Cor the sub-nets were cepstral articulatory parameters of the Mermelstein model and the cepstral parameters. The codebooks stored in the two level nets are created in three consecutive steps. First, the cepstral net is trained, i.e, the net weights (a,,,,) are calculated according the Kohonen learning rule explained on the last section. After that, the learning patterns are divided into groups. Each group has the patterns for which a particular cepstral neuron winned. Then, the spatial sub-net, connected to a particular cepstral centroid is trained with the group of patterns associated to it. Naturally, after tliis training phase the sub-net weights (cs,,,, and Bj,k) were computed using the Kohonen learning rule stated before.

V. Simulation Results And Conclusion We obtained good results in the expansion of tlie Kohonen neural nets using the decreasing neighborhood and learning functions. This method proved to be a good method to understand the non uniqueness of the acoustic to articulatory mapping, due essentially to the topology of the neural networks which allowed to study the problem. The topology helps to study the problem because it gathers in the same spatial sub-net configurations with similar cepstral properties. By inspecting these configurations one can easily find any articulatory diferences between them ensuring similar cepstral properties. Using this method we were ablc to find configurations similar in Cepstral distance but different in the corresponding articulatory parameters (Fig.3).

IV. Acoustic-To-Articulatory Mapping Inverse mapping is donc by retrieving the information stored in the coodeboks. First, the winner of the Cepstral net must be found using a Cepstral distance. Then, the spatial winner is Found using the sub-net connected to the Cepstral winning neuron. The spatial winning configuration is determined, using a distance that uses both cepstral and spatial distance. The Cepstral net cost function for each neuron is the weighted cepstral distance,

,i= I

Fig.3 - Example of differeiit articulatory configurations with

Where J(n) is the Juang's Lifter value of the nrh cepstral parameter ( c,!)and I is tlic number of neurons of the net. The winner neuron of this net

must have the minimum value of ~i~t,,~, (i) . After, for the Spatial iiet connected to the winning cepstral neuron the cost function of each sub-net neuron is

k=l

similar transfer functions.

According to the examples described in literature [ 18,191 and with interpolations between them we were also able 10 create codebooks to store the articulatory configurations of tlie Portuguese oral

vowels.

,,=I

j = 1,2;..J Where Ak is an articulatory parameter; a takes the value 0.5; J is the number of neurons of the spatial sub-net. The sub-net winner neuron has also the minimum value of the spatial distance

2

(j ) . Another expetience ot inverse mapping was donc with oral vowels spoken by '1 24 years old indlc Coinparing the obtained formant values with results in previow studies, the F1/F2 graphic displayed

DSP 97 - 585

(fig.4) a similar acoustic triangle configuration[l9]. This result validates the original vowel configurations described i n the literature.

the Art," IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-27, nOl,pp. 281-285. [17] Ciinimiiigs ,"Glotal Models for Digital Speech Processing: A historical survey and new results," Digital Signal Processing, A Review Journal - Academic Press. Vo1.5 , 1995.

VI. Acknowledgements

[ I X ] Borden, G., Harris,K., Raphael, L. (1980) "Speech

One of the au~hors(A.Branco) is grateful to JNICT(PRAX1S XXI) lor the financial support dui-ing the course of this work. This research was performed in partial fulfillment of the requirements of the University of Aveiro for the Msc in Electronics and Telecoininunications

Science Primer," William 8:Wilkins.

[19] Martins,M. (1971), "Aiijlise aclistica das vogais orais t6nlcas em PortuguW Instittito de FoiiCtica da Faculdadc de Letras de Lisboa, Novembro de 197 I

VII. References [I] Atal, B., Chang, J., Mathcws, M., Tukey, J.;"lnversion of articulatory-to-acoustic transformation in the vocal tract by computer solting technique," J.Acoust. Soc. Am. Vol 63(5), pp. 1535-1555, 1978. [2] Juang, B., liabiiier, L., Wilpon, J, (1987) "On the use of batid-pass Iiftering i n speech recognition." E E E Trans. Acousl., Specch, Signal Processing, vol. ASSP-35, pp. 947954. [3] Kohonen, T. ( 1 982), "Self Organized Formation of Topologically Cot-rect Feature Maps," Biological Cybernetics, 11'43, pg. 55-69. [4] Kohonen, T. (I982b), "Clustenng Taxonomy, and Topological Maps of Pattems," Proceedings of the hth intematiolial conference on pattern recognitloti, October. [ S ] Kohonen, T. (1 987). "Adaptive, associative, and selforganizing fiinctions in iieiil.al computing," Applied Optics, Vol. 26, no23, 1 December. [GI Kohonen, T. ( I 988), "Tlic Neural phonetic typewriter," Computer, March. [7] Lin, Qiguang (I99S), "A Fast Algorithm for Computing the Vocal-Tract linpiilse Response from the Transfer Function," IEEE Transactions on Speech and Audio I'rocessing, Vnl.3, nob, November I995 181 Mernielstein, P. (1973). "Articulatory model for ilie study of speech production," J.Acoust. Soc. Ani. Vol. 53, pp. 829841. [9] Rabiner, I-., Juang. D., "Fundanicntals of Speech liecognition," Prentice Hall. [IO] Rahiin, M., Goodyear, C.. et all (1993), "On the use of neural networks i n articulatory speech syiithesis,"J.Acoust. Soc. Am. Vol. 9 3 ( 2 ) , pp. 1109-I 121. [I I ] Schroeder, M.(1967). "l~eterminationof the Geometry of the Human Vocal Tract by Acoustic Measurements," J.Acoust. Soc. Ani. Vol 41(4), pp. 1002-1010. [I21 Schroeter,l., Soiitllii. M . (1994),"TecIinicjiies for Estimatinx Voc;il-Tract Shapes from the Spcech Signsl,"lEEE Transactions oil Speech and Audio Processing, Vol.2,n"l ,PART 11, Jaiiiiary 1994. [ I l l Sondhi, M. (1974) "Model for wave propagatioii 111 a lossy vocal tmct," J . Acoust. Soc. Am., Vol. 55(5), pp. 1070-

1075. [ 141 Sontllii, M. (19791, "Esriniation of Vocal-Tract Areas:

The Need lor Acouslical Measurements," IEEE Transactions oil Acoustics, and Si&nal I'rocessing, Vol. ASSP-27, n"3, pp 26X-271.

[ I S ] Sondhi. M., Schroeter, J. (1987), "A Hybrid TimeFrequency Domain Articulatory Speech Synthesizer," lEEE Transactions 011 Acoustics, and Signal I'rocessing, Vol. ASSP-35, Ii"7. pp.955-967. [ 161 Wakita, H. (1979), "Estimation of Vocnl-Tract Shapes fi-om Acoustical Analysis of the Speech Wave: The State of

DSP 97 - 586

Suggest Documents