EUROSPEECH 2003 - GENEVA
Probability Models of Formant Parameters for Voice Conversion Dimitrios Rentzos
Saeed Vaseghi
Qin Yan Ching-Hsiang Ho* Emir Turajlic
Department of Electronics and Computer Engineering Brunel University, Middlesex UB8 3PH, UK *Fortune Institute of Technology, Kaohsiung, Taiwan, 842, R.O.C. (Dimitrios.Rentzos, Saeed.vaseghi, Qin.Yan, Emir.turajlic)@brunel.ac.uk,
[email protected]
Abstract This paper explores the estimation and mapping of probability models of formant parameter vectors for voice conversion. The formant parameter vectors consist of the frequency, bandwidth and intensity of resonance at formants. Formant parameters are derived from the coefficients of a linear prediction (LP) model of speech. The formant distributions are modelled with phonemedependent two-dimensional hidden Markov models with state Gaussian mixture densities. The HMMs are subsequently used for re-estimation of the formant trajectories of speech. Two alternative methods are explored for voice morphing. The first is a non-uniform frequency warping method and the second is based on spectral mapping via rotation of the formant vectors of the source towards those of the target. Both methods transform all formant parameters (Frequency, Bandwidth and Intensity). In addition, the factors that affect the selection of the warping ratios for the mapping function are presented. Experimental evaluation of voice morphing examples is presented.
1. Introduction Voice conversion has applications in all voice output systems such as text to speech synthesis, voice editing, broadcast systems, Internet, toys, robots, Karaoke and entertainment applications. An effective voice conversion system requires two essential components: (a) accurate probability models of the distributions of sets of feature vectors that characterise the source and the target speakers’ voice and (b) an effective signal processing system for mapping the source speaker’s voice to the target speaker’s voice. There are two broad approaches to voice conversion: (a) non-parametric mapping of the spectral vectors of a source speaker to those of a target speaker using a source-to-target spectral codebook [3,4], and (b) parametric (mainly LP) modelbased methods [5] of mapping via the modification of the source model parameters towards the estimates of the target model parameters. Parametric modelling allows a more flexible and selective modification of spectral parameters of the vocal tract and also allows modification of the glottal and prosodic parameters. In this paper we consider some of the practical issues in the parametric
2405
modelling and mapping of formants in the context of voice conversion. This paper is organised as follows: in section 2 formant estimation is described. Section 3 considers and compares two different methods of parametric mapping of formant features. Section 4 describes experimental results and section 5 concludes the paper.
2. Formant Probability Modelling and Trajectory Estimation To achieve formant-based voice mapping, an accurate formant model estimation is required to deal with the problems of the variability of the number of formants across the phonemes and the merging and de-merging of neighbouring formants (such as F2 and F3) over time. The problems can be alleviated using a hidden Markov model (HMM) based formant estimation procedure [1,6]. The formant candidate vectors, with unknown labels, are derived from the poles of an LP model of speech. The pole (formant candidate) features are frequency, bandwidth, delta frequency, delta bandwidth and intensity. A 2-D HMM with N left-to-right states distributed across frequency, and M states distributed across time, is used to classify the formant observations as shown in figure (1). The distribution of formants within each HMM state may be modelled by a mixture Gaussian distribution as M
p( Fk )
¦ ck N ( Fk , P Fk , V Fk )
(1)
k 1
where N ( Fk , P Fk , V Fk ) is a Gaussian model of Formant with mean P Fk and variance V Fk . To illustrate the accuracy with which the distributions of the formants are modelled by Gaussian mixture probability F re q u e n c y
T im e
Figure 1: A 2-D HMM formant tracking model.
EUROSPEECH 2003 - GENEVA
weight derived from the model.
3. Parametric Formant Mapping In this section two alternative methods for voice conversion via formant mapping are considered: (a) nonuniform spectral mapping and (b) LP frequency response mapping through rotation of the poles of LP models. 3.1 Spectrum Mapping Through Frequency Warping The equation for the transformation of the spectrum of a source speaker to a target speaker is expressed as (3)
Y ( f , t ) J ( f , t ) X [W ( f , t ) f ]
Figure 2: Formant histograms and models for male speaker
models of HMM states, the plots of histograms of formants were superimposed on the plots of the distributions of the mixture Gaussians of HMM states as shown in figure (2). The raw experimental data used are the frequencies of the poles of the LP model for all training data segments of each phoneme of a speaker. In figure (2) the peaks of the histogram and mixture Gaussian distribution curves correspond to the formants. Figure (2) illustrates, that as expected, the HMMs of formants closely fit the histograms of the distribution of formants. Given the phoneme-dependent HMM formant classifier and the associated formant bandwidths, formant contour estimation for a speech waveform is achieved through minimisation of a weighted mean square error objective function as 2· I t § ˆF t min k¦ w t ¨ Fi t Fk t ¸ (2) k ki ¨ 2 ¸ Fk t i 1 © BWi t ¹ where Ik(t) is the total number of poles in the tth speech frame classified as formant k and wki is a probabilistic
/3 0 RGHO S o u rc e Speaker HMM M odel
) R UP D Q W 7 U D MH F WR U \ : D U S LQ J ) D F WR U V
T a rg e t Speaker HMM M odel
) R UP D Q W 7 U D MH F WR U \
/3 0 RGHO
/3&6SHFWUXP:DUSLQJ3ROH5RWDWLRQ
S o u rc e Speech
F o rm a n t T r a c k in g
F o rm a n t M a p p in g
speaker , FiT,i 1,t , to that of the source speaker, FiS,i 1,t , as, a( i , t )
FiT,i 1,t
FiT1,t FiT,t
FiS,i 1,t
FiS1,t FiS,t
6 S H H FF KK 5 H F R QQ WLR QQ V WU X F WLR
i 1, , N
(4)
where FiT,t and FiS,t are the formant frequency values for the target and source respectively, i denotes the formant index and N is the total number of formants. The conversion ratios for bandwidth of resonance at formants, ȕ(f,t), are obtained as E ( i,t )
BW iT,t BW i S,t
i
1, , N
(5)
where BW i S and BW iT denote the source and target bandwidths for formant i. The conversion ratios for intensity of resonance at formants are derived in a similar way to the frequency warping ratios in equation 5. To apply the formant parameter conversion vectors to an LP spectrum with K spectral samples, the discrete warping ratios at the formants are interpolated to obtain K spectral values. The interpolation of frequency and intensity functions can be performed using one of the widely used polynomial interpolation methods such as the Hermite or
M apped Speech
T a rg e t Speech
M odel E s ti m a ti o n
where X, Y, t, and f denote the source spectrum, the transformed spectrum, and the time and frequency variables respectively. The method for warping the frequency axis is based on a formant-dependent function for rescaling the frequency axis in order to achieve the desired change in the frequency and bandwidth of the formants. The composite frequency warping function W(f,t) includes the mapping functions for both the formant frequency and bandwidth. The spectral amplitude shaping function J(f,t) is used to map the energy intensities between the source and target formant frequencies. In order to derive the parametric mapping functions W(f,t) and J(f,t) the source to target formant parameter vector conversion ratios are calculated; namely the formant frequency ratio vector Į(i,t), the formant bandwidth ratio vector ȕ(i,t) and formant intensity ratio vector Ȗ(i,t). The ratio D(i,t) is calculated from the ratio of frequency differences between two successive formants of the target
Speech R e c o n s tr u c t i o n
Figure 3: Spectrum Mapping Procedure
2406
(a)
Formant magnitude (dB)
EUROSPEECH 2003 - GENEVA
3dB
X (f) Y (f)
Fi-B W i/2
Warping function
1+x
where M ( pi ) , M ( pic ) are the angles of the original and mapped poles, and D(i,t), ȕ(i,t) are the frequency and bandwidth mapping ratios (equations 3,4).
Y (f)=X (ȕ(f) f)
Fi
Fi+B W i/2 Frequency (H z)
ȕ (f)
(b)
1
Frequency (H z)
1-x
Figure 4: Bandwidth warping: (a) original and reduced spectral shape of formant i (b) the warping function for reduction of formant’s bandwidth.
the piecewise cubic spline interpolation. The process of reduction of the bandwidth of a formant together with the bandwidth warping function are illustrated in figure 4. The frequency warping and bandwidth warping functions are combined to form a composite warping function as W ( f , t ) D ( f , t )* E ( f , t ) f 1, , K (6)
3.2 Spectrum Mapping Through Pole Rotation The formant and spectral shape of speech can be modified through rotation of the poles of an LP model of speech. The main issues in this method are: (a) the classification and labelling of the LP model poles with speech formants, (b) the mapping of intensity of resonance at formants which cannot be achieved by just moving the poles (c) the interaction of poles after rotation may distort the resultant spectrum if not controlled carefully. The first problem, the pole-formant pairing is solved using the HMM-based formant tracking. To achieve frequency warping and spectrum shaping, the poles’ angular frequencies are rotated to shift formant frequencies and the poles’ radii are modified to increase or decrease the formant bandwidth. The intensity of the entire spectrum after the poles have been moved is adjusted using the spectral shaping procedure described in section 3.1. The frequency of the resonance at a formant depends on the angle of the pole, and its bandwidth on the pole radius. The equations describing the frequency and bandwidth of a pole p with a sampling frequency Fs, are given by Fs (7) Fi angle( p) * 2S Fs BWi log(abs ( p)) * (8)
S
The poles frequencies and bandwidths are multiplied as, Fs M ( pic ) D ( i , t ) u u M ( pi ) (9) 2S Fs abs( pic ) E ( i ,t ) u log( abs( p )) u (10)
S
2407
3.3 Selection of Mapping Parameter Resolution The voice conversion ratios for mapping of the parameters of a source speaker towards the parameters of a target speaker can be made frame-dependent, phonemedependent or just one constant set of average mapping values may be used for a pair of source and target speakers. Average conversion vector: One set of average parameter vector conversion ratios are obtained over all vowels and used for mapping the source speech to target voice. Phoneme-based conversion vector: A set of Phonemedependent transformation ratios are obtained and applied to each cluster of triphones. This method required the phonemic labelling/segmentation of source speech prior to transformation. Frame-based conversion vector: The source and target speech frames are time-aligned and a set of transformation ratios is obtained for each frame.
4. Experiments Experiments were performed to train formant models of speakers and to determine the effectiveness of the mapping methods in converting the source speech to target speech. Three speaker-dependent databases, collected from three American English speakers, two males and one female were used as a case study. The databases consist of ten minutes of speech per speaker with a sampling rate of 10 kHz. An LP order of 13 is used and the LP coefficients are estimated every 10 ms using a 25 ms signal window. The first set of experiments obtained phoneme-dependent formant models for the source and target speakers, the models are used to obtain the best mapping function between the source and target speakers. Figure 5 shows a comparison of the formant models between two male speakers (M1/M2) and a male and a female (M1/F). It is interesting to note that for the male-to-male case the differences in frequencies are not large. This is expected since the two male speakers are of the same sex, accent and have a similar age. Thus for males or females of a similar age and physique the most important formant characteristics that distinguish two voices are the bandwidths and intensities of formants which can differ significantly as can also be shown in table 1. In the femaleto-male example the formant frequency difference is important (about 12%). In the male-to-male voice conversion often the bandwidth and in particular the /ae/, formant 2 M1/M2 M1/F F (%) 4.84 12.12 BW (%) 49 24 I (dB) -11.28 -10.75 Table 1: Frequency (F), bandwidth (BW) and intensity (I) differences for formant 2 of phoneme /ae/
EUROSPEECH 2003 - GENEVA
Figure 7: Illustrations of Source, target and mapped spectra of phonemes /iy/ and /ey/ Figure 5: Comparison of formant models for a male/male pair and a male/female speaker pair
spectral intensity differences are most significant correlates of voice characteristics. To evaluate the effectiveness of the phoneme-dependent mapping ratios, the average formant trajectories for phonemes of the source and target speakers were extracted (figure 6). Speech was mapped using both the spectrum warping and pole rotating methods. Two mapping examples can be seen in figure 7. Significant voice modifications can be achieved as in the first plot where the source has one less formant than the target which is “reproduced”. Pole rotating is a more flexible method in which theoretically any change in spectral shape can be performed. On the other hand its ability to produce large modifications in voice spectrum also makes it a less robust method than frequency warping. Experiments using all three conversion ratios resolutions (constant parameters, phoneme-based and frame based) were performed and it
Figure 6: Average formants trajectories vowels /eh/ and /uw/ for source (male1) and target (male2) speakers.
2408
was established that overall a smoothed phoneme-based method produced preferred results.
5. Conclusion The problems of modelling and mapping of the formants of a source speaker to a target speaker were explored. It is shown that he distributions of the formants are accurately modelled with 2D-HMMs of formants. A main focus of the paper was presentation and evaluation of different methods for formant mapping in voice conversion. A spectrum warping and a pole rotation method were evaluated and it was established that both methods are effective. Different mapping resolutions were evaluated, and the phoneme based mapping was found to produce the best results. Acknowledgement We acknowledge the support of UK’s Engineering and physical sciences research council Grant No GR/M98036/01.
6. References [1] A. Acero (1999), “Formant Analysis and Synthesis using Hidden Markov Models”, Proc. of the Eurospeech Conference, Budapest . [2] Allen J., Hunnicutt S., and Klatt D. “From text to speech: the MITalk system”, MIT Press, Cambridge, MA 1998. [3] Y.Stylianou, O.Cappe and E.Moulines, ”Continuous Probabilistic Transform for Voice Conversion”, IEEE Proc. on Speech and Audio Processing, Vol.6, No.2, March 1998, pp.131-142. [4] A. Kain and M. Macon. "Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction". Proceedings of ICASSP, May 2001. [5] M. Abe (1991), “A segment-based approach to voice conversion,” in Proc. ICASSP, pp 765-768. [6] C.H. Ho, Dimitrios Rentzos, Saeed Vaseghi (2002), “Formant Model estimation and transformation for Voce Morphing” in Proc. ICSLP, pp. 2149-2152.