one sentence voice adaptation using gmm-based frequency-warping

ONE SENTENCE VOICE ADAPTATION USING GMM-BASED FREQUENCY-WARPING AND SHIFT WITH A SUB-BAND BASIS SPECTRUM MODEL Masatsune Tamura, Masahiro Morita, Takehiko Kagoshima, and Masami Akamine

Knowledge Media Laboratory, Corporate Research and Development Center, Toshiba Corporation ABSTRACT This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shift with parameters of a subband basis spectrum model (SBM)[1]. The SBM parameter represents a shape of a spectrum of speech. It is calculated by fitting a sub-band basis to the log-spectrum. Since the parameter is the frequency domain representation, frequency warping can be directly applied to the SBM parameter. A frequency warping function that minimize the distance between source and target SBM parameter pairs in each mixture component of a GMM is derived using a DP (Dynamic programming) algorithm. The proposed method is evaluated in an unit-selection based voice adaptation framework applied to a unit-fusion based text-to-speech synthesizer. The experimental results show that the proposed adaptation method is effective for rapid voice adaptation using just one sentence, compared to the conventional GMM.-based linear transformation of mel-cepstra. Index Terms—voice adaptation, frequency warping, subband basis spectrum model, unit fusion speech synthesis

1. INTRODUCTION Voice conversion[2]-[5] is a technique to convert source speech to target speech that sounds as if it is uttered by a target speaker. By applying a voice conversion technique to a speech unit database of a text-to-speech synthesizer(TTS), it can be adapted to a new voice using only a small amount of recorded utterances of a target voice. GMM-based voice conversion[3]-[5] is one of the widely used voice conversion methods. The GMM and the voice conversion functions for respective mixture components are trained using pairs of utterances of a source speaker and a target speaker. For the transformation of a spectral feature, linear regression of melcepstrum parameters is widely used[3][4]. Since the regression matrices have many parameters to estimate (square of the order of a cepstrum parameter), adaptation is slow. When the amount of adaptation data is small, estimation of the regression matrices is not reliable. Over-smoothing of the converted spectra is another problem. Spectrum peaks of a converted spectrum become unclear. Therefore, adapted speech has a degradation of voice quality. To solve these problems, voice conversion algorithms using frequency warping[4][5] were proposed. A vocal tract length of a speaker is one of the properties for representing voice characteristics. In frequency domain, it is represented by locations of formant frequencies. Therefore, by applying frequency warping, voice characteristics of speech can be changed. Strengths of formants also represent a vocal tract shape. An sub-band energy conversion[5] or a filter along with frequency warping changes voice characteristics. One method conducts a dynamic frequency warping to the Straight spectrum[4] to reduce the over-smoothing problem. The converted spectrum is calculated by interpolation of a frequency warped source spectrum and a spectrum generated by

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

the cepstrum transformation. The method still uses the GMMbased linear regression of cepstra. Therefore, the adaptation is still slow. Another method uses a GMM-based weighted frequency warping[5]. It uses a piecewise linear frequency warping function and sub-band energy conversion. The warping function is calculated using positions of formants of mean target-and-source spectrum for each mixture component. The warping function only uses small number of formant positions. It is not estimated by data optimization process such as mean squared error minimization. Therefore, the conversion function is not precise. In this paper, we propose a GMM-based voice conversion method that applies frequency warping and shift to the parameters of sub-band basis spectrum model(SBM)[1]. The SBM parameter represents a shape of pitch-synchronous log-spectrum and phase. It uses sub-band basis vectors created by 1-cycle sinusoidal shapes that are similar to sparse coding basis. The parameter of the SBM is calculated by fitting the basis to the log-spectrum. This parameter is useful for voice adaptation for unit fusion based TTS, since there are no significant degradation in synthetic speech using analysis-synthesis database and original database. The SBM parameter is the frequency domain representation. Therefore, frequency warping can be directly applied. A GMM is trained using SBM parameters of source and target speaker. Then, frequency warping functions and shift vectors for respective mixture components are estimated. A frequency warping function that minimize the distance between source and target SBM parameter pairs in each mixture component of a GMM is derived using a DP (Dynamic programming) algorithm. A weighted distance matrix with the posterior probability of the GMM is calculated, and then a DP path is searched for on it. The shift vectors representing filters are also calculated to minimize the difference between the warped source parameters and the target parameters. The voice conversion of proposed method is rapid because the number of parameters of a conversion function is small (two times the order of a SBM parameter). The conversion functions are estimated by minimizing the distances between training data pairs. Therefore the source spectrum can be precisely converted to the target. The proposed method is evaluated on an unit selection based voice adaptation framework for the plural unit selection and fusion based TTS[6][7]. The system does not need a parallel corpus of the source and target speakers. Instead, it uses a cost function to make a pair of source and target speech units. It enables a non-parallel training of voice conversion functions. The experimental results show that the proposed voice conversion method is effective compared to the GMM-based linear regression of mel-cepstrum parameters, when using a very small number of training utterances such as one sentence.

2. THE VOICE ADAPTATION SYSTEM Figure 1 illustrates the flow of the proposed voice-adaptationbased speech synthesis system. The system consists of a voice

5124

ICASSP 2011

Source speech unit database (large)

Target adaptation data (small)

0

Log-magnitude[db]

voice conversion module Training data preparation Conversion function training Voice conversion Adapted speech unit database (large) input text

TTS module

Sub-band basis

π/2

0

Frequency[rad] π

(a) Sub-band basis. FFT spectrum SBM parameter reconstructed spectrum

100 50

0

π Frequency[rad] (b) Log-spectrum using FFT, SBM parameter, reconstructed spectrum from SBM parameter. Figure 2. Example of SBM parameter.

synthetic speech

Figure 1. Flow diagram of the proposed voice adaptation based speech synthesis system. conversion module and a TTS module. The voice conversion module uses a speech unit database of the source speaker and a small amount of adaptation data of a target speaker as inputs, and the output is a voice-adapted speech unit database. The voice conversion module consists of a training data preparation process, a conversion function training process, and a voice conversion process. In the training data preparation process, speech samples of adaptation data are segmented into speech units (half-phone) and pitch-cycle waveforms are extracted from the speech units by applying Hanning window. To each of the pitch-cycle waveforms, SBM parameters are extracted. Then, for each speech unit of the adaptation data, a source speech unit is selected to make a pair with the target speech unit for the conversion function training. The unit selection is performed by minimizing a cost function[7], defined by C (u t , u c ) = ∑ wi C i (u t , u c ) ,

Frequency scale

1

(1)

where, u t , u c represents the target speech unit and source speech unit. Ci (u t , u c ) represents a sub-cost function and wi is a weight of the sub-cost. F0 sub-cost, duration sub-cost, phoneme environment sub-cost, and boundary spectral sub-cost are used. In the conversion function training process, a voice conversion function is trained. The voice conversion function is based on the GMM-based frequency warping and shift algorithm, described in Section 3, and consists of frequency warping functions and shift vectors. A GMM model and conversion functions for respective Gaussian components are trained in this process. In the voice conversion process, the conversion function is applied to the speech units in the source speech unit database. Converted speech units are generated from source speech units and stored into the adapted speech unit database. The SBM parameters of respective pitch-cycle waveforms of each source speech unit are converted to those of a target speech unit by applying the conversion function. Pitch-cycle waveforms for converted speech units are generated by inverse-FFT of the reconstructed spectra from the converted SBM parameters. The phase spectra reconstructed from phase parameters of source units are used in this process. The converted speech units are generated by over-lap adding the pitch-cycle waveforms. An LSP post-filter[8] is applied to the pitch-cycle waveforms. In the TTS module, speech is synthesized from input text using the adapted speech unit database. The TTS module consists of text analysis part, prosody generation part and a speech synthesis part.

π/2

0

A plural unit selection and fusion method[6] is used in the speech synthesis part.

3. VOICE ADAPTATION USING GMM-BASED FREQUENCY WARPING AND SHIFT The proposed method uses SBM parameters[1] for voice adaptation. Figure 2 shows an example of a SBM parameter. Figure 2 (a) illustrates the sub-band basis vectors that are placed on a mel-frequency-scale for the lower half band and an equally spaced scale for the upper half band. A basis vector is generated by a 1-cycle sinusoidal function (Hanning window shape). Figure 2 (b) shows an example of the log-spectrum of a phoneme “a” from a female speaker: the FFT spectrum, the SBM parameter and the reconstructed spectrum from the SBM parameter are shown. The SBM parameter is plotted with points and vertical lines at the center frequencies of respective basis vectors. Voice conversion is done by GMM-based frequency warping and shift algorithm in the SBM parameter domain. The conversion function is defined by, y=

M

∑ P (x, cx = m | λ gmm ){warpm (x ) + shiftm } ,

(2)

m =1

where y = { y (1),L, y ( N )} , x , M represents the converted parameter, the source parameter and the number of mixtures in GMM λ gmm . P(x, c = m | λ gmm ) ≡ γ m (x) is a posterior probability, where the mixture c x for a given observation x is m . The warping function warp m (x) for the mixture m is defined by a mapping function ψ m (k ) that maps the coefficients of source parameter to the target parameter. The shift parameter shift m is defined by a shift vector s m = {s m (1), L , s m ( N )} . Since the parameter is in the log-spectrum domain a shift operation represents a filter operation in time domain. For each elements of y , equation (2) is written as y (k ) =

M

∑ γ m (x ){x(ψ m (k )) + sm (k )} .

m =1

(3)

In the warping function warp m (x) , smoothing operation is also applied in addition to ψ m (k ) . The mapping function ψ m (k ) maps the elements of x to y by skipping or repeating some elements of x . Smoothing operation uses interpolated value for skipped elements and smoothed value for repeated elements. It reduces unnatural spectrum jumps and flat spectra. To train the warping function and the shift parameter, the training

5125

Source frequency [rad]

data pairs are used. For each target speech units, a source speech unit is selected from the source speech unit database to make a training data pair as described in Section 2. The GMM model λ gmm is trained using the data pairs. An observation vector for GMM ot = {o′tsrc , o′ttarget }′ consists of a source parameter o tsrc and . The combined vector is used for training, a target parameter otarget t and the Gaussian for the o tsrc part is used for conversion. The GMM parameters are initialized by LBG (Linde-Buzo-Gray) algorithm and re-estimated by maximum likelihood estimation. The conversion function consists of mapping functions ψ m (k ) and shift vectors s m for respective mixture components. They are iteratively trained to minimize the error function. 1. Initialization: Set s m to 0 . 2. Calculate ψ m that minimizes the distance between target data and source data. 3. Calculate s m that minimizes the error function 4. Go to step 2, until convergence of average distance between converted parameter and target parameter. The squared error E between converted target parameters and source parameters is,

Log-magnitude[db]

2

t

≈

M

∑ γ m (x t ){yt − {warpm (x t ) + shiftm }}

2

M

2

m =1 t

where x t y t represents source and target of t-th training pair. Here, we assume that the error distribution for each Gaussian mixture is independent. The mapping function ψ m (k ) can be obtained by, 2 ψ m ( k ) = arg min ∑ γ m ( x ){( y ( k ) − sm ( k )) − x (ψ m ( k ))} 㻚 (5) ψ m (k )

t

Let Dm (i, j ) = ∑t γ m (xt ){( yt (i ) − sm (i ) ) − xt ( j )}2 be the weighted distance matrix, the DP path is obtained by searching for a path that minimizes, ⎧ distm (i − 1, j ) ⎫ ⎪ ⎪ distm (i, j ) = min ⎨ distm (i − 1, j − 1) ⎬ + Dm (i, j ) 㻚 (6) ⎪dist (i − 1, j − 2)⎪ ⎩ m ⎭ By using equation (6), an optimum DP path for each Gaussian mixture using multiple training data pairs can be obtained. Next in Step 3, the shift vector s m is calculated as a weighted average difference between the mapped source parameters and the target parameters: sm ( k ) = ∑ γ m ( x ){yt ( k ) − xt (ψ m ( k ))} ∑ γ m ( x ) 㻚 (7) t

0

π/2

Target frequency [rad] (a) Mapping function.

10

π

Shift vector s m

0 -10

0

π/2

Frequency [rad]

π

100

source target converted

50 0

0 π/2 π Frequency [rad] (c) Example of source, target, and converted SBM parameters. Figure 3. Example of conversion function.

(4)

m =1

∑ ∑ γ m (x t ){yt − {warpm (x t ) + shiftm }}

0

(b) Shift vector.

t

=∑

Mapping function ψ m

π/2

Log-magnitude[db]

E = ∑ y t − yˆ t

π

t

Figure 3 shows an example of the (a) mapping function, (b) shift vector, (c) source parameters, target parameters, and converted parameters. These parameters are plotted at the center frequencies of respective sub-band basis vectors. These figures show that the spectrum shape is getting closer to the target by applying frequency warping and shift directly to the SBM parameter.

4. EXPERIMENTS For comparison between the conventional method and the proposed method, a MOS evaluation test was conducted. GMMbased linear regression of the mel-cepstrum parameters is used as a baseline. The speech unit database of one female speaker (624 sentences) and one male speaker (802 sentences) were used as

conversion source, and those of 4 female speakers (FA, FB, FC, FD) and 1 male speaker (MA) were used as the target. Adaptation was performed only on voiced speech units. One sentence and 50 sentences were used for adaptation, and different 50 sentences were used for calculating spectral distance. Sub-cost weights wi in equation (1) is experimentally set as {10, 3, 1, 1, 3} for normalized sub-costs of F0 target cost, duration target cost, phonetic context cost, spectrum concatenation cost, and power concatenation cost, respectively. Figure 4 shows the objective measure for (a) the proposed method (SBM) and (b) the baseline method (MCEP) for one sentence adaptation. The log-spectral distance between the reconstructed spectrum and target spectrum for the test sentences was used as an objective measure. The test data pairs of the target parameters and the source or adapted parameters were created by unit selection. The x-axis represents the number of mixtures for conversion. SOURCE shows the case for source speech unit database without conversion. The y-axis represents the log-spectral distance. The distances for respective target speakers and their average (indicated as ALL) are plotted. For proposed method, the distance does not increase rapidly as increasing the number of mixtures. On the other hand, the distance of the baseline method increases rapidly Therefore, multiple mixtures can be used even for one sentence adaptation by using the proposed method. It means that the proposed method can reflect the acoustic space of source speaker efficiently, even if a few adaptation data is available. One of the reasons is that number of conversion parameters for each mixture is small (100) for SBM conversion, while it is large (2500) for MCEP conversion. Thus, adaptation of proposed method is rapid. Figure 5 shows the result of the MOS evaluation. The subjects listened to the stimuli with synthetic speech from a large speech unit database of the target speaker. They gave five scale scores for both speech quality (1: poor, 3: fair, 5: excellent) and similarity (1: different, 3: resembled, 5: same). Seven subjects are attended and 4 sentences for each are evaluated. In Figure 5, (a) shows the comparison between the baseline method (MCEP) and the

5126

Log-spectral distance[db]

Log-spectral distance[db]

12

FA FB

10

FC FD

MA ALL

Speech quality Similarity

SBM-1 MCEP-1 SBM-50

8 6

12

MCEP-50 SOURCE 1

2 4 8 16 number of GMM mixtures (a) Proposed method (SBM)

SOURCE

32

1 3 5 2 4 (a) Comparison between baseline and proposed method

SHT-1

10

DFW-1

8 6

Speech quality Similarity

DFW+SHT-1

FA FB

FC FD

DFW+SHT-50

MA ALL

SHT-50 DFW-50

SOURCE 1

2 4 8 16 32 number of GMM mixtures (b) Baseline method (MCEP) Figure 4. Log-spectral distance for one sentence adaptation.

proposed method (SBM). SBM-1 and MCEP-1 use one sentence for adaptation and the SBM-50 and MCEP-50 use 50 sentences. SOURCE represents the synthetic speech from the sourcespeaker’s speech unit database without conversion. For synthesizing test samples, target speaker’s prosody generated in the TTS module was used. In the figure, MOS results of speech quality and similarity are shown. The scores were averaged over target speakers. Based on the results of the objective evaluation, the number of mixtures for the proposed method to 2 (FC,MA) or 4 (FA,FB,FD), and that for the baseline were set to 1. For adaptation with 50 sentences, they were set to 64 (FA) or 128 (other) for proposed method and 8 for the baseline. The results show that the speech quality of the proposed method using one sentence is higher than the baseline system while the similarity stays almost the same. For adaptation with 50 sentences, the scores for the proposed method are close to baseline for both speech quality and similarity. For the proposed method, scores of speech quality are close to “fair”, and those of similarity are higher than “resemble”. The speech quality for SOURCE is higher than the others, but the similarity score is less than 2. Consequently, the results show that the proposed method is effective if only a small number of adaptation sentences are available. The proposed method, which synthesizes fair quality speech, can be used in applications such like server based speech-to-speech translation system or avatar interface of user’s voice where adaptation data is limited. Figure 5 (b) shows the result for comparison among the frequency-warping (DFW) and the shift (SHT), and their combination (DFW+SHT). For this evaluation, FB and FC were used for target speakers, and the average scores of them are shown. The individuality of DFW+SHT is better than both DFW and SHT for one sentence adaptation. For adaptation with 50 sentences, both for individuality and speech quality, DFW+SHT was more highly scored. To summarize, the result showed that the frequency warping and shift based voice conversion is effective compared to the frequency warping only or the shift only conversion.

5.

1 2 3 4 (b) Comparison between DFW+FLT, DFW, FLT Figure 5. Mean opinion score.

5

GMM-based frequency-warping and shift with parameters of subband basis spectral model. The proposed method was compared to the baseline GMM-based mel-cepstrum linear regression method. The results show that the proposed method is effective when only a small number of adaptation utterances are available. The results also showed that frequency warping and shift conversion results in higher similarity MOS scores than adaptation by shift only or frequency warping only. Our future work includes speaking-style adaptation, cross-lingual adaptation, and using the proposed method in HMM-based speech synthesis.

6. REFERENCE [1]

[2] [3] [4]

[5] [6] [7] [8]

CONCLUSION

In this paper, we proposed a voice adaptation method using

5127

M. Tamura, T. Kagoshima, and M. Akamine, “Sub-band spectrum parameter for pitch-synchronous log-spectrum and phase based on approximation of sparse coding,” Proc. INTERSPEECH, pp.2406-2409, 2010. Y. Stylianou, “Voice transformation: a survey,” Proc. ICASSP, pp. 3585-3588, Apr., 2009. Y. Stylianou,, O. Cappe, and E., Moulines, “Continuous probabilistic transform for voice conversion”, IEEE Trans. Speech & Audio Processing, vol. 6, pp. 131-142, 1998. T. Toda, H. Saruwatari, K. Shikano. “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum,” Proc. ICASSP, pp. 841-844, 2001. D. Erro, A. Moreno, “Weighted frequency warping for voice conversion,” Proc. INTERSPEECH, pp. 1965-1968, 2007 T. Mizutani and T. Kagoshima, "Concatenative speech synthesis based on the plural unit selection and fusion method," IEICE Trans. E88-D, 11, pp.2565-2572, 2005. M. Tamura, T, Kagoshima, “A study on voice conversion for plural speech unit selection and fusion based speech synthesis,”, Proc. ASJ2008, 2-P-5, Sept., 2008. (in Japanese) Z.H.Ling, Y.J. Wu, Y.P.Wang, L.Qin, R.H.Wang, “USTC system for Blizzard challenge 2006 an Improved HMMbased speech synthesis method,” Proc. Blizzard challenge workshop, 2006.