Low-Delay Voice Conversion Based on Maximum ... - ISCA Speech

0 downloads 0 Views 224KB Size Report
As typical voice conversion methods, two spectral conversion processes have been proposed: 1) the frame-based conversion that converts spectral parameters ...
Low-Delay Voice Conversion based on Maximum Likelihood Estimation of Spectral Parameter Trajectory Takashi Muramatsu† , Yamato Ohtani† , Tomoki Toda† , Hiroshi Saruwatari† , and Kiyohiro Shikano† †Graduate School of Information Science, NAra Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara 630-0192 JAPAN E-mail:

†{takashi-mu,yamato-o,tomoki,sawatari,shikano}@is.naist.jp

Abstract

movements with inappropriate dynamic characteristics due to ignoring the feature correlation between frames. Consequently, the converted speech quality is often deteriorated. On the other hand, the latter method provides the converted spectral parameter sequence exhibiting proper dynamic characteristics by considering dynamic features of the converted parameters. This method causes significant quality improvements of the converted speech. However, it does not allow the real-time conversion process because the source features over an utterance need to be converted simultaneously for considering the feature correlation between frames. In order to achieve the real-time conversion process considering dynamic characteristics of the converted parameters, we propose a time-recursive conversion algorithm based on MLE of spectral parameter trajectory. This method is inspired by the parameter generation algorithm for HMM-based speech synthesis [5] and the vector quantization algorithm for speech coding [6]. Experimental results demonstrate that the proposed method achieves the low-delay conversion process while keeping the conversion performance comparably high to that of the conventional trajectory-based conversion. This paper is organized as follows. In Section 2, we describe the conventional VC methods. In Section 3, we describe the proposed low-delay VC method. In Section 4, we provide the experimental results. Finally, conclusions are given in Section 5.

As typical voice conversion methods, two spectral conversion processes have been proposed: 1) the frame-based conversion that converts spectral parameters frame by frame and 2) the trajectory-based conversion that converts all spectral parameters over an utterance simultaneously. The former process is capable of real-time conversion but it sometimes causes inappropriate spectral movements. On the other hand, the latter process provides the converted spectral parameters exhibiting proper dynamic characteristics but a batch process is inevitable. To achieve the real-time conversion process considering spectral dynamic characteristics, we propose a time-recursive conversion algorithm based on maximum likelihood estimation of spectral parameter trajectory. Experimental results show that the proposed method achieves the low-delay conversion process, e.g., only one frame delay, while keeping the conversion performance comparably high to that of the conventional trajectory-based conversion. Index Terms: speech synthesis, voice conversion, Gaussian mixture model, maximum likelihood estimation, time-recursive algorithm.

1. INTRODUCTION Voice conversion (VC) is a technique for converting the source speech feature to the target one while keeping linguistic information. There are many VC applications for speech communication such as bandwidth expansion for telephone speech [1] and body-transmitted speech enhancement [2]. It is no doubt that a real-time conversion process is essential in these applications. As a typical statistical approach to VC, the conversion method based on the Gaussian mixture model (GMM) has been proposed. In this method, the joint probability density of source and target features is modeled by a GMM, which is trained beforehand using a parallel data set consisting of several sentence pairs uttered by source and target speakers. The trained GMM allows the conversion from any sample of the source features into that of the target features. So far there have been proposed two main GMM-based conversion methods: 1) the frame-based conversion method that converts spectral parameters frame by frame based on the minimum mean square error (MMSE) [3] and 2) the trajectorybased conversion method that converts all spectral parameters over an utterance simultaneously based on the maximum likelihood estimation (MLE) [4]. The former method is capable of the real-time (on-the-fly) conversion process because the source features at individual frames are converted independently of each other. However, this method sometimes causes spectral

Copyright © 2008 ISCA Accepted after peer review of full paper

2. Conventional GMM-Based Spectral Conversion Methods 2.1. Training of GMM D-dimensional source and target feature vectors at frame t are denoted by xt and yt , respectively. The joint probability density of the source and target feature vectors is modeled as follows: M X (z) P (zt |–(z) ) = wm N (zt ; —(z) (1) m , Σm ), m=1

ˆ ˜   where zt is a joint vector x . The superscript  indit , yt cates transposition. The mixture component index is m. The total number of mixture components is M . The weight of the m-th mixture component is wm . N (·; —, Σ) denotes the normal distortion with mean vector — and covariance matrix Σ. A parameter set of GMM , –(z) consists of weights, mean vectors and covariance matrices for individual mixture compo(z) (z) nents. The mean vector —m and the covariance matrix Σm of the m-th mixture component are written as " " # # —(z) m =

1076

(x)

—m (y) —m

, Σ(z) m =

(xx)

Σm (yx) Σm

(xy)

Σm (yy) Σm

,

(2)

September 22- 26, Brisbane Australia

(x)

(y)

where —m and —m are the mean vectors of the m-th mixture component for the source and the target, respectively. The ma(xx) (yy) trices Σm and Σm are the covariance matrices of the m-th mixture component for the source and target, respectively. The (xy) (yx) matrices Σm and Σm are the cross-covariance matrices of the m-th mixture component for the source and target, respectively. These parameters are estimated with the EM algorithm using the joint vectors extracted from the training data [7].

X1 m1

(x)

(xx)

(4)

(y)

(y)

−1

(y) (xx) Dm = Σ(yy) − Σ(yx) m m Σm

(xt − —(x) m ),

−1

Σ(xy) m .

(Y ) −1

R = P −1 = W  Dm ˆ r=W

(6)

1

The mean vector

The target feature vector is defined as 2D-dimensional vector Yt consisting of D-dimensional static and dynamic features at frame t. A relationship between the time seˆ ˜ quences Y = Y1 , Y2 , · · · , Yt , · · · , YT and y = ˆ   ˜    is represented as Y = W y y1 , y2 , · · · , yt , · · · , yT where i h (9) W = w1 , w2 , · · · , wT , and ˆ ˜ wt = 02D×(t−2)D , w, 02D×(T −t)D , (10) » – ID×D 0D×D w= . (11) −gID×D gID×D

(15)

T

(Y ) Em ˆ t , t and (Z)

(Y )

the covariance matrix Dm ˆ t are

3. Proposed Low-Delay Conversion Method We propose the time-recursive conversion algorithm by modifying the trajectory-based conversion in a manner similar to that (t−1) is obtained described in [5] [6]. The following matrix W by replacing the elements of W , w, with w from time t to T . W

As the source feature vector Xt , the feature vector including both static and dynamic features [4] or the concatenated feature vector from multiple frames [8] is employed. This time seˆ ˜ quence is written as X = X1 , X2 , · · · , Xt , · · · , XT . ˆ   ˜ Using joint vectors Zt = Xt , Yt , the GMM parameter set –(Z) of the joint probability density P (Zt |–(Z) ) is trained beforehand by adopting the conventional training framework [7]. In the conversion process, the converted feature vector sequence yˆ is determined by maximizing the likelihood function P (Y |X, –(Z) ) as follows:

y

(14)

calculated based on – . This paper uses only the diagonal (Y ) elements of Dm ˆt . Figure 1 shows the schematic image of the trajectory-based conversion process. Note that the matrix P is generally full. It is necessary to store statistics of the conditional probabilities over the time sequence shown by Eqs. (16) and (17) for deterˆ mining y.

2.3. Trajectory-Based Conversion

(t−1)

h i    = w1 , · · · , wt−1 , w , (18) t , w t+1 , · · · , w T

where

ˆ ˜ wt = 02D×(t−2)D , w, 02D×(T −t)D , (19) – » 0D×D ID×D . (20) w= 0D×D 0D×D In this case, the set of equations corresponding to Eq. (13) can be written as (t−1) (t−1) R y = r (t−1) , (21)

where R

(t−1)

=W

(t−1)  (t−1) 

(Y ) −1

Dm ˆ

(Y ) −1

W

(t−1)

(Y )

,

r (t−1) = W Dm Em ˆ ˆ , h i  (t−1) (t−1)  (t−1)  y (t−1) = y 1 , · · · , yt , · · · , yT .

)

ˆ ˆ –(Z) ), –(Z) )P (Y |X, m,  arg max P (m|X,

W,

(Y ) −1 (Y ) Dm Em ˆ ˆ ,

» – (Y ) (Y )  (Y )  (Y )  Em = Em , (16) ˆ ˆ 1 , 1 , · · · , Em ˆ t , t , · · · , Em ˆ T ,T , – » (Y ) −1 (Y ) −1 (Y ) −1 (Y ) −1 . (17) = diag Dm , · · · , Dm , · · · , Dm Dm ˆ ˆ ˆt ˆ

(7)

where yˆt is the converted target feature vector. This method converts spectral parameters frame by frame.

y



and

m=1

yˆ = arg max P (Y |X, –

^ yT

where

The MMSE-based conversion [3] is performed as follows: M X (y) P (m|xt , –(z) )Em,t , (8) yˆt =

(Z)

^ y2

ˆ is deterwhere the suboptimal mixture component sequence m mined by maximizing the posterior probability P (m|X, –(Z) ). The ML estimate of yˆ is given by solving the following linear equation Ryˆ = r, (13)

(y) P (yt |xt , m, –(z) ) = N (yt ; Em,t , Dm ). (5) (y) (y) The mean vector Em,t and the covariance matrix Dm of the m-th conditional probability distribution are written as (yx) (xx) Em,t = —(y) m + Σm Σ m

mT

Figure 1: Schematic image of the trajectory-based conversion process.

(3)

wm N (xt ; —m , Σm ) , P (m|xt , –(z) ) = PM (x) (xx) ) n=1 wn N (xt ; —n , Σn

m2

^ y1

The conditional probability density of yt , given xt , is represented as follows: M X P (yt |xt , –(z) ) = P (m|xt , –(z) )P (yt |xt , m, –(z) ), m=1

XT

P, r

2.2. Frame-Based Conversion

where

X2

(12)

1077

(22) (23) (24)

Table 1: Q = = ı =  = k (t) yL (t) PL

Summary of time-recursive algorithm (t−1) (t) Ja P L Ja + P˜L (A.1) (t−1)  + y˜L (A.2) Ja y L  (A.3) QvL vL ı (A.4) (Y ) −1

ı(I2D×2D + Dm ˆt

=

(Y ) −1 (Y ) + kDm (Em ˆt ˆ t,t −1 (Y )  Q − kDm ı ˆt

= =

)−1

(A.5)

− vL )

(A.6)

Xt-2 mt-2 (t-2)

(A.7)

(0)

The matrix w is defined so that R is a D-by-D block diagonal matrix. When wt is replaced with wt , the following equation is obtained R where W

(t)

(t)

y (t) = r (t) ,

  = w1 , · · · , wt−1 , wt , w t+1 , · · · , w T

i

.

r

=r

(t−1)

+

(Y ) −1 (Y ) vt Dm Em ˆt ˆ t,t,

(t)

=R

(t−1)

(t) −1

(t)

PL yL(t-2)

PL yL(t-1)

PL

^ yt-2

^ yt-1

^ yt

yL(t)

m t+1 (t+1)

PL yL(t+1) ^ yt+1

L = max(D, 1) + 1,

(26)

(39)

where D is the length of frame delay allowed in the spectral conversion. The converted vector at time t, yˆt is obtained as

(28)

where

(29) vt = wt − wt . Using the matrix inversion lemma as in the derivation of the RLS algorithm [9], y (t) and P

(t-1)

mt

Xt+1

The value L is the number of frames stored in this algorithm, which is determined by

Using the above equations, the following equations are derived: (t) (t−1) (Y ) −1 R =R + vt Dm vt , (27) ˆt (t)

mt-1

Xt

Figure 2: Schematic image of the low-delay conversion process with D = 1.

(25)

h

Xt-1

y t−D

yˆt = y  t−D ,

(t) is a vector contained in y L written i h (t) (t) (t) y L = y t−(L−1) , · · · , y t .

(40) as (41)

Figure 2 shows the schematic image of this conversion process with D = 1. The statistics of conditional probabilities at the previous frames are effectively propagated to the next frame. Consequently, the converted feature vectors are determined frame by frame while considering the statistics of conditional probabilities at the current and succeeding D frames as well as all previous frames. In the proposed algorithm, the (Y ) statistic Em ˆ t , t in (A.6) of Table 1 continuously changes frame by frame because it is represented as the linear transformation of the source feature vectors unlike the algorithms described in [5] [6]. It is noted that this time-recursive algorithm is valid if the (Y ) cross-covariance values of static and dynamic features in Dm ˆ (Y ) are zeros. If using full elements of Dm ˆ , it is necessary to derive the time-recursive algorithm based on a replacement of (t) (Y ) −1 elements of Dm instead of W in a manner similar to ˆ that described in [5].

are calculated re-

(t−1) −1

= R , respectively. cursively from y (t−1) and P All elements of y (t) are updated by replacing wt with wt . Note (Y ) that y (0) is equal to the static parts of Em ˆ . By replacing wt from time 1 to T , the final updated vector y (T ) is obtained, ˆ which corresponds to the solution of Eq. (13), y. Table 1 shows the time-recursive algorithm derived by further introducing the sliding window concept of the RLS algorithm to the above recursive algorithm. In this algorithm, (t) (t) the sizes of the recursively updated parameters, y L and P L , are limited to LD-by-1 and LD-by-LD, respectively. Therefore, the converted vectors obtained by this algorithm are suboptimal. Equations (A.1) and (A.2) are the key parts of this algorithm. Old elements are thrown out by using the matrix Ja written as » – 0(L−1)D×D I(L−1)D×(L−1)D Ja = , (30) 0D×D 0D×(L−1)D

4. Experimental Evaluations

and then new components are added by using the LD × LD (t) (t) matrix P˜L and the LD × 1 vector y˜L , which are respectively written as h i (t) ˜ t−1 , P˜L = diag 0D×D , · · · , 0D×D , R (31) i h (t) ˜ t−1 r˜t ) . y˜L = 0, · · · , 0, (R (32) ˜ The new components, the D × D matrix Rt and the D × 1 vector r˜t are written as (Y ) −1 ˜ t = Jb w  R wL Jb , (33) L Dm ˆt

The performance of the proposed method is evaluated in the framework of body-transmitted VC [2]. There have been proposed voice conversion frameworks for improving the quality of several types of body-transmitted speech. In this paper, we conduct the conversion from body-transmitted ordinary speech (BTOS) into natural speech. This conversion system is very effective for achieving noise-robust speech communication. 4.1. Experimental Conditions We simultaneously recorded BTOS and natural speech uttered by the same speaker using a NAM microphone [2], which is one of the body-conductive microphones, and an air-conductive microphone. Sampling frequency was set to 8 kHz. The 0-th through 16-th mel-cepstral coefficients were used as a spectral feature. The frame shift was set to 5 ms. In order to compensate missing acoustic features of BTOS due to the body transmission effects as described in [2], we used the spectral segment vector as the source feature, which was constructed by

−1

(Y ) (Y ) r˜t = Jb w Em (34) L Dm ˆt ˆ t,t, where the 2D×LD matrices wL , wL and vL , and the D×LD matrix Jb are respectively written as follows: wL = [02D×D , · · ·, 02D×D , w ] , (35) wL = [02D×D , · · ·, 02D×D , w ] , (36) vL = wL − wL , (37) Jb = [0D×D , · · · , 0D×D , ID×D ] . (38)

1078

100

Preference score on speech quality [%]

Mel-cepstral distortion [dB]

Table 2: Number of sequences for training/evaluation speaker male A male B female C female D training 93 89 100 100 evaluation 46 45 47 47 3.7 Time-recursive Trajectory-based Frame-based

3.65

95% confidence intervals 80 60 40 20 0

3.6

Figure 4: Result of preference tests of speech quality.

3.55 3.5

Frame-based Delay0 Delay1 Delay5 Trajectory-based Spectral conversion methods

0

5

10

15 Delay

20

25

5. Conclusions

30

In this paper, we proposed a spectral conversion method for high-quality and low-delay VC. The experimental results show that the proposed method of spectral conversion by the timerecursive algorithm archives high-quality and low-delay conversion. We plan to reduce the computational cost of the conversion process and the required memory size to develop a small run-time VC system for embedded applications.

Figure 3: Mel-cepstral distortion in each spectral conversion method. concatenating the mel-cepstra at a current ±4 frames and then reducing the concatenated vector dimension using PCA with a loss of no more than 20% of the information. STRAIGHT [10] analysis was employed for the target spectral extraction. On the other hand, simple FFT analysis was employed for the source spectral extraction for reducing the processing time. In our preliminary experiment, it was shown that the different spectral analysis method for the source speech doesn’t cause any significant differences of the conversion performance. We used four speakers including two male and two female. Table 2 shows the number of training/evaluation sentences in each speaker. The conversion model was trained for each speaker separately. The number of mixture components was set to 64.

6. Acknowledgments This work was supported in part by MEXT Grant-in-Aid for Young Scientists (A).

7. References [1] K. Y. Park and H. S. Kim. Narrowband to wideband conversion of speech using GMM based transformation. in Proc. ICASSP, pp. 1847-1850, Istanbul, 2000. [2] M. Nakagiri, T. Toda, H. Kashioka and K. Shikano. Improving body transmitted unvoiced speech with statistical voice conversion. in Proc. INTERSPEECH, pp. 2270-2273, Pittsburgh, USA, 2006.

4.2. Objective Evaluations

[3] Y. Stylianou, O. Capp´ e and E. Moulines. Continuous probabilistic transform for voice conversion. in Proc. IEEE Trans. SAP, Vol. 6, No. 2, pp. 131-142, 1998.

We used the mel-cepstral distortion between the target and converted mel-cepstra as the objective evaluation measure. Figure 3 shows the mel-cepstral distortions in the framebased, trajectory-based and proposed methods, respectively. The proposed method outperforms the conventional framebased method even in the no-delay condition because it effectively considers the dynamic features of the converted spectra. The distortion quickly decreases as the delay value increases, and it is almost equal to that in the trajectory-based method when setting the delay value to around five.

[4] T. Toda, A.W. Black, and K. Tokuda. Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. in Proc. IEEE Trans. ASLP, Vol. 15, No. 8, pp. 2222-2235, 2007. [5] K. Tokuda, T. Kobayashi and S. Imai. Speech parameter generation from HMM using dynamic features. in Proc. ICASSP, pp. 660-663, Detroit, USA, 1995. [6] K. Koishida, K Tokuda, T. Masko, and T. Kobayashi. Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features. in Proc. IEICE Trans. INF. & SYST., Vol. E84-D, No. 10, pp. 1427-1434, 2001.

4.3. Subjective Evaluations We conducted a preference test on speech quality. We evaluated converted speech by two conventional methods (the framebased and the trajectory-based) and by the proposed methods when setting the delay value to 0, 1, and 5, respectively. A pair of the converted voices from two different methods was randomly presented to the five listeners, and then they were asked which voice sounded more natural. Figure 4 shows the preference scores of individual methods with confidence intervals (95%). The converted speech quality of the proposed method with no delay is almost equal to that of the frame-based conversion. It is significantly improved by introducing some delay values, even only one frame (i.e., 5 ms). Consequently, the proposed method achieves a very low-delay frame-by-frame conversion process while keeping the converted speech quality comparably high to that of the trajectory-based conversion.

[7] A. Kain and M.W. Macon. Spectral voice conversion for text-tospeech synthesis. in Proc. ICASSP, pp. 285-288, Seattle, USA, 1998. [8] T. Toda, A. W. Black, and K. Tokuda. Statistical mapping between articulatory movements and acoustic spectrum with a Gaussian Mixture Model. in Proc. Speech Communication, Vol. 50, No. 3, pp. 215-227, 2008. [9] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood Cliffs, N.J., 1991. [10] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne. Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. in Proc. Speech Commun., vol. 27, no. 3-4, pp. 187-207, 1999.

1079

Suggest Documents