use data of a larger number of pre-stored speakers. Index Termsâ Voice conversion, Gaussian mixture model, eigenvoice, many-to-many, non-parallel training.
NON-PARALLEL TRAINING FOR MANY-TO-MANY EIGENVOICE CONVERSION Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science and Technology, Japan {yamato-o, tomoki, sawatari, shikano}@is.naist.jp ABSTRACT This paper presents a novel training method of an eigenvoice Gaussian mixture model (EV-GMM) effectively using non-parallel data sets for many-to-many eigenvoice conversion, which is a technique for converting an arbitrary source speaker’s voice into an arbitrary target speaker’s voice. In the proposed method, an initial EV-GMM is trained with the conventional method using parallel data sets consisting of a single reference speaker and multiple pre-stored speakers. Then, the initial EV-GMM is further refined using non-parallel data sets including a larger number of pre-stored speakers while considering the reference speaker’s voices as hidden variables. The experimental results demonstrate that the proposed method yields significant quality improvements in converted speech by enabling us to use data of a larger number of pre-stored speakers. Index Terms— Voice conversion, Gaussian mixture model, eigenvoice, many-to-many, non-parallel training 1. INTRODUCTION Voice conversion (VC) [1] allows us to convert voice characteristics of a source speaker into those of a target speaker without changing linguistic information. As one of the statistical approaches to VC, a conversion method based on the Gaussian mixture model (GMM) has been proposed [2]. In this method, a GMM of joint probability density of source and target acoustic features is previously trained using a parallel data set consisting of utterance pairs of the source and target voices. The trained GMM enables the conversion from the source into the target in a probabilistic manner [3]. Although this method works reasonably well, this training framework is less flexible and causes many limitations of VC applications. In order to make it possible to train the GMM more flexibly, we have proposed eigenvoice conversion (EVC) [4]. In EVC, we train an eigenvoice GMM (EV-GMM) in advance using multiple parallel data sets consisting of a single reference speaker and many prestored speakers. A GMM for the reference speaker and any arbitrary speaker is effectively built by adapting the EV-GMM to the arbitrary speaker. In this adaptation, a small number of free parameters (i.e., speaker-dependent weight values for individual eigenvoices) are estimated using only a few utterances of the adapted speaker in a textindependent manner. This method effectively uses prior knowledge extracted from many parallel data sets for building a new GMM. We have developed some VC frameworks based on EVC, such as many-to-one EVC and one-to-many EVC [5]. The many-to-one EVC framework enables the conversion from an arbitrary source speaker’s voice into a pre-determined target speaker’s voice. It is possible to rapidly adapt the EV-GMM to a new source speaker using only an input utterance to be converted. The rapid adaptation performance is significantly improved by applying maximum a posteriori (MAP) adaptation to the unsupervised weight estimation [6]. On the other hand, the one-to-many EVC framework enables the This work was supported in part by MIC SCOPE.
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
4822
conversion from a pre-determined source speaker’s voice into an arbitrary target speaker’s voice. One of the interesting applications of one-to-many EVC is voice quality control, which allows us to intuitively control the converted voice quality by manipulating voice quality control scores capturing voice characteristics represented by several primitive words such as gender and age [7]. To develop a more flexible VC framework enabling the conversion from an arbitrary source speaker’s voice into an arbitrary target speaker’s voice, we have proposed many-to-many EVC [8] inspired by the multistep VC process [9]. In the adaptation process, we build two adapted GMMs, one for many-to-one EVC from an arbitrary source speaker’s voice into the reference speaker’s voice, and the other for one-to-many EVC from the reference speaker’s voice into an arbitrary target speaker’s voice, by adapting the single EV-GMM to the source speaker and the target speaker separately using a few arbitrary utterances of each speaker. In the conversion process, a source voice is converted into a target voice through the reference voice using the two adapted GMMs by sequentially performing the conversion from the source into the reference and the conversion from the reference into the target while considering the reference speaker’s voice as a hidden variable. Although many-to-many EVC is a quite flexible framework for building a conversion model for any arbitrary speaker-pair, we still need to use parallel data for training the EV-GMM in advance. In this paper, we propose a method of refining the EV-GMM by additionally using any arbitrary utterance sets of a larger number of pre-stored speakers, i.e., non-parallel data sets from various speakers, in order to relax the use of parallel data sets in EV-GMM training. In the proposed method, the initial EV-GMM is trained using the existing multiple parallel data sets, and then it is refined using only non-parallel data sets including a larger number of speakers while considering that the reference voices corresponding to those data sets are hidden variables. Note that these non-parallel data sets are generally much more easily available than the multiple parallel data sets. Therefore, the proposed method allows us to extract more informative prior knowledge from a much larger number of speakers in EV-GMM training. This paper is organized as follows. In Section 2, we describe the many-to-many EVC framework. In Section 3, the proposed EVGMM training with non-parallel data sets is described. In Section 4, we describe experimental evaluations. Finally, we summarize this paper in Section 5. 2. MANY-TO-MANY EIGENVOICE CONVERSION Figure 1 shows an overview of many-to-many EVC. In this paper, we use one-to-many EV-GMM for many-to-many EVC. 2.1. Eigenvoice Gaussian mixture model (EV-GMM) We employ a 2D-dimensional acoustic feature consisting of Ddimensional static and dynamic features of the reference speaker (s) and that of the sth pre-stored speaker Y t = X t = x t , Δxt
ICASSP 2010
(s) (s) y t , Δy t , respectively, where denotes transposition of the vector. Joint probability density of the acoustic features of the reference speaker and the individual pre-stored speakers is modeled by an EV-GMM as follows: (s) P (X t ,Y t |λ,w(s) =
M
αm N
m=1
(s)
X t ,Y t
) ) , ; μ(X,Y (w(s) ), Σ(X,Y m m
(1)
(XX) (XY ) μ(X) Σm Σm (X,Y ) m = (Y X) (Y Y ) , (2) (0) ,Σm B m w(s) + bm Σm Σm
(X,Y) (s) μm (w )=
where N (·; μ, Σ) denotes Gaussian distribution with mean vector μ and covariance matrix Σ. In EV-GMM, an individual mean vec(0) tor is modeled by thebias vector bm , representative vectors B m = (1) (2) (J) bm , bm , · · · , bm and the weight vector w(s) . Acoustic features of various target speakers are effectively modeled by setting only w(s) to appropriate values. The other parameters λ which include mixture component weights, reference mean vectors, bias vectors, representative vectors and covariance matrices are tied for every speaker.
Used as many-to-one EV-GMM
Many-to-one EVC
Source speakers
s=1t=1
Both canonical model parameters λ and adaptive parameters, i.e., a (1) set of weight vectors wS , · · · , w(S) ) for individual pre1 = (w stored speakers, are optimized in the sense of maximum likelihood (ML). We employ the PCA-based EV-GMM [4] as an initial model for adaptive training. 2.3. Unsupervised adaptation of EV-GMM The EV-GMM is adapted to a new source and target speaker-pair by estimating w separately for each of the two speakers in the manner described in [11]. In order to perform the unsupervised adaptation using only arbitrary utterances, the following likelihood of the marginal probability density function is maximized with respect to the weight vector [4]: T (i) ˆ (i) = arg max (4) P X t , Y t |λ, w dX t , w w
2.4. Many-to-many conversion with adapted EV-GMMs The source speaker’s features are converted into the target speaker’s features by sequentially performing the first-step conversion from the source features into the reference features and the second-step conversion from the converted reference features into the target features. Mixture components between the two adapted EV-GMMs are shared for keeping phonemic spaces through these two conversion
all m
ˆ (i) , w ˆ (o) , × P Y (o) |Y (i) , m, λ, w subject to Y (o) = W y (o) ,
(5) (6)
where W is a window matrix extending the static feature vector sequence to the joint static and dynamic feature vector sequence. The notation m denotes a mixture componentsequence. The probability ˆ (i) , w ˆ (o) is given by density P Y (o) |Y (i) , m, λ, w ˆ (i) , w ˆ (o) P Y (o) |Y (i) , m, λ, w ˆ (o) P X | Y (i) , m, λ, w ˆ (i) dX = P Y (o) | X, m, λ, w =
T
(o) ˜ (Y ) ˜ (Y ) , N Y t ;E m,t , D m
(7)
t=1
where ) ˜ (Y ˆ (o) +b(0) E m m,t = B m w (i) (Y Y )−1 ˆ (i) −b(0) Y t −B m w , +Am Σm m ) (Y Y ) (Y Y ) ˜ (Y D − A m Σm m = Σm
Am =
t=1
ˆ (i) where Y (i) is a time sequence of the given source features and w is the ML estimate of the weight vector for the source speaker. The ˆ (o) for the given target speaker’s ML estimate of the weight vector w (o) is also determined in the same manner. data Y
Target speakers
processes. Moreover, the converted reference features are regarded as hidden variables for considering conversion errors caused from the first-step conversion process. Consequently, a single conversion model of joint probability density of the source and target features is derived from the two adapted EV-GMMs [8]. Let a time sequence of the source features, that of the target fea (i) (i) tures, and that of the reference features be Y (i) = Y 1 , · · · , Y T , (o) (o) Y (o) = Y 1 , · · · , Y T , and X = X , re1 , · · · , XT ˆ (o) = y ˆ (o) ˆ (o) spectively. Converted static feature vectors y ,··· ,y 1 T are obtained as follows: ˆ (o) = arg max ˆ (i) y P m|Y (i) , λ, w y
λ,wS 1
One-to-many EVC
Reference speaker
Fig. 1. Overview of many-to-many EVC
2.2. Training of EV-GMM The EV-GMM is trained using all parallel data sets consisting of time-aligned acoustic features of the reference speaker and the individual pre-stored speakers, which are determined by DTW, in the adaptive training paradigm [10] as follows: Ts S (s) ˆ w ˆS λ, (3) P X t , Y t |λ, w(s) . 1 = arg max
Used as one-to-many EV-GMM
−1
Am ,
X) (XX)−1 (XY ) Σ(Y Σm Σm . m ˆ
(8) (9) (10)
3. EV-GMM TRAINING WITH NON-PARALLEL DATA Inspired by the conversion process in many-to-many EVC, we propose a new training method of the EV-GMM considering the reference voice as a hidden variable. Figure 2 shows an overview of the proposed training process. In the first step, we train the initial EV-GMM using the existing multiple parallel data sets between a single reference speaker and many pre-stored speakers in the same manner as described in the previous section. In the second step, we refine the EV-GMM using non-parallel data including a larger number of pre-stored speakers while regarding the reference features corresponding to those non-parallel data as hidden variables. Because this process is performed in a completely text-independent manner,
4823
Parallel data sets consisting of the reference speaker and many pre-stored speakers
Non-parallel data sets consisting of a large number of pre-stored speakers
1. Training the initial EV-GMM
(s) V¯ m =
The refined EV-GMM
2. Refining EV-GMM
any pre-stored speech data, i.e., any utterance set of any speaker, can be used for refining the EV-GMM. Therefore, the proposed training method allows us to use a larger amount of training data including more varieties of texts and speakers. In the second training process, we update the EV-GMM parameters by maximizing the following marginal likelihood: Ts S (s) S ˆ w ˆ 1 = arg max P X t , Y t |λ, w(s) dX t λ, = arg max λ,wS 1
s=1 t=1 Ts S
(s) P Y t |λ, w(s) .
(11)
s=1 t=1
We can update the speaker-dependent weight vector w(s) and the EV-GMM parameters related to only the pre-stored speaker’s features, i.e., the mixture component weight αm , the representative (0) vectors B m , the bias vector bm , and the covariance matrix of the (Y Y ) for each mixture component. The ML pre-stored speakers Σm estimates of these parameters are given by M −1 (s) (Y Y )−1 (s) ˆ = γ¯m B m Σm Bm w m=1
×
M m=1
α ˆm
S (s) = γ¯m
(Y Y ) B m Σm
s=1
ˆm = v
S
(s) (s) (0) Y¯ m − γ¯m bm
S M (s) γ¯m ,
(s) ˆ W γ¯m
(s)
s=1
×
−1
ˆ W
(s)
Y) Σ(Y m
, (12) (13)
m=1 s=1
S
−1
−1 ˆ (s) W
Y )−1 ¯ (s) Ym Σ(Y m
,
(14)
s=1 Y) ˆ (Y Σ = m
S 1 (s) (s) (s) (s) ˆm μ ˆm V¯ m + γ¯m μ S (s) s=1 γ¯m
¯ (s) + Y¯ (s) ˆ (s) ˆ (s) , − μ m μ m m Ym
(1) (J) ˆ(0) ˆ ˆ ˆm = b , b , · · · , b , v m m m (s) (s) (s) ˆ (s) = I, w W ˆ1 I, w ˆ2 I, · · · , w ˆJ I , ˆ ˆ (s) μ m =W
(s)
Ts (s) (s) (s) P m|Y t , λ, w(s) Y t Y t .
(21)
ˆm. v
The sufficient statistics for these estimates are given by Ts (s) (s) γ¯m = P m|Y t , λ, w(s) ,
In the conversion process, we employ the same conversion method as the conventional many-to-many EVC. Note that the parameters Am for each mixture component as shown in Eq. (10) are related to the reference features, and therefore they are not updated in the proposed training method. They are determined from the initial EV-GMM parameters trained in the conventional parallel training. 4. EXPERIMENTAL EVALUATIONS 4.1. Experimental conditions We compared performance of the proposed non-parallel training with that of the conventional parallel training. We trained the EV-GMM with the conventional parallel training using one male speaker as the reference speaker and 27 pre-stored speakers including 13 male and 14 female speakers [12]. The trained EV-GMM was further refined with the proposed non-parallel training using 160 pre-stored speakers including 80 male and 80 female speakers. To demonstrate the effectiveness of increasing the number of pre-stored speakers used in the proposed non-parallel training, we varied the number of pre-stored speakers from 27 consisting of the same prestored speakers as used in the conventional parallel training to 160. Each speaker uttered 50 phoneme-balanced sentences as described in [4]. In evaluation, we used eight speaker pairs (two male-to-male pairs, two female-to-female pairs, two male-to-female pairs, and two female-to-male pairs) selected from four male and five female speakers that were not included in the pre-stored speakers. We used two utterances for adapting the EV-GMM, and 21 utterances for evaluation. We used 24-dimensional mel-cepstral coefficients analyzed by STRAIGHT [13] as a spectral feature. The number of representative vectors was 26 and the number of mixture components was 128. We converted source fundamental frequency F0 to target one as follows: σ (y) log F˜0 = (x) log F0 − μ(x) + μ(y) , (22) σ where μ(x) and σ (x) denote mean and standard deviation of logscaled source F0 , and μ(y) and σ (y) denote those of log-scaled target F0 . These parameters were calculated from the adaptation data. 4.2. Objective evaluation
s=1
where
(20)
t=1
Fig. 2. Overview of proposed EV-GMM training process
λ,wS 1
Ts (s) (s) P m|Y t , λ, w(s) Y t , t=1
Refining EV-GMM while regarding the reference speaker's voices as hidden variables
The initial EV-GMM
(s) Y¯ m =
(15)
(16) (17) (18)
(19)
t=1
4824
We evaluated spectral conversion performance using mel-cepstral distortion. Figure 3 shows the result when varying the number of pre-stored speakers used in the proposed non-parallel training. When using the same 27 pre-stored speakers as used in the conventional parallel training, the proposed non-parallel training method causes degradation of conversion performance. This is reasonable because the non-parallel training data sets are less informative than the parallel training data sets in this case. It is observed that the proposed nonparallel training yields better conversion performance as the number of pre-stored speakers in the non-parallel data increases. This is because the proposed training method effectively updates the EVGMM parameters so that the EV-GMM models well the voice characteristics of a larger number of speakers; e.g., representative vectors are updated so that a sub-space spanned by them widely covers more
Mel-cepstral distortion [dB]
5. CONCLUSIONS
Conventional parallel training Proposed non-parallel training
5.1
5
4.9
27
40
60
80
100
120
140
160
Number of non-parallel training speakers
Fig. 3. Mel-cepstral distortion as a function of the number of nonparallel training speakers.
This paper has described the EV-GMM training method using nonparallel data sets for many-to-many EVC. In the proposed training method, an initial EV-GMM is trained using parallel data sets consisting of a single reference speaker and multiple pre-stored speakers. Then, the initial EV-GMM is further refined using non-parallel data sets consisting of a larger number of pre-stored speakers while considering the reference speaker’s voices as hidden variables. The experimental results have demonstrated that the proposed training method yields significant quality improvements in converted speech by effectively using non-parallel data sets including a larger number of pre-stored speakers. 6. REFERENCES
70 60 50 40 30 20
27
80
160
95% confidence interval 80
Preference score on speaker individuality [%]
Preference score on speech quality [%]
Conventional parallel training Proposed non-parallel training 80
70 60 50 40 30 20
27
80
160
Number of non-parallel training speakers Fig. 4. Results of subjective evaluations. varieties of speakers. Consequently, the proposed non-parallel training provides significant improvements in conversion performance when using a much larger number of pre-stored speakers than that used in the conventional parallel training. 4.3. Subjective evaluations We compared the converted speech samples of the proposed nonparallel training with those of the conventional parallel training. We used 27, 80 and 160 speakers for the proposed non-parallel training. We conducted a preference test on speech quality and an XAB test on conversion accuracy for speaker individuality. In the preference test, a pair of two different types of the converted speech was presented to listeners, and then they were asked which voice sounded better. In the XAB test, a pair of two different types of the converted speech was presented to them after presenting the target speech as a reference. Then, they were asked which voice sounded more similar to the reference target. The number of listeners was ten and the number of sample-pairs evaluated by each listener was 48 in each test. Figure 4 shows the experimental results. In the speech quality test, we can see almost the same tendency as observed in the previous objective evaluation; i.e., better speech quality is obtained by increasing the number of pre-stored speakers in the proposed nonparallel training; and the proposed non-parallel training yields significant quality improvements in converted speech compared with the conventional parallel training when using 160 pre-stored speakers. On the other hand, in the speaker individuality test, conversion accuracy of the proposed non-parallel training is kept almost equal to that of the conventional parallel training. These results suggest that the proposed non-parallel training yields better converted speech quality than the conventional parallel training without degradation of conversion accuracy for speaker individuality by additionally using non-parallel data including a larger number of pre-stored speakers.
4825
[1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” J. Acoust. Soc. Jpn. (E), vol. 1, no. 2, pp. 71–76, 1990. [2] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. SAP, vol. 6, no. 2, pp. 131–142, 1998. [3] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222–2235, November 2007. [4] T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based on Gaussian mixture model,” Proc. ICSLP, pp. 2446– 2449, September 2006. [5] T. Toda, Y. Ohtani, and K. Shikano, “One-to-many and manyto-one voice conversion based on eigenvoices,” Proc. ICASSP, vol. IV, pp. 1249–1252, April 2007. [6] D. Tani, T. Toda, Y. Ohtani, H. Saruwatari, and K. Shikano, “Maximum a posteriori adaptation for many-to-one eigenvoice conversion,” Proc. Interspeech2008, pp. 1461–1464, September 2008. [7] K. Ohta, Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Regression approaches to voice quality control based on oneto-many eigenvoice conversion,” Proc. SSW6, pp. 101–106, August 2007. [8] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Manyto-many eigenvoice conversion with reference voice,” Proc. Interspeech2009, pp. 1623–1626, September 2009. [9] T. Masuda and M. Shozakai, “Cost reduction of training mapping function based on multistep voice conversion,” Proc. ICASSP, vol. IV, pp. 693–696, April 2007. [10] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Speaker adaptive training for one-to-many eigenvoice conversion based on Gaussian mixture model,” Proc. Interspeech2007, pp. 1981–1084, August 2007. [11] R. Kuhn, J. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. SAP, vol. 8, no. 6, pp. 695–707, April 2000. [12] “Jnas: Japanese newspaper article sentences,” http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html. [13] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign´e, “Restructuring speech representations using a pitch-adaptive timefrequency smoothing and instantaneous frequency-based f0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3-4, pp. 187–207, 1999.