An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets. Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari ...
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science and Technology, Japan {daisuke-t, yamato-o, tomoki, sawatari, shikano}@is.naist.jp
Abstract This paper describes an evaluation of many-to-one voice conversion (VC) algorithmsconverting an arbitrary speaker’s voice into a particular target speaker’s voice. These algorithms effectively generate a conversion model for a new source speaker using multiple parallel data sets of many pre-stored source speakers and the single target speaker. We conducted experimental evaluations for demonstrating the conversion performance of each of the many-to-one VC algorithms, including not only the conventional algorithms based on a speaker independent GMM and on eigenvoice conversion (EVC), but also new algorithms based on speaker selection and on EVC with speaker adaptive training (SAT). As a result, it is shown that an adaptation process of the conversion model improves significantlyconversion performance,and the algorithm based on speaker selection works well even when using a very limited amount of adaptation data. Index Terms: voice conversion, many-to-one VC, EVC, SAT, speaker selection
1. Introduction Voice conversion (VC) is a technique for converting an input speaker’s voice into another speaker’s voice while keeping the linguistic information [1]. VC is applied to various applications such as modification of the synthetic speech of Text-toSpeech [2], bandwidth extension of cellular speech [3], and body-transmittedspeech enhancement[4]. One of the most useful VC applicationsis cross-languageVC [5,6], which converts speaker individuality across different two languages. This technique realizes a speech translation system or a CALL (Computer Assisted Language Learning) system synthesizing nonnative language with user’s own voice. Statistical approaches are often employed in VC, the most popular being the conversion method based on the Gaussian Mixture Model (GMM) [7]. A GMM representing joint probability density of the source and the target speech parameters is trained in advance using parallel data consisting of utterance pairs of the source and the target speakers. The trained GMM allows the determination of the target speech parameters for the given source speech parameters based on minimum mean square error (MMSE) estimation[7] or maximum likelihood estimation (MLE) [8], without any linguistic restrictions. Thus an arbitrary sentence uttered by the source speaker is rendered as a sentence uttered by the target speaker. One essential problem of this approach is the need for parallel data for model training. Moreover, several tens of phoneme-balanced sentences whose total duration is around 3 to 5 minutes are generally required to train the GMM sufficiently for conversion performance. These constraints clearly limit VC applications.
In order to alleviate the problem of parallel training, two main approaches have been proposed. One is parallel data generation from non-parallel data; the other is model adaptation with non-paralleldata. The first approach conducts frame alignment between the source and the target voices based on HMM state alignment [9] or unit selection [10]. In that case, conventional model training is employed with the resulting parallel data. This approach requires only the source and the target speakers’ voices for the model training. Therefore, the same amount of training data is basically necessary. On the other hand, the latter approach uses other speaker’s voices as a prior knowledge for training the model for the desired speaker pair. Mouchtaris et al. [11] proposed a non-parallel training method based on maximum likelihood constrained adaptation. The GMM trained with an existing parallel data set of a certain source and target speaker-pair is adapted for the desired source and target speakers separately. Lee et al. [12] proposed the adaptation method based on maximum a posteriori (MAP). In order to use a more reliable prior knowledge and reduce the amount of adaptation data, Toda et al. [13] proposed the eigenvoice conversion (EVC). EVC trains the eigenvoice GMM (EVGMM) in advance using multiple parallel data sets consistingof utterance pairs of a single speaker and many pre-stored speakers. Effectively using the feature correlation between those speakers extracted from pre-storedparallel data sets enables unsupervised adaptation of EV-GMM for the desired speaker using only a few arbitrary sentences. There are two novel VC frameworks to which EVC has been applied, i.e., one-to-many VC and many-to-one VC [14]. One-to-many VC converts the particular source speaker’s voice into arbitrary speaker’s one. On the other hand, many-toone VC converts arbitrary speakers’ voice into the particular speaker’s voice. This paper focuses on many-to-one VC. It enables the conversion of any language uttered by an arbitrary speaker as utterances of the specific target speaker. It has been reported that not only the EV-GMM but also the sourceindependent GMM (SI-GMM) works in many-to-one VC [14] without any adaptationprocesses, but simply trained simultaneously using multipleparalleldata sets of many pre-storedsource speakers and the single target speaker. In this paper, we propose another many-to-one VC method based on speaker selection [15]. And, we introduce speaker adaptive training (SAT) [16] into EVC to further improve conversion performance. These many-to-one VC methods are compared with each other in both objective and subjective evaluations. This paper is organized as follows. In Section 2, we describe the Many-to-One VC algorithms. In Section 3, we describe an experimental evaluation. Finally, we summarize this paper in Section 4.
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
107
2. Many-to-One VC algorithms
1 st 2 nd ...
We describe four many-to-one VC algorithms based on 1) SIGMM, 2) speaker selection, 3) EVC, 4) EVC with SAT. The difference among these algorithms is only in the way of constructing the conversion model for a given new source speaker. Every algorithm employs the MLE-based conversion method [8] using the resulting conversion model.
S
SI-GMM
Target speaker th
Many pre-stored source speakers
2.1. Many-to-One VC based on SI-GMM [14] (s)
We use 2D-dimensional acoustic features, X t
=
!
(s) ∆xt ]! (the s-th source ! [y ! , ∆y ! (target speaker’s), t t ]
, speaker’s), and Yt = consisting of Ddimensional static and dynamic features, where ! denotes transposition of the vector. Joint probability density of
Figure 1: Previous training process in many-to-one VC based on SI-GMM.
(s)!
(s)
Previous training stage
! Z t = [X t , Y ! t ] consisting of time-aligned source and target features determined by DTW is modeled with a GMM as follows:
(Z)
—i
=
"
=
M X i=1
(X)
—i (Y ) —i
1st (s) (Z) αi N (Z t ; —i ,
#
(ZZ)
, Σi
=
"
(ZZ) Σi ), (XX)
Σi (Y X) Σi
(1) (XY )
Σi (Y Y ) Σi
#
, (2)
th
λ
TS S Y Y
(s)
P (Z t |λ),
(3)
s=1 t=1
Target speaker
S Many pre-stored source speakers
where N (x; —, Σ) shows the normal distribuion with a mean vector — and a covariance matrix Σ. The ith mixture weight is αi . The total number of mixtures is M . This paper employs (XX) diagonal covariance matrices for individual blocks, Σi , (XY ) (Y X) (Y Y ) Σi , Σi and Σi . The SI-GMM is trained with all multiple parallel data sets consisting of utterance-pairs of multiple pre-stored source speakers and one target speaker, as follows: λ(0) = arg max
Sufficient statistics M step 1 st
E step
2 nd
E step
S th
M step
2nd ...
(s) P (Z t |λ)
E step
...
(s) [xt
!
SD-GMMs
SI-GMM
Conversion model M step
Selection of N-bset sufficient statistics
3 rd
10
M step
th
...
Likelihood Caluculation of individual SD-GMMs 22th Selected N-best sufficient statistics Input speaker Speaker adaptation stage
Figure 2: Previous training and adaptation processes in manyto-one VC based on speaker selection.
where S is the number of pre-stored source speakers. Figure 1 shows the previous training process. In the conversion, the SI-GMM is directly used without any adaptation processes.
2.2.1. Previous training
2.2. Many-to-One VC based on Speaker Selection
A set of sufficient statistics for each speaker pair is computed with each parallel data set and the SI-GMM as follows:
It is well known that phonemic spaces of a certain speaker often overlap with those of another speaker. Therefore, the SI-GMM might cause a conversion error especially for source speakers whose voice characteristicsare quite different from those of the average voice among the pre-storedsource speakers. Therefore, a model adaptationprocess for each source speaker is useful for alleviating this problem. Speaker selection is one of the model adaptation techniques. Figure 2 shows the previous training and the adaptation processes in many-to-one VC based on speaker selection. The conversion model is trained not with all parallel data sets but with only those consisting of the pre-stored source speakers whose voice characteristicsare similar to those of the given source speaker. In order to reduce considerably the computational cost for this adaptation, we employ the single EM update of the SI-GMM, using pre-calculated sets of sufficient statistics for individual speaker-pairs. Note that this process allows unsupervisedadaptation.
(s)
γ¯i
=
Ts X t=1
¯ (s) Z = i
(s)
P (mi |Z t , λ(0) ),
(4)
Ts X ` ´ (s) (s) P mi |Z t , λ(0) Z t ,
(5)
t=1
(s) V¯ i =
Ts X t=1
(s)
(s)
(s)!
P (mi |Z t , λ(0) )Z t Z t
.
(6)
Moreover, individual source dependent GMMs (SD-GMMs) are trained using each of the calculated sets of sufficient statistics for individual speaker pairs.
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
108
Supervectors of source mean vectors
Previous training stage
2.2.2. Unsupervisedadaptation
PCA
EM Firstly, a likelihood of each SD-GMM for given adaptationdata of the new source speaker X (org) is calculated as follows: Z L(s) = P (X (org) , Y |λ(s) )dY . (7)
1st
α ˆi = X X
(s)
γ¯i
,
s∈S
(s)
γ¯i
N ˆi = s∈S X Σ
Conversion model
(s)
Input speaker
Figure 3: Previous training and adaptation processes in manyto-one VC based on EVC .
,
(9)
(s) V¯ i
γ¯i
...
...
MLED Speaker adaptation stage
s∈SN
X
EV-GMM SI-GMM
(8)
¯ (s) Z i
N ˆi = X —
EM
Many pre-stored source speakers
m s∈SN
X
Target speaker
S th
Then, the likelihoods of individual SD-GMMs are sorted and the top N speaker-pairs are selected. The conversion model is generated using the N-best sets of sufficient statistics for the selected speaker pairs as follows: X (s) γ¯i s∈SN
EM
2nd
ˆ i— ˆ! , −— i ,
(10)
s∈SN
ˆi are the updated mixture weight, mean ˆ i and Σ where α ˆi, — vector and covariance matrix, and SN is a set of the N-best speakers.
2.3.3. UnsupervisedAdaptation of EV-GMM The EV-GMM is adapted for arbitrary speakers by estimating the optimum weight vector for given speech samples without any linguistic information. The weight vector is estimated so that the likelihood of the marginal distribution for a time sequence of the given source features X (org) is maximized as follows: Z ˆ = arg max P (X (org) , Y |λ(EV ) )dY . w (12) w This estimation is performed with EM algorithm. 2.4. Many-to-One VC based on EVC with SAT
2.3. Many-to-One VC based on EVC [14] 2.3.1. Eigenvoice Gaussian Mixture Model (EV-GMM) The mean vector of the EV-GMM for many-to-one VC is given by " # (X) (X) B i w + bi (0) (Z) —i = . (11) (Y ) —i The source mean vector for the ith mixture is represented as (X) a linear combination of a bias vector bi (0) and representa(X) (X) (X) (X) = [bi (1),bi (2), · · · , bi (J)]. The tive vectors B i number of the representative vectors is J. The source speaker individuality is controlled with only the J-dimensional weight vector w= [w(1), w(2), · · · , w(J)]! . 2.3.2. Training of EV-GMM Figure 3 shows the previous training and the adaptation processes of many-to-one VC based on the EVC. Each SD-GMM is trained by updating only source mean vectors of the SI-GMM using each of the multiple parallel data sets. As a source dependent parameter, a supervector for each pre-storedsource speaker is constructedby concatenatingthe source mean vectors of each of the SD-GMMs. The bias and representative vectors, i.e., eigenvectors are determined with principal component analysis (PCA) for all source speakers’ supervectors. Finally, the EVGMM is constructed from the resulting bias and representative vectors and parameters of the SI-GMM.
The tied parameters of the PCA-based EV-GMM are from the SI-GMM. They are affected by acoustic variations of many prestored source speakers. Especially, source covariance values are much larger than those of the SD-GMM. They would cause performance degradation of the adapted EV-GMM. In order to train an appropriate canonical EV-GMM, we apply speaker adaptive training (SAT) to the EV-GMM training. Figure 4 shows the previous training and the adaptation processes of many-to-one VC based on the EVC with SAT. The canonical EV-GMM is trained by maximizing the following likelihood of the adapted models for individual pre-stored source speakers, ˆ (EV ) (w ˆS λ 1 ) = arg max λ
Ts S Y “ ” Y (s) P Z t |λ(EV ) (ws ) , (13)
s=1 t=1
(EV )
where λ (ws ) denotes the adapted model for the s-th pre-stored source speaker with the weight vector ws . SAT ˆ (EV ) and a estimates both canonical EV-GMM parameters λ ˆS set of weight vectors for pre-stored source speakers w 1 = [w1 , w2 , · · · , wS ]. The estimation is performed with EM algorithm by maximizing the following auxiliary function: “ ” ˆ (EV ) (w ˆS Q λ(EV ) (wS 1 ), λ 1)
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
=
S X M X s=1 i=1
` ´ (s) ˆ (EV ) w ˆ s ), γˆi log P (mi , Z (s) |λ
(14)
109
ML estimates of the tied parameters are written as Previous training stage
1st ...
S th
M X S X
Weight estimation for the Sth speaker Canonical EV-GMM
Many pre-stored source speakers
,
(16)
(s) γi
i=1 s=1
...
Target speaker
(s)
γi
s=1
α ˆi =
Weight estimation for the 2nd speaker
2nd
S X
Tied parameter estimation
Weight estimation st for the 1 speaker
ˆi = v
„X S
−1
(s) ˆ ! (ZZ) γ¯i W s Σi
ˆs W
s=1
1 ˆ (ZZ) Σ = PS i
S X
(s) ¯i s=1 s=1 γ
Input speaker
(ZZ) ˆ! W s Σi
s=1
MLED Conversion model Speaker adaptation stage
«−1 „X S
(s) (s) (s) (s) ˆi — ˆi V¯ i + γ¯i —
„ «ff ! ! (s) ˆ (s) ¯ (s) ¯ ˆ (s) − — Z + Z V , i i i i
Figure 4: Previous training and adaptation processes of manyto-one VC based on EV-GMM with SAT .
−1
« ¯ (s) Z , i
(17)
!
(18)
where
where
(s) V¯ i =
(s) γˆi
=
Ts X t=1
t=1
(s) P (mi |Z t , λ(EV ) (ws )).
ˆ (s) — i
It is difficult to update all parameters simultaneously because some depend on others. Therefore,each parameterof EV-GMM is updated as follows: “ ” (EV ) Q λ(EV ) (wS (wS 1 ), λ 1) “ ” (X) (X) (Y ) (zz) ˆS ≤Q λ(EV ) (wS , bi (0), —i , αi , Σi ) 1 ), (w 1 , Bi “ ” ) (zz) ˆ(X) ˆ (X) , b ˆS ˆ (Y ≤Q λ(EV ) (wS (0), — ,α ˆ i , Σi ) 1 ), (w 1 , Bi i i “ ” ) ˆ(X) ˆ (X) , b ˆ (zz) ˆS ˆ (Y ≤Q λ(EV ) (wS (0), — ,α ˆi, Σ ) . 1 ), (w 1 , Bi i i i The ML estimate of the weight vector for the s-th pre-stored source speaker is written as
ˆs = w
„X M
(X)!
(s)
γ¯i B i
(XX)
Pi
(X)
Bi
i=1
× +
»X M n
(X)!
Bi
i=1
«
(XY ) `
(s) (s) (Y ) Y¯ i − γ¯i —i
Pi
(X)! (XX) ` ¯ (s) Pi Xi Bi
(s) (X)
− γ¯i bi
– ´´o (0) ,
´ (15)
where ¯ (s) Z i
=
"
¯ (s) X i (s) Y¯ i
#
2 PTs
` ` ´´ (s) 3 (s) P mi |Z t , λ(EV ) ws X t 5, =4 ` ` ´´ (s) PTs (s) (EV ) ws Y t t=1 P mi |Z t , λ
(ZZ) −1 Σi
Ts X ` ` ´´ (s) (s)! (s) P mi |Z t , λ(EV ) ws Z t Z t ,
t=1
=
"
(XX)
Pi (Y X) Pi
(XY )
Pi (Y Y ) Pi
#
.
ˆ s vˆi = =W
"
ˆ(X) ˆ (X) ˆs +b B w (0) i i ) ˆ (Y — i
#
,
h i! ! ! ! ! ˆ(X) ˆ(X) ˆ(X) ˆi = — ˆ (X) v , { b (0)} , { b (1)} , · · · , { b (J)} , , i i i i » – 0 0 ... 0 ˆs= I 0 W , (s) (s) (s) 0 I w ˆ1 I w ˆ2 I ... w ˆJ I and the matrix I is D×D unit matrix.
3. ExperimentalEvaluations 3.1. Experimentalconditions We used 160 speakers, 80 male and 80 female, included in the Japanese Newspaper Article Sentences (JNAS) database [17] as the pre-stored source speakers. Each of them uttered a set of phoneticallybalanced 50 sentences. We used a male speaker not included in JNAS as the target speaker in many-to-one VC. He uttered the same sentence sets as uttered by the pre-stored speakers. We used 10 test speakers, 5 male and 5 female, who were not included among the pre-stored speakers. They uttered 53 sentences that were also not included in the pre-stored data sets. The number of adaptation sentences was varied from 1 to 32. The remaining 21 sentences were used for evaluations. More detailed conditions are described in [13,14]. We used the 1st through 24th mel-cepstral coefficients obtained from the smoothed spectrum analyzed by STRAIGHT [18] as a spectral parameter. The frame shift was set to 5 ms. A simple linear conversion with means and standard deviations of log-scaled F0 of the source and the target speakers was employed in the F0 conversion. In many-to-one VC based on EVC with/without SAT, the number of representative vectors of the EV-GMM was set to 159. In the MLED-based unsupervised adaptation, the first Estep was conducted with the SI-GMM. And then, the next Esteps were conducted with the adapted EV-GMM. The number of times of EM iteration was set to 10. Note that only a few times of EM iteration also works well because rapid convergence is usually obtained.
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
110
5.25
Mel-cepstral distortion [dB]
Mel-cepstral distortion [dB]
5.25 5.2 5.15
32 mix
64 mix
128 mix 5.1 5.05 5
512 mix
4.95 0
20
40
60
80
100
120
140
SI-GMM
5.1
Speaker selection
5.05 5 4.95 EVC with SAT
4.9
Number of selected speakers
EVC
Parallel training with 32 sentence pairs
4.8 32
160
Figure 5: Mel-cepstral distortion as a function of the number of selected pre-stored speakers in many-to-one VC based on speaker selection. The number of mixtures is set to 32, 64, 128, 256 and 512. Only one sentence is used for the adaptation.
Parallel training with 16 sentence pairs
5.15
4.85
256 mix 4.9
5.2
64
256
128
Number of mixtures Figure 6: Mel-cepstral distortion as a function of the number of mixtures. The number of adaptation sentences is set to 2. Results of the conventional parallel training [14] are also shown as references
We conducted objective evaluations using the mel-cepstral distortion between the converted and the target mel-cepstra as an evaluation measure. The averaged distortionover all test speakers was 8.11 [dB] before the conversion. Figure 5 shows the mel-cepstral distortion as a function of the number of selected pre-stored speakers in many-to-one VC based on speaker selection. Note that results when selecting 160 pre-stored speakers are the same as those in manyto-one VC based on SI-GMM. The adaptation method based on speaker selection improves conversion accuracy, compared with the method based on the SI-GMM, because the adapted GMM models an acoustic space of the given source speaker more properly than does the SI-GMM. We can see that the best conversion accuracy is achieved when selecting around 20 to 40 pre-stored speakers. Rapid degradation of conversion accuracy is observed when setting the number of the selected speakers much smaller. Using directly the conversion model for another speaker does not obviously work even if, among the prestored speakers, his voice characteristicsare the most similar to those of the given source speaker. Therefore, it is important to construct the conversion model properly, covering the acoustic space of the given source speaker by mixing multiple speakers’ data sets. The number of selected speakers is set to 27 in many-to-one VC based on speaker selection in the following evaluations. Figure 6 shows mel-cepstral distortion as a function of the number of mixtures. Each many-to-one VC algorithm with the adaptation process outperforms that based on the SI-GMM. Many-to-one VC based on speaker selection allows the adaptation of every parameter of the conversion model. However, its adaptation mechanism is rougher than that of EVC, in terms of using a constant rate of mixing data sets of the selected prestored speakers. On the other hand, EVC estimates the best mixing rate, i.e., weights for eigenvectors, in the sense of ML, although it allows only the adaptation of source mean vectors. Consequently, the conversion performance in speaker selection is comparableto that in EVC. SAT optimizes tied parametersof the EV-GMM considering the adaptation process. Therefore, the performance of EVC is obviously improved by applying SAT into the EV-GMM training.
Mel-cepstral distortion [dB]
5.7
3.2. Objective evaluations
5.6 5.5 5.4 5.3 5.2 EVC 5.1
SI-GMM
5 4.9
Speaker selection
4.8 1/32 1/16 1/8
EVC with SAT 1/4
1/2
1
2
4
8
16
32
Number of adaptation sentences Figure 7: Mel-cepstraldistortion as a function of the number of adaptation sentences. The number of mixtures is set to 128.
Figure 7 shows the mel-cepstral distortion as a function of the number of adaptationsentencesvarying from 1/32 to 32 sentences. In the 1/32 sentence, only one of 32 blocks into which one sentence is divided is used as adaptation data. It includes only around 32 frames, whose total duration is around 0.16 seconds. EVC obviously improves performancecompared with the SI-GMM, when using one or more adaptation sentences. Introducing SAT into EVC improves further performance. However, when decreasing the adaptation data less than one sentence, conversion accuracy starts to degrade rapidly due to overadaptation. Although reducing the number of representative vectors alleviates the over-adaptationproblem, similar degradation tendencieswere still observed in another experiment which in not shown here. On the other hand, speaker selection still keeps the performance improvements compared with the SIGMM even when decreasing the adaptation data less than one sentence. Compared with EVC, speaker selection is more robust against the amount of adaptation data. These results show that the best adaptation method differs according to the amount of adaptation data.
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
111
100
Preference score on speech quality [%]
95% confidence intervals
for text-to-speech synthesis. Proc. ICASSP, pp. 285-288, Seattle, USA, May 1998.
80 [3]
K-Y Park and H. S. Kim. Narrowband to wideband conversion of speech using GMM based transformation.Proc. ICASSP, pp. 1843-1846, 2000.
[4]
M. Nakagiri, T. Toda, H. Kashioka and K. Shikano. Improving body Transmitted Unvoiced Speech with Statistical Voice Conversion. Proc. INTERSPEECH2006-ICSLP, pp. 2270-2273, Pittsburgh, USA, Sep. 2006.
[5]
M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. Voice conversion through vector quantization. J. Acoust. Soc. Jpn. (E), Vol. 11, No. 2, pp. 71-76, 1990.
[6]
M. Mashimo, T. Toda, H. Kawanami, K. Shikano and N Campbell. Cross-language voice conversion using bilingual database. IPSJ Journal, Vol.43, No.7, pp.2177-2185, July 2002.
[7]
Y. Stylianou,O. Cappe and E.Moulines.Continuousprobabilistic transform for voice conversion. IEEE Trans. on Speech and Audio Processing, Vol. 6, no. 2, pp. 131-142, Mar. 1998.
[8]
T. Toda, A.W. Black and K. Tokuda. Spectral conversion based on maximum likelihood estimationconsidering global variance of converted parameter. Proc. ICASSP, pp. 9-12, 2005.
[9]
H. Ye and S. Young. Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans. ASLP, Vol. 14, No. 4, pp. 1301-1312, 2006.
60 40 20 0
SI-GMM Speaker selection
EVC
EVC with SAT
Many-to-one VC algorithms Figure 8: Result of subjective evaluation in many-to-one VC. 3.3. Subjective evaluation We conducted subjective evaluation of speech quality of the converted voices. Four many-to-one VC methods were evaluated in the preference test. The number of mixtures was set to 128. The number of adaptation sentences was set to 2. The number of subjects was 6. We randomly presented a pair of the converted voices from two different methods. The subjects were asked which sample sounded more natural. Each subject evaluated 120 sample-pairsincludingevery pair-combinationof the four methods. Figure 8 shows the result of the preference test. Three adaptationmethods significantlyimprove the converted speechquality compared with the method based on the SI-GMM. Each adaptation process alleviates unstable sounds of the converted speech sometimes caused by the SI-GMM. This result is very similar as shown in the objective evaluations.
4. Conclusions This paper described many-to-one voice conversion (VC) algorithms that convert an arbitrary speaker’s voice into a particular target speaker’s voice. We conducted an experimental evaluation of many-to-one VC algorithms, using not only the conventional methods based on the source independent GMM (SI-GMM) and on EVC, but also two new methods based on speaker selection and EVC with speaker adaptive training (SAT). Results of objective and subjective evaluations showed that, in many-to-one VC, the adaptation process results in a better conversion model than the SI-GMM. Moreover, an algorithm based on speaker selection worked well with very little amount of adaptation data.
5. Acknowledgements This research was supported in part by MEXT’s (the Japanese Ministry of Education, Culture, Sports, Science and Technology) e-Society project.
6. References [1]
H. Kuwabara, and Y. Sagisaka. Acoustic characteristics of speaker individuality control and conversion. Speech Communication, Vol. 16, No. 2, pp. 165-173, 1995.
[2]
A. Kain and M.W. Macon. Spectral voice conversion
[10] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black and S. Narayanan. Text-independent voice conversion based on unit selection. Proc. ICASSP, pp. 81-84, 2006. [11] A. Mouchtaris, J. Spiegel, and P. Mueller. Non-parallel training for voice conversion by maximum likelihood constrained adaptation. Proc. ICASSP, pp. 1-4, May. 2004 [12] C.H. Lee and C.H. Wu. Map-based adaptation for speech conversion using adaptation data selection and nonparalleltraining. Proc. ICSLP, pp. 1164-1167,Sept. 2006. [13] T. Toda, Y. Ohtani and K. Shikano. Eigenvoice Conversion Based on Gaussian Mixture Model. Proc. ICSLP, pp. 2446-2449, Sept. 2006. [14] T. Toda, Y. Ohtani and K. Shikano. One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices. Proc. ICASSP, Vol. 4, pp. 1249-1252, Apr. 2007. [15] S. Yoshizawa, A. Baba, K. Matsunami, Y. Mera, M. Yamada and K. Shikano. Unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers. Proc.ICASSP2001, pp. 341-344, May 2001. [16] T. Anastasakos, J. McDonough, R. Schwartz and J. Makhoul. A compact model for speaker-adaptive training. Proc. ICSLP, Vol. 2, 1996. [17] JNAS: Japanese Newspaper Article Sentences http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html [18] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigue. Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communivation, Vol. 27, No. 3-4, pp. 187-207, 1999.
6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
112