Average-Voice-Based Speech Synthesis Using HSMM-Based ...

15 downloads 0 Views 733KB Size Report
key words: HMM-based speech synthesis, speaker adaptation, speaker adaptive training (SAT), hidden semi-Markov model (HSMM), maximum likelihood linear ...
IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

533

PAPER

Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training Junichi YAMAGISHI†a) and Takao KOBAYASHI†b) , Members

SUMMARY In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMMbased adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests. key words: HMM-based speech synthesis, speaker adaptation, speaker adaptive training (SAT), hidden semi-Markov model (HSMM), maximum likelihood linear regression (MLLR), voice conversion

1. Introduction A speech synthesis system with the ability to arbitrarily change voice characteristics and prosodic features of synthetic speech would enable many new applications for human-computer interfaces using speech input/output. Several conversion techniques for voice characteristics have been proposed for developing this ability (e.g. [1]). The voice conversion techniques approximately transform the voice characteristics of input speech into those of a new target speaker by using examples from a small amount of target-speaker speech data. The voice conversion techniques might have the potential ability to synthesize an arbitrary speaker’s voice from a realistic and desirable amount of speech data. However, most techniques do not focus on the precise conversion of the prosodic features such as fundamental frequency and phone duration, although these features as well as spectral features affect speaker characteristics [2], [3]. On the other hand, the hidden Markov Model (HMM)based speech synthesis system proposed in [4], [5] can simultaneously transform voice characteristics and fundaManuscript received March 22, 2006. Manuscript revised August 10, 2006. † The authors are with Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama-shi, 226–8502 Japan. a) E-mail: [email protected] b) E-mail: [email protected] DOI: 10.1093/ietisy/e90–d.2.533

mental frequency (F0) of synthetic speech into those of a target speaker by using a small amount of speech data uttered by the target speaker. In this method, spectrum and F0 of several training speakers are simultaneously modeled within the HMM framework, and an average voice model, which models the average voice and prosodic characteristics of the training speakers, is constructed [5]. Then, using maximum likelihood linear regression (MLLR) adaptation [4], [6], the average voice model is adapted to a new target speaker based on speech data uttered by the target speaker. After the speaker adaptation, speech is synthesized in the same manner as the speaker-dependent speech synthesis method [7], [8]. MLLR adaptation can also be used as a speaker normalization technique of the average voice model to reduce the influence of speaker differences and the acoustic variability of spectral and F0 parameters. This training scheme of the average voice model, which is called speaker adaptive training (SAT) [9], can significantly improve the quality of the model and also the quality of the synthetic speech after the model adaptation [5]. In the previous work [5], the HMM-based speaker adaptation and speaker adaptive training were conducted for transforming and normalizing only state output probability distributions corresponding to spectrum and F0 parameters of the speech data. However, each speaker has his/her own phone duration as well as his/her own spectrum and F0. To mimic the speaker’s phone duration, the state duration distributions as well as the output distributions of the average voice model should be simultaneously adapted to the target speaker. Additionally, since each training speaker of the average voice model also has a distinctive phone duration, the state duration distributions would have a relatively large dependence on the speakers and/or gender ratio included in the training speech database. To obtain higher performance in the speaker adaptation, we should normalize the large dependence on the speakers and/or gender of the state duration distributions as well as the output distributions. Moreover, in the previous work [10], we proposed a model adaptation technique for simultaneously transforming spectrum, F0, and duration of synthetic speech and confirmed the effectiveness of the adaptation technique in the adaptation of speaking style of synthetic speech. The adaptation technique uses a hidden semi-Markov model (HSMM) framework [11]–[13]. The HSMM is an HMM with explicit state-duration probability distributions and enables us to conduct simultaneous adaptation of the output distributions and state duration distributions. In this paper,

c 2007 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

534

we utilize the HSMM-based model adaptation technique for the speaker adaptation. Furthermore we propose an HSMMbased speaker adaptive training as a normalization technique for speaker differences and the acoustic variability of the output and state duration distributions of the average voice model. We incorporate these HSMM-based techniques into our HSMM-based speech synthesis system using average voice model and show their effectiveness from the results of detailed subjective and objective evaluation tests. This paper is organized as follows. Section 2 gives an overview of simultaneous modeling technique of spectrum, F0 and phone duration using the HSMM and describes the conventional training method of the HSMM for reference. Section 3 describes the HSMM-based speaker adaptation technique for simultaneously transforming spectrum, F0, and duration into those of a target speaker using a small amount of speech data uttered by the target speaker. Section 4 describes the HSMM-based speaker adaptive training for reducing the influence of speaker- and/or genderdependent characteristics of spectral, F0 and phone duration parameters simultaneously and constructing an average voice model appropriate for the adaptation. Experimental conditions and the results of subjective and objective experiments are described in Sect. 5. Section 6 summarizes our findings. 2. Simultaneous Modeling of Spectrum, F0 and Phone Duration Based on Hidden Semi-Markov Model In the speaker adaptation for speech synthesis, we need to transform not only state output distributions corresponding to spectrum and F0 but also duration distributions corresponding to phone duration as described above. Therefore, we utilize the framework of the HSMM [12], which is an HMM having explicit state duration distributions instead of the transition probabilities to directly model and control phone durations (see Figs. 1 and 2)† . An N-state left-to-right HSMM λ with no skip paths is specified by a state output N and a state duration probprobability distribution {bi (·)}i=1 N ability distribution {pi (·)}i=1 . We assume that the i-th state output and duration distributions are Gaussian distributions characterized by a mean vector µi ∈ RL and diagonal covariance matrix Σi ∈ RL×L , and a scalar mean mi and variance σ2i , respectively; i.e., bi (o) = N(o; µi , Σi ) pi (d) = N(d; mi , σ2i )

where o ∈ R is an observation vector and d is the duration in state i. The observation probability of training data O = (o1 , · · · , oT ) of length T , given the model λ, can be written as N  N  t  i=1 j=1 d=1 ji

where



Fig. 2

αt−d ( j)pi (d)

t 

bi (os )βt (i)

(3)

s=t−d+1

t ∈ [1, T ], and αt (i) and βt (i) are the forward and

Hidden Markov model.

Hidden semi-Markov model.

backward probabilities defined by αt (i) =

t  N 

αt−d ( j) pi (d)

d=1 j=1 ji

βt (i) =

T−t  N 

t 

bi (os )

(4)

s=t−d+1

p j (d)

d=1 j=1 ji

t+d 

b j (os ) βt+d ( j)

(5)

s=t+1

where α0 (i) = 1, and βT (i) = 1. The conventional speaker-independent training of the parameter set λ based on the maximum likelihood (ML) criterion can be formulated as follows: λ˜ = argmax P(O|λ).

(6)

λ

Re-estimation formulas based on the Baum-Welch algorithm of the parameter set λ are given by T  t 

µi =

(1) (2)

L

P(O|λ) =

Fig. 1

γtd (i)

t=1 d=1

t 

os

s=t−d+1

T  t 

(7)

γtd (i) d

t=1 d=1 T  t 

Σi =

t=1 d=1

γtd (i)

t 

(os − µi )(os − µi )

s=t−d+1 T  t 

(8) γtd (i) d

t=1 d=1 † It is not straightforward to adapt the state duration in the HMM framework because the original HMM does not have explicit duration distributions. For a comparison of the HMM-based and the HSMM-based duration adaptation algorithms, refer to [10], [14].

YAMAGISHI and KOBAYASHI: AVERAGE-VOICE-BASED SPEECH SYNTHESIS

535 T  t 

mi =

γtd (i) d

t=1 d=1 T  t 

(9) γtd (i)

t=1 d=1 T  t 

σ2i =

γtd (i) (d − mi )2

t=1 d=1 T  t 

(10) γtd (i)

Λ

t=1 d=1

where · denotes matrix transpose, and γtd (i) is the probability of being in the state i at the period of time from t − d + 1 to t given O and is defined as γtd (i) =

1 P(O|λ)

N 

αt−d ( j) pi (d)

j=1 ji

ξi = [µi , 1] ∈ RL+1 and φi = [mi , 1] ∈ R2 , respectively. Note that ζ and  are respectively an L × L matrix and an L-dimensional vector, and both χ and ν are scalar variables. The HSMM-based MLLR adaptation estimates a set of transformation matrices Λ = (W, X) so that the likelihood of the adaptation data O of length T is maximized. The problem of the HSMM-based MLLR adaptation based on the ML criterion can be expressed as follows:   ˜ = W, ˜ X ˜ = argmax P(O|λ, Λ) (14) Λ

t 

bi (os )βt (i).

(11)

s=t−d+1

3. Maximum Likelihood Linear Regression Based on Hidden Semi-Markov Model First, we apply an HSMM-based MLLR adaptation [10], [15] to transform the state output and duration distributions simultaneously. In the HSMM-based MLLR adaptation, the mean vectors of the state output and duration distributions for the target speaker are obtained by linearly transforming the mean vectors of the state output and duration distributions of the average voice model (Fig. 3) as follows: bi (o) = N(o; ζµi + , Σi ) = N(o; Wξi , Σi ) pi (d) = N(d; χmi + ν, σ2i ) = N(d; Xφi , σ2i )

(12) (13)

where µi and mi are the respective mean vectors of state output and distributions for the average voice model.  duration  W = ζ,  ∈ RL×(L+1) and X = χ, ν ∈ R1×2 are the transformation matrices which transform extended mean vectors

where λ is the parameter set of HSMM. Re-estimation formulas based on the Baum-Welch algorithm of the transformation matrices Λ can be derived as follows: wl = yl Gl −1

(15)

X = zK −1

(16)

where wl ∈ RL+1 is the l-th row vector of W. In these equations, yl ∈ RL+1 , Gl ∈ R(L+1)×(L+1) , z ∈ R2 , and K ∈ R2×2 are given by Rb  T  t 

yl =

γtd (r)

r=1 t=1 d=1

Gl =

Rb  T  t 

γtd (r) d

r=1 t=1 d=1

z=

Rp  T  t 

γtd (r)

r=1 t=1 d=1

K=

Rp  T  t  r=1 t=1 d=1

t  1 o s (l) ξr Σr (l) s=t−d+1

γtd (r)

1 ξ ξ Σr (l) r r

(17)

(18)

1 d φr σ2r

(19)

1 φ φ , σ2r r r

(20)

where Σr (l) is the l-th diagonal element of the diagonal covariance matrix Σr , and o s (l) is the l-th element of the observation vector os . It is not always possible to estimate W and X for every distribution, because the amount of adaptation data of a target style is limited. Therefore, we use tree structures to group the distributions in the model and to tie the transformation matrices in each group in the same manner as HMM-based techniques [16]. In Eqs. (17)-(20), Rb and R p is respectively the number of the distributions of the state output and duration distributions belonging to this group. When 1 ≤ Rb < L and/or R p = 1 , we need to use generalized inverses with singular value decomposition since the matrices Gl and/or K become rank-deficient. 4. Speaker Adaptive Training Based on Hidden SemiMarkov Model

Fig. 3

HSMM-based MLLR adaptation.

To obtain higher performance in the HSMM-based speaker adaptation to a wide variety of target speakers, we need to construct an initial model appropriate for the adaptation. We train an average voice model as the initial model of the adaptation from the training data which consists of

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

536

several speakers’ speech. However, the training data of the average voice model includes a lot of speaker- and/or gender-dependent characteristics. For constructing appropriate average voice model from the training data, we should deal with the adverse effects caused by the speakerand/or gender-dependent characteristics when we estimate the model parameters of the state output and duration distributions. Hence, we propose an HSMM-based speakeradaptive training algorithm [17] for normalizing the influence of speaker differences among the training speakers in the state output and duration distributions. In the speakeradaptive training algorithm (Fig. 4), the speaker difference between the training speaker’s voice and the canonical average voice is assumed to be expressed as a simple linear regression function of mean vectors of the state output and duration distributions, µ(i f ) = ζ ( f ) µi +  ( f ) = W ( f ) ξi

(21)

(f) mi

(22)

= χ( f ) mi + ν( f ) = X ( f ) φi ,

(f)

(f)

where µi and mi are respectively the mean vectors of the state output and duration distributions fortraining speaker   f . W ( f ) = ζ ( f ) ,  ( f ) and X( f ) = χ( f ) , ν( f ) are transformation matrices which indicate the speaker difference between training speaker f and the average voice in the state output and duration distributions, respectively. After estimating the transformation matrices for each training speaker, a canonical/average voice model is estimated so that the training speaker’s model transformed by the matrices maximizes the likelihood for the training data of the training speakers. Let F be the total number of the training speakers, O = {O(1) , · · · , O(F) } be all the training data, and O( f ) = {o1 f , · · · , oT f } be training data of length T f for speaker f . The HSMM-based speaker-adaptive training simultaneously estimates the parameter set of HSMM λ and the set of transformation matrices Λ( f ) = (W ( f ) , X( f ) ) for each training speaker so that the likelihood of O is maximized. The problem of the HSMM-based speaker adaptive training based on the ML criterion can be formulated as follows:

˜ = argmax P(O|λ, Λ) ˜ Λ) (λ, λ,Λ

F 

= argmax λ,Λ

P(O( f ) |λ, Λ( f ) )

(23)

f =1

where Λ = (Λ(1) , · · · , Λ(F) ) is the set of the transformation matrices for the training speakers. Here, we use a threestep iterative procedure to update the parameters [9]. First, we estimate Λ while keeping λ fixed to the current values. The re-estimation formulas based on the Baum-Welch algorithm of the parameter set Λ are identical to Eqs. (15), (16). We then estimate the mean vectors of λ by using the updated transformation matrices while keeping the covariance matrices of λ fixed to the current values. Finally, the covariance matrices of λ are estimated using the updated transformation matrices and the updated mean vectors. The re-estimation formulas of the parameter set λ are −1  F Tf t     ( f ) −1 ( f )  d  µi =  γt (i) d ζ Σi ζ  · f =1 t=1 d=1

 F Tf t  t     ( f ) −1  (f)  d  γt (i) ζ Σi (os f −  )  f =1 t=1 d=1

Tf  F  t 

Σi =

(24)

s=t−d+1

γtd (i)

f =1 t=1 d=1

t 

(os f − µ(i f ) )(os f − µ(i f ) )

s=t−d+1 Tf  F  t 

(25) γtd (i) d

f =1 t=1 d=1 Tf  F  t 

mi =

γtd (i) χ( f ) (d − ν( f ) )

f =1 t=1 d=1 Tf  F  t 

(26) 2

γtd (i) χ( f )

f =1 t=1 d=1 Tf  F  t 

σ2i =

γtd (i) (d − m(i f ) )2

f =1 t=1 d=1 Tf  F  t 

,

(27)

γtd (i)

f =1 t=1 d=1 (f)

where µ(i f ) = ζ µi +  ( f ) and m(i f ) = χ( f ) mi + ν( f ) are the mean vectors transformed into the training speaker f using the updated mean vectors transformation  ( f )and the  updated   (f) (f) matrices. Here W = ζ ,  ( f ) and X = χ( f ) , ν( f ) are the updated transformation matrices for the training speaker f. It is straightforward to expand the above global linear regression to a piecewise linear regression using multiple transformation matrices in the same manner as HMM-based techniques [16]. Fig. 4

HSMM-based speaker adaptive training.

YAMAGISHI and KOBAYASHI: AVERAGE-VOICE-BASED SPEECH SYNTHESIS

537

5. Experiments 5.1 Experimental Conditions To show the effectiveness of the simultaneous model adaptation and adaptive training algorithm for spectrum, F0 and duration, we conducted several objective and subjective evaluation tests. We used the ATR Japanese speech database (Set B), which contains a set of 503 phonetically balanced sentences uttered by 6 male speakers (MHO MHT MMY MSH MTK MYI) and 4 female speakers (FKN FKS FTK FYM), and a speech database which contains the same sentences as the ATR Japanese speech database uttered by a female speaker (FTY). Figure 5 shows the average values of the logarithm of F0 and mora/sec of each speaker. For the calculation of mora/sec, we used the manually labeled duration of the utterances of each speaker. We chose a male speaker MTK and a female speaker FTK as target speakers of the speaker adaptation and used the rest as training speakers for the average voice model. In the modeling of the synthesis units, we used 42 phonemes, including silence and pause and took the phonetic and linguistic contexts [5] into account. Speech signals were sampled at a rate of 16 kHz and windowed by a 25-ms Blackman window with a 5-ms shift. The feature vectors consisted of 25 mel-cepstral coefficients [18], [19] including the zeroth coefficient, log F0, and their delta and delta-delta coefficients. We used 5-state leftto-right context-dependent HSMMs without skip paths. The basic structure of the HSMM-based speech synthesis is the same as the HMM-based speech synthesis system [5], except that the HSMMs are used for all stages instead of the HMMs. In the system, gender-dependent average voice models were trained using 453 sentences for each training speaker. The total numbers of training sentences were 2265 sentences for male-speaker average voice model and 1812 sentences for female-speaker average voice model, respectively. In the training stage of the average voice models, shared-decision-tree-based context clustering algorithm [5] using the minimum description length (MDL) criterion and

Fig. 5

Distribution of average log F0 and mora/sec of each speaker.

speaker adaptive training described in Sect. 4 were applied to normalize the influence of speaker differences among the training speakers. The numbers of leaf nodes of the shared decision trees for the male-speaker average voice model are 2005, 4886, and 2301 for spectrum, F0, and phone duration parts, respectively. Those for the female-speaker average voice model are 1663, 4642, and 2229 for spectrum, F0, and phone duration parts, respectively. We then adapted the average voice model to the target speaker. In the MLLR adaptation and speaker-adaptive training, multiple transformation matrices were estimated based on the shared-decisiontrees constructed in the training stage of the average voice models. The threshold that specifies an expected number of speech samples used for each transformation matrix was determined from preliminary objective experimental results. The transformation matrices were diagonal triblock. The triblocks corresponded to the static, delta, and delta-delta coefficients. 5.2 Objective Evaluation of Simultaneous Speaker Adaptive Training First, we evaluated the speaker adaptive training for normalizing the influence of speaker differences among the training speakers in both the state output and duration distributions described in Sect. 4. For comparison, we also trained three kinds of average voice model using the conventional speaker-independent training described in Sect. 2 for the state output and/or duration distributions. The speakerindependent training treats the training data which consists of several speakers’ speech as that of one speaker and makes no distinctions among the training speakers of the average voice model. Note that the same topology and number of distributions based on the shared decision trees were used for the speaker-independent training and speaker-adaptive training. We calculated the likelihood of the average voice models, given the adaptation data for the target speaker, as an objective measure of the speaker-adaptive training. If the speaker differences included in the average voice model are normalized appropriately, the likelihood for the unknown target speaker increases. The number of adaptation sentences was from 5 to 450 sentences. Figure 6 shows the likelihood for the male speaker MTK. In the figure, “None” represents the results for the average voice model using the conventional speakerindependent training for both state output and duration distributions. “Output” and “Duration” represent the results for the average voice models using the speaker-adaptive training for only state output or duration distributions, respectively. “Both” represents the results for the average voice model using the speaker-adaptive training for both state output and duration distributions. From the figure, we can see that the likelihood of the proposed method is higher than that of the conventional speaker-independent training or the speaker-adaptive training for only state output or duration distributions. This is because the HSMM-based speakeradaptive training algorithm can reduce the influence of the

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

538

Fig. 6 Effect of speaker-adaptive training of the output and duration distributions. Target speaker is male speaker MTK.

Fig. 8

Number of leaf nodes of decision trees for Mel-cepstrum.

Fig. 9

Number of leaf nodes of decision trees for log F0.

Fig. 7 Average log F0 and mora/sec of target speakers’ speech and synthetic speech generated from the adapted model using 10 sentences.

speaker differences in both the output and the state duration distribution during the re-estimation process, and can suppress inappropriate transformations in the speaker adaptation. 5.3 Objective Evaluation of the HSMM-Based MLLR Adaptation Next, we calculated the average log F0 and mora/sec of the synthetic speech generated from the adapted model using 10 adaptation sentences. Fifty test sentences which were included in neither the training nor the adaptation data were used for the evaluation. The average log F0 and mora/sec of the target speakers’ speech and the synthetic speech generated from the adapted model are shown in Fig. 7. The average log F0 and mora/sec of the average voice, which is synthetic speech generated from the average voice model are shown for reference. From the figure, we can see that the average log F0 and mora/sec of the synthetic speech generated from the adapted model are close to those of the target speakers’ speech. Next, we calculated the target speakers’ average melcepstral distance and root-mean-square (RMS) error of

Fig. 10

Number of leaf nodes of decision trees for duration.

log F0 and vowel duration as the objective measure. The adaptation data was from 5 sentences to 450 sentences. Fifty test sentences were used for the evaluation, and these were included in neither the training nor the adaptation data. For the calculation of the average mel-cepstral distance and the RMS error of log F0, the state duration of each HSMM model was adjusted after Viterbi alignment with the target speakers’ real utterance. For the calculation of the RMS error of the vowel duration, we used the manually labeled

YAMAGISHI and KOBAYASHI: AVERAGE-VOICE-BASED SPEECH SYNTHESIS

539

Fig. 11

Average mel-cepstral distance of male speaker MTK.

Fig. 12

Fig. 13

RMS log F0 error of male speaker MTK.

RMS error of vowel duration of male speaker MTK.

duration of the target speakers’ real utterance as the target vowel duration. Figures 11 and 14 show the target speakers’ average mel-cepstral distance between spectra generated from the adapted model (MLLR) and spectra obtained by analyzing target speakers’ real utterance. For reference, we also show the average distance of spectra generated from a speakerdependent (SD) model [8] using the adaptation data as the training data of the SD model. Figures 8, 9 and 10 show the numbers of leaf nodes of the decision trees of the SD

Fig. 14

Average mel-cepstral distance of female speaker FTK.

Fig. 15

Fig. 16

RMS log F0 error of female speaker FTK.

RMS error of vowel duration of female speaker FTK.

model in spectrum, F0, and phone duration parts, respectively. For reference, we also show those of the male- and female-speaker average voice models in these figures. The numbers of leaf nodes are corresponding to the number of Gaussian distributions used in the model. Although the average voice models have many more Gaussian distributions than the SD models as shown in these figures, the Gaussian distributions of the average voice models are estimated from the speech data of the training speakers and the numbers of the speech data of the target speakers for the average voice

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

540

models are the same as those for each SD model. Figures 12 and 15 show the RMS log F0 error between F0 patterns of synthetic and real speech. Figures 13 and 16 show the RMS error between the generated duration and that of the real utterance. Silence and pause regions were eliminated from the mel-cepstral distance calculation. Since F0 is not observed in the unvoiced region, the RMS log F0 error was calculated in the region where both the generated and the real F0 were voiced. From these figures, we can see that all features of synthetic speech generated from the adapted model become closer to the target speakers’ features than those of average voice. The adapted model significantly outperforms the speaker dependent model especially when the available data is limited. We can also see that the adapted model gives results comparable to or a little better than the speaker dependent model even when sufficient adaptation data is available. Comparing the adaptation of F0 parameters and spectrum parameters, one sees that just a few adaptation sentences give good results in the adaptation of the F0 parameter, whereas about 50 to 150 sentences are needed to obtain good results in the adaptation of the spectral parameters. This is due to the different numbers of parameters for the features. In these experiments, we used 75-dimensional spectral parameters including delta and delta-delta parameters, and thus the transformation matrix for the spectral parameters needs many more parameters than the transformation matrix for the F0 parameter which is one dimensional. As a result, when the available adaptation data is limited, the estimation accuracy of the spectral parameters’ transformation matrix decreases compared with those of the transformation matrix for the F0 and duration parameters. On the other hand, in the adaptation of the duration parameters, the required number of adaptation sentences varies with the target speaker. For preliminary experiments, we also compared synthetic speech generated from the models with or without the HSMM-based speaker adaptive training using the same objective evaluation methods. From the results of the preliminary experiments, we confirmed that the speaker adaptive training could lead to a certain improvement in the mel-cepstral distance of both the target speakers. We also confirmed the advantage of the speaker adaptive training in the RMS error of log F0 and vowel duration, although the amount of the improvement in the RMS error depended on the target speakers. 5.4 Objective Evaluation of the HSMM-Based MLLR Adaptation and Warping Method Since the state duration distributions of the average voice model are Gaussian pdfs, we may simply control speaking rate of synthetic speech by using the conventional duration warping method [20]. We briefly review the duration warping method [20]. When the state duration probability density is modeled by a single Gaussian pdf, it is possible to control speaking rate of

synthetic speech via a control parameter ρ. The controlled duration of state k is defined by dk = mk + ρ · σ2k

(28)

where mk and σk are the mean and variance of the state duration distribution of the state k, respectively. When the control parameter ρ is set to zero, speaking rate of synthetic speech becomes the average rate, and when ρ is set to negative or positive value, speaking rate becomes faster or slower, respectively. When the total frame length T of synthetic speech is known, the optimal value of the control parameter ρ in the sense of the state duration probability is given by  K  K    2  mk  σk (29) ρ = T − k=1

k=1

where K is the total number of states visited during T frames. Note that the state durations are not made equally shorter or longer because variability of a state duration depends on the variance of the state duration density. In this section, we compared the duration adaptation method using the HSMM-based MLLR adaptation and the speaking rate control method using the above duration warping method. We estimated an average of the control parameters ρ¯ from the given adaptation data, and controlled the speaking rate of synthetic speech for the test sentences by using ρ. ¯ On the other hand, in the MLLR adaptation method, ρ is set to zero. Evaluation method and other experimental conditions were same as the objective evaluation of duration described in Sect. 5.3. Figures 17 and 18 show the RMS error of vowel duration between the generated duration and that of the real utterance. From these figures, we can see that the duration adaptation method significantly outperforms the conventional duration warping method. The results indicate that it is not enough to control only the speaking rate of synthetic speech and that it is essential to adapt and transform the duration model to reproduce the speaker characteristics of the target speaker. 5.5 Subjective Evaluation of the Number of Adaptation Data We then conducted a comparison category rating (CCR) test to assess the effect of the number of the adaptation data. We compared the synthesized speech generated from the adapted models using 5, 10, 20, 50, 100, 150, 250, 350, and 450 sentences of the target speaker. Eight subjects were first presented with the reference speech sample and then with synthesized speech samples generated from the adapted models in random order. The subjects were asked to rate their voice characteristics and prosodic features compared with those of the reference speech. The reference speech was synthesized with a mel-cepstral vocoder. The rating was done on a 5-point scale, that is, 5 for very similar, 4 for similar, 3 for slightly similar, 2 for dissimilar, and 1

YAMAGISHI and KOBAYASHI: AVERAGE-VOICE-BASED SPEECH SYNTHESIS

541

Fig. 17

RMS error of vowel duration of male speaker MTK.

Fig. 19 Subjective Evaluation of the effect of the number of the adaptation data.

Fig. 20 Fig. 18

Subjective evaluation of adaptation effects of each feature.

RMS error of vowel duration of female speaker FTK.

for very dissimilar. For each subject, five test sentences were randomly chosen from a set of 50 test sentences, which were contained in neither the training nor the adaptation data. Figures 19 shows the average score. A 95% confidence interval is also shown in the figure. From this figure, one can see that about 50 to 100 sentences are needed to represent the target speaker model and synthesize speech appropriately. This result corresponds to the objective evaluation results in Sect. 5.3. We therefore utilized the 100 sentences of the target speakers as the adaptation data in the following subjective evaluations. 5.6 Subjective Evaluation of the HSMM-Based MLLR Adaptation Next, we conducted a comparison category rating (CCR) test to assess the effectiveness of the transformations of spectral and prosodic features of synthesized speech. We compared the synthesized speech generated from eight models with or without the adaptation of spectrum, F0, and/or duration. The adaptation data comprised 100 sentences. For reference, we also compared synthesized speech generated from the SD model using 453 sentences of the target speaker. Other experimental conditions were same as the CCR test described in Sect. 5.5. Figure 20 shows the average values of the CCR tests

for the male speaker MTK and the female speaker FTK. A 95% confidence interval is also shown. The values indicate that it is not enough to adapt only voice characteristics or prosodic features and that it is essential to simultaneously adapt both voice characteristics and prosodic features to reproduce the speaker characteristics of the target speaker. The result for the simultaneous adaptation of voice characteristics and prosodic features shows that a combination of spectrum and F0 has a powerful effect on reproduction of the speaker characteristics of the synthesized speech, and furthermore, the synthetic speech generated from the model using the simultaneous adaptation of all features has the most similar speaker characteristics to the those of the target speaker. It is interesting that the synthesized speech generated from the model using the simultaneous adaptation of all features using 100 target-speaker sentences is rated as having more similar speaker characteristics to the target speaker compared with the speaker-dependent models using 453 sentences of the target speaker. One reason is that the F0 parameter generated from the adapted model is more similar to the target speaker’s than that of the speaker-dependent model, as can be seen in Figs. 12 and 15. Another reason is that the average voice model can utilize a large variety of contextual information included in the several speakers’ speech database as a priori information for the speaker adaptation and provide a robust basis for synthesizing the speech of the new target speaker. As a result, synthetic speech with

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.2 FEBRUARY 2007

542

Fig. 21 Subjective evaluation of simultaneous speaker adaptation method.

similar speaker characteristics to those of the target speaker can be robustly obtained even when there are few speech samples available for the target speaker. In fact, the speech database for the average voice model has about three times as many contextual speech units as in the target speaker’s speech database. Thus, we can say that the context-rich speech database and the proposed normalization technique for speaker characteristics yields a good initial model for the speaker adaptation. Finally, we conducted an ABX comparison test to assess and directly compare the effectiveness of the transformations of spectral and prosodic features of synthesized speech. We compared the synthesized speech generated from the adapted models using adaptation of spectrum, a combination of spectrum and F0, and all features. In the ABX test, A and B were a pair of synthesized speech samples generated from the two models randomly chosen from the above combinations, and X was the reference speech. The reference speech was synthesized by a mel-cepstral vocoder. Eight subjects were presented synthesized speech in the order of A, B, X or B, A, X, and asked to select the first or second speech samples as being similar to X. For each subject, five test sentences were randomly chosen from the same set of test sentences. Figure 21 shows the average score of the ABX test for the male speaker MTK and the female speaker FTK. A 95% confidence interval is also shown in the figure. The result firstly confirms that the synthetic speech generated from the model using the simultaneous adaptation of spectrum, F0, and duration has the highest similarity to the target speaker again. It is interesting to note that in this experiments, the duration adaptation is perceived to be as effective as the F0 adaptation unlike the result in Sect. 5.6. Duration of synthetic speech seems to become essential for the comparison and judgment of the similarity of among synthetic speech. 6. Conclusions In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. To achieve simultaneous adaptation of spectrum, F0 and phone duration, we have applied an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we have proposed the HSMMbased adaptive training algorithm to normalize the state output and state duration distributions of the average voice

model simultaneously. We incorporated these HSMMbased techniques into our HSMM-based speech synthesis system, and the results of subjective and objective evaluation tests show the effectiveness of the algorithm for simultaneous model adaptation and adaptive training of spectrum, F0, and duration. An issue of the HSMM-based adaptation technique for state duration distribution is that it might be adapted to a negative mean value of the state duration distribution because it assumes that this distribution is Gaussian. Although the adaptation technique works effectively, the state duration distribution is defined in the positive area, and thus we need to assume a distribution defined in the positive area, such as lognormal, gamma, or Poisson distribution, for more rigorous modeling and adaptation. Our future work will focus on developing adaptation algorithms for the exponential family of distributions, which includes Gaussian, lognormal, gamma, and Poisson distribution, using a generalized linear regression model [21]. Acknowledgments The authors would like to thank Prof. Keiichi Tokuda of the Nagoya Institute of Technology and Dr. Takashi Masuko of the Toshiba Corporation for their valuable discussions with us. A part of this work was supported by JSPS Grant-inAid for Scientific Research (B) 15300055, MEXT Grantin-Aid for Exploratory Research 17650046, and JSPS Research Fellowships for Young Scientists 164633. References [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” J. Acoust. Soc. Jpn. (E), vol.11, no.2, pp.71–76, 1990. [2] K. Ito and S. Saito, “Effects of acoustical feature parameters of speech on perceptual identification of speaker,” IEICE Trans. Fundamentals (Japanese Edition), vol.J65-A, no.1, pp.101–108, Jan. 1982. [3] N. Higuchi and M. Hashimoto, “Analysis of acoustic features affecting speaker identification,” Proc. EUROSPEECH-95, pp.435–438, Sept. 1995. [4] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR,” Proc. ICASSP 2001, pp.805–808, May 2001. [5] J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “A training method of average voice model for HMM-based speech synthesis,” IEICE Trans. Fundamentals, vol.E86-A, no.8, pp.1956– 1963, Aug. 2003. [6] C. Leggetter and P. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol.9, no.2, pp.171–185, 1995. [7] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proc. EUROSPEECH-99, pp.2374–2350, Sept. 1999. [8] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” Proc. ICSLP 2004, pp.1393–1396, Oct. 2004. [9] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compact model for speaker-adaptive training,” Proc. ICSLP-96, pp.1137–1140, Oct. 1996.

YAMAGISHI and KOBAYASHI: AVERAGE-VOICE-BASED SPEECH SYNTHESIS

543

[10] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style adaptation technique for speech synthesis using HSMM and suprasegmental features,” IEICE Trans. Inf. & Syst., vol.E89-D, no.3, pp.1092–1099, March 2006. [11] J. Ferguson, “Variable duration models for speech,” Symp. on the Application of Hidden Markov Models to Text and Speech, pp.143– 179, 1980. [12] S. Levinson, “Continuously variable duration hidden Markov models for automatic speech recognition,” Comput. Speech Lang., vol.1, no.1, pp.29–45, 1986. [13] M. Russell and R. Moore, “Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition,” Proc. ICASSP-85, pp.5–8, March 1985. [14] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “Performance evaluation of style adaptation for hidden semi-Markov model,” Proc. EUROSPEECH 2005, pp.2805–2808, Sept. 2005. [15] J. Yamagishi, T. Masuko, and T. Kobayashi, “MLLR adaptation for hidden semi-Markov model based speech synthesis,” Proc. ICSLP 2004, pp.1213–1216, Oct. 2004. [16] J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, “Speaking style adaptation using context clustering decision tree for HMMbased speech synthesis,” Proc. ICASSP 2004, pp.5–8, May 2004. [17] J. Yamagishi and T. Kobayashi, “Adaptive training for hidden semiMarkov model,” Proc. ICASSP 2005, pp.365–368, March 2005. [18] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” Proc. ICASSP-92, pp.137–140, March 1992. [19] K. Tokuda, T. Kobayashi, T. Fukada, H. Saito, and S. Imai, “Spectral estimation of speech based on mel-cepstral representation,” IEICE Trans. Fundamentals (Japanese Edition), vol.J74-A, no.8, pp.1240– 1248, Aug. 1991. [20] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis using HMMs with dynamic features,” Proc. ICASSP-96, pp.389– 392, May 1996. [21] P. McCullagh and J. Nelder, Generalized Linear Models, Chapman & Hall, 1989.

Junichi Yamagishi received the B.E. degree in computer science, M.E. and Dr.Eng. degrees in information processing from Tokyo Institute of Technology, Tokyo, Japan, in 2002, 2003, and 2006, respectively. He was also an intern researcher at ATR spoken language communication Research Laboratories (ATR-SLC) during 2003 - 2006. He is currently a research fellow of the Japan Society for the Promotion of Science (JSPS) at the Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology. He is also a visiting researcher at the Centre for Speech Technology Research (CSTR), University of Edinburgh. His research interests include speech synthesis, speech analysis, and speech recognition. He is a member of IEEE, ISCA and ASJ.

Takao Kobayashi received the B.E. degree in electrical engineering, the M.E. and Dr.Eng. degrees in information processing from Tokyo Institute of Technology, Tokyo, Japan, in 1977, 1979, and 1982, respectively. In 1982, he joined the Research Laboratory of Precision Machinery and Electronics, Tokyo Institute of Technology as a Research Associate. He became an Associate Professor at the same Laboratory in 1989. He is currently a Professor of the Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, Japan. He is a co-recipient of both the Best Paper Award and the Inose Award from the IEICE in 2001, and the TELECOM System Technology Prize from the Telecommunications Advancement Foundation Award, Japan, in 2001. His research interests include speech analysis and synthesis, speech coding, speech recognition, and multimodal interface. He is a member of IEEE, ISCA, IPSJ and ASJ.

Suggest Documents