Statistical Voice Conversion Based on Noisy Channel ... - IEEE Xplore

1 downloads 0 Views 668KB Size Report
Abstract—This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion ...
1784

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

Statistical Voice Conversion Based on Noisy Channel Model Daisuke Saito, Member, IEEE, Shinji Watanabe, Senior Member, IEEE, Atsushi Nakamura, Senior Member, IEEE, and Nobuaki Minematsu, Member, IEEE

Abstract—This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small. Index Terms—Joint density model, noisy channel model, probabilistic integration, speaker model, voice conversion (VC).

I. INTRODUCTION

V

OICE conversion (VC) methods partly modify input voices to another ones while keeping some focused information unchanged. Most typically, VC techniques have been widely used in speaker conversion, which is a technique

Manuscript received June 16, 2011; revised December 03, 2011; accepted January 26, 2012. Date of publication February 22, 2012; date of current version April 04, 2012. The part of this work is conducted when the first author was an internship student of NTT Communication Science Laboratories, and the second author was with NTT Communication Science Laboratories, NTT Corporation. It is conducted as the joint research project of NTT Corporation and The University of Tokyo. This work was supported in part by KAKENHI Grant-in-Aid for JSPS Fellows (22-8861). The work of D. Saito was supported by the Japan Society for the Promotion of Science. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Chung-Hsien Wu. D. Saito was with the Department of Electrical Engineering and Information Systems, The University of Tokyo, Tokyo, 113-8656, Japan. He is now with the Department of Information Physics and Computing, The University of Tokyo, Tokyo, 113-0033, Japan (e-mail: [email protected]). S. Watanabe was with NTT Communication Science Laboratories, NTT Corporation, Kyoto 619-0237, Japan. He is now with Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA 02139 USA (e-mail: [email protected]). A. Nakamura is with NTT Communication Science Laboratories, NTT Corporation, Kyoto 619-0237, Japan (e-mail: [email protected]). N. Minematsu is with the Department of Information and Communication Engineering, The University of Tokyo, Tokyo, 113-0033, Japan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2012.2188628

to transform a speaker’s speech so that it sounds as if another speaker speaks, while keeping its linguistic content unchanged. In other words, here, the VC achieves modification of speaker identity in the speech. There are many proposed methods of VC in order to modify the speaker identity generated utterances [1], [2], and they have mainly been applied to function of speaker identity control in text-to-speech (TTS) systems [3]. The VC in more general terms can be interpreted as a mapping between two different feature spaces, which includes not only the modification of speaker identity, but also conversion from noisy speech to clean speech in speech enhancement [4], [5], from narrowband speech to wideband speech in telecommunication [6], techniques in media conversion for speaking aid [7], [8], etc. Although pitch modification is also important for voice conversion [9], in many cases, the mapping focuses on spaces of spectral feature which plays quite an important role in transmission of linguistic information via speech, and source and target spaces are in nonlinear relationship with each other. One of the essential points to implement the VC is, therefore, how to construct a nonlinear mapping from the source to target spectral feature spaces. There are many ways to accomplish this purpose using source and target feature data. Statistical/non-statistical learning approaches have often been used for estimating the mapping, such as artificial neural networks (ANNs) [10], [11], codebook mapping method [12], or Gaussian mixture models (GMMs) [2], [3]. These methods are roughly divided into two approaches; a direct modeling of nonlinear mapping and a combination of multiple linear mappings between local feature spaces. Among the examples, approaches based on ANN belong to the former one. ANN models consist of interconnected processing nodes, and the connection between two nodes has a weight associated with it. They can directly design nonlinear mapping between two whole feature spaces. In contrast, the other examples belong to the latter approaches, which are based on local linearities of the feature space. In codebook mapping method [12], a source and a target feature vectors are clustered into local spaces each of which has quantized representative vectors. Then, the correspondences between these representative vectors of the source and the target are constructed. In the conversion stage, an input vector is converted by quantizing it to the nearest representative vector and applying the corresponding mapping. Thus, this method is based on hard clustering and discrete mapping, and it suffers from the large quantization error. Recently, GMM-based approaches, which are based on soft clustering and continuous mapping, have been widely used. GMM-based techniques for

1558-7916/$31.00 © 2012 IEEE

SAITO et al.: STATISTICAL VC BASED ON NOISY CHANNEL MODEL

statistical mapping use a mixture of Gaussians to model the probabilistic densities of source feature vectors [2] or those of joint vectors of the source and the target speakers [3]. These parts correspond to the clustering stage. Both the approaches derive the transformation function as a weighted summation of linear mappings, each corresponding to each Gaussian component, while the weights are calculated as posterior probabilities of source vectors. Compared with codebook mapping method, in GMM-based techniques, input source vectors are softly allocated to each space, i.e., each Gaussian component by using posterior probabilities of the source vectors. Hence GMM-based approaches realize the continuous mapping, in which neighboring vectors in the source feature space are transformed to neighboring vectors in the target space. In addition, since probabilistic treatment of GMM-based modeling is very powerful and flexible, GMM-based approaches for VC are widely used. However, the joint density GMM methods require the training corpus, which contains plenty of utterances with the same linguistic content from source and target speakers to achieve sufficient quality. This data requirement of the parallel corpus becomes a large burden on both the speakers. In addition, the joint density GMM methods suffer from overtraining effects when the number of utterance pairs for training is small, since the dimensionality of the vector space is estimated to double [3]. To solve these problems, there have been several proposed approaches which compensate for these problems by effective use of nonparallel speech data [13]–[15]. These approaches are inspired by speaker adaptation techniques in speech recognition studies [16]–[19]. They have applied parameter adaptation techniques to parameters of the joint density model. However, these methods mainly focus on modeling of the joint density models and they still require the well-trained joint density models. In this paper, we propose another method to compensate for these problems. The function of VC is divided into two functions; to ensure the consistency of the linguistic content between both the speakers, and to model the speaker individuality of the target. The proposed method realizes these functions by different and independent models, i.e., the joint density model constructed by a small parallel corpus and the speaker model trained by a nonparallel but large speech corpus of the target speaker. Note that there is no constraint of linguistic content on the corpus for the speaker model. Finally it integrates the two functions into one function of VC using a noisy channel model, which is widely used in areas of automatic speech recognition (ASR) [20] or statistical machine translation (SMT) [21]. In approaches using a noisy channel model, observed features are regarded as output obtained through a noisy channel which has target features as input [see Fig. 1(a)]. Then, as shown in Fig. 1(b), the problem is reconstruction of the target features using the channel properties , the target properties , and the observed features . In the proposed method, and correspond to the source and target speech features, respectively. The two terms in the noisy channel model; the channel itself and the channel input well correspond to the two functions in VC, respectively. As well as the other noisy channel model-based approaches, since the proposed method can train the joint density model and the speaker model separately, it has the potential to apply precise modeling techniques

1785

Fig. 1. Generation and decoding process of a noisy channel model. (a) Generation process through a noisy channel. (b) Decoding using input and channel properties.

proposed independently in each research area; such as eigenvoice conversion in VC studies [15], approaches based on the universal background model [22] or the joint factor analysis [23] in speaker recognition studies. Our voice conversion framework also eases the burden on the source speaker, since it is expected to work well when the number of training data for the joint density model is small. Furthermore, there is no constraint of linguistic content on the training data for the speaker model. The approach was first proposed in [24]. This paper provides more analytical consideration by using detailed derivation and adds further experimental discussion by using CMU ARCTIC database and subjective evaluation. As mentioned above, the proposed approach can flexibly apply more sophisticated modeling techniques to both the models. In speech recognition or synthesis studies, dynamic features, which represent the temporal correlation between feature frames, are widely used to improve the performance of systems [25]–[27]. Particularly in speech synthesis, they mitigate the degradation of the perceptual quality of the generated speech caused by the discontinuity included in the parameter trajectory [26], [27]. Hence, to achieve better quality, we also derive a parameter generation algorithm using dynamic features to our proposed voice conversion framework. The remainder of this paper is organized as follows. Section II describes the conventional GMM-based VC approach using the joint density model [3]. Then, in Section III, our proposed voice conversion using both the joint density model and the speaker model is described. In Section IV, experimental evaluations are described. Finally, Section VI concludes the paper.

II. GMM-BASED VOICE CONVERSION In this section, the joint density GMM method is briefly described, because it is one of the fundamental methods of statistical feature mapping [2], [3], [7], [8], [13], [14]. Let be a -dimensional vector sequence characterizing an utterance from the source speaker, and be that of the target speaker. Note that the two utterances contain the same linguistic content. The dynamic time warping algorithm (DTW) is applied to align the

1786

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

source vectors to their corresponding vectors in the target se-dimensional joint vectors quence. Then, a new sequence of where is created. The notation denotes transposition of the vector, and denotes a new time index after DTW is applied. The joint probability density of the source and the target vectors is modeled by a GMM for the joint vector as follows:

When maximum-likelihood estimation is adopted for parameter generation [26], the covariance matrix of the conditional probability density in (7) is also taken into account and the target parameters are generated by the following updating equations:

(9) (1)

In (1), denotes the normal distribution with and covariance matrix , is the mixture mean vector component index, and the total number of mixture compo. The weight of the th component is and nents is . denotes a parameter set of the GMM, which consists of weights, mean vectors, and covariance matrices for individual mixture components. Since the feature space of the joint vector includes the feature spaces for the and source and the target speakers as its subspaces, are written as (2) and are the mean vector of the th component where for the source and that for the target, respectively. Similarly, the and are the covariance matrix of the th matrices component for the source and that for the target, respectively. and are the cross-covariance matrices The matrices of the th component for the source and the target. These parameters in the GMM are estimated by the EM algorithm using . the sequence of the joint vectors to convert the source vector to A mapping function is derived based on the conditional probathe target vector bility density of , given . This probability density can be represented by the parameters of the joint density model as follows:

(3) where

(4) (5) (6) (7) By minimizing the mean square error, the mapping function is derived as (8)

Compared with (8), in parameter generation by (9), the precision matrices of each Gaussian component are contained, and they play a role of a kind of confidence measures for the conditional mean vectors in (6). III. NOISY CHANNEL MODEL-BASED VOICE CONVERSION A. Outline Fundamentally, VC is a technique that allows us to convert voice characteristics of the source speaker into those of the target speaker without changing the linguistic content. When VC systems are considered as a speech generator of the target speaker, the source speaker’s utterance can be regarded as seeds of linguistic content for generation of speech. From this point of view, the joint density model in the conventional VC systems has two functions; 1) to model the proper correspondence of the source features to the target features with the consistency of the linguistic content, and 2) to represent a feature space of the target speaker precisely. The first function is one of the purposes of VC applications that keep the linguist content. For example, when the source speaker said “that was easy for us.”, the converted utterance also should be “that was easy for us.” To realize the first function, a parallel corpus is absolutely necessary for training a proper model for the function, and the conventional approaches of VC strongly focus on this function. However, about the latter one, a nonparallel corpus is sufficient for only constructing the target speaker model. Unlike the training data of the joint density model, there is no constraint of linguistic content (e.g. sentence, word, phoneme, etc.) on the nonparallel corpus. In this paper, we realize these two functions by different models; a joint density model for the first and a speaker model for the second. Namely, we interpret the voice conversion model as probabilistic integration of the joint density model and the speaker model. This is theoretically derived by using a noisy channel model, similar to automatic speech recognition (ASR) and statistical machine translation (SMT) [20], [21]. In the proposed approach using a noisy channel model, a channel takes the target speech features as input, and this channel generates the source speech features as output. Finally, the problem becomes a decoding problem of the target features using the channel properties, the target properties, and the observed features. One of the merits of the noisy channel model is that it can be applied even when the training data of the source and the target is not parallel data because the target properties and the channel properties can be modeled separately. In our case, the two terms in the noisy channel model; the channel itself and the channel input are well fit for the two functions of VC.

SAITO et al.: STATISTICAL VC BASED ON NOISY CHANNEL MODEL

1787

Fig. 2. Overview of the proposed framework.

B. Formulation First, as with the conventional VC approach, we focus on the conditional probability density of the target vector , given the source vector . From the conditional maximum-likelihood criterion, the optimum output for the target vector is derived as follows:

the function , we derive the auxiliary function with respect to . For the following derivation, similar to the conventional joint density GMM, and are the mixture component index and the total number of mixture components in the speaker GMM, respectively,

(10) By using the Bayes rule, which is the same manner as that of ASR or SMT, (10) is written as

(13)

(11)

(14) ,

The proposed framework of voice conversion is illustrated in Fig. 2. In (11), the first term corresponds to the function that provides the consistency of the linguistic content between the source and the target speakers, because this inverse model is trained by a parallel corpus. The second term corresponds to the function that models the speaker individuality of the target. For the first term, we use the parameters of the joint density model trained by the parallel corpus. On the other hand, we can use the speaker GMM for the second term, which is widely used in speaker recognition studies. The speaker GMM is trained by a nonparallel corpus of the arbitrary utterances of the target speaker. Here, we derive an algorithm of voice conversion based on and be the parameters of the joint density (11). Let model and those of the speaker model, respectively. We define a new likelihood function based on (11) as follows: (12) where the constant denotes the weight for controlling the balance between the two models, as it is similar to a language model weight in ASR. To obtain the optimum solution to maximize

, and

are auxiliary functions as follows: (15) (16) (17)

(18) To derive (14), we use Jensen’s inequality. In GMM-based speaker modeling in areas of voice conversion or speaker recognition, each Gaussian component broadly corresponds to some phonological units such as phonemes because GMMs are usually trained by a single speaker. From this viewpoint, and approximately represent the phonemic identity of each frame . In VC, linguistic content in utterances should be preserved. Hence, (15) does not change drastically changes, i.e., the derivative of with respect to when

1788

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

can be ignored. However, since this assumption is heuristic, it should be confirmed experimentally. Finally, we iteratively maximize the following function to optimize : (19) By setting the derivative of (19) with respect to following updating equation is derived:

to zero, the

manner to that of the approaches in [26]. From here, let a time sequence of the source features and that of the target ones be and , respectively. and consist of static and dynamic features. The joint density GMM and the speaker GMM are trained by these features as well as in the previous section. In the parameter generation, explicit relation between the static and dynamic features is utilized. A time sequence of the converted feature vectors is derived as follows: (23)

(20) subject to where and are the mean vector and the covariance matrix of the th component in the speaker GMM, and (21)

(24)

denotes the matrix to extend the static feature sewhere quence to the static and dynamic feature sequence. By a similar manner to that in [26], we derive the following updating equations:

(22) (25) denotes the pseudo-inverse of the matrix. The The notation derivation of (20) is described in Appendix A. The algorithm for parameter generation using (18) and (20) is as follows. and are set as 1) For the initial values of the iteration, and , respectively, in order to compensate for the mismatch between the feature spaces of and . 2) Using , and (20), is calculated. is updated by which is calculated at Step 2. 3) and are modified using updated and (18). 4) 5) Step 2 to 4 are repeated until the number of repetition equals to the preset value. Equation (20) has a similar form to (9), but it becomes the weighted summation of the effects from the joint density model and those from the speaker model. Thus, our proposed method can overcome the sparse parallel data problem by reducing the overestimation effects of the joint density parameters by the speaker model. C. Constraint From Dynamic Features In the frame-by-frame mapping where the temporal correlation of the feature vectors is ignored, the discontinuity of the parameter trajectory becomes a problem. For example, in [26], a comparison between parameter trajectories of natural speech and those converted by frame-by-frame conversion is shown. Compensation for this problem affects the perceptual quality of converted or synthesized speech [26], [27]. In our case, the parameter generation of (20) is also a frame-by-frame mapping. Hence, the improvements of the perceptual quality are expected if we tackle this problem. To compensate for the discontinuity of a feature sequence, several approaches that smooth the output parameter sequence have been proposed. Chen et al. applied a median filter and a low-pass filter for the parameter generation in VC to smooth the parameter trajectory [28]. Toda et al. proposed the maximum-likelihood estimation of the spectral parameter trajectory considering dynamic features [26]. In this paper, we also employ the parameter generation considering dynamic features to our proposed approach by a similar

(26) (27) (28)

(29) (30) In comparison of (25) with MLE-based method in [26], the proposed generation has a similar form, but it includes the effects of the independent speaker GMM. By comparing (25), (28), and (29) with (20), the proposed generation is regarded as being constrained from dynamic features. Thus, the proposed generation is advanced from both the MLE-based method and voice conversion based on a noisy channel model without dynamic features. IV. EXPERIMENTS A. Experimental Setups To evaluate the performance of our proposed approach, voice conversion experiments using both the databases of English and Japanese sentences were performed. There are two objectives of the experiments. The first objective is to verify that probabilistic integration of the joint density model and the speaker model based on the noisy channel model properly mitigates the sparse data problem which occurs in the case that the amount of the parallel corpus is not sufficient. The second objective is to demonstrate the effectiveness of parameter generation using constraint of dynamic features. 1) Databases: As the English speech data, we used speech data of four speakers from the CMU ARCTIC database [29]

SAITO et al.: STATISTICAL VC BASED ON NOISY CHANNEL MODEL

(two male speakers, bdl and rms, and two female speakers, clb and slt). Voice conversion was performed for 12 speaker pairs, which are all of pairs using CMU ARCTIC data. We selected 16 sentences (from a0001 to a0016) for training of the joint density models, and the training sets consisting of 1, 2, 4, 8, and 16 sentences each from the 16 sentences were prepared. For the speaker models, we prepared two training sets consisting of 8 and 128 sentences. Note that the sentences used for training of the joint density model was not included in the sentences used for the speaker GMMs. The evaluation set consisting of 50 sentences were selected. As the Japanese speech samples, we used speech samples from three male speakers (MSH as the source speaker, MMY and MTK as the target speakers) in the ATR Japanese speech database B-set [30]. This database consists of 503 phonetically balanced sentences. This database was used for both the objectives of the experiments. We selected the last 53 sentences for test data. For training of the joint density models, the number of training sentences was varied from 1 to 8. On the other hand, for the speaker GMMs, we selected two training sets consisting of 2 and 50 sentences each from the 50 sentences of the database. 2) Conditions: Using the above databases, we carried out three kinds of experiments. Table I shows experimental conditions of each experiment about databases, dynamic features, and weights for controlling the balance between the joint density models and the speaker models . The balance weight was fixed for each experimental condition, which was selected through preliminary experiments.1 In the experiment using the CMU ARCTIC (condition A), because we only intended to meet the first objective, i.e., to evaluate the basic performance of our proposed framework, the constraint of dynamic features was not applied in parameter generation. We compared the proposed approach with the conventional VC technique based on (9) that uses only the joint density model.2 Subjective evaluation for the condition A was not carried out since listening tests by nonnative listeners are not reasonable. In the experiments using the Japanese speech samples, first, we compared the proposed approach with the conventional method similarly to the case of the CMU ARCTIC database, i.e., condition B. For the second objective of the experiment, i.e., in order to evaluate the effects of the dynamic features in our proposed method, we also compared the MLE-based parameters [26], and our promethods with and without posed methods with and without parameters ((25) and (20), respectively). This experiment corresponds to condition C. 1In preliminary experiments, the value of affects the difference of mel-cepstral distortion. The maximum of difference is about 0.05 [dB] at most. The determination of could be affected by several factors, e.g., the use of the dynamic features, language of the source and target speech, the balance of training data for the joint density model and the speaker model, etc. The optimization of depending on the conditions is an important problem. However, one of the purposes of the experiments is to evaluate the performance of our approach based on the noisy channel model itself. In this paper, for each condition is carefully selected through the preliminary experiments. 2This paper used the optimized numbers of mixture components of the joint density model (M ) and the speaker GMM (N ) in each combination of training sets, similar to [26]. However, the optimization of model structure in speech modeling is an important problem, and there are several approaches by using statistical/machine learning techniques (e.g., MDL/BIC criterion for speech synthesis [31] and variational Bayes for speech recognition [32].) Therefore, we are investigating the optimization problem of the proposed approach in [33].

1789

TABLE I EXPERIMENTAL CONDITIONS

Fig. 3. Example of an likelihood function as a function of the number of iterations. The source and target speakers are bdl and clb, respectively. The number of sentence pairs for the joint density model is 1, and the number of training data for the speaker model is 128. The optimized numbers of M and N are 8 and 16, respectively. The sentence is a0544 in CMU ARCTIC database.

In conditions B and C, both the objective and the subjective evaluations were carried out to evaluate the total performance of our proposed framework. We used 24-dimensional mel-cepstrum vectors for spectral representation. These are derived by STRAIGHT analysis [34]. Aperiodic components, which are features to construct STRAIGHT mixed excitation, are not converted in this study, and they are fixed to 30 dB at all frequencies. Prosodic features, the power coefficient (zeroth cepstral coefficient) and the log-scaled fundamental frequency were converted in a simple manner that only considers the mean and the standard deviation of the parameters as follows: (31) where and are the source and converted values at frame . and are the mean and the standard deviation of the parameters. Note that parameters of prosodic features were not considered in all the experiments. B. Effectiveness of the Proposed Method 1) Objective Evaluations Using CMU ARCTIC Database: For the objective evaluation, we evaluated the conversion performance using mel-cepstral distortion between the target and converted vectors. The mel-cepstral distortion is defined as the following equation: dB

(32)

and are the th coefficients of the target and where converted Mel-cepstrum vectors, respectively. Fig. 3 shows an example of likelihood function as a function of the number of iterations. The dashed line is the value of auxiliary function of (19), and the dotted line is the value of auxiliary

1790

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

Fig. 4. Results of objective evaluations using the CMU ARCTIC database as a function of the number of training sentences of the joint density model. Averaged mel-cepstral distortion was calculated from all the converted samples. The and are indicated in the parentheses as ( ) for optimized numbers of each size of the training sets. Note that the constraint of the dynamic features are not considered in both the conventional and the proposed method. denotes the number of training sentences for the speaker GMM in the proposed method.

M N

M;N D

function of (14). This result shows that the deriviation of (19) is reasonable. It is also observed that five iterations of reestimation are enough from the viewpoint of likelihood. Hence, from here, in all the experiments, the number of iterations for the updating equations [(20) and (25) to (30)] was fixed to 5. Fig. 4 shows the result of average mel-cepstral distortion for the test data as a function of the number of training sentences of the joint density model in condition A. The dashed line of “Conventional” is the result of the conventional method using (9), and the two solid lines of “Proposed” are the results of the proposed method using (20). Compared with the conventional method, our proposed method was significantly better in the case that the numbers of training data for the joint density model were 1, 2, 4, and 8. On the other hand, it was not in the case that the number of data was 16. This is because of the sparse data problem, i.e., the conventional method could not train the model parameters sufficiently. The proposed method effectively mitigated the problem by using the speaker model that were trained independently. It was better than is observed that the performance of . This result means that the better speaker models that of improve the performance of our proposed method properly. It is observed that the optimum number of mixtures for the joint density model and that for the speaker model are sometimes different in Fig. 4. That is to say, our proposed method can flexibly control the complexity of the models in the whole framework. This is another advantage of the noisy channel model based voice conversion that can flexibly integrate the functions of the joint density model and the speaker model. 2) Evaluations Using ATR Database: Fig. 5 shows the result of average mel-cepstral distortion for the test data using the ATR database as the number of training sentences of the joint density model. The dashed line of “Conventional” is the result of the conventional method using only the joint density models, and the two solid lines are those of the proposed method. The respective optimal parameters for and for each combination of the training sets are selected. As well as in the previous experiment, it is observed that our proposed method mitigated the sparse data

Fig. 5. Results of objective evaluations using the ATR database as a function of the number of training sentences of the joint density model. The optimized numbers of and are indicated in the parentheses as ( ) for each experimental condition. Note that the constraint of the dynamic features are not considered in both the conventional and the proposed method. denotes the number of training sentences for the speaker GMM in the proposed method.

M

N

M;N D

problem of the joint density model. The proposed method improved the performance of the conventional one even when the both methods used the only 1 pair of parallel sentences. In this experiment, compared with the conventional method, the proposed method totally outperforms the conventional one even in the case that the number of training sentences for the speaker GMM is small. As well as in the experiment using the CMU ARCTIC database, we show the effectiveness of the proposed method that mitigates the sparse data problem by using the speaker model, as discussed in Section III-B. Generally, our proposed method does not depend on specific languages. However, effectiveness could be different because importance of two functions of VC could be different among languages, for example, between English and Japanese. In the case of English, there are more phonemes than that of Japanese. Hence, it could be more important to keep the consistency of linguistic content between the source and target speaker. On the other hand, in the case of Japanese, modeling of the target speaker space could be more important. For the subjective evaluations, a listening test was carried out to evaluate the naturalness of converted speech and conversion accuracy for speaker individuality. The numbers of the training sentences for the joint density model and the speaker model are 1 and 50, respectively. The purpose of the evaluations is to clarify the effects of the speaker model for two issues of VC; to naturalness and speaker individuality, which were not clear in the objective evaluations. Note that the number of training sentences of the joint density model is fixed to 1. Hence, degradation of the intelligibility from the joint density model is not considered in the test. The test was conducted with 15 subjects to compare the utterances converted by the proposed method and those by the conventional method. Note that the dynamic features were not used in this test. To evaluate naturalness, a paired comparison was carried out. In this test, pairs of two different types of the converted speech samples were presented to subjects, and then each subject judged which sample sounded more natural. To evaluate conversion accuracy, an RAB test was performed. In this test, pairs of the two different types of converted samples were presented after presenting the reference sample of

SAITO et al.: STATISTICAL VC BASED ON NOISY CHANNEL MODEL

Fig. 6. Results of subjective evaluations in the listening test. Error bars indicate 95% confidence intervals. Note that the dynamic features are not used. The ) are 16 and 32, respectively. numbers of mixtures (

1791

Fig. 7. Example of generated parameter trajectories with/without the dynamic features. Several local skips are observed in ellipses.

M;N

the target speech. The number of sample pairs evaluated by each subject was 12 in each test. Fig. 6 shows preference scores for the listening test. For the individuality of generated speech samples, the proposed method outperformed the conventional one. It is quite reasonable because it is considered that the speaker model corresponds to the function to represent the space of the target speaker. From this viewpoint, the speaker model effectively improved the quality of converted samples. In addition, for the naturalness of generated speech samples, the proposed method also outperformed the conventional one, although the improvement was slightly smaller than that for the speaker individuality. That is to say, in our framework based on the noisy channel model, the speaker model effectively mitigates the data sparse problem of the joint density model for both the naturalness and the speaker individuality of the generated samples. From the results from both the objective and subjective evaluations, we demonstrate that the proposed VC method based on the noisy channel model outperforms the conventional method in the case that a parallel corpus is small. C. Further Improvement by Using Dynamic Features We evaluated the effects of dynamic features by an experiment using the ATR database. After the delta parameters are added to the training feature vectors, the joint density model and the speaker model including the dynamic features were reconstructed. The other experimental conditions were the same as those of the experiment that did not use the dynamic feature. Fig. 7 shows an example of the trajectories converted by the proposed methods with and without the dynamic features. From the result of the proposed method without the dynamic features, inappropriate skips are observed in ellipses on the trajectory. On the other hand, the proposed method with the dynamic features obtained a smooth parameter trajectory. Fig. 8 shows the result of average mel-cepstral distortion for the test data as a function of the number of training sentences for the joint density model. From Fig. 8, compared with the conventional methods based on MLE-based approach, the results of the proposed method based on the noisy channel model were better. These results firmly show the effectiveness of the proposed VC framework that appropriately compensated for the sparse data problem of

M

Fig. 8. Results of averaged mel-cepstral distortion as a function of the number of training sentences of the joint density model. The optimized numbers of and are indicated in the parentheses as ( ) for each experimental condition. In this experiments, the constraint of the dynamic features are considered. The number of training sentences for the speaker GMM in the proposed method is fixed to 50.

N

M;N

the joint density model. In the conventional methods, as mentioned in [26], the dynamic features effectively improved the performance of voice conversion in view of mel-cepstral distortion. On the other hand, in our proposed methods, the improvement by the constraint of the dynamic features is not observed in view of mel-cepstral distortion. It is observed that the optimum number of mixtures for the speaker models in Fig. 8 are larger than that in Fig. 5. The similar results are also obtained in [26]. This means that a larger number of mixture components are required to model the speaker model, not only for the joint density model, in the case that dynamic features should be considered. In order to evaluate the perceptual effects of the dynamic features for our proposed method, a listening test was carried out. The test was conducted with ten subjects to compare the utterances converted by the proposed methods based on the noisy channel model with and without dynamic features. The numbers of the training sentences for the joint density model and the speaker model are 1 and 50, respectively. For the method without dynamic features, the number of mixtures of the joint was fixed to 8, and the number of mixtures density model was 16. For the method with dynamic of the speaker GMM features, the number of mixtures of the joint density model was fixed to 16. We selected the optimal number for each speaker . As well as the number of mixtures of the speaker GMM as the previous listening test, a paired comparison to evaluate

1792

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

VI. CONCLUSION

Fig. 9. Results of subjective evaluations in the listening test. Error bars indicate 95% confidence intervals.

naturalness, and an RAB test to evaluate conversion accuracy for the speaker individuality were carried out. The number of sample pairs evaluated by each subject was 12 in each test. Fig. 9 shows preference scores as the results of the listening test. According to the figure, the proposed method with dynamic features outperformed that without dynamic features in both the naturalness and the speaker individuality. The subjective scores show the effects that the discontinuity of the spectral trajectory was mitigated by considering the constraint of dynamic features, and the perceptual qualities of the converted speech were improved. Hence, we can say that dynamic features in our framework also work well.3 V. DISCUSSIONS Requirements of the parallel corpus for VC systems become a barrier to flexible applications of VC techniques. Then, there have been several proposed approaches which effectively use other speech data. Mouchtaris et al. proposed an unsupervised training method based on maximum-likelihood constrained adaptation of the GMM trained with an existing parallel data set of a different speaker-pair [13]. Lee et al. proposed another approach based on maximum a posteriori (MAP) adaptation [14]. Toda et al. proposed eigenvoice conversion in which the speaker’s feature space is constructed by the weighted sum of eigenvectors and controlled by weights for these eigenvectors [15]. This method is inspired by eigenvoice method in speech recognition [19]. Since these approaches mainly focus on flexible control of the speaker individuality, they are inspired by several methods of speaker adaptation in speech recognition studies. On the other hand, our approach uses a noisy channel model as a framework, which is widely used for decoding in ASR or SMT studies. Hence, it can be considered as a decoding approach for voice conversion. Since the viewpoints for voice conversion are different from each other, our approach and other approaches using nonparallel speech data do not conflict with each other. Actually, since the proposed method can train the joint density model and the speaker model separately, it has the potential to apply precise modeling techniques proposed independently in VC and speaker recognition studies, including the above methods. 3Some samples are available at http://www.gavo.t.u-tokyo.ac.jp/~dsk_saito/ ieee/ncmvc/.

We have proposed a new voice conversion framework based on a noisy channel model which integrates the speaker model and the joint density model into one function of VC. This approach uses nonparallel data from the target speaker effectively and works well when the amount of parallel data is limited. In this paper, we have also proposed the parameter generation using the constraint of the dynamic features for our voice conversion method. By using the dynamic features, the perceptual quality of the generated samples is improved. Since the proposed method can train the joint density model and the speaker model separately, it has the potential to apply more precise modeling to both the joint density model and the speaker model. In addition, since our approach is based on the noisy channel model, which is the same framework of ASR or SMT, more sophisticated integration could be realized by importing several knowledge of the decoder studies in ASR and SMT. For example, the optimization of the weighting parameter is included in this scope. For further improvements, we are planning to apply more sophisticated models for both the joint density model and the speaker model. For example, in VC studies, several HMM-based voice conversion methods, which can capture the temporal dynamics of speech more precisely, have been proposed [35], [36]. We will investigate the combination of our proposed framework and HMM-based voice conversion. In addition, since the prosodic modeling is important for improvement of conversion qualities [9], [37], we also should investigate more sophisticated modeling of the prosodic features for our proposed framework. Finally, we are planning to derive the Bayesian framework for smooth integration of these modeling methods including model optimization. APPENDIX A DERIVATION OF EQUATION (20) The update (20) is derived from the auxiliary function of (19) by the similar manner to [26]. The first term of the auxiliary function of (19) is written as

SAITO et al.: STATISTICAL VC BASED ON NOISY CHANNEL MODEL

1793

(33) where and (7):

and

are described as follows similarly to (6)

(34) (35) and are given by (21) and (22), respectively. The are independent of . As well as the first constants and term, the second term of the auxiliary function of (19) is written as (36) (37) By setting the derivative of with respect to to zero, (20) is derived. ACKNOWLEDGMENT The authors would like to thank Dr. T. Toda of NAIST, Japan, for valuable discussion on VC approaches. REFERENCES [1] H. Ye and S. Young, “Quality-enhanced voice morphing using maximum likelihood transformations,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1301–1312, Jul. 2006. [2] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, Mar. 1998. [3] A. Kain and M. W. Macon, “Spectral voice conversion for text-tospeech synthesis,” in Proc. ICASSP, 1998, vol. 1, pp. 285–288. [4] L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, “High-performance robust speech recognition using stereo training data,” in Proc. ICASSP, 2001, pp. 301–304. [5] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with SPLICE for noise robust speech recognition,” in Proc. ICASSP, 2002, pp. 57–60. [6] K. Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. ICASSP, 2000, pp. 1847–1850. [7] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech,” in Proc. Interspeech, 2006, pp. 1395–1398. [8] A. Kunikoshi, Y. Qiao, N. Minematsu, and K. Hirose, “Speech generation from hand gestures based on space mapping,” in Proc. Interspeech, 2009, pp. 308–311.

[9] A. Mousa, “Voice conversion using pitch shifting algorithm by time stretching with PSOLA and re-sampling,” J. Elect. Eng., vol. 61, no. 1, pp. 57–61, 2010. [10] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Commun., vol. 16, no. 2, pp. 207–216, 1995. [11] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” in Proc. ICASSP, 2009, pp. 3893–3896. [12] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proc. ICASSP, 1988, pp. 655–658. [13] A. Mouchtaris, J. V. der Spiegel, and P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 952–963, Mar. 2006. [14] C. H. Lee and C. H. Wu, “Map-based adaptation for speech conversion using adaptation data selection and non-parallel training,” in Proc. Interspeech, 2006, pp. 2254–2257. [15] T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based on Gaussian mixture model,” in Proc. Interspeech, 2006, pp. 2446–2449. [16] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, pp. 171–185, 1995. [17] M. J. F. Gales, “Maximum likelihood linear transformations for HMMbased speech recognition,” Comput. Speech Lang., vol. 12, pp. 75–98, 1998. [18] C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A study on speaker adaptation of the parameters of continuous density hidden Markov models,” IEEE Trans. Audio, Speech, Lang. Process., vol. 39, no. 4, pp. 806–814, Apr. 1991. [19] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp. 695–707, Nov. 2000. [20] F. Jelinek, “Continuous speech recognition by statistical methods,” Proc. IEEE, vol. 64, no. 4, pp. 532–556, Apr. 1976. [21] P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin, “A statistical approach to machine translation,” Comput. Linguist., vol. 16, no. 2, pp. 79–85, 1990. [22] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, no. 1–3, pp. 19–41, 2000. [23] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 5, pp. 980–988, Jul. 2008. [24] D. Saito, S. Watanabe, A. Nakamura, and N. Minematsu, “Probabilistic integration of joint density model and speaker model for voice conversion,” in Proc. Interspeech, 2010, pp. 1728–1731. [25] S. Furui, “Speaker-independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, no. 1, pp. 52–59, Feb.. 1986. [26] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [27] K. Tokuda, T. Kobaayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” in Proc. ICASSP, 1995, pp. 660–663. [28] Y. Chen, M. Chu, E. Chang, J. Jiu, and R. Liu, “Voice conversion with smoothed GMM and MAP adaptation,” in Proc. Eurospeech, 2003, pp. 2413–2416. [29] J. Kominek and A. W. Black, “CMU ARCTIC databases for speech synthesis,” Lang. Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, 2003 [Online]. Available: http://festvox.org/cmu_arctic/index.html [30] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis,” Speech Commun., vol. 9, pp. 357–363, 1990. [31] H. Zen, K. Tokuda, and T. Kitamura, “An introduction of trajectory model into HMM-based speech synthesis,” in Proc. 5th ISCA Speech Synth. Workshop, 2004, pp. 191–196. [32] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Variational Bayesian estimation and clustering for speech recognition,” IEEE Trans. Speech Audio Process., vol. 12, no. 4, pp. 365–381, Jul. 2004. [33] D. Saito, S. Watanabe, A. Nakamura, and N. Minematsu, “High accurate model-integration-based voice conversion using dynamic features and model structure optimization,” in Proc. ICASSP, 2011, pp. 4576–4579.

1794

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 6, AUGUST 2012

[34] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol. 27, pp. 187–207, 1999. [35] H. Duxans, A. Bonafonte, A. Kain, and J. Van Santen, “Including dynamic and phonetic information in voice conversion systems,” in Proc. ICSLP, 2004, pp. 1193–1196. [36] C. H. Wu, C. C. Hsia, T. H. Liu, and J. F. Wang, “Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1109–1116, Jul. 2004. [37] C.-C. Hsia, C.-H. Wu, and J.-Y. Wu, “Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp. 1994–2003, Nov. 2010.

Daisuke Saito (M’11) received the B.E., M.S., and Dr. Eng. degrees from the University of Tokyo, Tokyo, Japan, in 2006, 2008, and 2011, respectively. From 2010 to 2011, he was a Research Fellow (DC2) of the Japan Society for the Promotion of Science. He is currently an Assistant Professor in the Graduate School of Information Science and Technology, University of Tokyo. He is interested in various areas of speech engineering, including voice conversion, speech synthesis, acoustic analysis, speaker recognition, and speech recognition. Dr. Saito is a member of the International Speech Communication Association (ISCA), the Acoustical Society of Japan (ASJ), the Institute of Electronics, Information, and Communication Engineers (IEICE), the Japanese Society for Artificial Intelligence (JSAI), and the Institute of Image Information and Television Engineers (ITE).

Shinji Watanabe (M’03–SM’12) received the B.S., M.S., and Dr.Eng. degrees from Waseda University, Tokyo, Japan, in 1999, 2001, and 2006, respectively. From 2001 to 2011, he was a Research Scientist at NTT Communication Science Laboratories, Kyoto, Japan. From January to March in 2009, he was a Visiting Scholar at the Georgia Institute of Technology, Atlanta. From 2011, he has been with Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA. His research interests include Bayesian learning, pattern recognition, and speech and spoken language processing. Dr. Watanabe is a member of the Acoustical Society of Japan (ASJ) and the Institute of Electronics, Information, and Communications Engineers (IEICE). He received the Awaya Award from the ASJ in 2003, the Paper Award from the IEICE in 2004, the Itakura Award from ASJ in 2006, and the TELECOM System Technology Award from the Telecommunications Advancement Foundation in 2006.

Atsushi Nakamura (SM’07) received the B.E., M.E., and Dr.Eng. degrees from Kyushu University, Fukuoka, Japan, in 1985, 1987, and 2001, respectively. In 1987, he joined Nippon Telegraph and Telephone Corporation (NTT), where he engaged in the research and development of network service platforms, including studies on application of speech processing technologies into network services, at Musashino Electrical Communication Laboratories, Tokyo, Japan. From 1994 to 2000, he was with Advanced Telecommunications Research (ATR) Institute, Kyoto, Japan, as a Senior Researcher, working on the research of spontaneous speech recognition, construction of spoken language database and development of speech translation systems. Since April, 2000, he has been with NTT Communication Science Laboratories, Kyoto, Japan. His research interests include acoustic modeling of speech, speech recognition and synthesis, spoken language processing systems, speech production and perception, computational phonetics and phonology, and application of learning theories to signal analysis and modeling. Dr. Nakamura is a member of the Machine Learning for Signal Processing (MLSP) Technical Committee, as well as served as a Vice Chair of the Signal Processing Society Kansai Chapter. He is also a member of the Institute of Electronics, Information and Communication Engineering (IEICE) and the Acoustical Society of Japan (ASJ). He received the IEICE Paper Award in 2004, and received twice the Telecom-Technology Award of The Telecommunications Advancement Foundation, in 2006 and 2009.

Nobuaki Minematsu (M’08) received the Ph.D. degree in electronic engineering from the University of Tokyo, Tokyo, Japan, in 1995. In 1995, he became a Research Associate in the Department of Information and Computer Science, Toyohashi University of Technology, and in 2000, he became an Associate Professor in the Graduate School of Engineering, University of Tokyo. Since 2012, he has been a Professor there. From 2002 to 2003, he was a Visiting Researcher at the Royal Institute of Technology (KTH), Sweden. He has a wide interest in speech science and engineering, covering phonetics, phonology, language acquisition and learning, speech perception, speech disorders, speech analysis, speech recognition, speech synthesis, and speech applications. He is a board member of SLaTE (The ISCA Special Interest Group on Speech and Language Technology in Education). He is a member of ISCA, IPA, the Computer Assisted Language Instruction Consortium, the Institute of Electronics, Information and Communication Engineers, the Acoustical Society of Japan, the Information Processing Society of Japan, the Japanese Society for Artificial Intelligence, the Research Institute of Signal Processing, the Phonetic Society of Japan, and the Japan Society of Logopedics and Phoniatrics.