932
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
Supervisory Data Alignment for Text-Independent Voice Conversion Jianhua Tao, Member, IEEE, Meng Zhang, Jani Nurminen, Jilei Tian, and Xia Wang
Abstract—We propose new supervisory data alignment methods for text-independent voice conversion which do not need parallel training corpora. Phonetic information is used as a restriction during alignment for mapping the data from the source speaker onto the parameter space of a target speaker. Both linear and nonlinear methods are derived by considering alignment accuracy and topology preservation. For the linear alignment, we consider common phoneme clusters of the source and target space as benchmarks and adapt the source data vector to the target space while maintaining the relative phonetic positions among neighborhood clusters. In order to preserve the topological structure of the source parameter space and improve the stability of conversion and the accuracy of the phonetic mapping, a supervised self-organizing learning algorithm considering phonetic restriction is proposed for iteratively improving the alignment outcome of the previous step. Both the linear and nonlinear methods can also be applied in the cross-lingual case. Evaluation results show that the proposed methods improve the performance of alignment in terms of both alignment accuracy and stability for text-independent voice conversion in intra-lingual and cross-lingual cases. Index Terms—Data alignment, self-organized learning, supervisory phonetic restriction, text-independent voice conversion.
I. INTRODUCTION
V
OICE conversion is a technique that is used to transform the voice of one speaker so that it is perceived as the voice of another speaker. There are many existing transformation approaches such as the use of vector quantization [1]–[3], Gaussian mixture models [4]–[7], pitch-synchronous overlap addition [8], artificial neural networks [9], and multiple functions [10], [11]. All these techniques have two common stages: training and transformation. The voice conversion system gathers information on the voices of the source and target speakers and automatically formulates voice conversion rules in the training stage. For this purpose, a process called data alignment is required, in which a relationship between the acoustic parameter spaces of the two speakers is estimated.
Manuscript received May 13, 2009; revised November 09, 2009. Current version published June 16, 2010. This work was supported in part by the National Natural Science Foundation of China under Grants 60575032, 60873160, and 90820303) and the 863 Program Grants 2006AA01Z138 and 2009AA01Z320. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tomoki Toda. J. H. Tao and M. Zhang are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected];
[email protected]). J. Nurminen is with the Nokia Devices R&D, 33720 Tampere, Finland (e-mail:
[email protected]). J. Tian and X. Wang are with the Nokia Research Center, 100176 Beijing, China (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TASL.2010.2041688
The transformation stage employs the mapping obtained in the training stage to modify the source voice so that it matches the characteristics of the target speaker. The majority of methods proposed in existing literature assume the availability of parallel training sentences, which are referred to as the text-dependent corpus for the source and target speakers. In these approaches, the source and target voices can be aligned using, for example, dynamic time warping [4]. For research purposes, the requirement of having parallel speech databases is not prohibitive, but from the viewpoint of potential practical applications, this requirement is rather inconvenient and sometimes hard to fulfill. Moreover, in some applications, it may even be impossible to obtain parallel speech corpora; e.g., in cross-lingual voice conversion where the source and target speakers speak different languages. To address this problem, text-independent voice conversion techniques using nonparallel databases are developed. The major challenge for text-independent voice conversion is how to align the corresponding training data to obtain a mapping function. In text-dependent conversion, aligned data can be derived from parallel speech. Parallel particularity is necessary because otherwise there is no guarantee that the phonetic content would be unchanged during conversion [12]. However, a nonparallel speech database does not have a high level of inherent time alignment. What is more troublesome is that sometimes the phoneme sets of the source and target voices are different from each other. For instance, in an incomplete or cross-lingual speech database, some phonemes in the source voice are missing for the target and/or vice versa. In recent years, various approaches for text-independent voice conversion have been proposed. Among these methods, conversion was attempted utilizing speech recognition [13], unit selection [12], a codebook [14], and vocal tract length normalization (VTLN) [11], and a variety of other techniques. Although some promising results were achieved, it is expensive to use speech recognition or unit selection techniques in voice conversion. VTLN omits the data alignment procedure and uses multifunctional frequency warping to transform the source spectrum directly. It is fast and efficient but lacks accuracy. To obtain better accuracy, data alignment can be regarded as a crucial part of voice conversion. In the conventional framework, many studies align training data on the basis of a similarity measurement, such as the spectral distance [15], [16]. As summarized in [15], the more similar corresponding source and target vectors are, the less speaker-dependent information can be taken from them. Thus, the use of a similarity criterion has its limitations. There are two aspects of differences in the characteristics of two speakers: speaker individuality and phonetic content differ-
1558-7916/$26.00 © 2010 IEEE Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
ences. In the data alignment of voice conversion, phonetic content differences should be isolated so that only speaker individuality is retained for mapping. Phonetic information can then be used to check phonetic content differences and to increase mapping accuracy. Moreover, as the data trajectories of source and target speech are continuous, a mapping that preserves the topological structure of the parameter space is beneficial for ensuring the stability of conversion from the source space to target space. In this paper, we propose a new data alignment method for text-independent voice conversion considering both the phonetic accuracy and preservation of the internal topology of parameter space. To ensure accuracy, we use phonetic labels of the training data as supervisory information and phonetic restriction is considered for supervised alignment. First, a mapping between the source and target parameter spaces is established using weighted linear alignment based on common phonetic clusters. These phoneme clusters between the source and target speech are regarded as anchors for the mapping, and several of the nearest phonetic clusters to each vector are taken into account simultaneously to ensure mapping continuity. Furthermore, to fine-tune and improve the alignment, we propose a nonlinear data alignment that uses a self-organizing iterative learning algorithm. The result of the linear alignment is used as the initialization of the iterative learning. The algorithm establishes an optimal balance between the phonetic restriction and preservation of the topology, and it thus maintains alignment accuracy and stability. As these nonlinear alignment results are self-organized, the underlying internal structures of the source and target spaces can be associated. As an extension of the algorithm to cross-lingual voice conversion, a manifold expansion algorithm is used in the nonlinear data alignment for the cross-lingual case with topology preservation. The remainder of the paper is organized as follows. In Section II, we present the problem of data alignment for text-independent voice conversion and discuss what characteristics the data alignment function should have. In Sections III and IV, we describe our proposed data alignment method for text-independent voice conversion with consideration of both phonetic accuracy and topology of the source and target parameter spaces. Section V discusses the proposed methods. Experimental results are presented in Section VI. Finally, our conclusions are given in Section VII.
Fig. 1. F1–F2 distributions for different vowels and speakers.
Fig. 1 shows the distribution of the first formant (F1) and second formant (F2) for different vowels pronounced by four speakers (two males and two females). We see that the phonetic distributions for the different speakers are similar, and much literature supports this observation. The F1 F2 plane, which is also known as a vocalic triangle, has been used as a standard way of comparing vowel quality in many fields [17], [18] since it was first proposed by [19] and [20]. The International Phonetic Alphabet (IPA) vowel diagram [21] describes vowels in terms of the common features of height and backness, which have strong relations with F1 and F2. The above works imply that there are similar distributions of phonemes for different speakers, and the phonetic structure can thus be unified as a common reference. Although the acoustic parameters of phonemes differ among speakers, their distribution is relatively stable. This similar phonetic distribution motivates us to use common phonemes as the baseline for the data alignment of a nonparallel training corpus. With supervised restriction based on common phonemes, the phonetic distribution between two speakers is preserved and the accuracy of the mapping might be improved. In a nonparallel corpus, according to different overlaps of the source and target phoneme sets there are generally five types of combinations:
(1)
II. ASPECTS OF THE PROBLEM A. Phonetic Distributions for Different Speakers From a mathematical point of view, the alignment of training data in voice conversion aims to find a binary relation between two vector sets. Considering the definition of voice conversion, the transform should be phonetically motivated to retain the content of the input speech. In conventional text-dependent voice conversion using a parallel database, phonetic accuracy is guaranteed by the inherent phonetic parallelism of the training data. For text-independent voice conversion, in order to remove the differences in phonetic content from observations, phonetic labeling information could be useful.
933
and denote the phoneme sets of the source and where target speakers, respectively, is their common set, and and are the mismatched phoneme sets of the two speakers. Because the conversion is a transformation of data from source to target, the mismatched phonemes of the target do not greatly affect the alignment procedure. Based on this consideration, the relevant combinations can then be resorted into three kinds:
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
(2)
934
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
In case 1 of (2), both the source and target speech cover all phonemes. In case 2, there are mismatched phonemes for the source that do not have mapping targets, which is the case for cross-lingual voice conversion. We can first train a mapping function or build a mapping codebook using speech data of the shared phonemes and then adapt the mapping criterion to the mismatched phonemes. Case 3 is the most difficult, where the phonetic content of the speech is useless because there are no shared phonemes between the source and target. For this case, we can align data according to the acoustic feature similarity [16]. Since most situations in practical real world scenarios are one of the first two cases, we mainly focus on alignments for these cases in this paper. B. Alignment Stability To ensure the conversion is stable, it is beneficial if the alignment function is continuous so that smoothness can be preserved in the mapping. The targets corresponding to a source vector and its neighboring vectors in the source space should be close to each other in the target space: (3) where is the alignment function and and are source vectors. O-notation means asymptotic upper bound. The conversion function should not distort the vector’s neighborhood structure so that we can obtain smoothly converted speech. In other words, the transformation of speech data from source space to target space with topology preservation could improve the smoothness and stability of conversion. III. LINEAR ALIGNMENT BASED ON PHONEME CLUSTERS To use phonetic information in data alignment, the training data for each speaker are separated into different phoneme clusters. The distribution of the source phoneme cluster in the common set can be adapted to that of the corresponding target phoneme cluster. Based on this, we create a relation of the common phoneme distributions between source and target. Fig. 2 illustrates the general idea of linear alignment based on phoneme clusters. The distribution of each cluster is regarded as a single Gaussian function. The acoustic parameter vectors corresponding to the distribution of these clusters are then denoted , , where , , , and represents as mean and covariance of source and target common phoneme , respectively, and . denotes the number of common phonemes. The acoustic parameters used in the mapping are represented as line spectral frequencice (LSFs). A perceptually motivated and is spectral distance between two LSF parameters employed as follows [22]:
Fig. 2. Framework for linear alignment based on phoneme clusters. (a) Alignment based on phoneme clusters for case 1. (b) Alignment based on phoneme clusters for case 2.
where is the dimension of the LSF parameters. In the phoneme cluster-based alignment procedure, for each , we can find its nearest source clusters source vector ,( , ). Residual vectors between vector (where the superscript indicates the frame belongs to phoneme ) and centroids of the selected nearest source clusters are calculated. Each source vector can then be described as (5) where (6) (7) According to the relation between the source and target clusters, we extend (5) to the target space by adapting the selected source clusters using their corresponding target clusters. Thus, corresponding to is calculated the reference target vector as (8) Finally, the data is aligned by finding the target vector based on the reference result using
(4)
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
(9)
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
935
The proposed alignment function can be used as a transformation function directly. However, the reference target vector is not actually used because the use of (8) may distort the spectral structure, which may degrade the quality of converted speech. Fig. 2 shows the linear data alignment for cases 1 and 2 in (2). Both alignments can be conducted under the same framework by considering different phoneme sets to be the “common phoneme set.” In Fig. 2(a), each source vector is mapped to the target space whereas in Fig. 2(b), the whole miss mapped phoneme cluster is mapped to the target space by the same algorithm. Combining (6) and (8), we obtain (10) can be regarded as the posterior probability of where belonging to cluster . It is obvious that this equation has a similar format to the transformation function generated by the Gaussian mixture model [5]:
(11) where
are the mean and covariance matrices of the th mixture component of joint vector . Therefore, the alignment function can be considered as a kind of transformation function based on phoneme cluster mixing. As the alignment function allows soft clustering based on the consideration of multi-nearest phoneme clusters, it can provide a smooth and stable alignment between the source and target parameter spaces. A similar approach to our method is Vector Field Smooth (VFS) method [23], [24], which is usually used to generalize the speaker model for speaker-independent speech recognition or adapt the acoustic model for multi-speaker parametric speech synthesis. IV. TOPOLOGY-PRESERVING NONLINEAR ALIGNMENT To demonstrate the idea of the proposed phoneme-clusterbased linear alignment algorithm, we apply it to an artificial dataset with label information to compare its performance with that of similarity-based alignment [16]. The toy dataset is chosen to be two-dimensional for convenient visual presentation, and is shown in Fig. 3. From the results, it can be seen that most source vectors obtain reasonable targets with higher accuracy than that achieved by the similarity-based method. However, we find that source vectors near the cluster boundaries may have worse mapping results than those close to the centroid of clusters. These worse mappings are caused mainly
M=3
Fig. 3. Effect of phoneme cluster-based linear alignment. Note that in (d). Dash-dotted and dotted lines are the boundaries between classes in the source and target datasets, respectively. (a) Source parameter. (b) Target parameter. (c) Similarity-based alignment. (d) Phoneme cluster-based.
by the weighted linear average effect of the selected nearest clusters. While the distances of vectors in the cluster-based alignment are based on Euclidean space, which preserves the relative positions of the clusters, they induce a rigid transform and cause a weak restriction on cluster labeling in the boundary areas among different clusters. A looser restriction based on topology that describes the inner connection without rigid description is more suitable for representing relations of training data vectors and helpful in solving this problem. A. Topology of the Parameter Space The topological structure of parameter space, which is also referred to as the manifold of data in our work, is intended to encompass an underlying structure hidden in a high-dimensional dataset. A manifold is a mathematical space in which every point has a neighborhood which “resembles” Euclidean space [25]. In data mining for an unknown dataset, we cannot easily obtain a mathematical description for the property of an object, whereas there is a method for approximating the manifold that the object dataset lies on: building a data graph. K-Nearest Neighbor (K-NN) Graph: Let be the dataset concerned. The (1-)nearest neighborhood of in X is given by (12) is the normal Euclidean distance between vecwhere tors and . For an integer , the -nearest neighbor of a data vector is defined in a similar way. The K-NN puts an edge between each data vector in X and its K-nearest neighbors [26]. B. Self-Organizing-Based Iterative Learning A self-organizing map (SOM) is a type of artificial neural network trained using unsupervised learning, which can be in-
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
936
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
vectors are updated according to this criterion. The BMU itself as well as its topological neighbors are moved closer to the input vector in the input space, i.e., they are attracted to the input vector. is calculated At the th iteration, the model vector and input data . The update rule is from the old value (14)
terpreted as a topology-preserving mapping from input space onto the grid of a map [27]. It consists of a regular low-dimensional grid of map units. Each unit is represented by a model vector, which has the same dimension as those of the input vectors. Adjacent units are connected by a neighborhood relation. The number of map units, which typically varies from a few dozen to several thousand, determines the accuracy and generalization capability of the SOM. During training, the SOM forms an elastic net that folds onto the “cloud” formed by input data. Data points lying near each other in the input space are mapped onto nearby map units. A self-organizing iterative learning is used in this section to build topology-preserving mapping from the source parameter space onto target parameter space by defining source vectors as map units. Using the iterative learning process, an elastic and soft mapping from the source space to target space is achieved. Fig. 4 presents the general concept of the iterative learning method. In voice conversion, for an existing mapping of training data from source to target , where represents alignment relation of the source and target vectors, source vectors are regarded as units in the low-dimension map even though it is not low-dimensional. The data graph which is a manifold of the source dataset is considered as grid map between units in self-organizing learning. and targets are considered as model References vectors and input data vectors, respectively. Thus, the target space is used as the input space in the iterative learning process. 1) Iterative Learning: For each -dimensional input vector , we find the nearest model vector according to the Euclidean distance using
is a monotonically decreasing learning coefficient Here, is a kind of which governs the learning rate. The factor smoothing kernel, also called the neighborhood function. The decreases with an increase in the distance bevalue of and in the grid, where and are the source tween and , respectively. The spavectors of reference vectors tial width of the kernel in the source space decreases as the iteration index increases. The characteristics and effectiveness of the neighborhood function in particular are described in the next subsection. Generally, iterative learning starts with random initialization and usually requires thousands of iterations for proper results, which is time consuming. An approximate initial value would reduce the training time and make the process more reasonable. The result of phoneme-cluster-based linear alignment is employed for initialization of the iterative learning. As the result of the linear alignment would have been already approximately “right,” effective and efficient output can be obtained without an excessive number of iterations. By updating the model vectors iteratively according to a randomly selected input vector (target vector), the reference vectors are regulated more appropriately and in accordance with the distribution of target vectors while maintaining the topology of the source space. : The neighborhood function 2) Neighborhood Function is an important factor in iterative learning. If the spatial width of the kernel is small, only the BMU is modified so that it is even more similar to the input vector; thus, iterative learning is a form of competitive learning. If the width is large, not only the BMU but also its neighboring model vectors shift towards the input data vector, and learning is cooperative. Thus, the neighborhood function determines the degree of topology preservation during mapping. It considers the distance between the source vectors in the spatial width of the kernel and modulates the corresponding reference vectors simultaneously so that the topological structure of the source parameter space can be considered. In our work, we use Gaussian function as the neighborhood function:
(13)
(15)
Fig. 4. General concept of the proposed self-organizing iterative learning alignment.
is called as the best-matching unit (BMU). where The task of iterative learning is to define the model vector in such a way that the mapping from the model vector to the input data vector is ordered and descriptive of the distribution of the input vector. In our work, the task is to define the reference vector from the source data by considering the distribution of the target data . After the BMU has been found, all model
where
decreases with an increase in the iteration index . denotes the distance in the grid between and . In using a phonetically guided restriction in iterative learning, source and target vectors belonging to the same phoneme are more likely to be aligned. Thus, a reference vector of source data should be moved to the cluster of the same phoneme in the target space.
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
Fig. 5. Effect of self-organized iterative alignment. Note that the dataset is the same as that in Fig. 3.
To increase the phonetic accuracy during alignment, we define a phonetic function
937
number increases. We also test the convergence speed using different initializations from random samples and linear alignment results, and find that phoneme-cluster-based linear alignment initialization clearly accelerates convergence. It should be noted that we are not the first to use a SOM for preserving the topology of acoustic space. Some earlier works such as [30] and [31] use it for speaker normalization and speaker adaptation. [32] and [33] also utilized SOM for voice conversion and achieved informing outcome, but these methods can only be applied to text-dependent cases which require parallel database for training. Compared with these works, the algorithm in this paper employs phonetic information for supervisory learning and achieves good accuracy and stability. C. Manifold Expansion
(16) The superscripts and refer to the phoneme index of the . The smaller is, the stronger the vector, and phonetic restriction becomes. This phonetic function can give different values according to whether the model vector (reference vector) and input vector (target vector) belong to the same phoneme, so the reference vectors of different phonemes tend to be separated by the boundaries between phoneme clusters. By combining (15) and (16), we get our neighborhood function (17) Finally, our update function becomes as shown in equation (18), as shown as at bottom of page. 3) Effect and Convergence: Fig. 5(a) shows the results of applying the proposed self-organizing nonlinear algorithm to the two-dimensional dataset used in Fig. 3. Fig. 5(b) shows the original topological structure of the converted dataset given in K-NN graph. Compared with the linear results, the iterative learning regulates the reference target by considering both topology preservation and labeling accuracy. We can also confirm that preserving the topology is more reasonable than preserving the Euclidean relative position in data alignment. The convergence of SOM has been discussed in many other works [27]–[29]. Generally speaking, the convergence and results of this iterative learning are dominated by the learning rate and neighborhood function. In a later experiment in Section VI, we demonstrate the convergence by using the phonetic-restriction function for the SOM. The results show that both the quantization error and the topological error decrease as the iteration
Because the self-organizing learning discussed above requires input data from the target space to regulate the reference vector, this method is more suitable for case 1 in (2). In case 2, there are mismatched phoneme clusters in the source space that do not have coordinates in the target space. To map them onto the target parameter space while maintaining the manifold of the data structure in these clusters, we employ a method referred to as manifold expansion, which supplements the self-organizing method for case 2. Because the manifold resembles Euclidean space locally, a linear transform based on the Euclidean distance and similar to that in Section III is used in a local area. To expand the alignment relation from matched data to mismatched data while maintaining the manifold of the source dataset, the mapping relation is expanded to the unknown area in the target space according to the manifold of source data. First, source and target data labeled as common phonemes are aligned using the above self-organizing iterative learning method. The aligned data are then used as benchmarks for the alignment of other data. Given each unaligned data vector , we find its nearest , where the neighbors in the grid first neighbors have been aligned. In each step, we select the unaligned data with the maximum number of aligned neighbors, and locate the position of its reference vector in the target space according to the reference vectors of its aligned neighbors. The algorithm is similar to that of the linear alignment in Section III. The difference is that anchors in this algorithm are not phoneme clusters but aligned neighbors of each source vector. We then obtain (19)
(18)
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
938
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
where (20) (21) The reference vector is calculated using
the target data using the source data. Once a suitable representation is found, a reasonable alignment from the source space to the target space is achieved. Hidden Markov model classification has been used to label data vectors and assign them to phonetic classes in this work. As the requirement of classification accuracy in this algorithm is not strict, this type of classification is not necessary and could be replaced by other less costly classifications; e.g., classification according to formant information.
(22) VI. EXPERIMENTAL RESULTS After finding the corresponding reference target, is labeled as aligned and can be considered as a labeled neighbor for other source vectors in the next step. The algorithm stops when all the source data vectors are aligned. A problem analogous to data alignment in the cross-lingual case is the out-of-sample problem in machine learning. The corresponding solution is generally to embed new data points into the previously learned result, which is usually a manifold of a dataset [34]. Other related works can be found in [35]–[37]. V. INTERPRETATION Data alignment for text-independent voice conversion intends to build a mapping relation between source and target acoustic parameters while maintaining the phonetic content. In a generative alignment algorithm, the conversion can even be left out since the alignment and conversion usually have similar formulation and function. Thus, the data alignment for the text-independent case can be considered as a kind of unsupervised regression and its corresponding algorithm and methods can be applied to other fields. The proposed phoneme-cluster-based linear alignment can also be regarded as topology-persevering mapping at the manifold of phoneme cluster level since it considers and retains the structure of neighboring clusters. Compared to the phoneme-cluster based linear alignment, self-organizing nonlinear alignment does not need to make assumptions on the statistical distribution of the data. An unsupervised distribution of the real training data is employed. The balance between the distance to the neighbor and the distance to the corresponding target for each data vector is self-organized by iterative regulation. The K-NN graph is based on the nearest neighbors of each data according to the local Euclidean distance. However, in high-dimensional acoustic space, data vectors with different phonetic content may also be very close to one another. For some kinds of dataset like speech, which is normally denoted as a series (or trajectory) of frames, the order of the data sequence also contains additional information that should be considered. For these kinds of datasets, we can connect each data vector according to each trajectory and a grid can be established after all trajectory is complete, which we can name as a trajectory graph. We can also define a window with width that crosses the trajectory and links all data within the window at each step. The SOM was originally used to represent and visualize high-dimensional data using a low-dimensional map. Therefore, the proposed SO alignment can be considered to represent
A. Experimental Conditions Both objective and listening test evaluation are used to test our methods. In the objective evaluation, we first present an experiment that demonstrates the convergence of iterative learning. Similarity-based alignment [16], linear and nonlinear alignments are compared to cases 1 and 2 by considering the average quantization error, the phonetic error, topographic error, and spectral distortion (SD). As another important part of voice conversion, we use the codebook method as a transformation function for the conversion. It should be noted that similarity-based method in [16] was used in state level. The parameter generation algorithm was used for synthesizing converted speech, whereas mean vectors of state pairs generated using similarity-based method, proposed in this paper, are used as codebook and conversion is done in frame level so that the three methods are comparable. For supervised self-organized nonlinear alignment, we test the performance of using topology constructed by a K-NN data graph. The speeds of convergence for iterative learning initialized by a random position and linear alignment results are also compared. In the listening tests, the performance of linear and nonlinear alignments is evaluated using forced-choice ABX and mean opinion score (MOS) methods to consider speaker individuality and speech quality respectively. The performance of phoneme cluster based linear alignment versus the number of nearest phonetic anchors considered has been previously investigated [38]. In the most intra-lingual situations, all the phonemes used in both the source and target language can be considered as common phonemes. However, in the cross-lingual situation, there are source phonemes missing in the target language. Thus, in our experiment, we use the intra-lingual and cross-lingual situations to simulate cases 1 and 2. The training data is recorded by four speakers (two male and two female) in British English and, for two of the speakers, also in Mandarin Chinese for the cross-lingual case. While there are 46 phonemes in Chinese speech and 44 phonemes in English speech, only 19 phonemes are common. Each speaker speaks 180 sentences of a language. Hidden Markov models are used to label phonetic tags for the data vectors so that phonetic information can be used in alignment. Afterwards, each training data subset is quantized into about 3000 vectors to reduce the training cost. We use 24-dimensional LSF vectors as the spectral feature. The LSFs are calculated from the spectra obtained using STRAIGHT [39]. Training speech is sampled at 16 kHz.
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
939
B. Objective Evaluation Several criteria are used in the objective evaluation: mean quantization error, phonetic error, topographic error, and spectral distortion. Mean quantization error (MQE) is the average distance from each data vector to its best-match reference target: MQE
(23)
is defined in (13) and is the number of target vectors. Phonetic error (PHE) is an evaluation of the phonetic accuracy of alignment. It is defined as the percentage of target vectors for which the first BMUs are all located in the wrong phonemes: PHE where
(24)
is the th BMU of
and
belong to same phoneme otherwise. Topographic error (TGE) which originally comes from [40] is defined as percentage of target vectors whose first BMUs are not adjacent: TGE
(25)
where
Fig. 6. Convergence of iterative learning. (a) Quantization error. (b) Topological error.
no adjacent units in K BMUs otherwise. SD between the converted and target speech is defined as [41]
SD
(26)
where and denote the lower and upper frequency limits of and are the converted and target spectral the integration. responses, respectively. 1) Convergence of Supervised Self-Organizing Learning: The convergence of supervised self-organizing learning is tested with both MQE and TGE. In the experiment, the learning rate is a linear decreasing factor and is set from 0.3 to 0. As convergence is sensitive to the radius of the kernel, a Gaussian kernel with a radius decreasing from 2 to 1 is used for the neighborhood function. The unit map of the source data is set as a 5-NN data graph and the phonetically supervised factor is . We use for the topographic error. The set as numbers of source and target vectors used are 3099 and 2985, respectively. With these parameters, the MQEs and TGEs of
the iterative learning initialized with both random samples and outputs of phoneme-cluster-based linear alignment are shown in Fig. 6. It is obvious that both the MQE and TGE decrease with an increasing number of iterative steps. Compared to the initialization from linear alignment, the random initialization achieves lower TGE and much higher MQE, which means the reference vectors are likely to assemble in a small area. These results are a far from reasonable representation of the input dataset, which means there are locally optimal results. The MQE and TGE each describe only a part of the performance of iteration. A reasonable representation of the training data set is achieved if both MQE and TGE are low. Fig. 6 also shows that initialization using the proposed linear alignment not only accelerates the convergence but also helps to achieve a reasonable final status. It should be noted that for linear initialization, MGE reaches a low range quickly with only a few iterations and additional iteration makes it slightly worse. This is because although MGE is low at the beginning, TGE is still high, which may cause unstable alignment. In order to decrease TGE, the algorithm needs to first sacrifice a little MGE to regulate topology structure of
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
940
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
Fig. 8. Results of comparison of SO-based method with different r .
Fig. 7. Comparison of the performances of different alignment methods. Sim refers to the similarity-based alignment, PhC to the phoneme-cluster-based alignment, and SO to the self-organizing alignment. For conversion in the cross-lingual case, SO includes manifold expansion. (a) Alignment performances. (b) Conversion performances.
alignment, and then pursue both low MGE and TGE simultaneously. The performance of alignment should not depend on only one factor because different factors represent different aspects. 2) Comparison of the Alignment Methods: This subsection compares the two data alignment methods proposed in the paper: phoneme-cluster-based linear alignment and self-organizing nonlinear alignment. A previous method based on similarity alignment is also considered for baseline comparison. The alignment results with MQE, TGE, and PHE are shown in Fig. 7(a). The conversion results with SD are shown in Fig. 7(b). In the experiment, the factor of topology error and the factor of phonetic error are set as 5 and 15, respectively. of the nearest phoneme clusters considered in The number the cluster-based linear alignment is set to 10. The number of . iterations in self-organized learning is From the results presented in Fig. 7(a), we see that the TGE of similarity-based alignment is smaller than that of phonemecluster-based (PhC) linear alignment. The reason for this is that the topological structure of an area can be easily preserved using only the spectral distance for the source and target in similaritybased (Sim) alignment. However, this method causes higher PHE and MQE than others, which means the similarity-based method is weaker than other two methods in describing characteristics of speakers. Phoneme-cluster-based alignment uses phonetic information and has lower PHE than the similarity-
based method. However, it has the highest TGE as a result of only considering the relative positions of phoneme clusters. From the figure, we see that supervised self-organizing (SO) nonlinear alignment can attain both the highest accuracy and stability with the lowest MQE, TGE and PHE simultaneously. However, the final MQE and TGE values depend on the setting of the learning factor and the neighborhood function. Fig. 7(b) shows the SD results between converted speech and target speech. In both intra-lingual and cross-lingual cases, the conversion result using the self-organizing nonlinear alignment method achieves the smallest SD. 3) Performance of SO Based Method With Different : In order to test the performance of SO-based method with different in (16), we propose a series of experiments with different settings while keeping all other parameters constant. The result is shown in Fig. 8. As shown in figure Fig. 8, when increasing , PHE increases whereas MGE and TGE decrease. When is smaller, the phonetic restriction is stronger, reference vectors are more likely to be moved near to input vectors belong to same phoneme rather than their original neighbor vectors. Although PHE is low, TGE and MGE are high, especially TGE. When is larger, the whole algorithm intends to generate more stable alignment so MGE and TGE are lower whereas PHE is relatively higher. C. Evaluation Using Listening Tests In the listening tests for each case, 20 sentences are converted and scored by 14 subjects. The performance of speaker identity conversion is tested using the ABX method and the speech quality using the MOS method. In the ABX test, subjects were asked to score each converted speech with a zero for similarity to the source and 1 for similarity to the target. In the MOS test, subjects rated the speech quality on a five-point scale (1 for bad, 2 for poor, 3 for fair, 4 for good, 5 for excellent). The final ABX and MOS results are the averages for all subjects. was only converted During the voice conversion, the through mean-variance transformation, which converts the pitch contour of the source speech into the converted pitch contour having characteristics of the target speech. The pitch transform is calculated as (27)
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
941
Fig. 9. Results of the ABX test evaluation.
Fig. 10. Results of the MOS test evaluation.
where are mean and variance of the log-pitch value, respectively. To reduce the influence of the prosody, the listeners were asked not to consider the prosody in the speech. Fig. 9 shows the ABX test results. These evaluation results agree with the objective results that self-organizing alignment achieves the best performance of the three methods. Generally, the cross-gender conversion results are better than that of the intra-gender case. The similarity-based alignment produced the worst results due to its lack of phonetic restriction. The results of self-organizing methods are better than those of the phonemecluster-based method, which confirms that having more supervised information in the local area gives better performance. From the MOS test results shown in Fig. 10, we see that the performance of the similarity-based method is almost the same as that of the phoneme-cluster-based method. As the similarity-based alignment does not greatly alter the topological structure of the source data, the quality of the converted speech is even better in the intra-lingual case. The quality of speech converted using self-organizing alignment and manifold expansion is better than that achieved by the other two methods. However, the benefits of self-organizing manifold expansion in the cross-lingual case are not as great as self-organizing method in the intra-lingual case, since the mismatched data are mapped to a target space in a manner similar to that used in phonemecluster-based alignment. To compare the conversion results between parallel training corpus and nonparallel training corpus, we have finished ABX and MOS experiments in our previous work [16], in which we compared the performances of similarity-based alignment conversion with a conventional text-dependent voice conversion (see Fig. 11). Both parallel and nonparallel corpuses contain 180
Fig. 11. Results of the listening test for the conversion between parallel corpus and nonparallel corpus [16]. T-d represents conventional codebook based textdependent conversion. Sim means similarity-based conversion using nonparallel training corpus. (a) ABX tes. (b) MOS test.
sentences, which is the same size as that in this paper. These experimental results showed that conversion performance of original text-dependent voice conversion is marginally better than text-independent voice conversion but not by much, and that the similarity-based approach gives a better speech quality than the original text-dependent codebook-based voice conversion. Notice that similarity-based experiment in [16] is carried out with synthesis by parameter generation algorithm rather than codebook, leading to better quality. By considering the results of the previous work (Fig. 11) and the new results (Fig. 9 and Fig. 10) presented in this paper, one can see that the performance of conversion based on the new text-independent alignment methods is similar to the performance with conversion based on a parallel database. VII. CONCLUSION This paper proposes supervised data alignment for text-independent voice conversion by considering phonetic restriction between the source and target speech. Both linear and nonlinear methods are derived by considering the alignment accuracy and preservation of topology during the mapping. For the linear alignment, we use phoneme clusters common to the source and target space to adapt the source data vector into the target space while maintaining the relative phonetic position. Mismatched
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
942
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
data in the cross-lingual conversion case can also be mapped into the target space under the same framework by considering the different common phoneme sets. To preserve the topological structure of the source parameter space and to improve the stability of conversion and the accuracy of phonetic mapping, a supervised self-organizing learning method that considers the phonetic restriction is proposed for iteratively regulating training results. Evaluation results show that the proposed methods improve the alignment in both a visible artificial dataset and in real alignment tasks for nonparallel corpora in both intra-lingual and cross-lingual cases. In our current work, we align the training data of the source and target speech frame by frame but other extensions are to be studied later. Prosody conversion, which is another important aspect of voice conversion, could benefit from the use of hyperframe information. A common inner structure at higher level will be considered in our future work. ACKNOWLEDGMENT The authors would like to thank the students of the National Laboratory of Pattern Recognition for their cooperation in the experiments. REFERENCES [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2, pp. 71–76, 1990. [2] S. Nakamura and K. Shikano, “Speaker adaptation applied to HMM and neural networks,” in Proc. ICASSP, Glasgow, U.K., May 1989, pp. 89–92. [3] L. M. Arslan and D. Talkin, “Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum,” in Proc. Eurospeech’97, Rhodes, Greece, 1997. [4] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, Mar. 1998. [5] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in Proc. ICASSP, Seattle, WA, May 1998, pp. 285–288. [6] T. Toda, H. Saruwatari, and K. Shikao, “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of straight spectrum,” in Proc. ICASSP, 2001, pp. 841–944. [7] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [8] H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using PSOLA technique,” Speech Commun., vol. 11, no. 2–3, pp. 175–187, 1992. [9] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Commun., vol. 16, no. 2, pp. 207–216, 1995. [10] N. Iwahashi and Y. Sagisaka, “Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks,” Speech Commun., vol. 16, no. 2, pp. 139–151, 1995. [11] D. Suendermann, H. Ney, and H. Hoege, “VTLN-Based cross-language voice conversion,” in Proc. ASRU’03, Virgin Islands, 2003. [12] D. Suendermann, H. Hoege, A. Bonafonte, H. Ney, A. Black, and S. Narayanan, “Text-independent voice conversion based on unit selection,” in Proc. ICASSP’06, 2006. [13] H. Ye and S. J. Young, “Voice conversion for unknown speakers,” in Proc. ICSLP’04. [14] L. M. Arslan, “Speaker transformation algorithm using segmental codebooks,” Speech Commun., vol. 28, pp. 211–226, 1999.
[15] D. Sündermann, H. Höge, A. Bonafonte, H. Ney, and J. Hirschberg, “Text-independent cross-language voice conversion,” in Proc. ICSLP, 2006. [16] M. Zhang, J. Tao, J. Tian, and X. Wang, “Text-independent voice conversion based on state mapped codebook,” in Proc. ICASSP’08, 2008. [17] P. Ladefoged, Preliminaries to Linguistic Phonetics. Chicago, IL: Univ. of Chicago Press, 1971. [18] W. Labov, Principles of Linguistic Change: Vol. II: Social Factors. Oxford, U.K.: Blackwell, 2001. [19] C. Essner, “Recherche sur la structure des voyelles orales,” Archives Néerlandaises de Phonétique Expérimentale, vol. 20, pp. 40–77, 1947. [20] M. Joos, “Acoustic phonetics,” Language, vol. 24, pp. 1–136, 1948. [21] “Handbook of the International Phonetic Association,”. Cambridge, U.K.. [22] R. Laroia, N. Phamdo, and N. Farvardin, “Robust and efficient quantization of speech LSF parameters using structured vector quantizers,” in Proc. ICASSP’91, 1991. [23] J. Takahashi and S. Sagayama, “Vector-field-smoothed Bayesian learning for incremental speaker adaptation,” in Proc. ICASSP’95, 1995, vol. 1, pp. 696–699. [24] J.-I. Takahashi and S. Sagayama, “Vector-field-smoothed Bayesian learning for incremental speaker adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Conf., May 1995, vol. 1, pp. 696–699. [25] J. M. Lee, Introduction to Topological Manifolds. New York: Springer, 2000, vol. 202, Graduate Texts in Mathematics. [26] J. A. Costa and A. O. Hero, “Manifold learning using euclidean K-nearest neighbor graphs,” in Proc. ICASSP, 2004. [27] T. Kohonen, Self-Organizing Maps. Berlin/Heidelberg, Germany: Springer, 1995, vol. 30. [28] C. Bouton and G. Pagès, “Convergence in distribution of the onedimensional Kohonen algorithms when the stimuli are not uniform,” Adv. Appl. Probab., vol. 26, no. 1, pp. 80–103, 1994. [29] J. C. Fort and G. Pagès, “On the a.s. convergence of the Kohonen algorithm with a general neighborhood function,” Ann. Appl. Probab., vol. 5, no. 4, pp. 1177–1216, 1995. [30] L. Knohl and A. Rinscheid, “Speaker normalization and adaptation based on feature-map projection,” in Proc. Eurospeech’93, 3rd Eur. Conf. Speech, Commun. Technol., 1993, pp. 367–370. [31] L. Knohl and A. Rinscheid, “Speaker normalization with self-organizing feature maps,” in Proc. IJNN’93-Nagoya, Int. Joint Conf. Neural Netw., 1993, pp. 243–246. [32] R. Ansgar, “Voice conversion based on topological feature maps and time-variant filtering,” in Proc. ICSLP’96, pp. 1445–1448. [33] E. Uchino, K. Yano, and T. Azetsu, “A self-organizing map with twin units capable of describing a nonlinear input-output relation applied to speech code vector mapping,” Inf. Sci.: Int. J., vol. 177, no. 21, pp. 4634–4644, Nov. 2007. [34] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang, “Embedding new data points for manifold learning via coordinate propagation,” in Proc. PAKDD’07, 2007, Long version: Knowledge and Information Systems Journal. [35] Y. Bengio, J. Paiement, and P. Vincent, “Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps and spectral clustering,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2004, vol. 16. [36] M. Law and A. K. Jain, “Incremental nonlinear dimensionality reduction by manifold learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 3, pp. 377–391, Mar. 2006. [37] O. Kouropteva, O. Okun, and M. Pietikaeinen, “Incremental locally linear embedding,” Pattern Recognition, vol. 38, no. 10, pp. 1764–1767, 2005. [38] M. Zhang, J. Tao, J. Nurminen, J. Tian, and X. Wang, “Phonetic Anchor based state mapping for text-independent voice conversion,” in Proc. ICSP’08, 2008. [39] H. Kawahara, I. Masuda-Katsuse, and A. deCheveigne, “Restructuring speech representations using pitch-adaptive time frequency smoothing and instanta-neous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol. 27, pp. 187–207, 1999. [40] K. Kiviluoto, “Topology preservation in self-organizing maps,” in Proc. Int. Conf. Neural Netw. (ICNN’96), 1996, pp. 294–299. [41] K. Paliwal and B. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Trans. Speech Audio Process., vol. 1, no. 1, pp. 3–14, Jan. 1993.
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.
TAO et al.: SUPERVISORY DATA ALIGNMENT FOR TEXT-INDEPENDENT VOICE CONVERSION
Jianhua Tao (M’98) received the M.S. degree from Nanjing University, Nanjing, China, in 1996 and the Ph.D. degree from Tsinghua University, Beijing, China, in 2001. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing. His current research interests include speech synthesis and recognition, human–computer interaction, and emotional information processing. He has published more than 60 papers in major journals and proceedings, such as the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, ICASSP, Interspeech, ICME, ICPR, ICCV, ICIP, etc. In 2006, he was elected as Vice-Chairperson of the ISCA Special Interest Group of Chinese Spoken Language Processing (SIG-CSLP), and Executive Committee member of the HUMAINE association. He is the Editorial Board Member for the Journal on Multimodal User Interfaces (JMUI), the International Journal of Synthetic Emotions (IJSE), and the Steering Committee Member for the IEEE TRANSACTIONS ON AFFECTIVE COMPUTING.
Meng Zhang received the B.S. degree from the Department of Automation, University of Science and Technology of China (USTC), Hefei, in 2005. He is currently pursuing the Ph.D. degree in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing. From September 2006 to March 2007, he was an intern student in the DSP Lab, Chinese University of Hong Kong (CUHK), Hong Kong. His research interests include speech transformation and speech synthesis.
943
Jilei Tian received the B.S. and M.S. degrees in biomedical engineering from Xi’an Jiaotong University,Xi’an, China, and the Ph.D. degree in computer science from the University of Kuopio, Kuopio, Finland, in 1985, 1988, and 1997, respectively. He was with the Northern Jiaotong University faculty from 1988 to 1994. He has been with Nokia Research Center as a Senior Research Engineer since 1997, and recently as Principal Member of Research Staff. He has authored or coauthored over 50 refereed publications including book chapter, journal, and conference papers, and holds 30 granted and pending patents. His research interests include speech and natural language processing, human–user interfaces, data mining, and biomedical signal processing. Dr. Tian has served as a member of technical committee and technical review committees for conferences and workshops, including ICSLP, Eurospeech, IEEE conferences, etc. He is the member of ISCA.
Xia Wang received the B.S. and M.S. degrees from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1997 and 1999, respectively. She is currently a Research Team Leader with the Nokia Research Center, Beijing. She has been working on voice user interfaces, speech recognition and synthesis technologies, and natural language processing technologies for over ten years. Her current research interests include human–computer interaction, speech recognition and synthesis, multimodal user interfaces, and user experiences.
Jani Nurminen received the M.Sc. degree from the Department of Information Technology, Tampere University of Technology, Tampere, Finland, in 2001. He has worked on speech-related technologies since 1999, first at the Tampere University of Technology until 2002, and after that with Nokia. He has authored or coauthored about 40 research publications, and has over 30 granted or pending patents. Currently, he is a Technology Manager with Nokia Devices R&D. His research interests include speech synthesis, speech and audio processing, language processing, data compression, and multimodal user interfaces.
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on July 03,2010 at 08:01:29 UTC from IEEE Xplore. Restrictions apply.