On the Impact of Alignment on Voice Conversion ... - Semantic Scholar

On the impact of alignment on voice conversion performance Elina Helander1 , Jan Schwarz2 , Jani Nurminen3 , Hanna Silen1 , Moncef Gabbouj1 1

2

Department of Signal Processing, Tampere University of Technology, Finland Institute for Circuit and System Theory, Christian-Albrechts University of Kiel, Germany 3 Nokia Devices R&D, Tampere, Finland

[email protected], [email protected], [email protected], [email protected]

Abstract

voice conversion that uses parallel sentences for training data generation, and we analyze how the alignment process affects the conversion result. We carry out experiments and point out alignment related aspects that may improve or degrade the conversion performance. The results support our hypothesis that it is possible to affect the conversion quality through alignment. We also show that a reasonably simple alignment procedure can be used for obtaining a quality level similar to the level that can be obtained using manually annotated labels. This paper is organized as follows. Section 2 describes DTW in general whereas the usage of DTW in VC is discussed in Section 3. Experimental results related to alignment accuracy are presented in Section 4. Section 5 presents voice conversion experiments with different alignments and provides analysis of the results. Section 6 concludes the study.

Most of the current voice conversion systems model the joint density of source and target features using a Gaussian mixture model. An inherent property of this approach is that the source and target features have to be properly aligned for the training. It is intuitively clear that the accuracy of the alignment has some effect on the conversion quality but this issue has not been thoroughly studied in the literature. Examples of alignment techniques include the usage of a speech recognizer with forced alignment or dynamic time warping (DTW). In this paper, we study the effect of alignment on voice conversion quality through extensive experiments and discuss issues that should be considered. The main outcome of the study is that alignment clearly matters but with simple voice activity detection, DTW and some constraints we can achieve the same quality as with hand-marked labels. Index Terms: voice conversion, alignment, DTW

2. Dynamic time warping in speech alignment

1. Introduction

The objective of DTW is to find an optimal alignment between speech patterns X and Y. Speech patterns X and Y are represented by short-time feature vector sequences. The feature vectors typically relate to the corresponding speech spectra. The overall distortion d(X , Y) is a sum of the local distances d(ix , iy ) computed over the path. The optimal alignment basically minimizes the overall distortion with some constraints. Constraints on the warping function are required in order to provide a meaningful alignment. Constraints also save computational resources. But the use of strict constraints can introduce problems if the ”correct” path cannot fit into the allowed area. Examples of typical warping constraints include [3]

The aim in voice conversion (VC) is to convert speech from one speaker (source speaker) to sound like the speech of another particular speaker (target speaker). VC consists of two phases: training and conversion. Training usually relies on parallel data from the source and target speakers, although some approaches for non-parallel VC data alignment have also been proposed. An interesting framework for voice conversion is offered by voice adaptation with hidden Markov model (HMM) based speech synthesizer [1] that does not require parallel sentences for training. Many VC systems are based on applying a conversion scheme directly to the source speech or its parametric representation. The most popular conversion scheme is to use a Gaussian mixture model (GMM) to model the joint density of aligned source and target features [2]. Thus, before training the GMM it is necessary to align the training data, i.e. to find a corresponding target frame for each source frame. The alignment process has been studied little in the literature. Nevertheless, there are several techniques that can be used for carrying out the alignment. The simplest alignment technique is linear interpolation that works under the assumption that speaking rate variation is only global, not local. Non-linear warping can be obtained using dynamic time warping (DTW) that finds an optimal path through a difference matrix computed between the source and target features. It is also possible to use a speech recognizer with forced alignment. Compared to this solution, DTW has the advantage that alignment can be done without knowing the content of the sentence. Moreover, there is no need to have a speech recognizer available. In this study, we consider the conventional GMM based

Copyright © 2008 ISCA Accepted after peer review of full paper

• Endpoint constraints define that the alignment starts at the first frame pair d(1, 1) and the stops at d(N, M ) where N and M are the number of source and target frames, respectively. • Monotonicity constraints do not allow the warping path to have a negative slope. • Local constraints define the set of allowed predecessors and transitions to the current node. • Global constraints define the region of nodes that are searched for the optimal path.

3. DTW in voice conversion Usually the training data in VC is not as corrupted as the signals that speech recognition systems have to deal with. Thus, voice conversion is a relatively easy use case for DTW. Moreover, there is no single ”correct” time alignment between the

1453

September 22- 26, Brisbane Australia

formance. However, unreliable estimation of end points is also a problem, if strict end point constraints are used. The most severe problems arise when there is silence at the beginning and/or at the end of a speech pattern X but no silence at the beginning or the end of a speech pattern Y. If the silent parts are not removed before the alignment and the end point constraints are strict, the alignment will go wrong, most likely for the whole sentence. Even if the alignment could seek its way into the ”correct” path at some phase, the alignment will still produce silence-speech frame pairs that should preferably be removed from the training data.

250

Frames of TARGET

200

150

100

3.2.2. Global optimization

50

DTW provides a globally optimal alignment through the sourcetarget difference matrix. However, this does not mean that each frame pair would represent a decent feature pair for GMM training. For example, DTW can handle short silence segments between words even if the silence is present only either in the source or the target speech. Nevertheless, it is questionable if we should use that kind of pair for training where the source part is silence and the target part is speech or vice versa. Moreover, the local performance needs to be sacrificed for the global optimization, i.e. some clearly voiced frames become paired with some clearly unvoiced frames. Including such data for GMM training may not be meaningful.

DTW path Lin. interpolation Labels 50

100

150 Frames of SOURCE

200

250

Figure 1: Example of distance matrix, labels (circle), DTW (solid) and linear interpolation (dashed) result.

sentences from two different speakers and thus there may be several acceptable alignments. However, there are still some aspects that should be considered when implementing DTW for voice conversion purposes.

3.2.3. One-to-many and many-to-one mappings 3.1. Alignment features

The purpose of DTW is to generate a non-linear warping function of feature sequences along the time axis. This means a target frame may become mapped to more than one source frame. Also, a source frame can have more than one target frame mapped into it. This results in one-to-many and many-toone mappings. Such data is ambiguous for the GMM. In general, the main problem of GMM in VC is oversmoothing and if one-to-many or many-to-one mappings occur systematically (e.g. in the case where the speaking rates of source and target differ significantly) the oversmoothing may become worse.

In alignment, parallel speech waveforms are converted into sequences of features that can be compared with each other. The alignment searches for the source-target feature path that minimizes the overall distortion. The closer the speaker characteristics are to each other, the better the features are likely to be aligned. It is somewhat paradoxical that in VC we try to capture the differences between the source and the target speaker characteristics but at the same time the alignment process tries to minimize the difference. To minimize the effect of this paradox, the features used in the alignment should be as speakerindependent as possible. In speech recognition, Mel-frequency cepstral coefficients (MFCCs) are commonly used features. Thus, they should also be suitable for the alignment. Linear prediction related features such as LSFs are more complicated to use since the kth LSF from the source may not correspond to the kth LSF of the target.

4. Experiments on alignment accuracy 4.1. Database and alignment features The database consists of a set of the Berlin sentences taken from the German speech database The Kiel Corpus of Read Speech (KCoRS) [4] sampled at 16 kHz. All the sentences have been manually labeled by experts and in the alignment experiments we assume these labels to be precise. For the alignment, 13 MFCCs at 5 ms steps with an analysis window of 25 ms were extracted. The first MFCC was omitted and the other 12 MFCCs were normalized to zero mean.

3.2. Problems with DTW in the context of VC There are some potential problems when DTW is used to align signals for VC. These inherent problems that are discussed later in this section can be dealt with by additional constraints or by removing ”bad” frame pairs. Although VC systems need to be able to cope with small amounts of training data and an increase in the training set size is likely to enhance the conversion quality, single frame pairs can be safely removed from the training data when necessary. Removing badly aligned frames may even increase the resulting speech quality.

4.2. Schemes included in comparison There are many alternatives concerning the different constraints for guiding the optimal path search with DTW. In addition, the selection of the alignment features can affect the result as discussed in Sec. 3.1. However, according to our experiments, the use of different local constraints (I, II, V and Itakura) [3] did not make much difference on the alignment performance. Furthermore, we compared the use of only static features (12 MFCCs) with the use of both static and dynamic features (12 MFCCs, their delta and delta-delta coefficients). Incorporating dynamic information had only minor effects on the results. Thus, local constraint type II and 12 MFCCs without dynamic information

3.2.1. Silent segments When recording a sentence, some silence is usually included before and after the meaningful speech content. Silence segments represent non-interesting information and can sometimes confuse the performance of DTW depending on the constraints. Exploitation of end points is highly crucial for good DTW per-

1454

5. Experiments on voice conversion performance

are used. Two different methods for the alignment of two utterances were compared, namely linear interpolation and DTW algorithms. Linear interpolation is a basic method that can be used to lengthen or shorten any sequence. In Figure 1, an example of linear interpolation is shown as well as a non-linear mapping given by DTW. The alignment given by DTW varies depending on the selection of the constraints and the assumptions described in Sec. 2.

5.1. Voice conversion framework In the analysis and synthesis, a VC framework similar to the one presented in [5] is used but now for wideband signals (sampling rate is 16 kHz). The alignment was performed using several different alignment schemes and separate GMMs were trained using the different alignments. 12 MFCCs were applied as alignment features. For GMM training the corresponding 16dimensional LSF vectors computed from the source and the target signals were used. The joint density of the aligned source and target LSF vectors was modeled with a GMM as explained in [2]. In conversion, all the other speech parameters (voicing information and harmonic amplitudes for the residual spectrum, pitch and energy) were handled in an identical way. In addition to the LSF modification, pitch level adjustment and residual spectrum resampling was carried out.

The problem of having silent frames at the beginning and end of sentences was discussed in 3.2.1. We used a simple voice activity detection (VAD) technique based on a heuristic energy threshold to find the silent frames at the beginning and at the end of the sentences. This type of processing could not remove all the breathing effects that sometimes appear at the beginning or the end of the sentences for example with speaker k65. However, we wanted to examine whether this had an effect on the voice conversion quality.

The resulting voice conversion quality was evaluated for 9 different alignment schemes, as summarized in 2. All 9 alignment schemes were evaluated using objective metrics. The main techniques (gmm1, gmm2, and gmm3) were also evaluated in a listening test. Global constraints were not used and bad data removal was used only with gmm2. Cases gmm8 and gmm9 correspond to a fictional situation where the source had silence removed from the beginning and the end while the target did not (gmm8) and vice versa (gmm9).

4.3. Results on alignment accuracy As a preliminary step for assessing the alignment in VC, we measured the misalignment given by linear interpolation and different DTW approaches with respect to the manual labels. The alignment analysis comprised 100 sentences spoken by 5 different speakers (2 female (k04, k06) and 3 male (k05, k61, k65) speakers). Table 1 lists the results using different approaches to align two utterances. It shows the mean misalignment in ms and the percentages of misalignments greater than 20 ms, 50 ms and 100 ms. Linear interpolation (D0 ) was tested against different DTW approaches that differed firstly in terms of the use of global constraints (GC) and forced end-point constraints (EC). In Table 1 n means that the particular constraint was not used while y means it was in use. Regardless of the use of EC, the DTW algorithm always assumed that the first feature vector of the source and the target form a pair. In addition to the two constraints, three types of DTW schemes were included in the test. The first one (D1 ) had no silence removal. The second one (D2 ) removed silent frames before path calculation from the beginning and the end using a simple VAD based on an energy threshold. Finally, the third case (D3 ) used the starting and ending points given by the manually annotated labels. The results are commented together with VC results in Sec. 5.4

Speaker pairs k04–k05 (female-male) and k61–k05 (malemale) were used in the evaluation. 70 sentences were used in training of the GMM models.

Table 2: Alignment schemes tested with voice conversion. gmm1 gmm2 gmm3 gmm4 gmm5 gmm6 gmm7 gmm8 gmm9

DTW goes through manual labels (”ideal” case) DTW + simple VAD, forced end, data removal Linear interpolation, endpoints from manual labels Linear interpolation, endpoints with simple VAD Same as gmm2 but no data removal DTW + simple VAD, no forced end DTW + no VAD, forced end DTW + silence removed from source, forced end DTW + silence removed from target, forced end

5.2. Listening test results Table 1: Misalignment caused by linear interpolation and different DTW approaches. Algorithm GC EC D0 – – D1 y y D1 n n D1 n y D2 y y D2 n n D2 n y D3 y y D3 n n D3 n y

Mean [ms] 132.4 11.0 63.1 11.0 19.4 107.4 19.4 7.2 7.2 7.2

The VC performance achieved using the alignment given by DTW and linear interpolation was evaluated in a listening test. 17 native German listeners were asked to judge the quality of the transformed voice by doing a comparison category rating (CCR). The listeners were asked to compare the voice conversion quality of ten sentence pairs not included in the training set from two speaker pairs in two different comparisons. DTW with simple VAD and forced endpoint (gmm2) was compared against the ”ideal” case that utilizes the manually annotated labels (gmm1). In addition, gmm2 was compared with linear interpolation (gmm3). Additional data removal was also applied when training gmm2: unvoiced-voiced pairs were discarded as well as pairs where at least one of the frames had an energy level less than 10% of the mean energy. The results for the preference test are shown in Table 3.

Misalignment [%] > 20ms > 50ms > 100ms 91.7 79.9 56.4 11.1 3.6 1.7 12.1 5.7 4.8 11.1 3.6 1.7 19.3 10.9 5.4 30.2 24.3 19.8 19.3 10.9 5.4 5.0 1.0 0.6 5.1 0.9 0.5 5.0 1.0 0.6

1455

source-target sentence (compare to the cases gmm8 and gmm9). On the contrary, using VAD without forcing the endpoint performed poorly (gmm6). This is also indicated by the results in Table 1. The spectral distance between the original source and the target was also shown in the last row of Table 4. Since the target is a male, it was expected that source k65 (male) was closer to the target than source k04 (female). However, this applied only to the frequency band 125-3200 Hz and not for the whole speech band. The conversion error can also be expressed as meansquared error normalized to the difference between the source and the target. For the ideal case (gmm1) this conversion error was 0.35 and 0.46, for gmm2 0.36 and 0.47 and for gmm3 0.46 and 0.80, for the speaker pairs k04–k05 and k65–k05, respectively. It should also be noted that the database had its impact on the results. The database used in the experiments did not contain very long sentences or noise but there were some breathing effects that can affect the results. Finally, it should be noted that the alignment results are speaker-pair specific.

Table 3: Results of the CCR test.

k04k05 k65k05 total

gmm1 – gmm2 gmm1 gmm2 identical better better 10.0% 4.1% 85.9% 7.7% 4.1% 88.2% 8.8% 4.1% 87.1%

gmm2 – gmm3 gmm2 gmm3 identical better better 94.1% 1.2% 4.7% 98.8% 0.6% 0.6% 96.5% 0.9% 2.6%

5.3. Objective voice conversion results We also evaluated the voice conversion quality by measuring spectral distortion (SD) between the converted target and real target features. The alignment that goes through real labels was assumed to be the correct one. We compared the LSFs of converted target and real target features using mean spectral distortion. The average SD in 30 test sentences not included in the training set was calculated at two different bands (125-3100 Hz and 0-8000Hz). The results are shown in Table 4. The number of GMM mixtures was 8 in all cases except with gmm7 (4 mixtures). The selection of the number of mixtures was optimized by selecting the number of mixtures resulting in the lowest SD.

Table 4: Results for mean spectral distortion (dB).

5.4. Analysis of results The results for subjective and objective voice conversion quality are very consistent with each other. The listeners could not observe a clear difference between the samples generated using the GMM based on the ideal alignment (gmm1) and the GMM with a simple VAD and data removal (gmm2). Although gmm1 was preferred more often, the difference was not significant according to a two-sample t-test (p=0.08 for k04–k05 and p=0.40 for k65–k05). Thus, the objective results indicate that gmm1 and gmm2 have roughly equal performance. There was a clear difference between the performance of gmm2 and gmm3. It is interesting to note that according to Table 4, linear interpolation (gmm3) with k04–k05 seems to be slightly more successful than with k61–k05 and this can be also concluded from the objective results. This may be explained with the fact that speaker k65 had a quite unique speaking style compared to k04 and k05 and thus the global speaking rate assumption was far from valid. Results in Table 1 also confirm that errors can be rather high. If simple VAD was used instead of correct starting and ending points for linear interpolation (gmm4), the quality was degraded. The use of data removal in gmm2 increased the quality over the case where no data removal was employed (gmm5). The use of a simple VAD did not automatically remove all silence-speech pairs, and removing them improved the quality slightly, at least based on the objective results. Also some voiced-unvoiced and silence-silence pairs were removed with gmm2. The performance of gmm8 and gmm9 was very poor. This indicates that silent frames can be problematic for DTW and this will degrade the performance significantly if they are not taken into account. In contrast, the performance of gmm7 was rather successful. In the training, some problems with covariance matrices occurred. This is due to the high number of silence-silence pairs in the training. Nevertheless, the performance of DTW in that case seemed to be successful as could be predicted also from the results shown in Table 1. There was silence with both speakers and no strict constraints were given which enabled DTW to find a decent path. However, in practice we should at least verify the existence of silence for each

gmm1 gmm2 gmm3 gmm4 gmm5 gmm6 gmm7 gmm8 gmm9 source

k04–k05 125-3100Hz 4.57 4.65 5.40 6.34 4.77 5.50 4.87 11.38 6.94 8.17

k04–k05 0-8kHz 4.63 4.69 5.31 6.15 4.79 5.35 4.92 10.39 6.46 7.25

k65–k05 125-3100Hz 4.61 4.66 6.39 7.22 4.74 4.81 4.89 12.38 6.13 6.60

k65–k05 0-8kHz 4.60 4.65 6.14 6.80 4.73 4.79 4.86 11.25 5.87 7.11

6. Conclusions The experiments presented in this paper have verified that the quality of GMM based voice conversion can be significantly enhanced by improving the alignment. However, the results also indicate that a combination of DTW and a simple VAD can be used for successful alignment in most cases. It also seems to be beneficial to remove inappropriate data: frame pairs containing clearly non-matching data and silence-silence pairs should be removed from the training data. On a higher level, the main conclusion that can be drawn based on our study is that while the main challenges in voice conversion are elsewhere, alignment is still an important piece of the puzzle that should be taken into account in the development of voice conversion systems.

7. References [1] J. Yamagishi, K. Ogata, Y. Nakano, J. Isogai, and T. Kobayashi, “HSMM-based model adaptation algorithms for average-voicebased speech synthesis,” in ICASSP, vol. I, 2006, pp. 77–80. [2] A. Kain and M. W. Macon, “Spectral voice conversion for text-tospeech synthesis,” in ICASSP, vol. 1, 1998, pp. 285–288. [3] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series, Prentice-Hall Inc., 1993. [4] K. Kohler, “The Kiel Corpus of Read Speech,” Institute of Phonetics and digital speech processing at the Christian-Albrechts University of Kiel, Kiel, Germany, 1994. [5] J. Nurminen, V. Popa, J. Tian, and Y. Tang, “A parametric approach for voice conversion,” in Proc. of TC-STAR Workshop on Speechto-Speech Translation, 2006, pp. 225–229.

1456