Voice Conversion

Voice Conversion

Abstract In this writeup, we describe the approach towards voice conversion challenge 2018. First, we give a description of the task followed by a brief and hopefully intuitive introduction to the signal processing involved. An elaborate discussion of the machine learning involved - both formulation and a mention of potential approaches follows. We conclude this report by providing links to some sample outputs.

1. Introduction - Task Description Voice conversion is the process of transforming the vocal characteristics from a source speaker to that of target without altering the content (phonetic in terms of the language and the expressivity in terms of emotions). Voice conversion systems are typically evaluated on a subjective basis. However, the dimensions which are measured in these analyses are typically speaker similarity and naturalness both of which can be pretty convincingly measured using objective metrics such as the cepstral distortion, etc. The tasks in the challenge therefore require the participants to accomplish a conversion of speaker characteristics while maintaining the content constant in each speech utterance, among the speakers provided in the training stage.

Figure 1. Overview of Voice Conversion

There are two subtasks in this version of the challenge: • Hub Task : Parallel Training. A set of 80 recordings by 8 speakers will be provided as training database. Each speaker will speak all the 80 sentences resulting in a parallel dataset. There will be an equal distribution of source and target speakers as well as male and female speakers. The possible nomenclature followed could be αβN, where α indicates source or target speaker. β indicates male or female speaker. N indicates the identity of the speaker. Ex: SF1 stands for Source speaker who is female and identified as speaker 1. The task is to generate all possible combinations (16 in this case) of conversions. • Spoke Task: Non Parallel Training All the other conditions remaining same, this task is characterized by the fact that the set of utterances recorded by the speakers will not be same as in Hub Task. The task is to generate all possible combinations (16 in this case) of conversions. 1

2. Signal Processing 2.1. Constituents of Speech Signal In this section, lets take a brief look at understanding the signal processing involved in the problem at hand. An elaborate version can be found as a blog post at here for reference. It might be beneficial to take help of an analogy for this - light. Even though sound and light are fundamentally different in their properties, it makes sense to use the analogy from the perspective of basic constituents. We all know that visible light contains a whole spectrum of colors which can be separated using a prism. Similarly, human speech is a rich encoded form of several distinct information - ranging from language and content to speaker, age, emotions, etc. There are works which have shown that it is possible to decipher even the behavioral patterns and diagnose certain medical conditions from speech. For our discussion, lets limit ourselves to speaker, content and the manner in which the content was spoken. Just as in the case of light, these individual traits can be separated from speech using techniques such as Fourier analysis. Visualizing mathematically, a speech utterance (U) can be seen as a non-linear combination ( because linear is a subset of nonlinear) of speaker, speech content, emotion and other personality/channel traits as follows:

U = f (s, c, e, θ)

(1)

Here θ is a variable we are choosing to account for all the extra information present in the signal. It is interesting to note that even in controlled setting, professional speakers also tend to produce variations in these factors (leading to different pronunciations or different durations for same pronunciation, etc). This is also apparent if we take a look at the distribution of durations used by speakers for the challenge in 2016.

Figure 2. Bar chart depicting the duration of speech per speaker in minutes. Note that the content of speech is same across the speakers. The variation in duration might point to the differences in pronunciation, style or accentuation corresponding to the speaker set.

2.2. Features of Speech Signal As speech is a continuum of different characteristics, by intuition, more than a single feature is involved in accurately accounting for those different characteristics. Given a frame (F) of speech signal, it is reasonable to assume that atleast 3 features play a role in deciphering the aforementioned traits. Mathematically,

F = g(spectrum, f undamentalf requency, voicing)

(2)

where spectrum and voicing correspond to the content spoken - Vowels are voiced and consonants are unvoiced. fundamental frequency and spectrum correspond to the speaker - formants are speaker specific and pitch is speaker specific. However, this is just a rough approximation as the content is also present in the spectrum (harmonics for voiced and random for unvoiced). A presentation covering these might provide additional insights. 2.2.1

Spectral Representation

Three representations were chosen to begin with:

2

• Mel cepstral representation from SPTK. • Spectral Representation from World Vocoder • Mel General Spectrum extracted from the world spectrum

3. Machine Learning Component Given the intuitions from signal processing, the task of voice conversion can be seen as a modeling problem, where we want to discover a mapping function Hf for each frame f such that it transforms the spectrum, F0 and voicing of source speaker(A) to that of the target speaker (B). Mathematically, [spectrumB , f undamentalf requencyB , voicingB ] =

X

Hf (spectrumA , f undamentalf requencyA , voicingA ) (3)

f

This is an approach which supports the formulation of discrete time models, such as feed forward neural networks and GMMs. Alternatively, a joint model (H) can also be learnt such that it optimizes over the entire sequence of frames (f1...fn) using Sequence to Sequence models.

3.1. Potential Deep Learning Methods In this subsection, lets look at the potential deep learning paradigms that can be used in this task. 3.1.1

Nature of the task - GAN

The task itself is a generative one. Therefore, it makes sense to deploy a GAN for it. Although, it needs to be decided what the generator and adversarial networks try to capture. Unlike vision, there is a clear demarcation which is possible here: One can model the speaker and the other can model the content. 3.1.2

Nature of Speech - Multi Task Learning and Feature Discovery

A whole bunch of features can be extracted from a single frame of speech utterance but we are sticking to spectrum, f0 and voicing. This is because we will need to reconstruct the signal and therefore, need features which donot lose the important traits. However, there might be other features which are suited for the task of modeling. In other words, it is intuitive to imagine that a feature representation(f1) which helps in reconstructing the signal is not necessarily the best feature to transfer the characteristics among speakers which another feature (f2) might be better at. Voice conversion using deep learning provides a nice framework therefore to test out both these features in a multi task learning setting. It is also intuitive to see that a paradigm such as autoencoder or convolutional neural networks can be used to discover a feature (f3) which might either possess both the characteristics of f1 and f2 or supplement one of them. 3.1.3

Constraints in the task - Pretraining

Given that the data available is only of 80 utterances, which roughly corresponds to 5 minutes, it is very important to exploit the pretraining framework - either (a) using same dataset or (b) using a similar dataset or (c)a completely different dataset. 3.1.4

Multi output Nature of task : Combination

The task requires participants to submit 16 combinations. It would be a deceptively foolish thing to assume that aa single model would work well for all the combinations. In other words, a model that works for the speaker combination (SF1,TF1) might not be idea for the speaker combination (SF1,TM1). Therefore, it might be necessary to intelligently combine the approaches. 3.1.5

Subjective vs Objective Evaluation : Combination

The evaluation in the challenge is performed via a subjective analysis. Having said that, it is infeasible to design elaborate tests while building the system. Therefore, we might need to use objective metrics to fine tune the systems, etc. and deep learning can play a role in either identifying a suitable metric or optimizing for a chosen one. 3

3.2. Baseline Models 3.2.1

Signal Processing Baseline

The motivation behind using a signal processing baseline is to see how far we can go using the intuition from speech principles. Consider that H is the spectral representation of all the frames from target speaker and E is the spectral representation of a single utterance from source speaker.

Figure 3. Frame Replacement procedure

• Substitution by closest frame in Euclidean space The idea in this technique is to replace each frame in English with the closest frame in Hindi and re synthesize using the replaced frames. This process can be explained using the expression: E = {H[k] | k = argmin(H − e)∀eE}

(4)

• Substitution by learning a mapping among the frames The idea in this technique is to first learn a representation of the distribution of Hindi spectrum using a regression technique such as an autoencoder. The hypothesis then is that the autoencoder, given a spectral representation in English, can map it into the domain of Hindi spectrum. This process can be explained using the expression: E = {F (e)E | F = φ(h, w, b)∀hH}

(5)

where w and b are the parameters of the model φ. • List of Steps 1. Extract spectrum for both source and target utterances ( 25 dimensions). 2. Iterate over each frame of an source test sentence and pick the closest target frame using an exhaustive search, with the voicing as first conditional. 3. Perform an f0 mapping to get speaker characteristics ( bracketed by speaker statistics) 4. Build full voice using a mixed frames.

4. Sample Outputs • Baseline Frame Replacement • Boosting on top of frame replacement • Boosting + Autoencoder • List of all voices • DNN baseline(currently being worked on)

4