A Time Regularization Technique for Discrete

5 downloads 0 Views 859KB Size Report
Discrete Spectral Envelopes Through Frequency Derivative. Gilles Degottex* with follow-up works with. Luc Ardaillon and Axel Roebel. University of Crete and ...
A Time Regularization Technique for Discrete Spectral Envelopes Through Frequency Derivative

University of Crete Computer Science Department

Gilles Degottex*

Luc Ardaillon and Axel Roebel

with follow-up works with

University of Crete and FORTH, Heraklion, Greece

IRCAM, Paris, France

ABSTRACT

EVALUATION Compared methods Reference TE

• Shape reconstruction • Time regularity

20 10 0 −10

3000

2000

−20

0

0.05

0.1

0.15

0.2 Time [s]

0.25

0.3

0.35

0.4

Amplitude [dB]

Amplitude [dB]

−10

10 0

0

1000

2000

3000 4000 Frequency [Hz]

5000

6000

1000 2000 3000 Frequency [Hz]

20 10 0

1000 2000 3000 Frequency [Hz]

1000 2000 3000 Frequency [Hz]

4000

Reference SDCE−MFA

30 20 10

LETTER ]

4000

0

−0.05

1500

2000 2500 3000 Frequency [Hz]

4000

4500

5000

−35 −40 −45 −50

−0.5

−55

0

0

500

1000

1500

2000 2500 3000 Frequency [Hz]

3500

4000

4500

Model for the Discrete Cepstral Envelope (DCE):

2

1000 2000 3000 Frequency [Hz]

−3.8

−3.6

−3.4

4000

−3.2 −3 −2.8 −2.6 Quefrency [log10 s]

−2.4

−2.2

−2

Relative cepstral error (RC Error) M ∗ X cn − cm,n 1 ρn = c∗ M n

m=1

Critical band

1.6

Weak band

Safe band

Critical band

1.4

Weak band

1 0.8 0.6

−1.5

0.4

cn cos(n2πf /fs)

0.2

(1)

−2 −4

cn: the cepstral coefficients, P : the cepstral order.

−3 Quefrency [log10 ms]

Safe band

−2.5

Critical band

0 −4

−2

0.9

Weak band

(2)

k=1

k: the frame, ak : the log amplitudes, dk : correction term for the AM, uk = [1, . . . , 1]T , c: the cepstral coefficients, B: the Fourier basis.

−3.5

−3 Quefrency [log10 ms]

Safe band

Critical band

0.8

0.5 Cepstral Variance [log10]

kak − dk uk − Bk ck

−3.5

1

For MFA, Shiga et al. [4] suggested to minimize: ǫ=

Order for 323Hz

−1

n=1

K X

Order for 80Hz(~E2)

3

−2.5

−2

Weak band

0.7 0.6

0

RC Excess

E(f ) = c0 + 2

Weak band

1.2

SDCE-MFA[1,2] P X

7

4

5000

AC Error [log10]

−65

Safe band

Reference Harmonics dLI−MFA estimate

−60

(No need of frame pre-alignment!)

3500

−30

Log amplitude [dB]

4. Final alignment on the central frame (K − 1)/2.

dRn: Average of the dLn

0.05

3. Compute integral of the averaged derivated envelope.

Order for 800Hz(~G5)

5

RC Error

Log amplitude derivative

dLn of each frame n

1000

6

Relative Cepstral Excess (RC Excess): Risk of degeneration or Cepstral Variance: The capacity to incoherent resonances. reproduce the global variance: M X 1 M X χn = max({ρm,n, 1}) − 1 1 stdi(¯cn,i) c¯n,i = cm,n,i σ¯ n = M m=1 ∗ stdi(¯cn,i) M m=1 ∗ cn − cm,n ρ = m,n c∗ c¯n,i: the average cepstrum of sample i. n R ESULTS

0.1

500

5

Critical band

Safe band

0 −4

DL INEAR -MFA[ THIS

0

3 4 Quefrency [ms]

1

0

M : the number of frames c∗n: The known VTF.

−0.1

2

6

Extra issue: The frames’ energy follows the amplitude modulation (AM) of the signal (e.g. tremolo in singing voice) ⇒ How to align the frames ?

2. Averaging of K frames across time.

1

7

Absolute cepstral error (AC Error) M X 1 ǫn = |c∗n − cm,n| M m=1

1. Compute derivative of the linear envelope (on log scale).

0

M EASUREMENTS

The methods below use spectral peaks extracted on each frame (each 5ms).

Steps:

2

0

−10 0

Order for 80Hz(~E2)

Order for 323Hz

4

0

10

0

Reference Linear−MFA−LIFT

Order for 800Hz(~G5)

6

20

4000

−10

MFA METHODS

4000

−10

30

0

2500

20

1000 2000 3000 Frequency [Hz]

Reference Linear−MFA

30

Amplitude [dB]

3500

0

0

Reference DLinear−MFA

0

10

8

10

4000

−10

Amplitude [dB]

4000

20

Amplitude [dB]

Frequency [Hz]

4500

SFA sampling MFA sampling Underlying VTF hl=17th harmonic fl=3303Hz

30

Harmonics hl=17th harmonic fl=3303Hz Intermediate freq.

1000 2000 3000 Frequency [Hz]

30

Evaluation shown for singing voice [1,2] MFA size: 2 periods of vibrato ⇒ 400ms

Gaussian noise; VTF from digital acoustic synthesizer.

20

−10 0

Traditional approach: Single Frame Analysis (SFA) Why not using Multiple Frame Analysis (MFA) [4] ?!

freq. in [80, 800]Hz; AM component: low-passed filtered

Reference DCE

30 Amplitude [dB]

Amplitude [dB]

30

Amplitude [dB]

• Undersampled Vocal Tract Filter (VTF)

1000 synthetic signals of 2s; f0: vibrato with central

Amplitude [dB]

Issues in amplitude spectral envelope estimation:

5000

Material

TE=True Envelope (SFA) DCE=Discrete Cepstral Envelope (SFA)

0.5 0.4

−0.5 TE DCE DLinear−MFA Linear−MFA Linear−MFA−LIFT−UOF1.4 SDCE−MFA−UOF1.4

−1

−1.5 −4

−3.5

0.3 0.2 0.1

−3 Quefrency [log10 ms]

−2.5

−2

0 −4

−3.5

−3 Quefrency [log10 ms]

−2.5

−2

Steps:

• RC Error, safe band: MFA divides the error by ∼2 compared to SFA.

1. Compute:

• Critical and weak bands, cepstral Variance < 1 ⇒ averaging of the envelopes (without any statistical modeling!). The DLinear-MFA suffers the most. SDCE-MFA is the best.

c=

K X K X k=1

−1   BlT Bl · BkT ak

(3)

l=1

(No need of frame pre-alignment! Shown in [1,2])

• RC Excess: SFA produces substantial erratic shapes, even in the safe band. Much less problems for the MFA.

2. Final alignment on the central frame.

L ISTENING

We can increase the order ! (e.g. x1.4 in the following experiments)

Material: 1-3s of sustained vibrato; 15 French vowels; 2 voices (!) Method: Harmonic synthesis[6]

L INEAR -MFA-LIFT[1,2] MFA version of the basic linear interpolation, which has already been used for comparison [5]. We only add a low-pass liftering. Steps: 1. Pre-alignement of K successive frames using energy in [0-4]kHz. 2. Linear interpolation of all peaks of the K frames [5] ⇒ Linear-MFA.

TESTS

Pitch scaling: x1.25 and x0.75

Timbre conversion: from

Female singer Male singer

TE

Linear−MFA−LIFT

SDCE−MFA

SDCE−MFA

−1.5

−1

−0.5 0 0.5 1 Comparison Mean Opinion Score (CMOS)

Female singer Male singer

TE

Linear−MFA−LIFT

1.5

−1.5

to

−1

−0.5 0 0.5 1 Comparison Mean Opinion Score (CMOS)

1.5

3. Low-pass lifter of the Linear-MFA to alleviate erratic shapes.

• MFA clearly prefered to TE.

4. Final alignment on the central frame.

• Linear-MFA-LIFT: good results and efficient !

[1] G. Degottex, L. Ardaillon, A. Roebel, "Simple Multi Frame Analysis methods for estimation of Amplitude Spectral Envelope estimation in Singing Voice", in ICASSP, 2016. [2] G. Degottex, L. Ardaillon, A. Roebel, "Multi-Frame Amplitude Envelope Estimation for Modification of Singing Voice", IEEE Transactions on Audio, Speech, and Language Processing, Accepted 2016. [3] M. Campedel-Oudot, O. Cappe, E. Moulines "Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach", IEEE Transactions on Speech and Audio Processing, 2001.

[4] Y. Shiga, S. King, "Estimating the spectral envelope of voiced speech using multi-frame analysis", EUROSPEECH, 2003. [5] T. Wang, T. Quatieri, "High-pitch formant estimation by exploiting temporal change of pitch", IEEE Transactions on Audio, Speech, and Language Processing, 2010. [6] G. Kafentzis, G. Degottex, O. Rosec, Y. Stylianou, "Pitch Modifications of speech based on an Adaptive Harmonic Model", in ICASSP, 2014.

* Now supported by Horizon2020 Marie Sklodowska-Curie Actions (MSCA) - Research Fellowship - 655764 - HQSTS

Suggest Documents