Discrete Spectral Envelopes Through Frequency Derivative. Gilles Degottex* with follow-up works with. Luc Ardaillon and Axel Roebel. University of Crete and ...
A Time Regularization Technique for Discrete Spectral Envelopes Through Frequency Derivative
University of Crete Computer Science Department
Gilles Degottex*
Luc Ardaillon and Axel Roebel
with follow-up works with
University of Crete and FORTH, Heraklion, Greece
IRCAM, Paris, France
ABSTRACT
EVALUATION Compared methods Reference TE
• Shape reconstruction • Time regularity
20 10 0 −10
3000
2000
−20
0
0.05
0.1
0.15
0.2 Time [s]
0.25
0.3
0.35
0.4
Amplitude [dB]
Amplitude [dB]
−10
10 0
0
1000
2000
3000 4000 Frequency [Hz]
5000
6000
1000 2000 3000 Frequency [Hz]
20 10 0
1000 2000 3000 Frequency [Hz]
1000 2000 3000 Frequency [Hz]
4000
Reference SDCE−MFA
30 20 10
LETTER ]
4000
0
−0.05
1500
2000 2500 3000 Frequency [Hz]
4000
4500
5000
−35 −40 −45 −50
−0.5
−55
0
0
500
1000
1500
2000 2500 3000 Frequency [Hz]
3500
4000
4500
Model for the Discrete Cepstral Envelope (DCE):
2
1000 2000 3000 Frequency [Hz]
−3.8
−3.6
−3.4
4000
−3.2 −3 −2.8 −2.6 Quefrency [log10 s]
−2.4
−2.2
−2
Relative cepstral error (RC Error) M ∗ X cn − cm,n 1 ρn = c∗ M n
m=1
Critical band
1.6
Weak band
Safe band
Critical band
1.4
Weak band
1 0.8 0.6
−1.5
0.4
cn cos(n2πf /fs)
0.2
(1)
−2 −4
cn: the cepstral coefficients, P : the cepstral order.
−3 Quefrency [log10 ms]
Safe band
−2.5
Critical band
0 −4
−2
0.9
Weak band
(2)
k=1
k: the frame, ak : the log amplitudes, dk : correction term for the AM, uk = [1, . . . , 1]T , c: the cepstral coefficients, B: the Fourier basis.
−3.5
−3 Quefrency [log10 ms]
Safe band
Critical band
0.8
0.5 Cepstral Variance [log10]
kak − dk uk − Bk ck
−3.5
1
For MFA, Shiga et al. [4] suggested to minimize: ǫ=
Order for 323Hz
−1
n=1
K X
Order for 80Hz(~E2)
3
−2.5
−2
Weak band
0.7 0.6
0
RC Excess
E(f ) = c0 + 2
Weak band
1.2
SDCE-MFA[1,2] P X
7
4
5000
AC Error [log10]
−65
Safe band
Reference Harmonics dLI−MFA estimate
−60
(No need of frame pre-alignment!)
3500
−30
Log amplitude [dB]
4. Final alignment on the central frame (K − 1)/2.
dRn: Average of the dLn
0.05
3. Compute integral of the averaged derivated envelope.
Order for 800Hz(~G5)
5
RC Error
Log amplitude derivative
dLn of each frame n
1000
6
Relative Cepstral Excess (RC Excess): Risk of degeneration or Cepstral Variance: The capacity to incoherent resonances. reproduce the global variance: M X 1 M X χn = max({ρm,n, 1}) − 1 1 stdi(¯cn,i) c¯n,i = cm,n,i σ¯ n = M m=1 ∗ stdi(¯cn,i) M m=1 ∗ cn − cm,n ρ = m,n c∗ c¯n,i: the average cepstrum of sample i. n R ESULTS
0.1
500
5
Critical band
Safe band
0 −4
DL INEAR -MFA[ THIS
0
3 4 Quefrency [ms]
1
0
M : the number of frames c∗n: The known VTF.
−0.1
2
6
Extra issue: The frames’ energy follows the amplitude modulation (AM) of the signal (e.g. tremolo in singing voice) ⇒ How to align the frames ?
2. Averaging of K frames across time.
1
7
Absolute cepstral error (AC Error) M X 1 ǫn = |c∗n − cm,n| M m=1
1. Compute derivative of the linear envelope (on log scale).
0
M EASUREMENTS
The methods below use spectral peaks extracted on each frame (each 5ms).
Steps:
2
0
−10 0
Order for 80Hz(~E2)
Order for 323Hz
4
0
10
0
Reference Linear−MFA−LIFT
Order for 800Hz(~G5)
6
20
4000
−10
MFA METHODS
4000
−10
30
0
2500
20
1000 2000 3000 Frequency [Hz]
Reference Linear−MFA
30
Amplitude [dB]
3500
0
0
Reference DLinear−MFA
0
10
8
10
4000
−10
Amplitude [dB]
4000
20
Amplitude [dB]
Frequency [Hz]
4500
SFA sampling MFA sampling Underlying VTF hl=17th harmonic fl=3303Hz
30
Harmonics hl=17th harmonic fl=3303Hz Intermediate freq.
1000 2000 3000 Frequency [Hz]
30
Evaluation shown for singing voice [1,2] MFA size: 2 periods of vibrato ⇒ 400ms
Gaussian noise; VTF from digital acoustic synthesizer.
20
−10 0
Traditional approach: Single Frame Analysis (SFA) Why not using Multiple Frame Analysis (MFA) [4] ?!
freq. in [80, 800]Hz; AM component: low-passed filtered
Reference DCE
30 Amplitude [dB]
Amplitude [dB]
30
Amplitude [dB]
• Undersampled Vocal Tract Filter (VTF)
1000 synthetic signals of 2s; f0: vibrato with central
Amplitude [dB]
Issues in amplitude spectral envelope estimation:
5000
Material
TE=True Envelope (SFA) DCE=Discrete Cepstral Envelope (SFA)
0.5 0.4
−0.5 TE DCE DLinear−MFA Linear−MFA Linear−MFA−LIFT−UOF1.4 SDCE−MFA−UOF1.4
−1
−1.5 −4
−3.5
0.3 0.2 0.1
−3 Quefrency [log10 ms]
−2.5
−2
0 −4
−3.5
−3 Quefrency [log10 ms]
−2.5
−2
Steps:
• RC Error, safe band: MFA divides the error by ∼2 compared to SFA.
1. Compute:
• Critical and weak bands, cepstral Variance < 1 ⇒ averaging of the envelopes (without any statistical modeling!). The DLinear-MFA suffers the most. SDCE-MFA is the best.
c=
K X K X k=1
−1 BlT Bl · BkT ak
(3)
l=1
(No need of frame pre-alignment! Shown in [1,2])
• RC Excess: SFA produces substantial erratic shapes, even in the safe band. Much less problems for the MFA.
2. Final alignment on the central frame.
L ISTENING
We can increase the order ! (e.g. x1.4 in the following experiments)
Material: 1-3s of sustained vibrato; 15 French vowels; 2 voices (!) Method: Harmonic synthesis[6]
L INEAR -MFA-LIFT[1,2] MFA version of the basic linear interpolation, which has already been used for comparison [5]. We only add a low-pass liftering. Steps: 1. Pre-alignement of K successive frames using energy in [0-4]kHz. 2. Linear interpolation of all peaks of the K frames [5] ⇒ Linear-MFA.
TESTS
Pitch scaling: x1.25 and x0.75
Timbre conversion: from
Female singer Male singer
TE
Linear−MFA−LIFT
SDCE−MFA
SDCE−MFA
−1.5
−1
−0.5 0 0.5 1 Comparison Mean Opinion Score (CMOS)
Female singer Male singer
TE
Linear−MFA−LIFT
1.5
−1.5
to
−1
−0.5 0 0.5 1 Comparison Mean Opinion Score (CMOS)
1.5
3. Low-pass lifter of the Linear-MFA to alleviate erratic shapes.
• MFA clearly prefered to TE.
4. Final alignment on the central frame.
• Linear-MFA-LIFT: good results and efficient !
[1] G. Degottex, L. Ardaillon, A. Roebel, "Simple Multi Frame Analysis methods for estimation of Amplitude Spectral Envelope estimation in Singing Voice", in ICASSP, 2016. [2] G. Degottex, L. Ardaillon, A. Roebel, "Multi-Frame Amplitude Envelope Estimation for Modification of Singing Voice", IEEE Transactions on Audio, Speech, and Language Processing, Accepted 2016. [3] M. Campedel-Oudot, O. Cappe, E. Moulines "Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach", IEEE Transactions on Speech and Audio Processing, 2001.
[4] Y. Shiga, S. King, "Estimating the spectral envelope of voiced speech using multi-frame analysis", EUROSPEECH, 2003. [5] T. Wang, T. Quatieri, "High-pitch formant estimation by exploiting temporal change of pitch", IEEE Transactions on Audio, Speech, and Language Processing, 2010. [6] G. Kafentzis, G. Degottex, O. Rosec, Y. Stylianou, "Pitch Modifications of speech based on an Adaptive Harmonic Model", in ICASSP, 2014.
* Now supported by Horizon2020 Marie Sklodowska-Curie Actions (MSCA) - Research Fellowship - 655764 - HQSTS