anovel filtering-based f0 estimation algorithm with an ...

2 downloads 0 Views 663KB Size Report
Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT),. Gandhinagar, India-382007. 2. EURECOM, France.
A N OVEL F ILTERING - BASED F0 E STIMATION A LGORITHM WITH AN A PPLICATION TO V OICE C ONVERSION Nirmesh J. Shah1 , Pramod B. Bachhav2 and Hemant A. Patil1 1 Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, India-382007 2 EURECOM, France Email: {nirmesh88_shah,hemant_patil}@daiict.ac.in, [email protected]

I NTRODUCTION

E FFECTIVENESS OF F0 E STIMATION

S PEECH ANALYSIS / SYNTHESIS • Objective Evaluations: Perceptual Evaluation of Speech Quality (PESQ).

• Proposed algorithm: – Glottal Closure Instants (GCIs) [1]. – Fundamental frequency (F0 ). • Eliminates the assumption: F0 is constant over a short period. • Free from spectral leakage issues related to the windowing. • State-of-the-art algorithms – Yet Another Algorithm for Pitch Tracking (YAAPT) [2].

Figure 5: PESQ scores of VC systems with 95 % confidence interval.

– Speech Transformation and Representation using Adaptive Interpolation of weiGHT spectrum (STRAIGHT) [3].

M EAN O PINION S CORE (MOS)

– Pitch Detection Algorithm (PDA) [4]. • Case study: Voice Conversion (VC).

P ROPOSED F0 E STIMATION A LGORITHM

Figure 3: (a) A speech utterance, “Author of the danger trail, Phillips steels, etc.,” from a male, F0 contours extracted using (b) proposed method, (c) YAAPT, (d) STRAIGHT, (e) PDA. The vertical window indicates that the proposed algorithm is capturing F0 for even shorter voiced region.

• Proposed F0 estimation algorithm is able to estimate F0 even in small voiced region.

Figure 6: MOS scores for developed VC systems with 95 % confidence interval from 20 listeners (14 males and 6 females with no known hearing impairments).

V OICE C ONVERSION S UMMARY AND C ONCLUSIONS

• CMU-ARCTIC Corpus: BDL-RMS, BDL-SLT, CLBRMS and CLB-SLT speaker-pairs.

• We propose novel F0 estimation algorithm that uses simple lowpass filtering and peak picking method.

• 100 training utterances. • Joint Density Gaussian Mixture Model (JDGMM)based VC [5].

• Proposed F0 is able to track F0 for even for low, and short-duration of voiced segments.

• 48 VC systems development for each speake-pair and for each F0 estimation algorithm.

• VC application has been selected to measure the effectiveness of proposed method.

• Mel Log Spectrum Approximation (MLSA) analysis/synthesis framework.

• It is free from block processing artifacts.

ABX T EST

(a)

(b)

(c)

Figure 4: Analysis of ABX tests, namely (a) Proposed vs. YAAPT, (b) proposed vs. PDA, and (c) proposed vs. STRAIGHT with 95 % confidence interval from 20 listeners (14 males and 6 females with no known hearing impairments).

Figure 1: The proposed algorithm for F0 estimation. After [1].

• F0 estimation procedure at various steps. • Pitch halving and doubling.

S ELECTED R EFERENCES 1. P. B. Bachhav, H. A. Patil, and T. B. Patel, “A novel filtering based approach for epoch extraction,” in International Conference on Acoust, Speech and Signal Process. (ICASSP), Australia, 2015, pp. 4784 – 4788. 2. S. A. Zahorian and H. Hu, “A spectral/temporal method for robust fundamental frequency tracking,” The J. of the Acoust. Soc. of Amer. (JASA), vol. 123, no. 6, pp. 4559 – 4571, 2008. 3. H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “ Restructuring speech representations using a pitchadaptive time –frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3, pp. 187 – 207, 1999. 4. Y. Medan, E. Yair, and D. Chazan, “Super resolution pitch determination of speech signals,” IEEE Trans. on Signal Proc., vol. 39, no. 1, pp. 40 – 48, 1991. 5. T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. on Audio, Speech and Lang. Process., vol. 15, no. 8, pp. 2222 – 2235, 2007.

A CKNOWLEDGEMENTS • We acknowledge MeitY, Govt. of India and authorities of DA-IICT, Gandhinagar. • We also thank all the listeners who took part in the subjective evaluation. Figure 2: (a) A segment of the speech utterance, (b) detected epochs, (c) F0 contours obtained after step 2, (d) F0 contours obtained after step 3, and (e) F0 contours obtained after step 4 and step 5.

• Poster is presented by Prof. Hemant A. Patil at 9th IEEE Asia – Pacific Signal and Information Processing Association (APSIPA) Annual Summit Conference (ASC), Kuala Lumpur, Malaysia, Dec. 12-15, 2017 .

Suggest Documents