2009 IEEE Workshop on Applications of Audio and Acoustics
October 18-21, 2009, New Paltz, NY
SOURCE ENUMERATION OF SPEECH MIXTURES USING PITCH HARMONICS Keith D. Gilbert
Karen L. Payton
Electrical and Computer Engineering Dept. University of Massachusetts Dartmouth N. Dartmouth, MA 02747 USA
[email protected]
Electrical and Computer Engineering Dept. University of Massachusetts Dartmouth N. Dartmouth, MA 02747 USA
[email protected] Number of Estimates
ABSTRACT This paper proposes a method to simultaneously estimate the number, pitches, and relative locations of individual speech sources within instantaneous and non-instantaneous linear mixtures containing additive white Gaussian noise. The algorithm makes no assumptions about the number of sources or the number of sensors, and is therefore applicable to over-, under-, and precisely-determined scenarios. The method is hypothesis-based and employs a power-spectrumbased FIR filter derived from probability distributions of speech pitch harmonics. This harmonic windowing function (HWF) dramatically improves time-difference of arrival (TDOA) estimates over standard cross-correlation for low SNR. The pitch estimation component of the algorithm implicitly performs voiced-region detection and does not require prior knowledge about voicing. Cumulative pitch and TDOA estimates from the HWF form the basis for robust source enumeration across a wide range of SNR.
b) Female
200 100 0
100 120 140 160 180 200 220 240 260 Frequency Bin (Hz) Rayleigh
Gaussian
Figure 1: Pitch histograms for a minute of a) male or b) female speech with two PDF overlays. 2.
TIME-VARYING PITCH
To understand some of the source-specific issues, it is helpful to consider the simplified source-filter model of voiced speech. In this model, voiced speech corresponds to the sound emitted from a tube, representing the vocal tract, and the glottal source is modeled as a periodic, pulsatile, excitation at one end. The tube will have natural resonances, corresponding to formants, at frequencies that depend on the tube geometry. When the air column in this tube is driven at a repetition rate of f0, some of the harmonics of f0 will fall in the general frequency range of the vocal tract resonances and will be enhanced by the vocal tract filter. The frequency, f0, is known as the fundamental frequency, or pitch, while the harmonics can be described as integer multiples of f0, or fk=(k+1)f0, where k=1,2,3,... It should be noted that there are many forms of speech production, but the proposed enumeration algorithm is based solely on the voiced segments that fit this model. [9],[10] During conversational speech, a talker’s pitch will vary due to inflection, excitement, etc. and, without this variation, the speech would be deemed monotone and dull. Figure 1 plots histograms of a) a male talker’s and b) a female talker’s pitch estimates corresponding to more than 1 minute of speech, overlaid with Gaussian and Rayleigh probability density functions (PDFs).
INTRODUCTION
Existing source enumeration (SE) algorithms can broadly be classified into two major areas; informationtheoretic (e.g. [1],[2]) and matrix-decomposition-based (e.g. [3],[4]). The information-theoretic approach is inherently confined to the over-determined case, although Luengo et al. [5] have extended a minimum description length approach to handle the under-determined case. The matrixdecomposition-based approaches in general do provide DOA information, but neither the information-theoretic nor the matrix decomposition methods provide simultaneous estimates of number, pitch, and location of speech sources. The work presented here conceptually extends joint position-pitch (PoPi) estimation, e.g. [6][7], to explicitly enumerate the estimated sources. The results generated by the harmonic windowing function (HWF) method reported herein are comparable to existing PoPi methods, although each method tackles the problem in a fundamentally different way. The method presented here draws from a number of research areas, and due to space limitations, an exhaustive overview is not possible in this paper. The reader is directed to [8] for a more complete review of the relevant topics. Motivation and background for source enumeration using the harmonic windowing function (SE-HWF) is presented below.
978-1-4244-3679-8/09/$25.00 ©2009 IEEE
a) Male
Histogram
Index Terms— Source enumeration, pitch harmonics, multi-pitch extraction, linear mixtures, real-time
1.
300
3.
HARMONIC WINDOWING FUNCTION
For the current work, the Gaussian distribution was chosen to represent the time-varying nature of a talker’s pitch since it is not an unrealistic representation, and it is the most tractable. It should be noted that nothing presented here would necessarily preclude the use of any other reasonably fitted distribution.
89
October 18-21, 2009, New Paltz, NY
Using the random variable (R.V.) of pitch, F0~N(f0,ı0), where ı0 is the standard deviation, and the relation between the fundamental frequency and it’s harmonics, a simple linear transformation of variables produces the kth harmonic R.V. as Fk ~ N(fk,ık)= N((k+1)f0,(k+1)ı0) for k=1,2,3... Treating the components of this set of dependent R.V.s individually; frequency magnitude spectra can be constructed to mimic the shape of the probability distributions. Considering these PDF’s as filters relaxes constraints, most notably, the peak of the individual power spectrum is set to one, as opposed to the area under the affiliated PDF curve equaling one. Moreover, the non-overlapping spectra can then be summed together, without loss of generality, to describe the probabilistic relation of pitch harmonics in the power frequency domain. If Hk(f,f0) is considered to be the underlying filter for the kth harmonic, then the harmonic filter, H(f,f0), for K harmonics plus the fundamental is: K H ( f , f0 ) = H k ( f , f0 ) (1) k =0 From this, a harmonic windowing function (HWF) can be defined as the power spectrum of H(f,f0), namely PHH(f, f0):
log|HWF(f)| (dB)
2009 IEEE Workshop on Applications of Audio and Acoustics
0
HW F(f,f0=100Hz)
-20 -40 -60 -80 0 -20 -40 -60 -80
HW F(f,f0=225Hz) 0
200 400 600 800 1000 1200 1400 1600 Frequency(Hz)
Figure 2: The Harmonic Windowing Function (HWF) for two values of the fundamental, f0.
¦
HWF ( f , f0 ) = PHH ( f , f0 ) =| H ( f , f0 ) |2
(2)
One of the benefits in using the Gaussian distribution is that these individual power spectra can be created quite readily using linear-phase FIR filters and a couple examples are shown in Fig. 2. Once constructed, this harmonic windowing function can then be used to isolate pitch harmonics in the power spectrum of running speech. To reduce the effects of formants and acoustic attenuation, the overall frequency range of the HWF should be chosen to be between 50Hz and 1600Hz. 4.
SOURCE ENUMERATION USING THE HWF (SE-HWF)
A linear mixture model consisting of P sensors and Q sources with additive noise is defined as: Q M −1 (3) x p [ n] = hqp [k ]s q [n − k ] +ν [k ] q =1 k = 0 where xp[n] is the signal received at the pth sensor, hqp is the channel filter between the qth source and the pth sensor, M is the length of hqp, sq is the qth source, Ȟ[k] is additive noise, and n is the time index. [11]
¦¦
For simplicity, the initial description of the SE-HWF method will make two assumptions: 1) the mixture model will contain only two mixtures (P=2), comprised of Q sources (not restricted to 2); 2) the sources will be spatially stationary. A further implicit assumption in this discussion is that the pth mixture, xp[n], is the ith frame of the pth mixture (in a block processing context) consisting of the current L-samples of the mixture windowed with an L-point rectangular window. This generalization is necessary, as the ability to overlap successive windows can improve pitch-tracking accuracy over time. Also, a rectangular window is used, since it provides the narrowest passband for a given window length. SE-HWF is a four step process: 1) identify promising pitch candidates,
2) isolate valid harmonic structures, 3) spatially locate the valid harmonics, and 4) enumerate the sources. 4.1. Simple Pitch Candidates In order to utilize the HWF formulated in Sect. 3, an initial set of pitch candidates needs to be ascertained. Using the power spectrum of the pth mixture, Ppp(f), a hypothesis test can be formulated by determining peaks in the spectrum that are greater that some threshold, as:
H1 max Ppp ( f ) P
>
= − ln(1 − γ ) μ