DETECTING PHASE COUPLING IN SPEECH SIGNALS J W A Fackrell and S McLaughlin 1
The linear models for speech production have now been in use for so long that it is sometimes almost forgotten that they do not necessarily completely describe the speech signal. Recent work in Higher Order Statistics (HOS) has provided tools which now enable us to extract more information from a speech signal than was previously possible using spectral-based methods. This information might be useful in the development of improved systems for speech synthesis, coding or recognition. In fact there have already been a number of papers published presenting the application of HOS algorithms to various speech problems, but to date none has comprehensively answered the question \What are the HOS properties of speech signals ?" This paper describes on-going work which aims to answer this question. In particular, the paper describes the application of a bicoherence-based Quadratic Phase Coupling (QPC) detector to a small set of sustained vowel sounds.
Introduction Recent interest in Higher Order Statistics (HOS) has resulted in several applications of HOS techniques in the elds of speech and image processing. In particular, several papers have introduced new HOSbased algorithms in speech recognition [1{3], speech coding [4,5] and speech pathology detection [6]. However, to date there has been no attempt to comprehensively describe exactly what the HOS properties of speech signals are, and where these properties originate in the speech production process. It is crucial that such an investigation is carried out, as this will provide justi cation for the unquali ed claims often made in papers which use HOS methods for speech analysis. Amongst these claims are \Third order statistics of speech signals are not identically zero....due to quadratic harmonic coupling produced in the vocal tract" [2] \..the asymmetry in the probability density function of the voice signal has motivated the user of third-order HOS in speech analysis" [5] This paper deals with two HOS quantities of interest - the bispectrum and bicoherence. These quantities have dierent interpretations for dierent signal types (stochastic and deterministic signals), and these will be treated separately at rst. This leads on to the consideration of the form the bicoherence will take for real speech sounds. This paper concentrates on voiced speech sounds, since these present the most diculty concerning the interpretation of the HOS measures. Finally some preliminary results are presented.
De nitions It is assumed that the discrete speech signal x(n)(n = 0; 1; :::Ntot) is quasi-stationary over a frame of length Ns points, provided that Ns is not too large. This stationarity assumption permits us to estimate stochastic quantities by using time averages, and its suitability will be discussed below. The bispectrum can be de ned in two ways - as the triple product of the DFT's, or as the double DFT of the third order cumulant C (m1; m2) = E [x(n)x(n + m1)x(n + m2)] [7]. Here we consider only the frequency domain representation,
B(k; l) =4 X (k)X (l)X (k + l)
(1)
Some interesting properties of the bispectrum are described below. 1 Signals and Systems Group, Dept of Electrical Engineering, University of Edinburgh, Edinburgh, EH9 3JL, UK. c 1995 The Institution of Electrical Engineers. IEE Colloquium Digest email:
[email protected] [email protected]. 1995/091 on Speech and Image Processing (London, England) pp. 4/1-4/8, 2 May 1995 .
The bispectrum has two frequency indices, k and l. This means that it is plotted on a plane.
The main region of interest in this plane is an isosceles triangle called the Inner Triangle (IT) which is bounded by the lines l = 0; k = l; k + l = fs =2 [8] as shown in Figure 1. The term bifrequency is used to decribe a frequency pair k; l. The bispectrum is complex, and as such contains Fourier magnitude and phase information. The bispectral content at k; l depends on the DFT X (k) =j X (k) j ej(k) at the three frequencies k, l and k + l. In particular the following relations follow from equation 1.
B(k; l) = j B(k; l) j ej(k;l) j B(k; l) j = j X (k) jj X (l) jj X (k + l) j (k; l) = (k) + (l) ? (k + l) The term biphase is used to describe (k; l) the phase of the bispectrum. In this paper sustained speech sounds are assumed to conform to a broad signal model. According to this signal model any voiced speech sound will consist of determinstic components d(n), stochastic components s(n) and a measurement noise component v (n) as described by the following equation
x(n) = d(n) + s(n) + v(n)
(2)
The bispectrum has dierent interpretations for stochastic and deterministic signals, and these will now be treated separately, before considering mixed signals. For both signal types it will be shown that there is a useful HOS measure called the bicoherence which is derived from the bispectrum.
The Bicoherence of Stochastic Signals Stochastic signals can be described by expectations, averages and statistical properties. Signals with a Gaussian pdf are completely characterised by their mean and variance (i.e. the rst two moments), but many stochastic signals have non-Gaussian pdf's, and hence non-zero higher-order moments. Just as the power spectrum is a measure of the signal variance, so the bispectrum is a measure of the signal skewness 3 (the third-order moment). It is common [9,10] to use a segment-averaging method to obtain a consistent estimate of the bispectrum using the following estimator
B^ (k; l) =
K X X (k)X (l)X (k + l) i=1
i
i
i
(3)
where the quasi-stationary time series section of length Ns is divided into K segments (or frames) each of length N , where N is the DFT size. However, the variance of this bispectral estimate is proportional to the power spectrum of the signal [8]. This means that the estimate in equation 3 is sensitive to second-order statistics (i.e. signal variance) as well as to third-order statistics of the signal. To get round this problem a normalisation scheme is often used to form the bicoherence b(k; l) which is estimated using [9] P 2 ^b2 (k; l) = Pj Xi (k)Xi(l)2XPi (k + l) j 2 (4) j Xi(k)Xi(l) j Xi(k + l) The properties of this estimated quantity are described extensively elsewhere [8]2, but the main properties of interest here are Hinich actually uses a slightly dierent normalisation called the skewness function [8], but in practice both the bicoherence and the skewness take similar values. 2
0 b2(k; l) 1 [11]. The sum of the squared bicoherence over the IT is approximately proportional to the skewness 3 of the signal [8]3 .
The Bicoherence of Deterministic Signals The bicoherence of a deterministic signal is a useful measure of Quadratic Phase Coupling (QPC), which in turn is a strong indication of nonlinearities in the signal production mechanism. For example, a signal formed by passing a simple summation of sinusoids through a quadratic nonlinearity (such as a simple squarer) exhibits QPC. It is easy to show that for such a signal, the bispectrum (and bicoherence) magnitude will peak and the biphase will be zero at a bifrequency related to the frequencies of the signal components. This property forms the basis of the bicoherence's use as a QPC detector [9,11{13]. In some early papers [3, 9] only the magnitude of the bicoherence was used as a detector of QPC. This detector only works if a phase randomization (PR) assumption can be made about the signal. If the PR assumption cannot be made, then the phase information should also be used as part of the QPC detector [13,14]. We believe that the PR assumption is not applicable to speech signals, which means that some previously published results [3] do not, in fact, conclusively show that speech exhibits nonlinearities. Given that the PR assumption cannot be made, it is relevant to ask \What does the magnitude of the bicoherence tell us ?". If the signal is purely deterministic, then its bicoherence will take a value of unity at all bifrequencies where there is signal energy [7]. It is then the biphase which carries the useful information pertaining to the presence of QPC. Although theory dictates that no averaging is necessary in the calculation of deterministic signals, in practice averaging procedures are advisable because they reduce the in uence of (ever-present) background noise.
The Bicoherence of Speech Signals It has been argued above that the bicoherence magnitude contains useful information about the skewness of stochastic signals, and that the biphase contains useful information about QPC in deterministic signals. As speech signals consist of a mixture of stochastic and deterministic components, it is necessary to develop a bicoherence analysis scheme which accommodates both signal types. A semi-empirical two-stage QPC test which uses the biphase is described in [12]. This test has been shown [13] to perform well in conditions of low SNR where the background noise v (n) is Gaussian. The test is carried out as follows. 1. Estimate the complex square root of equation 4 P Xi(k)Xi(l)X (k + l) i ^b(k; l) = P (5) P 1 2 [ j Xi (k)Xi(l) j Xi (k + l)2] 2 2. At each bifrequency carry out the following two-stage test Test for signi cant magnitude - this is Hinich's \Gaussianity" test [8]4, modi ed slightly for the bicoherence [11]. If ^b2(k; l) > c where c is the critical threshold for a central It would be more accurate to use the Hinich skewness function rather than the bicoherence here, since its summed magnitude is more closely proportional to the signal skewness, but the Hinich skewness function does not have the useful property of being upper-bounded by unity [11]. 4 In fact it is a test for signal symmetry rather than Gaussianity. 3
1
fs /2
Imag
OT 0
IT
accept QPC -1
fs /2
Figure 1: Principal Domain of the discrete Bispectrum/Bicoherence showing the Inner Triangle (IT) and the Outer Triangle (OT). index 1 4 7 10
word heat hit head hate
IPA SAMPA index word /i/ i 2 hart /)/ I 5 hood // E 8 hot /e)/ eI 11 hat
-1
0 Real
1
Figure 2: Complex bicoherence plane - QPC is accepted if a bicoherence point falls in the shaded region. IPA SAMPA index word // A 3 hut /?/ U 6 hurt // Q 9 hoot /a/ { 12 caught
IPA SAMPA // V // 3 /u/ u /=/ O
Table 1: Test words used in database collection with approximate phonemic transcriptions for British English in IPA and SAM-PA.
2 (2) variable at a chosen level of signi cance then reject symmetry, and carry out the second part of the test. If ^b2(k; l) < c then the signal is stochastic symmetric and thus its
bispectrum contains no useful information. Test for zero biphase - this is the QPC test [12]. The biphase is estimated from equation 5 using ^ ^ (k; l) = arctan =[^b(k; l)]