Markov model-based phoneme class partitioning ... - Semantic Scholar

4 downloads 0 Views 658KB Size Report
decision process. Once noisy speech frames are identified, iterative speech ..... under different SNR: (a) Original noisy speech; (b) variable frame rate (VF,. Auto:I ...
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3. NO. I , JANUARY 1995

98

[41 A. Benyassine and H. Abut, “Mixture excitations and finite-state CELP speech coders,” in Proc. IEEE ICASSP., Mar. 1992, pp. 1-345-1-348. PI P. Krmn and B. S. Atal, “Strategies for improving the performance of CELP coders at low bit rates,” in Proc. IEEE ICASSP, Apr. 1988, pp. 151-154.

t61 P. moon and B. S. Atal, “On the use of pitch predictors with high temporal resolution,” IEEE Truns. Acoust., Speech, Signal Processing, vol. 39, no. 3, pp. 733-735, 1991. 171 C.-C. Kuo, F.-R. Jean, and H.-C.Wang, “Low bit-rate quantization of LSP parameters using two-dimensional differential coding,” in Proc. IEEE ICASSP, Mar. 1992, pp. 1-97-1-100.

Markov Model-Based Phoneme Class Partitioning for Improved Constrained Iterative Speech Enhancement John H. L. Hansen and Levent M. Arslan

Absfract- Research has shown that degrading acoustic background noise influences speech quality across phoneme classes in a nonuniform manner. This results in variable quality performance of many speech enhancement algorithms in noisy environments. A phoneme classification procedure is proposed which directs single-channel constrained speech enhancement. The procedure performs broad phoneme class partitioning of noisy speech frames using a continuous mixture hidden Markov model recognizer in conjunction with a perceptuallymotivated cost-based decision process. Once noisy speech frames are identified, iterative speech enhancement based on all-pole parameter estimation with inter- and intra-frame spectral constraints is employed. The phoneme class-directed enhancement algorithm is evaluated using TIMIT speech data and shown to result in substantial improvement in objective speech quality over a range of signal-to-noise ratios and individual phoneme classes.

I. INTRODUCTION Speech enhancement can be described as the processing of speech signals to improve perceptual features associated with quality, intelligibility, and/or listener fatigue. The need for enhancement arises when speech is produced in a noisy acoustic environment, where human-to-human, human-to-machine (speech recognition), or humanto-machine-to-human (voice coding) communication is desired. In this study, it is assumed that i) only the degraded speech signal is available, and ii) that the noise is additive and uncorrelated with the speech signal. There has been renewed interest in speech enhancement in recent years due in part to improved speech production and auditory modeling methods. Several reviews can be found in the literature [3], [9], [5]. In general, three broad methods have emerged, which include i) short-time spectral amplitude estimation (spectral subtraction), ii) model based optimal filtering (Wiener filtering), and iii) adaptive noise canceling. The enhancement approach considered here is an extension of a previously formulated class of constrained iterative methods proposed by Hansen and Clements [6],[7].These methods employ sequential Manuscript received June 15, 1993; revised May 25, 1994. This work was supported in part by National Science Foundation Grant No. NSF-IRI90-10536. The associate editor coordinating the review of this paper and approving it for publication was Dr. Xuedong Huang. The authors are with the Robust Speech Processing Laboratory, Department of Electrical Engineering, Duke University, Durham,NC 27708-0291 USA. IEEE Log Number 9406777.

maximum a posteriori (MAP) estimation of the speech waveform and speech modeling parameters followed by the application of interand/or intra-frame spectral constraints between iterations. Constrained enhancement approaches impose intra- and inter-frame vocal tract spectral constraints on the input speech signal to ensure more speech-like formant trajectories, reduce frame-to-frame pole jitter, and effectively introduce a relaxation parameter during the iterative scheme. Two methods in particular are of interest in the present study. The first method is based on spectral constraints applied to the autocorrelation lags across iterations (Auto:I), followed by application of further spectral constraints based on the line spectral pair (LSP) transformation of the LPC parameters across time (LSP:T) for fixedframe sizes (FF);hence the notation (FF, Auto:I, LSP:T). This method proved to be the most successful of the class of methods considered [7]. An extension was also considered where LSP spectral constraints were applied on a variable-frame (VF) basis across time (VF, Auto:I, LSP:T). The idea was to apply similar speech spectrum constraints across longer speech segments. This approach, while desirable in theory, produced inconsistent results as signal-to-noise ratio (SNR) decreased, while fixed-frame processing provided consistent quality improvement across a much broader SNR range. Loss in performance was attributed to the inability of the segmentation algorithm to perform reliably as SNR decreased. Separate from the issue of speech enhancement is the impact of additive background noise on speech quality across phonemes. For the problem of single-channel speech enhancement, it is assumed that background noise is additive and stationary across an input utterance. However, since speech energy varies significantly across phoneme classes, local SNR as well as the impact of noise distortion on speech quality will vary locally. For example, vowel sections are not distorted as much as transitive and plosive sounds for a given background acoustic noise level. Therefore, for speech enhancement to be successful, it must address the nonuniform effect that noise has on speech across phoneme classes. Improvements in speech enhancement could be achieved if additional a priori speech information is known or estimated prior to the enhancement process to address the variable impact of noise. For example, in one study [4], a hidden Markov model (HMM) recognizer is used to select an improved all-pole Wiener filter for enhancement. This approach can provide improvements if reliable recognition is achieved on a frameby-frame basis. One possible limitation, however, is that prior speech data is required for training for each speaker, leading to a possible speaker-dependent enhancement approach. Another alternative is to employ a frame-based vector quantization-directed approach [ 111, where a formant distance measure is used to select a noisefree entry from a vector quantizer codebook for enhancement. Since the codebook can be trained with multiple speakers, requiring less training data than that needed for an HMM, it would provide further quality improvement versus traditional spectral subtraction or Wiener filtering. Both methods seek to further incorporate apriori knowledge of speech or speaker characteristics. These methods could, in theory, suffer from the same limitation experienced in variable-frame (VF, Auto:I, LSP:T) constrained enhancement. If an error is made in the decision process, significant distortion can be introduced during the enhancement process. Given that noise affects speech quality in a nonuniform manner, and that a priori estimation of speech characteristics prior to enhancement could potentially improve the enhancement process, an improved speech classification method which directs enhancement is

1063-6676/94$04.00 0 1994 JEEE

99

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. 1, JANUARY 1995

Noisy speech

I I I I I I I I I

I I I I I I I I

APPLY CONSTMIHIS:

. I,

TION ON .

/ I

'

..

I

I I I I

I

I

I

li

I

I I I I I I I I I I I I I I I I

1

I

FORWARD

1

f

\

MODELS

I

I

P(XIAi) 112,...,13

I I I I I I

FORllPHONEME CUSSES

won -w

I

CHOOSE PH(foppl AS THE PHONEME CLASS CORRESPONDING TO ME FRAME ELSE

.

Do NOT CLASSIFY ( i.e.,DON'T CME )

I I I

I I I I I I I II

I I I I I

I

ENHANCEMENT

I

I I

THEN

I I I I I I I I I I 1

- Li(A,c,~z)

P ( X / ~ ~ ) ~ P I X I ~ )s g>THRESHOLD md

IF

I I

I I

MIXTURE HIDDEN MARKOV MODELTRAINING ALGORmM

CLASSIFICATION Enhanced speech

Fig. 1. Framework for the classification-directed (CP,Auto:I, LSPT) enhancement algorithm.

proposed in this paper. Once noisy speech frames are identified as one of several broad phoneme classes, each frame is enhanced using an appropriate set of constraints in the (FF, Auto:I, LSPT) algorithm for each phoneme class. This study is organized as follows. In Section II, the phoneme class-partitioning (CP)-based constrained enhancement (CP, Auto:I, LSP:T) algorithm is formulated. Particular attention is paid to the cost decision criterion. Next, Section III presents classifier and enhancement evaluations. Detailed results are presented for a single sentence and overall performance using a subset of the TIMIT [ l ] database. Finally, conclusions are summarized in Section IV.

A . (FF, Auto:I, LSP:T) Constrained Enhancement Consider a noisecormpted speech vector GT3. It is assumed that the input speech signal can be modeled by a set of all-pole model parameters 2 and gain g. A sequzntial maximum n posteriori (MAP) estimation of the clean speech SO is obtained given the noisy input speech ,YO,followed by MAP estimation of the model parameters a' given 30,where is the result of the first MAP estimation, where all unknowns are assumed to be Gaussian distributed. The process iterates between the following two MAP estimation steps

so

i) ii)

11. ALGORITHM FORMULATION: CLASS PARTITIONING AND

ENHANCEMENT

The proposed enhancement algorithm (Fig. 1) is based on a speech class partitioning scheme which directs a spectrally constrained iterative enhancement algorithm. Processing steps are discussed in the following sections.

1 20,*, PO; g, $1) MAX p(&,i I &,PO; g, 31)

MAX p(

Suggest Documents