NEURAL DUAL EXTENDED KALMAN FILTERING - Semantic Scholar

11 downloads 0 Views 397KB Size Report
Eric A. Wan and Alex T. Nelson .... This approach to dual estimation is related to work done by Nelson [12] in the linear ..... [4] J. Connor, R. Martin, L. Atlas.
NEURAL DUAL EXTENDED KALMAN FILTERING: APPLICATIONS IN SPEECH ENHANCEMENT AND MONAURAL BLIND SIGNAL SEPARATION Eric A. Wan and Alex T. Nelson Department of Electrical Engineering, Oregon Graduate Institute, P.O. Box 91000 Portland, OR 97291

Abstract The removal of noise from speech signals has applications ranging from speech enhancement for cellular communications, to front ends for speech recognition systems. A nonlinear time-domain method called dual extended Kalman filtering (DEKF) is presented for removing nonstationary and colored noise from speech. We further generalize the algorithm to perform the blind separation of two speech signals from a single recording. INTRODUCTION Traditional approaches to noise removal in speech involve spectral techniques, which frequently result in audible distortion of the signal. Recent time-domain nonlinear filtering methods utilize data sets where the clean speech is available as a target signal to train a neural network. Such methods are often effective within the training set, but tend to generalize poorly for actual sources with varying signal and noise levels. Furthermore, the network models in these methods do not fully take into account the nonstationary nature of speech. In the approach presented here, we assume the availability of only the noisy signal. Effectively, a sequence of neural networks is trained on the specific noisy speech signal of interest, resulting in a nonstationary model which can be used to remove noise from the given signal. A noisy speech signal y(k) can be accurately modeled as a nonlinear autoregression with both process and additive observation noise:

x(k) y(k)

= =

f (x(k ? 1); :::x(k ? M ); w) + v(k) x(k) + n(k);

(1) (2)

where x(k) corresponds to the true underlying speech signal driven by process noise v(k), and f () is a nonlinear function of past values of x(k) parameterized by . The speech is only assumed to be stationary over short segments, with each segment having a different model. The available observation is y(k), which contains additive noise n(k). The optimal estimator given the noisy observations (k) = fy(k); y(k ? 1);    y (0)g is E [x(k )j (k )]. The most direct way to estimate this would be to train on a set of clean data in which the true x(k) may be used as the target to a neural

w

y

y

network. Our assumption, however, is that the clean speech is never available; the goal is to estimate x(k) itself from the noisy measurements y(k) alone. In order to solve this problem, we assume that f (; ) is in the class of feedforward neural network models, and compute the dual estimation of both states x ^ and weights ^ based on a Kalman filtering approach. In this paper we provide a basic description of the algorithm, followed by a discussion of experimental results.

w

DUAL EXTENDED KALMAN FILTERING By posing the dual estimation problem in a state-space framework, we can use Kalman filtering methods to perform the estimation in an efficient, recursive manner. At each time point, the Kalman filter provides an optimal estimation by combining a prior prediction with a new observation. Connor et al.[4], proposed using an extended Kalman filter with a neural network to perform state estimation alone. Puskorious and Feldkamp [13] and others have posed the weight estimation in a state-space framework to allow for efficient Kalman training of a neural network. In prior work, we extended these ideas to include the dual Kalman estimation of both states and weights for efficient maximum-likelihood optimization (in the context of robust nonlinear prediction, estimation, and smoothing) [15]. The work presented here develops these ideas in the context of speech processing. A state-space formulation of (1) and (2) is as follows:

x(k) y(k)

= =

F [x(k ? 1)] + Bv(k); C x(k) + n(k);

where 2 6

x(k) x(k ? 1)

x(k) = 664 .. .

3

2

7 7 7 5

; F [x(k)] = 664

6

f (x(k); : : : ; x(k ? M + 1); w) x(k)

.. .

(3) (4) 3 7 7 7 5

x(k ? M + 1) x(k ? M + 2)  T C = 1 0  0 ; B=C : (5) If the model is linear, then f (x(k)) takes the form wT x(k), and F [x(k)] can be written as Ax(k), where A is in controllable canonical form. We initially assume the noise terms v(k) and n(k) are white with known variances v2 and n2 , respectively. 

Methods for estimating the noise variances directly from the noisy data are described later in this paper. Extended Kalman Filter - State Estimation

For a linear model with known parameters, the Kalman filter (KF) algorithm can be readily used to estimate the states [9]. At each time step, the filter computes the linear least squares estimate x ^(k ) and prediction x ^? (k ), as well as their error covariances, ? PX (k) and PX (k). In the linear case with Gaussian statistics, the estimates are the

minimum mean square estimates. With no prior information on x, they reduce to the maximum likelihood estimates. When the model is nonlinear, the KF cannot be applied directly, but requires a linearization of the nonlinear model at the each time step. The resulting algorithm is called the extended Kalman filter (EKF), and effectively approximates the nonlinear function with a time-varying linear one. The EKF algorithm is as follows:

x^?(k)

Px^? (k) K (k) Px^ (k) x^(k)

= = = = =

F [^x(k ? 1); w^ (k ? 1)] (6) APx^ (k ? 1)AT + Bv2 B T ; where A = @F @[^xx^; w^ ] (7) x^(k ? 1) Px^? (k)C T (CPx^?(k)C T + n2 )?1 (8) ? (I ? K (k )C )Px (9) ^ (k) x^? (k) + K (k)(y(k) ? C x^?(k)): (10)

Dual Extended Kalman Filter - Weight Estimation Because the model for the speech is not known, the standard EKF algorithm cannot be applied directly. We approach this problem by constructing a separate state-space formulation for the underlying weights as follows:

w(k) y(k)

= =

w(k ? 1) f (x(k ? 1); w(k)) + v(k) + n(k);

(11) (12)

x

where the state transition is simply an identity matrix, and the neural network f ( (k? ; (k)) plays the role of a time-varying nonlinear observation on . These statespace equations for the weights allow us to estimate them with a second EKF.

1)

w

w

w^ ?(k)

Pw^?(k) Kw^ (k) Pw^ (k)

w^ (k)

= = = = =

where

w^ (k ? 1)

Pw^ (k ? 1) Pw^? (k)H (k)T (H (k)Pw^? (k)H (k)T + n2 + v2 )?1 ? (k) (I ? Kw ^ (k)H (k))Pw ^ ? w^ (k) + Kw^ (k)(y(k) ? CF (^x(k ? 1); w^ ?(k))) ; x; w^ ] H (k) = C@F@ [^ w^ w^ (k ? 1) :

(13) (14) (15) (16) (17) (18)

The linearization in (18) can be computed as a dynamic derivative [16] to account for the recurrent nature of the state-estimation filter, including the dependence of the Kalman gain K (k) on the weights. The calculation of these derivatives is computationally expensive, and can be avoided by ignoring the dependence of ^ (k) on ^ .1 This approximation was used to produce the results in this paper. The use of the full derivatives is currently being investigated by the authors.

x

1 This is equivalent to a single-step of backpropagation through time [16].

w

Time Update EKF1 −

ˆ x[k]

ˆ x[k-1]

ˆ x[k]

Measurement Update EKF1

(measurement)

ˆ y[k]

Time Update EKF2



ˆ w[k]

ˆ w[k-1]

ˆ w[k]

Measurement Update EKF2

Figure 1: The Dual Extend Kalman Filter (DEKF). EKF1 and EKF2 represent the filters for the states and the weights, respectively.

x

w

We now have EKFs for estimating both the states and the weights , resulting in a pair of dual extended Kalman filters (DEKF) run in parallel (see Figure 1). At each time step, the current estimate of is used by the weight filter, and the current estimate of is used by the state filter. For finite data sets, the algorithm is run iteratively over the data until the weights converge.

w

x

This approach to dual estimation is related to work done by Nelson [12] in the linear case, and to Matthews’ neural approach [11] to the recursive prediction error algorithm [7]2. In the speech literature, the method is most closely related to Lim and Oppenheim’s approach to fitting LPC models to degraded speech [10]. It also relates to Ephraim’s model-based approach [6], but uses nonlinear estimation to fit the given data instead of using a fixed number of prespecified linear models. Nonstationary White Noise Experiment The result of applying the DEKF to a speech signal corrupted with simulated nonstationary bursting noise is shown in Figure 2. The method was applied to successive 64ms (512 point) windows of the signal, with a new window starting every 8ms (64 points).3 Feedforward networks with 10 inputs, 4 hidden units, and 1 output were used. Weights typically converged in less than 20 epochs. The results in the figure were computed assuming both v2 and n2 were known. The average SNR is improved by 9.94 dB, with little resultant distortion. We also ran the experiment when n2 and v2 were estimated using only the noisy signal. This also produced impressive results,

w

x

2 An alternative approach is to concatenate both and into a joint state vector, and apply the EKF to the resulting nonlinear state equations (see [7] for the linear case, [17] for application to recurrent neural networks). This algorithm, however, has been known to have convergence problems. 3 A normalized Hamming window was used to emphasizes data in the center of the window, and deemphasize data in the periphery. The standard EKF equations are also modified to reflect this windowing in the weight estimation.

Clean Speech

Noise

Noisy Speech

Cleaned Speech

Figure 2: Cleaning Noisy Speech With The DEKF. The speech data was approximately 33,000 points (4 seconds)long. Nonstationary white noise was generated artificially and added to the speech to create the noisy signal y.

with an SNR improvement of 8.50 dB. In comparison, available “state-of-the-art” techniques of spectral subtraction [3] and adaptive RASTA processing [1] achieve SNR improvements of only .65 and 1.26 dB, respectively. Colored Noise For most real-world speech applications, we cannot assume the noise is white. For colored noise, the state-space equations 3 and 4 need to be adjusted before Kalman filtering techniques can be employed. Specifically, the measurement noise process is given its own state-space equations,

n(k)

An n(k ? 1) + Bn vn (k) (19) Cnn(k); (20) where n(k) is a vector of lagged values of n(k), vn (k) is white noise, An is a simple state transition matrix in controllable canonical form, and Bn and Cn are of the same form as B and C given in (3) and (4). Note that this is equivalent to an autoregressive n(k)

=

=

model of the colored noise, which may be fit from a small section of signal where the speech is not present. With this formulation for the colored noise, it is straightforward to augment both the state (k) and the weight (k) with (k), and write down combined state equations. Specifically, (3) and (4) are replaced by:

x

w



x(k)  n(k)

=

y(k)

=

n

F [x(k ? 1)] + B 0 0 Bn Ann(k ? 1)   x(k) ; [C Cn ] n(k) 









v(k) ; vn (k)

(21) (22)

and (11) and (12) are replaced by: 

w(k)  n(k) y(k)

w(k ? 1)  +  0  vn(k); n(k ? 1) 0 An Bn f (x(k ? 1); w(k)) + Cnn(k) + v(k): 

= =

I

0



(23) (24)

The noise processes in these state-equations are now white, and the DEKF algorithm can be used to estimate the signal. Note, the colored noise explicitly affects not only the state estimation, but also the weight estimation. Clean Speech

Noisy Speech

Cleaned Speech

Figure 3: Removing Colored Noise With The DEKF. The speech data is 3,500 points long. An actual recording of stationary colored noise was added to the speech to create the noisy signal y.

An actual recording of cellular phone noise was added to a speech signal to produce the data shown in Figure 3. The noise was modeled as an AR(10) process to determine An and v2n using a segment of the data (512 points) where no speech is present. The results in the figure reflect assumed knowledge of v2 , and showed an average SNR improvement of 5.77 dB. When v2 is estimated, the SNR improvement is 5.71 dB. In this case, spectral subtraction and adaptive RASTA processing produced comparable SNR improvements of 4.87 dB and 5.27 dB, respectively. MONAURAL BLIND SIGNAL SEPARATION If the additive noise is colored and highly nonstationary, the distinction between what is signal and what is noise becomes somewhat arbitrary. For this reason, we consider the observation y(k) = x(k) + n(k) to represent the addition of two signals y(k) = x1 (k) + x2(k). We simply treat the noise itself as an additional signal that must be estimated. This is a form of blind signal separation, in which, for example, the signals result from the mixing of speakers. The problem differs from recent methods in the literature [2] for blind signal separation in which M signals must be separated from M observations by learning a fixed “inverse weighting” matrix. Instead, we are interested in separating two or more signals from a single (monaural) observation. Previous work on monaural signal separation has primarily been based on harmonic

selection and pitch tracking in the frequency domain [5]. In contrast, we estimate each signal by learning a set of short-term models which best separate the signals. Extension of the DEKF framework is straightforward. Specifically, we formulate state equations containing 1 (k) and 2 (k) 

x

x1(k)  x2(k)

=

y(k)

=

x

F1 [x1(k ? 1)] B1 0 F2 [x2(k ? 1)] + 0 B2  x1(k)  : [C1 C2 ] x (k) 









v1 (k) ; v2 (k)

(25) (26)

2

These equations are analogous to the colored noise formulation (22) and (21), where is replaced by 2 and has corresponding nonlinear model F2 with parameters 2 . To estimate the weight parameters 1 for the neural network F1 associated with 1 we form the state equations

n

x



w x

w

w1(k)  x2(k) y(k)

w1(k ? 1)  +  0  v2(k); F2 [x2(k ? 1]) B2 f1 (x1(k ? 1); w1 (k)) + C2x2 (k) + v1 (k); 

= =

x

(27) (28)

where the state 2 is included so that the associated noise process is white (compare to (23) and (24)). Finally, for estimating the second set of weight parameters 2 for F2 associated with 2, we add a third set of state equations 

w

x

w2(k)  x1(k) y(k)

w2(k ? 1)  +  0  v1(k); F1 [x1(k ? 1]) B1 f2 (x2(k ? 1); w2 (k)) + C1x1 (k) + v2 (k): 

= =

(29) (30)

It is straightforward to show that given a known (linear) model for each signal, both x1(k) and x2(k) are observable from the additive observation y(k). However, showing that model parameters may be jointly learned from the observations alone is a much more difficult problem. Nevertheless, some simple preliminary experiments have been performed which indicate the potential of this approach. Figure 4 illustrates the blind separation of two segments of speech (male =s= and female =eI=) which have been added together and then separated in this manner. While encouraging, the approach should be viewed as only a starting point to a model based framework for approaching blind signal separation problems. ESTIMATING NOISE VARIANCES In the implementation of the DEKF, it is assumed that the variances of v(k) and n(k) in (3) and (4) are known quantities. In practical applications, however, the noise variances must be estimated from the noisy data. We have investigated several approaches for doing this in the speech processing domain. Additive Noise Statistics: Assuming stationarityof the additive noise, the noise variance n2 (or its full autocorrelation) may be estimated from segments of the data y(k)

Segment of Combined Speech −− Monaural Recording

Blind Estimate of Male Speech

Blind Estimate of Female Speech

Figure 4: Blind Separation. Top figure shows the combined signal supplied to the algorithm. In the bottom two figures, the original speech is shown with a dashed line, and the estimated signal with a solid line. 100 data points are shown. that do not contain speech. For slow variations in the noise statistics, Hirsch [8] has proposed an approach based on histograms of spectral magnitudes which does not require explicit segmentation of the data into speech and non-speech segments. For rapidly changing noise (e.g., background chatter, wind noise, and artifacts introduced by automatic gain control) we are interested in short-term estimates of the noise statistics for each window of noisy speech data. An approach we have developed for nonstationary white noise sources re-estimates n2 as follows. First, we note that the optimal weights for the linear estimator

x^(k) = may be expressed as

R

M

X

i=0

wi y(k ? i) = wT y(k);

w^  = R^ ?yy1(^ryy ? n2 e1);

(31)

(32)

where ^ yy is the sample autocorrelation of the noisy speech, and e1 = [1 0    0]. 2 in (32) such that ^ leads to a minimum variance estimator. We Next, we choose  ^n 2 have shown that this provides an upper bound on n2 . Starting at this upper bound,  ^n is iteratively decreased until w ^0 > w ^ i 8i 6= 0, which forces the current observation to have the greatest influence the estimator output relative to other observations. A 2 is then re-estimated for each short-term window. This technique was used new  ^n for the results given with the nonstationary white noise experiment.

w

Process Noise Variance: To estimate v2 (assuming an LPC model for the signal), Lim and Oppenheim [10] used an expression for the inverse Fourier transform of the

signal power (which is a function of v2 ). We have developed an alternative approach by noting that the process noise variance v2 can be estimated as the mean squared error of a linear AR predictor on the clean data x(k)4 . Specifically,

p

v2 = x2 ? pTxxR?xx1pxx;

x R p R p p R p ^x2 = y2 ? ^n2 ; p^ xx = pyy ? p^ nn; R^ xx = Ryy ? R^ nn:

where xx is the cross-correlation between the lagged input vector (k ? 1) and the current x(k), and xx is the autocorrelation of the inputs. In our setup, only the noisy signal y(k) with prediction residual x2 + n2 ? Tyy ? yy1 yy is available. We can 2 T ? 1 approximate x ? xx xx xx using: (33)

This results in the following estimate:

^v2 = (^y2 ? ^n2 ) ? p^ TxxR^ ?xx1p^ xx :

(34)

p

Note that when n(k) is white, the terms in (33) simplify because ^ nn ^n2 I , where the additive noise variance is estimated as above.

Rnn =

= 0 and ^

While these “ad-hoc” methods were used in the experiments reported in this paper, estimating the noise variances remains a critical area for future work. Our current direction is to treat v2 and n2 as additional parameters which may be optimized within the Kalman and maximum-likelihood framework. CONCLUSION AND FUTURE WORK We have presented a DEKF algorithm with preliminary results on its application to speech enhancement in the presence of both nonstationary and colored noise. The approach performs significantly better than the current state-of-the-art on the reduction of nonstationary noise, and performs well on colored noise problems as well. In addition, the application of the approach to monaural blind signal separation was considered as a special case of the nonstationary colored noise problem. Future work will include additional approaches to variance estimation, as well as the coupling of error statistics, windowing aspects, recurrent training implications, and forward-backward methods for smoothing. An additional aspect that is currently under consideration is the minimization of both prediction and estimation errors by the weight filter. While the current implementation minimizes only prediction error of the model, the full errors in variables cost function [14, 15] can be minimized by a two-observation form of the weight filter. This refinement will be discussed in a future paper. Acknowledgments: This work was sponsored by the NSF under grant ECS-9410823, and by ARPA/AASERT Grant DAAH04-95-1-0485. We would like to thank C. Avendano for providing spectral subtraction and RASTA processing comparisons. 4 This is exact, assuming the signal is generated by a linear autoregression.

REFERENCES [1] C. Avendano, H. Hermansky, M. Vis, A. Bayya. Adaptive speech enhancement based on frequency-specific SNR estimation. Proc. IVTTA 1996, Basking Ridge, NJ, October 1996. [2] A. Bell, T. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129-1159, 1995. [3] S.F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE ASSP-27, pp. 113-120, April 1979. [4] J. Connor, R. Martin, L. Atlas. Recurrent neural networks and robust time series prediction. IEEE Tr. on Neural Networks. March 1994. [5] P.N. Denbigh, J. Zhao. Pitch extraction and separation of overlapping speech. Speech Communication, vol.11, pp.119-25. 1992. [6] Y. Ephraim. Statistical model-based speech enhancement systems. Proceedings of the IEEE, Vol. 80, October 1992. [7] G. Goodwin, K.S. Sin. Adaptive Filtering Prediction and Control. PrenticeHall, Inc., Englewood Cliffs, NJ. 1994. [8] H.G. Hirsch. Estimation of noise spectrum and its application to SNRestimation and speech enhancement. Technical Report TR-93-012, International Computer Science Institute. 1993. [9] F. Lewis. Optimal Estimation. John Wiley & Sons, Inc. New York, 1986. [10] J. Lim, A. Oppenheim. All-pole modeling of degraded speech. IEEE Trans. Acoust., Speech, Signal Process., vol. 26, June 1978. [11] M. B. Matthews and G. S. Moschytz. The identification of nonlinear discretetime fading-memory systems using neural network models. IEEE Transactions on Circuits and Systems-II, 41(11):740–51, November 1994. [12] L. Nelson, E. Stear. The simultaneous on-line estimation of parameters and states in linear systems. IEEE Tr. on Automatic Control. February 1976. [13] G. Puskorious, L. Feldkamp. Neural control of nonlinear dynamic systems with Kalman filter trained recurrent networks. IEEE Trn. on NN, vol. 5, no. 2, 1994. [14] G. Seber, C. Wild. Nonlinear Regression. John Wiley & Sons. 1989. [15] E. Wan, A. Nelson. Dual Kalman filtering methods for nonlinear prediction, estimation, and smoothing. In NIPS96 Proceedings, 1997. [16] P. Werbos. Backpropagation through time: what it does and how to do it. Proc. IEEE, special issue on neural networks 2. 1550-1560. 1990. [17] R. Williams. Training recurrent networks using extended Kalman filter. International Joint Conference on Neural Networks, Baltimore, vol. IV, 1992.

Suggest Documents