Preprint – to appear at ICNN’97
1
Neural Speech Enhancement Using Dual Extended Kalman Filtering Alex T. Nelson
[email protected]
Eric A. Wan
[email protected]
Department of Electrical Engineering Oregon Graduate Institute P.O. Box 91000 Portland, OR 97291 Abstract The removal of noise from speech signals has applications ranging from speech enhancement for cellular communications, to front ends for speech recognition systems. Spectral techniques are commonly used in these applications, but frequently result in audible distortion of the signal. A nonlinear time-domain method called dual extended Kalman filtering (DEKF) is presented that demonstrates significant advantages for removing nonstationary and colored noise from speech.
1. Introduction We begin by modeling the speech signal as a series of nonlinear autoregressions with both process and additive observation noise:
x(k) = f (x(k ? 1); :::x(k ? M ); w) + v(k ? 1) y(k) = x(k) + n(k);
(1)
where x(k) corresponds to the true underlying speech signal driven by process noise v(k), and f () is a nonlinear function of past values of x(k) parameterized by . The speech is assumed to be stationary over short segments, with each segment having a different model. The only available observation is y(k), which contains additive noise n(k). An estimate x ^(k ) can be determined through estimation, using observations up to and including time k, or through smoothing, using all observations, past and future. The optimal estimator given the noisy observations (k ) = fy (k ); y (k ? 1); y (0)g is E [x(k )j (k )]. The most direct way to estimate this would be to train on a set of clean data in which the true x(k) may be used as the target to a neural network. Unfortunately, the clean speech x(k) is unavailable; only the noisy measurements y(k) are
w
y
y
observed. Note, if we were to use y(k) instead of x(k) as the target, then E [y(k)j (k)] = y(k) and no estimation is achieved. In order to solve this problem, we assume that f (; ) is in the class of feedforward neural network models, and compute the dual estimation of both states x ^ and weights ^ based on a Kalman filtering approach. We will begin with a description of the algorithm, followed by a discussion of experimental results.
y
w
2. A Neural State-Space Approach By posing the dual estimation problem in a state-space framework, we can use Kalman filtering methods to perform the estimation in an efficient, recursive manner. At each time point, an optimal estimation is achieved by combining a prior prediction with a new observation. Connor et al. [4], proposed using an extended Kalman filter with a neural network to perform state estimation alone. Puskorious and Feldkamp (1994) and others have posed the weight estimation in a state-space framework to allow for efficient Kalman training of a neural network [14, 13]. We have extended these ideas to include the dual Kalman estimation of both states and weights for efficient maximum-likelihood optimization [15]. A state-space formulation of Equation 1 is as follows:
x(k) = F [x(k ? 1)] + Bv(k ? 1) y(k) = C x(k) + n(k);
(2)
where 2
x(k) =
6 6 6 4
x(k) x(k ? 1)
.. .
x(k ? M + 1)
3
2
7 7 7 5
; B = 664
1
3
6 0 7 7 .. 7 . 5
0
;
2 6
F [x(k)] = 664
f (x(k); : : : ; x(k ? M + 1); w) x(k) .. .
x(k ? M + 2)
3 7 7 7 5
;
Time Update EKF1 −
x
and C = B T . If the model is linear, then f ( (k)) takes the form T (k), and F [ (k)] can be written as A (k), where A is in controllable canonical form [3]. We initially assume the noise terms v(k) and n(k) are white with known variances v2 and n2 , respectively. In practice, the noise variances must be estimated directly from the noisy data (see Section 3.0.2).
wx
ˆ x(k)
ˆ x(k-1)
x
x
ˆ x(k)
Measurement Update EKF1
(measurement)
y(k)
Time Update EKF2
−
ˆ w(k)
ˆ w(k-1)
ˆ w(k)
Measurement Update EKF2
2.1. Extended Kalman Filter For a linear model with known parameters, the Kalman filter (KF) algorithm can be readily used to estimate the states [8]. At each time step, the filter computes the linear least squares estimate ^ (k) and prediction ^ ? (k), as well as their error covariances, Px(k) and Px?(k). In the linear case with Gaussian statistics, the estimates are the minimum mean square estimates. With no prior information on x, they reduce to the maximum likelihood estimates. A smoothing version of the KF (known as the forwardbackward Kalman filter (FB) [8]) is achieved by combining the standard KF with a backwards information filter. Effectively, an inverse covariance is propagated backwards in time to form backwards state estimates that are combined with the forward state estimates. In the case of a known linear model, this produces the maximum likelihood estimates of each data point based on all observations, not just the previous ones. When the model is nonlinear, the KF cannot be applied directly, but requires a linearization of the nonlinear model at the each time step. The resulting algorithm is called the extended Kalman filter (EKF), and effectively approximates the nonlinear function with a time-varying linear one. The EKF algorithm is as follows:
x
x
x^?(k) = F [^x(k ? 1); w^ (k ? 1)]
Px^?(k) = APx^ (k ? 1)AT
+
Bv
2
(3)
BT
A , @F @xx; w x^(k ? 1) ? T ? K (k) = Px^ (k)C (CPx^ (k)C T + n2 )?1 Px^ (k) = (I ? K (k)C )Px^? (k) x^(k) = x^?(k) + K (k)(y(k) ? C x^?(k)): where
[^ ^ ] ^
(4) (5) (6) (7) (8)
2.2. Dual Extended Kalman Filter Because the model for the speech is not known, the standard EKF algorithm cannot be applied directly. We ap-
Figure 1. The Dual Extend Kalman Filter (DEKF). EKF1 and EKF2 represent the filters for the states and the weights, respectively. The “Time Updates” refer to Equations 3-5 for EKF1, and Equations 10-11 for EKF2. Similarly, the “Measurement Updates” refer to Equations 6-8 for EKF1, and Equations 12-15 for EKF2.
proach this problem by constructing a separate state-space formulation for the underlying weights as follows:
w(k) = w(k ? 1) y(k) = f (x(k ? 1); w(k)) + v(k ? 1) + n(k);
(9)
where the state transition is simply an identity matrix, and the neural network f ( (k ? 1); (k)) plays the role of a time-varying nonlinear observation on . These state-space equations for the weights allow us to estimate them with a second EKF. This is effectively a second-order parameter estimation method, which (for the linear case) is equivalent to recursive least squares [6].
x
w
w
w^ ?(k) = w^ (k ? 1)
Pw^?(k) = Pw^ (k ? 1) Kw^ (k) = Pw^?(k)H T (HPw^?(k)H T
(10)
?1 + n + v ) 2
H , C@F@ wx; w w^ (k ? 1) ? Pw^ (k) = (I ? Kw^ (k)H )Pw^ (k) where
[^ ^ ] ^
(11)
2
(12) (13) (14)
w^ (k) = w^ ?(k) + Kw^ (k)(y(k) ? CF (^x(k ? 1); w^ ?(k)):
(15)
We now have a pair of dual extended Kalman filters (DEKF), shown in Figure 1, that are run in parallel. At each
x w
time step, the current estimate of is used by the weight filter, and the current estimate of is used by the state filter. For finite data sets, the algorithm is run iteratively over the data until the weights converge. This approach to dual estimation is related to work done by Nelson [12] in the linear case, and to Matthews’ neural approach [11] to the recursive prediction error algorithm [5] 1. In the domain of speech, the method is most closely related to Lim and Oppenheim’s approach to fitting LPC models to degraded speech [9]. We are currently investigatingtwo improvements to the method: the use of each filter’s error statistics in the updates of the other filter; and the use of recurrent procedures to compute the linearization in Equation 13.
3. Speech Processing 3.0.1. Windowing. Application of the DEKF to speech processing involves segmenting the signal into short windows over which the signal is approximately stationary. We used successive 64ms (512 point) windows of the signal, with a new window starting every 8ms (64 points). A normalized Hamming window was employed to emphasize the estimates in the center of each window when combining the overlapping segments. The same Hamming window is also used to weight the parameter estimation process. That is, the data in the center of the window is emphasized with respect to learning the model parameters (as is done in weighted least squares). Using the relationship between recursive least squares (RLS) and Kalman filtering, we reformulated the Kalman weight filter to perform this windowing effect. Relatively short window lengths are needed to track the nonstationary speech signals. Unfortunately, this limits the amount of data available for training the neural network. We can mitigate this problem by initializing the weight vectors and covariance matrices of each window with those of the preceding window, thereby increasing the effective size of the training sets. For the experiments presented in this paper, feedforward networks of 10 inputs, 4 hidden units, and 1 output were used. Weights converged in less than 20 epochs.
3.0.2. Estimating the Noise Variances. In the implementation of the DEKF, it is assumed that the variances of v(k) and n(k) in Equation 2 are known quantities. In practical applications, however, the noise variances must be estimated from the noisy data. We have investigated several
w
x
1 An alternative to the dual approach is to concatenate both and into a joint state vector. The model and time series are then estimated simultaneously by applying the EKF to the nonlinear joint state vector (see [5] for the linear case). This algorithm, however, has been known to have convergence problems. The application of joint estimation in recurrent neural networks has been reported by Williams [16], and Matthews [10].
approaches for doing this in the speech processing domain. For estimating the additive noise variance n2 , two methods have been used. The first method assumes we can estimate n2 from segments of the signal where no speech is detected. This is a common technique, and works well if the noise is approximately stationary. A second approach we have developed assumes a rapidly varying noise source, in which case n2 must be re-estimated in each window. Ini2 to minimize the output variance of a tially, we choose ^n smoother; this provides an upper bound on n2 . Starting at 2 is iteratively decreased until the influthis upper bound, ^n ence of the current observation on the smoother output is maximal relative to other observations. The process noise variance v2 can be estimated as the mean squared error of a linear AR predictor on the clean data x(k). This MSE is given by x2 ? Txx ?xx1 xx, so we have:
p R p v2 = x2 ? pTxxR?xx1 pxx;
p x
where xx is the cross-correlation between the lagged input vector (k ?1) and the current x(k), and xx is the autocorrelation of the inputs. However, only the noisy signal y(k) 1 with prediction residual x2 + n2 ? Tyy ? yy yy is available. 2 T ? 1 We can approximate x ? xx xx xx using:
R p R p p R p
^x2 = y2 ? ^n2 ; p^ xx = pyy ? p^ nn; R^ xx = Ryy ? R^ nn:
(16)
This results in the following estimate:
^v2 = (^y2 ? ^n2 ) ? p^ Txx R^ ?xx1p^ xx:
(17)
Note that when n(k) is white, the terms in Equation 16 sim2 I , where e = 2 e and ^ ^n ^n plify because ^ nn = nn = 1 1 [1 0 0]. We will elaborate more on these methods for estimating both the measurement and process noise variances in a future paper.
p
R
3.1. Nonstationary Noise Experiment The result of applying the DEKF to a speech signal corrupted with nonstationary bursting white noise is illustrated in Figure 2. These results were computed assuming both v2 and n2 were known. The average SNR is improved by 9.94 dB, with little resultant distortion. We also ran the experiment when n2 and v2 were estimated using only the noisy signal. This also produced impressive results, with an SNR improvement of 8.50 dB. In comparison, available “state-ofthe-art” techniques of spectral subtraction [2] and adaptive RASTA processing [7, 1] achieve SNR improvements of only .65 and 1.26 dB, respectively.
Clean Speech
Noise
Noisy Speech
Cleaned Speech
Figure 2. Cleaning noisy speech with the DEKF. The speech data was approximately 33,000 points (4 seconds) long. Nonstationary white noise was generated artificially and added to the speech to create the noisy signal y.
3.2. Colored Measurement Noise If the measurement noise n(k) is not white, then the statespace equations 2 need to be adjusted before Kalman filtering techniques can be employed. Specifically, the measurement noise process is given its own state-space equations,
n(k) = Ann(k ? 1) + Bnvn(k ? 1) n(k) = Cnn(k);
n
where (k) is a vector of lagged values of n(k), vn (k) is white noise, An is a simple state transition matrix in controllable canonical form, and Bn and Cn are of the same form as B and C given in Equation 2. Note that this is equivalent to an autoregressive model of the colored noise. With this formulation for the colored noise, it is straightforward to augment both the state (k) and the weight (k ) with (k ), and write down combined state equations. Specifically, Equation 2 is replaced by:
w
x
n
x(k) n(k)
F [x(k ? 1)] + B 0 An n(k ? 1) 0 Bn y(k) = [C Cn ] xn((kk)) ;
=
v(k) ; vn(k)
w(k ? 1) + 0 vn(k); 0 n(k ? 1) Bn An y(k) = f (x(k ? 1); w(k)) + Cn n(k) + v(k ? 1):
w(k) n(k)
=
I
0
3.3. Colored Noise Experiment An actual recording of cellular phone noise was added to a speech signal to produce the data shown in Figure 3. To account for the coloration of the noise signal, the DEKF algorithm was applied using the modified state-space formulation described in Section 3.2. The noise was modeled as an AR(10) process to determine An and v2n using a segment of the data where no speech is present. The results in the figure reflect assumed knowledge of v2 , and showed an average SNR improvement of 5.77 dB. When v2 is estimated (Equation 17), the SNR improvement is 5.71 dB. In this case, spectral subtraction and adaptive RASTA processing produced comparable SNR improvements of 4.87 dB and 5.27 dB, respectively.
4. Conclusion and Future Work
and Equation 9 is replaced by:
The noise processes in these state-equations are now white, and the DEKF algorithm can be used to estimate the signal. Note, the colored noise affects not only the state estimation, but also the weight estimation.
This paper has presented preliminary results on the application of the DEKF algorithm to speech enhancement in the presence of both nonstationary and colored noise. The advantages of the method are most evident when the noise is white and nonstationary, although it performs better than
Clean Speech
Noisy Speech
Cleaned Speech
Figure 3. Removing colored noise with the DEKF. The speech data is 3,500 points long. An actual recording of stationary colored noise was added to the speech to create the noisy signal y. competing methods in the presence of stationary colored noise, as well. Future work will include the development of methods for nonstationary colored noise, variance estimation, and forward-backward smoothing. While further study is needed, the dual extended Kalman filter offers a powerful new tool for speech enhancement.
Acknowledgments. This work was sponsored by the NSF under grant ECS-9410823, and by ARPA/AASERT Grant DAAH04-95-1-0485. We would like to thank C. Avendano for providing spectral subtraction and RASTA processing comparisons.
References [1] C. Avendano, H. Hermansky, M. Vis, and A. Bayya. Adaptive speech enhancement based on frequency-specific SNR estimation. In Proceedings IVTTA 1996, Basking Ridge, NJ, October 1996. [2] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. In IEEE ASSP-27, pages 113–20, April 1979. [3] C. Chen. Linear System Theory and Design. Saunders College Publishing,Harcourt Brace College Publishers, 1970. [4] J. T. Connor, R. D. Martin, and L. E. Atlas. Recurrent neural networks and robust time series prediction. IEEE Transactions on Neural Networks, March 1994. [5] G. C. Goodwin and K. S. Sin. Adaptive Filtering Prediction and Control. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1994. [6] S. Haykin. Adaptive Filter Theory. Prentice-Hall, Inc, Upper Saddle River, NJ, 1996.
[7] H. Hermansky, E. Wan, and C. Avendano. Speech enhancement based on temporal processing. In International Conference on Acoustics, Speech, and Signal Processing, 1995. [8] F. L. Lewis. Optimal Estimation. John Wiley & Sons, Inc., New York, 1986. [9] J. S. Lim and A. V. Oppenheim. All-pole modeling of degraded speech. IEEE Transactionson Acoustics, Speech, and Signal Processing, 26(3):197–210, June 1978. [10] M. B. Matthews. A state-space approach to adaptive nonlinear filtering using recurrent neural networks. In Proceedings IASTED Internat. Symp. Artificial Intelligence Application and Neural Networks, pages 197–200, 1990. [11] M. B. Matthews and G. S. Moschytz. The identification of nonlinear discrete-time fading-memory systems using neural network models. IEEE Transactions on Circuits and Systems-II, 41(11):740–51, November 1994. [12] L. W. Nelson and E. Stear. The simultaneous on-line estimation of parameters and states in linear systems. IEEE Transactions on Automatic Control, pages 94–8, February 1976. [13] E. S. Plumer. Training neural networks using sequentialupdate forms of the extended Kalman filter. Informal Report LA-UR-95-422, Los Alamos National Laboratory, January 1995. [14] G. V. Puskorious and L. A. Feldkamp. Neural control of nonlinear dynamic systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 1994. [15] E. A. Wan and A. T. Nelson. Dual Kalman filtering methods for nonlinear prediction, estimation, and smoothing. In Advances in Neural Information Processing Systems 8, 1997. [16] R. J. Williams. Training recurrent networks using the extended Kalman filter. In Proceedings International Joint Conference on Neural Networks, volume 4, pages 241–6. IEEE and INNS, 1992.