Denoising using Time-Frequency and image processing methods

0 downloads 0 Views 537KB Size Report
One of the fundamental problems in signal analysis is noise removal from a .... The short time Fourier transform (STFT) of a signal was developed to study the ...
Denoising Using Time-Frequency and Image Processing Methods D. Nelson Dept. of Defense Ft. Meade, MD 20755, USA

G. Cristobal1 and V. Kober2 Instituto de Optica (CSIC), Serrano 121, 28006 Madrid, Spain

F. Cakrak and P. Loughlin3 Dept. of Electrical Engineering, Univ. of Pittsburgh. Pittsburgh, PA 15261, USA

L. Cohen4 Dept. of Physics, City University of New York, Hunter College, New York, NY 10021, USA

We present a number of methods that use image and signal processing techniques for removal of noise from a signal. The basic idea is to rst construct a time-frequency density of the noisy signal. The time-frequency density, which is a function of two variables, can then be treated as an \image", thereby enabling use of image processing methods to remove noise and enhance the image. Having obtained an enhanced time-frequency density, one then reconstructs the signal. Various time frequency-densities are used and also a number of image processing methods are investigated. Examples of human speech and whale sounds are given. In addition, new methods are presented for estimation of signal parameters from the time-frequency density. 1

Introduction

One of the fundamental problems in signal analysis is noise removal from a signal and the extraction of signal parameters such as the amplitude and phase modulations. There have been many approaches developed to address these problems. One approach is to obtain a time-frequency density or representation of the signal, cut out or \mask" the signal in the time-frequency plane to eliminate the undesirable parts (e.g., noise), and then reconstruct the signal. Since the noise is usually spread out in the time-frequency plane and the signal is often concentrated in a region, this method generally works well. However one of its shortcomings is that one has to make a decision as to what part to mask and how to mask it, and the reconstruction of the masked signal from the time-frequency plane is usually achieved by a least-squares approximation. Our approach is di erent in that while we also rst represent the signal in the time-frequency plane, we treat the time-frequency representation as 1 This research has been supported in part by the following grants: NATO Collaborative Grant No. 950297, EU INCO-DC 961646, EU MAS3-CT97-0122. E-mail: [email protected]. 2 On leave from Russian Academy of Sciences, Institute of Information and Transmission Systems, Bolshoi Karetnii 19, 101447 Moscow, Russia. V.Kober expresses his appreciation for the nancial support from NATO and the Russian Foundation for Basic Research (grant 99-01-00269). E-mail: [email protected]; [email protected]. 3 Funding provided by ONR grant N00014-98-1-0680. E-mail: [email protected]. 4 Work supported by the OÆce of Naval Research, the NASA JOVE, and the NSA HBCU/MI programs.

1

an image but we enhance it and remove noise using image processing methods. We then reconstruct the signal form the enhanced image using a variety of di erent methods. We have found this approach works quiet well, particularly for biological signals. The processing of biological signals raises issues and considerations not encountered communication signals, radar signals and other man made signals. Machine produced signals, are frequently driven by process, such as rotation or controlled vibration, which are nearly periodic, resulting in signal components which are nearly stationary or cyclostationary. Examples are noises produced by drills, milling machines and electric motors which are driven by processes which are rotating at nearly constant rate. Radars are frequently driven by crystals which are highly stable clocks, and modem signals have stable underlying baud rates and carrier frequencies. Biological signals, by contrast are generally controlled by tensions and pressures which can rapidly change, resulting in signal structures which have been called quasi periodic. By this we mean that the signal components, and the excitations which drive them are not periodic or stationary, but are FM modulated by a bounded FM modulating function. In addition, there may be phase modulation and considerable AM modulation. These signal components can change quite rapidly. If we think of the speech signal as an example, the primary excitation function is a series of glottal pulses which are produced as the vocal cords open and close. The process is driven by the pressure di erential in the lungs and the vocal tract and by the tension of the vocal cords. As these tensions and pressures change, the rate at which the pulses are generated changes. The process is constrained to an octave or so of variation. Speech is observed as resonances in the vocal tract. As the tongue an lips move, the resonant structure of the vocal tract changes rapidly. In this representation, the vocal tract is a \channel" which varies rapidly. The problem apparently solved by the ear is the identi cation of the channel as it is \sounded" by the glottal pulse excitation function. In e ect, humans function much the same as bats which sound their environment in order to navigate at night. In the case of whale sounds, the signals are produced much the same as speech, but the channel consists of the vocal tract of the whale and the transmission characteristics of the ocean. Both of thes The processing of biological signals raises issues and considerations not encountered communication signals, radar signals and other man made signals. Machine produced signals, are frequently driven by process, such as rotation or controlled vibration, which are nearly periodic, resulting in signal components which are nearly stationary or cyclostationary. Examples are noises produced by drills, milling machines and electric motors which are driven by processes which are rotating at nearly constant rate. Radars are frequently driven by crystals which are highly stable clocks, and modem signals have stable underlying baud rates and carrier frequencies.e can change rapidly, and the ocean is a medium with fading multipath. There are two issues detection and tracking of quasi stationary signals in noise and interference and removal of noise/interference and reconstruction of the uncorrupted signal.

2

About the data

There are three sources of data used in developing the analytic methods presented here. The TIMIT database is one of the standard speech databases used for the development of speech processing algorithms. It consists of clean speech which has been recorded over a telephone line. The only anomalies observed in the NTIMIT data are the channel characteristics of the NYNEX telephone channel and the handset. The primary characteristic of this channel is that the passband of the channel is approximately 300 Hz to 3.5 kHz. The data were sampled at 16 kHz with 16-bit resolution.

2

The second data set consists of read sentences recorded in the cockpit of a helicopter. This data set has fairly high noise and is nearly unintelligible due to the environmental noise in the helicopter cockpit. The sample rate for this set is 8 kHz, with 8-bit resolution.5 The nal set of data consists of whale sounds recorded by an underwater hydrophone and supplied by the U.S. Department of the Navy. These data les were recorded in an underwater environment and represent typical underwater acoustic environments.

3

The short time Fourier transform, cross-spectrum and separability

The short time Fourier transform (STFT) of a signal was developed to study the temporal variations of a signal. One windows a signal at time t to extract a relatively small portion of the signal around the time of interest and Fourier analyzes that small piece of data. The Fourier transform is then called the short time Fourier transform. The magnitude square of the Fourier transform is then the energy density spectrum at time t. If one then varies t for all time one then one obtains a two dimension density of time and frequency, which commonly called the spectrogram. This has been the conventional way of studying the time varying spectral content of a signal. We aim at developing an alternative approach by just considering the short time Fourier transform and not taking the absolute square to represent its energy. This of course results in studying a two dimensional complex function in contrast to the spectrogram which is a two dimensional real function. However we will show that our approach has considerable advantageous for addressing the problems of signal estimation. Our method takes advantage of the spectral phase and magnitude while methods based on the spectrogram only take advantage of the signal magnitude. Under some conditions the spectrogram is invertible in principle, which is an important factor in signal enhancement. Even in this problem, using the spectral phase provides an advantage of simple reconstruction methods with fewer restrictions on the windowing function. For a signal f (t) and window function w(t) the STFT is Z Fw (T; ! ) = f (t + T )w( t)e

i!t

dt:

(1)

We note that for xed !0 , the STFT is a function of time Fw (T; !0 ) =

Z

f (t + T )w( t)e

i!0 t

dt:

(2)

This last expression is easily seen to be the convolution of f (t) and the function w!0 = w(t)e

That is,

Fw (T; !0 ) = f

i!0 t

:

 w!0 :

(3) (4)

For xed !0 , the STFT is a complex time-domain signal, which is a nite impulse response (FIR) ltered representation of the original time-domain signal f (t) by a lter whose impulse response is w!0 . Since there are no restrictions on the window ( lter) w(t), for each value of w0 , the STFT represents the signal ltered by a lter whose impulse response is the window w translated to frequency !0 , and this holds for any window. We see that the STFT represents a bank of lters. For each value of ! , we get a time-domain ltered version of the original signal. The rst question is what signal is it a ltered representation of? It is complex valued, even 5 We

thank Prof. Amro El-Jaroudi for providing us the helicopter cockpit data.

3

if the input signal is real and the analysis window is real since the factor of e i!0 t is complex. Now, let us consider two signals. If the input signal is the analytic signal, the situation is clear. The lter output is the ltered analytic signal. However, if the input signal is the real signal, the spectrum of the real signal and the analytic signal agree in the positive frequencies,6 which is normally the portion of the spectrum we assume in the STFT. Therefore, the STFT represents the ltered analytic signal at positive frequencies, regardless of whether the analytic or real signal is used as input. At the negative frequencies, the STFT represents the ltered anti-analytic signal. We may consider the STFT as a distribution of the signal itself - for two reasons. First, as we have just seen, for each ! , the STFT is not an energy distribution. It is actually the original signal ltered by a lter centered at frequency ! . This means that the signal is somehow distributed in frequency, but since the STFT is based on short time observations, the signal is also distributed in time. We could start from the signal spectrum F (! ) and computed the inverse STFT surface, and obtain a distribution of the signal spectrum which would have the same magnitudes, but the phase would be a function of the STFT phase and the frequency ! . For the second reason, note that the time-domain signal and its Fourier spectrum may be reconstructed from the STFT by summing over frequency to estimate the time-domain signal and Fourier transforming in time to reconstruct the signal spectrum. Z 2f (t)w(0) = Fw (T; ! )d!; w(0) 6= 0 Z Z F (! ) w( t)dt = Fw (T; ! )e i!T dt:

(5) (6)

This means that the signal is truly distributed in the sense that it locally represents the analytic signal jointly in time and frequency. Our use of the STFT and the representation of a TF surface as the distribution of the analytic signal represents a signi cant departure from convention. Normally the STFT is used to compute the spectrogram, and it is more conventional to represent TF surfaces as energy distributions. Note that for well behaved windows, no information lost in the STFT. Since it is invertible, any information contained in the original signal is also contained in the Fourier transform. The advantage in looking at the signal in one particular domain is that the signal components may separate in one domain, but not the other. The question is whether the di erent representations are equivalent in some sense. If a transformation is invertible, no information is lost, and both domains truly have exactly the same information. If this is the case, they are representations of the same signal. They di er only in perspective. The sum of two sine waves is a mess in the time domain, but clearly looks like two impulses in the frequency domain and is therefore recognizable as the sum of two sine waves. The sum of two chirps looks like a mess in time or frequency, but looks like two chirps in a joint TF representation. If these representations can be freely transformed to each other, they are all equally valid representations of the same signal. Even though we can theoretically discover the nature of the signal from any one of the representations, discovering this nature is not equally obvious in each domain. The advantage of the TF domain is that signal components may separate locally in some orientation on the surface. The problem with the STFT is that it must be computed as a nite duration time integral, resulting in ambiguities in both time and frequency. It is, however, linear, so there are no cross terms. 6 This is not strictly speaking true, since the spectral representation is very dependent on the window. What we assume throughout this paper is that the frequency response of the window is very narrowband (i.e. concentrates energy well in frequency. If this is the case, there will be very little aliasing of the positive and negative spectrum in the STFT of the real signal. As long as we avoid the DC and Nyquist frequencies, the positive spectrum of the real signal and the analytic spectrum will be in close agreement. We will also assume that the window has a short impulse response (i.e. concentrates energy well in time.) Selecting a window is a compromise since high resolution in one domain means poorer resolution in the other.

4

Consider an FM modulated tone as an example, f (t) = ei

R

! (t)dt

;

(7)

where ! (t) is the instantaneous frequency. Let us assume that we have absolute knowledge that the signal consists of only one signal component and that we know ! (t) exactly. We further assume that ! (t) is continuously di erentiable. In this case, the signal is represented accurately by the instantaneous frequency alone. If we assume that the ! (t) changes rapidly with time, we would expect to see a response in the Fourier transform in a fairly broad spectral band. Now, we must say what we mean by response. By a response at a given frequency, we mean that there is a signi cant amount of energy in the Fourier transform coeÆcient represented by that frequency. In general there will be some energy distributed over the entire spectrum, but if the energy at a particular frequency is very small compared to the energy at some other frequency, we may e ectively ignore it. As was mentioned, the FM signal we are assuming will be observed as a broad spectral band of energy in the frequency domain. But, the signal is not exactly at all frequencies within the band simultaneously, as the Fourier transform, or the STFT seems to indicate. We could decrease the analysis interval, and the spectral band may decrease, or we could increase the analysis interval, and the observed spectral band may increase (or decrease.) The observed bandwidth depends in part on the analysis interval of the window used in the Fourier transform, the time duration of the signal, the total frequency excursion of the signal over the observation time, and to some extent the rate of change of the frequency of the signal. In fact, the signal passes through each frequency in a band of the Fourier spectrum at a particular time, but is observed \simultaneously" in a band of adjacent frequencies for two reasons. The rst reason is the support interval of the signal. To understand this, we consider a short segment of data where the true signal frequency is constrained to a narrow band. In this case, we may consider the model of a pulsed sine wave whose support interval is short compared to the length of the analysis interval of the Fourier transform. The observed spectrum in this case is apparently broad since the observed spectral bandwidth is proportional to the reciprocal of the length of the support interval of the pulsed sine wave. The estimated spectrum is broad in this case only because the signal duration is short. The second reason is that the signal may be moving across the spectrum during the analysis interval of the Fourier transform. In this case, the observed spectrum is broad primarily because the signal frequency is di erent at the beginning of the analysis interval than it is at the end. Splitting the Fourier transform into two shorter transforms will show that the signal frequency bands are in fact di erent on the two smaller intervals. In both the FM modulated tone and the pulsed sine wave cases cited above, the signals consist of only one signal component, so we should be able to recover the time domain signal directly from the local properties of the nite Fourier transform or the STFT, without applying an inverse transform. In fact this is the case, assuming that the FM modulation is monotonic within the analysis interval of the Fourier transform, or each Fourier transform used to compute the STFT in the case of the STFT. By monotonic, we mean that ! (t) is a monotonic function of t. If it is not monotonic, there is an ambiguity in the time associated with the observed frequencies, and the observed spectrum can not be inverted to recover time domain signal. That it should be possible to recover the signal directly from the spectrum without inversion may be expected since the STFT distributes the signal (both amplitude and phase) over the TF surface. If the rate of increase of phase increases, the observed frequency location of the signal component on the surface changes by the same amount. The frequency reconstruction property is essentially projection of the surface onto the frequency axis. If there is only one monotonic signal component, the projection essentially preserves the phase of the time domain signal, so the time domain signal may be recovered from the group delay function in the frequency domain. This is the analog of estimating the spectrum of the signal by the instantaneous frequency function of the time domain signal. The important thing is that under the monotonic single component FM assumptions, the two representations must be the same. 5

There are several questions we may ask about the STFT surface. First of all, is there any signal which has a given surface as its STFT? These surfaces are very delicate in the sense that it is extremely diÆcult to modify a STFT surface to produce a new surface which is again the STFT of some signal. In most time-frequency processing, the phase is removed and the TF surface is represented as an energy distribution. In such a representation, it is legitimate to ask where the signal energy resides, and how the energy changes with respect to time and frequency. If we consider only an energy distribution surface and no information (such as the marginals) extracted directly from the signal from which the surface was derived, (except in special circumstances) there is no known way to determine from the surface alone whether the distribution faithfully represents the underlying signal and its individual components. However, if we include the surface phase, the situation changes completely. We can now look at the phase and ask whether the observed phase of the surface is consistent with the position of the observation. Actually, we can not determine consistency from only one observed phase since we can rotate the original signal by an arbitrary phase to create a signal whose STFT has any desired phase at any one point on the STFT e F (! ) = i

Z

e

i!t

dt:

(8)

What we can ask is whether the phase of any point on the STFT surface is consistent with its position on the surface and the phase of the surface elements near it. Instead of asking whether the phase of any surface component is consistent with observations, consider the question whether the surface component is at the correct frequency. We ask how fast the phase of the surface at a particular point (T; !0 ) is rotating as a function of time. Since frequency is the derivative of phase with respect to time, and the rate of rotation with respect to time of the surface phase at (T; !0 ) is also the derivative of phase with respect to time, the time derivative of the surface phase represents a frequency estimate. Since the frequency coordinate !0 of the observed point also represents a frequency estimate, under the right circumstances, the two estimates should agree. If they do not agree, then one estimate is wrong. The question is which one, if any, is correct. Let us now x time and ask whether the phase observations are consistent with the time location (T; !0 ) of the surface component. As in the frequency estimation case, we can not ask whether the time at which a particular lter responded to a particular frequency component is correct. We must ask whether the phase of the response is consistent with the observations at neighboring frequencies. If the response is not consistent, then something is wrong. 3.1

Cross-spectrum

We de ne the cross-spectrum of two signals f (t) and g (t) as the product of the STFT of one of the signals the conjugate of the STFT of the other signal Fw (T; ! )Gw (T; ! ):

(9)

This notation is an extension of the usual de nition of the cross spectrum as the spectrum of the cross correlation function. For xed T , our surface de nition of the cross spectrum is precisely the spectrum of the cross correlation of the two signals windowed with window w. We loosely refer to the cross-spectrum of a signal f (t) as the conjugate product of the STFT and the STFT delayed in time, frequency, or both time and frequency. The most general form of this self cross-spectrum is

6

Fw

3.2

(T; !; ;  ) = Fw (T + ; ! +  )Fw (T; ! ):

(10)

Separability

Now let us assume a composite signal of the form f (t) =

X

f n (t ) =

X

An (t)ein (t) ;

(11)

with the condition that the phase n of each component has a continuous, bounded derivative and that the magnitude An (t) of the component is continuous and slowly varying. We further assume that in some neighborhood of the point (T0 ; !0 ) in the time-frequency plane no component other than fn has signi cant energy. By this we mean that, if Fnw (T; ! ) represents the STFT of component fn , then we have the separability condition [2,3]

jFnw (T; !)j >> jFmw (T; !)j; m 6= n; (T; !)N (T ; ! ); 0

0

(12)

for some neighborhood N (T0 ; !0 ). Separability is a local condition, which establishes the conditions under which we can e ectively isolate a single signal component Fnw on a TF surface. Separability is crucial since it means that there can be only one such component with any signi cant amount of energy in an entire neighborhood of a point on the surface. If this is the case, and the frequency is monotonic, and the AM contribution is small compared to the FM, the surface can be e ectively remapped by the CIF and LGD surfaces presented in the next section. If these conditions are violated, the process does not work.

4

Local group delay and channelized IF

In order to address the issue of correctness of the surface representation raised in the previous section, we must de ne two surfaces which are dependent on the STFT phase. Thee surfaces are the channelized instantaneous frequency (CIF) and the local group delay (LGD.) The CIF surface represents the rate of change of the phase of the STFT with respect to time, and the LGD surface is essentially the rate of change of the phase of the Fourier surface with respect to frequency. If the STFT surface satis es the separability condition at a point, at that point the CIF represents an estimate of the frequency of the component, and LGD represents a relative time o set. The LGD is essentially a timing error estimate for the STFT surface and the correctness of the of the estimate is dependent on both separability and the assumption that the frequency of the component is locally monotonic as a function of time. Together the CIF and the LGD represent the phase gradient of the STFT, and together they represent an answer to the question an observer on the surface might ask: \Am I at the correct point on the time-frequency surface based on the rate that the surface components are spinning?" The CIF is a function of time and frequency, and may be computed as the partial derivative with respect to time of the STFT phase !IF (Fw ; T; ! ) = @T argfFw (T; ! )g:

(13)

Similarly, the LGD function may be computed by di erentiating the STFT phase with respect to frequency as G(Fw ; T; ! ) =

@! argfFw (T; ! )g:

7

(14)

Both of these surfaces may be easily estimated from cross-spectra as 1   !0  argfFw (T + 2 ; !)Fw (T

!IF (Fw ; T; ! ) = lim 

and



2

; ! )g:

ÆT   !0 2 argfFw (T; ! + 2 )Fw (T; !

G(Fw ; T; ! ) = lim 

where ÆT is the length of the analysis interval of the Fourier transform.



2

)g:

(15) (16)

If we assume a pure FM modulation (i.e. the amplitude is slowly varying with respect to time), the local group delay has the interpretation that, for each frequency !0 it represents the time delay for the signal component at that frequency relative to the observation time assigned to the Fourier transform. This is similar to the IF which represents the instantaneous frequency of the time domain signal at a particular time. We see that the contribution of the STFT at (T; ! ) may be e ectively remapped as [2,3] Fremap (T + G(Fw ; T; ! ); !IF (Fw ; T; ! )) = Fw (T; ! );

(17)

where  is a constant dependent on the sample rate and transform size for sampled signals. The remapped signal is the basis of the increased accuracy of the cross-spectral methods. What we would ideally like is for the TF surface to be in a state of \equilibrium", in which the analytic signal is distributed over the TF surface F (T; ! ) in such a way that !IF (F; T; ! )  !;

(18)

G(F; T; ! )  0;

(19)

and

If this could be accomplished, the surface F (T; ! ) would be consistent with both the CIF and LGD. In this entire discussion, we have assumed that the AM contribution is small by assuming that the amplitude component is slowly varying. The contribution of the AM component can not be ignored since AM modulation of the signal con- tributes to the phase and magnitude of the Fourier transform and can not be distinguished from the contribution of FM modulation. If the AM contribution is signi cant, the phase derivative estimates become biased and unreliable. 4.1

Higher order phase derivatives

Sometimes it is useful to be able to compute the higher order phase derivatives of the STFT surface. If we assume a representation of the STFT surface Fw (T; ! ) = A(T; ! )ei(T;!) ;

(20)

we may compute the nth derivative of the surface phase with respect to time as (n) (T; ! ) = lim argf

!0



F(n) (T; ! ) g; n

(21)

where F(n) (T; ! ) = F(n 1) (T; ! )F(n 1) (T ; ! )), n = 1; 2; ::: , and F(0) (T; ! ) = Fw (T; ! ) . The derivatives with respect to frequency, and the mixed derivatives may be similarly computed. 8

5

Some applications

We apply the cross spectral methods the problem of accurately estimating the signal components of clean telephone quality speech from the NTIMIT database, noisy helicopter cockpit speech and whale sounds. The rst problem we consider is the accurate estimation of the repetition frequency of the glottal pulse train which excites the vocal tract in voiced speech. This excitation frequency is sometimes called pitch. When each glottal pulse excites the vocal tract, what is observed is the convolution of the glottal pulse and the impulse response of the vocal tract. In the time domain, the resulting waveform is similar to the superposition of several damped oscillations at di erent resonant frequencies (formants.) Because the vocal tract changes very slowly, the responses observed for several consecutive pulses are nearly identical. Although the frequency of the glottal pulses changes fairly rapidly, the waveform is nearly periodic. What is observed in the frequency domain is an harmonic structure, in which Nelson has argued the harmonics may be expected to be phase locked [?]. We segmented the data into frames of length 401 samples (25 milliseconds) and windowed with a prolate-spheroidal window. The windowed data was zero lled to 2048 samples, and the power spectrum and the CIF were computed. The CIF was computed using a delay of one sample. The CIF was then used to remap the spectral energy. The results of both operations are presented in the Fig. 1a. The CIF as a function of frequency is presented in Fig. 1b. The autocorrelation function of the complex cross-spectrum was computed, following the method of Nelson and Wysocki [?], and the results are compared to the autocorrelation of the power spectrum Fig. 1c. The phase of the autocorrelation of the cross-spectrum is essentially the CIF of the spectral correlation function ([?].) This function is displayed in gure Fig. 1d. The purpose of the spectral autocorrelation function in these displays is to produce a bulge at the pitch fundamental frequency, in the case that the fundamental is not observed in the original spectrum. The same cross-spectral methods were applied to typical segments of the helicopter cockpit data using a 201 sample (25 millisecond) window, and the results of the two correlations are displayed in Figs. 1e and 1f. The ability to recover accurate pitch or formant estimates from noisy speech make it possible to isolate the speech components on the TF plane. In processing the whale data, the segment of data selected was not voiced. The data consisted of several whales \talking" in a noisy ocean environment. The CIF and LGD were used to remap the sounds, and the results are presented with the normal spectrogram. Both the spectrogram and the remapped data were computed from the same STFT. The results are shown at two resolutions in Figs. 1g and 1f. The rectangular blocking of the spectrogram represents the TF tiling of the STFT. Each rectangle represents one cell of the STFT. The white +'s represent the remapped spectrogram. No attempt was made to remove or mitigate the noise, interference and cochannel interference in the displayed data.

6

Denoising using the short-time discrete cosine transform

The objective of this section is to develop a noise suppression technique on the base of the short-time discrete cosine transform (DCT), to implement a computationally eÆcient algorithm, and to test its performance in actual noise environment such as the helicopter cockpit. The concept of short-time signal processing with ltering in the domain of an orthogonal transform is usually used when signals such as speech have inherently in nite length. Moreover, since the speech signal properties (amplitudes, frequencies, and phases) change with time, a single orthogonal transform is not suÆcient to describe such signals. The short-time orthogonal transform of a signal xk is de ned as

9

4

x 10

5

2000

4.5

1800

4

1600

3.5

1400

3

1200

2.5

1000

2

800

1.5

600

1

400

0.5

a

0

200

0

200

400

600

800

1000

1200

1400

1600

1800

b

2000

0

0

200

400

600

800

−800

−600

−400

−200

1000

1200

1400

1600

1800

2000

200

400

600

800

1000

10

10

x 10

1000

9 800

8 600

7

400

6

200

5

0

4

−200

3

−400

2

−600

1

c

−800

0 −1000

−800

−600

−400

−200

0

200

400

600

800

1000

d

−1000 −1000

0

11

2.5

x 10

1500

1000

2

500

1.5

0

1 −500

0.5 −1000

e

0 −1500

−1000

−500

0

500

1000

f

1500

−1500 −1500

−1000

−500

0

500

1000

1500

381

300

382

383

350 384

385

400 386

387

450 388

389

500

g

80

90 100 110 120 130 140 150 Class 26 3.tpf −− multicomponent clicks (broadband) and whistles

160

h

100.5

101

101.5 102 102.5 103 103.5 104 104.5 Class 26 3.tpf −− multicomponent clicks (broadband) and whistles

105

105.5

Figure 1: a. Normal power spectrum of 25 milliseconds of NTIMIT data and spectrum remapped by CIF. b. CIF of 25 milliseconds of NTIMIT data. c. Autocorrelation of power spectrum of NTIMIT data and CIF remapped spectrum. d. Phase of autocorrelation of CIF calibrated in Hz. e. Autocorrelation of normal power spectrum and autocorrelation of spectrum remapped by CIF of 25 milliseconds of helicopter cockpit recording. f. Phase of autocorrelation of helicopter CIF calibrated in Hz. g. Spectrogram of whale sounds and whale spectrum remapped by CIF and LGD h. Expansion of a portion of the whale spectrogram and the same data remapped by CIF and LGD. Blocking in spectrogram represents quantization of spectrogram. 10

k Xm =

1 X l=

1

xk+l wl (l; m)

(22)

where wl is a window sequence, (l; m) are basis functions of an orthogonal transform. k Equation (??) can be interpreted as the orthogonal transform of xk as viewed through the window wl . Xm displays the orthogonal transform characteristics of the signal around time k . Note that while increased window length and resolution are typically bene cial in spectral analysis of stationary data, however, for time-varying data it is preferable to keep the window length suÆciently short so that the signal is approximately stationary over the window duration. Next we assume that the window has nite length around l = 0, and it is unity for all l 2 [ N1 ; N1 ]. Here N1 is an integer value. This leads to signal processing in a running window [5]. In other words, local lters (bank of lters) in the domain of an orthogonal transform at each position of a moving window modify the orthogonal coeÆcients of a signal to obtain only an estimate of the central sample xk of the window. The choice of orthogonal transform for short-time signal processing depends on many factors. The DCT is one the most appropriate transform with respect to the accuracy of power spectrum estimation from observed data, the lter design, and computational complexity of the lter implementation. For example, linear ltering in the domain of DCT followed by inverse transforming is superior to that of the discrete Fourier transform (DFT) because DCT can be considered as DFT of a signal evenly extended outside its edges. This consequently attenuates boundary (temporal aliasing) e ects caused by circular convolution that are typical in the domain of DFT. In the case of DFT, speech frames usually are windowed to avoid temporal aliasing and to ensure a smooth transition of lters in successive frames. For the ltering in the domain of DCT, the windowing operation can be skipped. In such a manner the computational complexity can be further reduced. The short-time DCT of a signal xk can be written as

k Xm =

N1 X



xk+l cos m

(l + N1 + 1=2)

l= N1

N



(23)

 k where fxk ; k = :::; N1 ; N1 + 1; :::; 0; :::; N1 ; :::g is a in nite-length speech sequence, Xm ; m = 0; :::; N are the transform coeÆcients around time k; N = 2N1 + 1 is the length of DCT. Note that N is an arbitrary integer value which is determined by pitch period of speech. One can be chosen to be approximately twice as large as the maximum expected pitch period for adequate frequency resolution. It has been shown [6] that the algorithm for computing DCT in a running window requires 2(N-1) of multiplication operations. The inverse transform after ltering in the domain of DCT is performed for computing only the central sample of the window in the following way: x bk =

N1 X

m d k

m X 2m ( 1)

(24)

m=0

n o d k where fxbk g is the processed speech sequence, X are the modi ed transform coeÆcients around m ; m = 0; :::; N time k; 0 = 1 f m = 2; m = 1; :::; N 1g are the scaling coeÆcients of DCT. It is interesting to note that in the computation only the spectral coeÆcients with even indices are involved. Next we design local lters to enhance noisy speech. We assume that clean speech signal fsk g is degraded by zero-mean additive noise fVk g xk = sk + vk

11

(25)

k k k be the DCT coeÆcients in a running window around time k of noisy speech, clean speech, Let Xm , Sm , Vmk ; Scm noise, and ltered signal, respectively. Di erent criteria can be exploited for the lter design. In the following analysis we use the criterion of Minimum Mean Square Error (MMSE) around time k which is de ned in the domain of DCT, taking into account Eq. (??), as follows:

*

N1 X

M M SEk = E

h

2

m S2km

k Sd 2m

i2

+ (26)

m=0

where E h:i denotes the expected value.

As we mentioned above, the length ofD the window is chosen in such a way that noise can be considered as 2 E stationary in the window. Let Pmk = E Vmk denote the power spectrum of noise in the domain of DCT. k = X k H k , here H k is the lter to be designed. By minimizing M M SE with respect to H k , Suppose that Scm k m m m m we arrive to a version of the Wiener lter in the domain of DCT: k Hm

=

k 2 Xm

jXmk j

2

k Pm

k Pm jXmk j2

=1

(27)

The DCT spectrum estimation of speech is given by (  1 k Sc m = 0;

j

k Pm k 2 Xm

j







k k 2 k Xm ; if Xm > Pm

otherwise

(28)

The obtained lter can be considered as a spectral subtraction method in the domain of DCT. In general, spectral subtraction methods [7], while reducing the wide-band noise, also introduce a new "musical" noise due to the presence of remaining spectral peaks. To attenuate the "musical" noise, one can suggest oversubstraction of the power spectrum of noise by introducing a nonzero power spectrum bias. Finally, the DCT spectrum estimation of speech signal can be written as follows: (  =

1 0;

j

k Pm k 2 Xm

j







k k 2 k Xm ; if Xm > Pm + Bk

otherwise

(29)

where B k is a speech-dependent bias value. The ltered speech signal is obtained with use of Eq.(??). A test speech signal recorded in helicopter environment is presented in Fig. 2a. The data was sampled at 16.00kHz. In our tests the window length of 761 samples is used. The short-time squared DCT coeÆcients averaged over all positions of the running window for noisy speech is presented Fig. 2b. The power spectrum of noise is obtained by actual measurement from background noise in the intervals where speech is not presented. It is shown in Fig. 2c.We see the di erence in spectral distributions of speech and helicopter cockpit, that will help us to suppress the helicopter noise. The result of ltering by using the proposed algorithm is shown in Fig. 2d. In this section we derived a lter for noise suppression on assumption that speech is always present in the measured data. However, if a given frame of data consists of noise alone, then obviously a better suppression lter can be used [8,9]. In general, the algorithm should include a detector of voiced and unvoiced speech signals.

12

a

b

c

d

Figure 2: a. Time waveform oh helicopter speech b. Average squared DCT magnitude of noisy speech c. Average squared DCT magnitude of noise d. Enhanced speech signal.

13

CALCULATE TFD

TFD(t; !)

s^(t)

^!2 (t)

ESTIMATE IBW

R 

6000

φ’(t)

5000

^F (t) 0

ESTIMATE IF

cos(^F (t))

7000

FREQUENCY

x(t) = s(t) +  (t)

4000

3000

σω(t)

2000

1000

^F (t)

^0F ( )d

0

0

50

100

150

200

250

300

350

400

450

500

TIME

cos(^F (t))

A^I (t) sin(^F (t))

LTV

sin(^F (t))

FILTER

A t



+

s^(t)

,

^ ( ) Q

AM , FM ANALY SIS

AM , FM SY NTHESIS

Figure 3: Implementation diagram of the TFD-based method for estimating the signal AM and FM and removing noise 7

Denoising TFD-Based AM-FM Coherent Demodulation

Noise removal via linear time-invariant (LTI) ltering is most e ective when the signal and noise spectra have minimal overlap in frequency. In particular, it can be diÆcult to extract, via LTI ltering, broadband signals from broadband noise, because often their spectra overlap. However, many broadband signals are locally narrowband (e.g., AM-FM signals with large FM and moderate to small AM), by which we mean that the instantaneous bandwidth { which is determined by the AM { is small at any given time; the global bandwidth, on the other hand, is determined by both the AM and the FM, and thus it can be much larger than the local bandwidth [14]. This characteristic of broadband signals that are locally narrowband can be exploited to improve noise suppression for such signals. In this section, we present a method [?] for extracting locally narrowband signals from broadband noise, based on an AM-FM decomposition of the signal and time-varying ltering. The center frequency and passband of a linear time-varying lter are determined from estimates of the instantaneous frequency and instantaneous bandwidth of the signal, obtained from a TFD. The method is applied to a whale sound recorded in ambient ocean noise, and exhibits greater noise removal compared to LTI-based ltering. 7.1

AM-FM Signal Model

We model the signal as the real part of a complex signal with amplitude A(t) and phase (t); s(t) = RefA(t)ej [A (t)+F (t)] g = Ref(AI (t) + jAQ (t))ejF (t) g

(30)

where the signal phase is divided into two parts [?], (t) = A (t) + F (t), one of which (F (t)) determines the frequency modulations (FM) of the signal, and the other (A (t)) is coupled with the amplitude to produce

14

quadrature amplitude modulation (QAM) A(t)ejA (t) = AI (t) + jAQ (t):

(31)

Note that we can reconstruct a signal from its FM and QAM. As shown next, time-frequency distributions allow us to obtain noise-robust estimates of these quantities, in particular for locally narrow-band mono-component signals in broadband noise, thereby resulting in enhanced noise-removal compared to LTI methods. 7.2

Algorithm for Noise Removal via TFD-based AM-FM Decomposition

Let the observed signal be corrupted by additive broadband noise, x(t) = s(t) +  (t): The underlying assumption of the proposed method is that in the time-frequency plane, the signal s(t) will stand out above the noise oor caused by  (t),7 such that we can obtain from a TFD P (t; ! ) of the signal a robust estimate of the FM part of the phase, F (t), via peak picking as Z ^0F (t) = max P^ (t; ! ); ^F (t) = ^0F ( )d (32) !



After estimating the FM part of the phase, we then obtain estimates of the in-phase and quadrature amplitudes via time-varying coherent demodulation, Z Z A^I (t) = x( )cosf^F ( )ghlp (t;  )d A^Q (t) = x( )sinf^F ( )ghlp (t;  )d: (33) where hlp (t;  ) is the impulse response of a linear time-varying low-pass lter. The corner frequency of the lter is set to 2-3 times the instantaneous bandwidth of the signal, which is obtained from the conditional spectral variance of the TFD of the observed signal, Z 2 !jt = (! 0F (t))2 P (! jt)d! (34) In practice, the estimates are obtained iteratively to improve accuracy, as follows (the procedure is illustrated schematically in Figure??). Iterative Algorithm

1. Initialization: (a) Estimate the instantaneous frequency (IF ) ^0F (t) and phase ^F ( ) of the noisy signal x(t) = s(t) +  (t) per eq. (??) using the iterative method of C  akrak and Loughlin [?](we used a multi-window spectrogram in the examples for the TFD [?]). (b) Calculate the bandwidth of the signal from its Fourier spectrum: R !2 = (! < ! >)2 jSx (! )j2 d!: (c) Design a (time-invariant) low-pass lter hlp (t) with bandwidth ! ; and use this lter along with the estimated phase to obtain in-phase and quadrature components of the amplitude, A^I (t) and A^Q (t), per eq. (??). (d) Generate an estimate s^(t) of s(t) per eq. (??) using the estimated phase and amplitudes. 2. Main Loop: Generate a TFD of s^(t), and estimate the IF and phase (as above) and the instantaneous bandwidth (IBW ) per eq. (??) from the T F D: 7 This

assumption is often accurate, particularly for signals that are locally narrow-band.

15

3. Design a time-varying low-pass lter hlp (t;  ) with time-varying bandwidth given by the 2-3 times the estimated IBW . 4. Re-estimate AI (t) and AQ (t) per eq. (??). 5. Repeat (back to step 2) until the error between two successive IBW estimates is smaller than a predetermined threshold. Using the nal estimates of the phase, the in-phase amplitude and the quadrature amplitude, generate the nal estimate s^(t) of the noise-free signal s(t) per eq. (??). 7.3

Example

We demonstrate the method by applying it to a whale sound recorded in ambient ocean noise. 8 The original time-series of the recording is shown on the left in Figure ??a. A multi-window TFD [?] of this signal is shown on the right. (We point out that the noise in this signal was not simulated or added to a clean recording; the noise was genuine and was part of the original recording.) In Figure ??b, we show the reconstructed signal and its TFD, using a linear time-invariant lter to remove as much noise outside the signal band as possible. In Figure ??c, the reconstructed signal obtained by the proposed AM-FM time-varying coherent demodulation procedure is shown, along with its TFD. The di erence between this reconstructed signal and the original signal is shown in Figure ??d. The reconstructed signal obtained via the AM-FM approach is virtually noise free. To quantify the improvement in noise removal, we generated synthetic ltered white noise with similar power and spectral characteristics as the ocean noise (as determined from the residual signal in Figure ??d), and added this noise to the AM-FM reconstructed signal in Figure ??c. The SNR of this noisy signal was -2 dB (where we took the reconstructed signal to be the 'noise-free' signal in the calculation of the SNR). We then processed this noisy signal as above, i.e., with an LTI lter (passband 2250-4375 Hz), and with the proposed TFD-based AM-FM demodulation method. The LTI lter improved the SNR by just over 3 dB (to an SNR of 1.1 dB) and the TFD-based AM-FM method improved the SNR by more than 15 dB (to an SNR of 13.2 dB).

8 Leon

Cohen and Pat Loughlin thank Stephen Greineder, Phillip Ainsleigh, Tod Luginbuhl, and Paul Baggenstoss at the Naval Undersea Warfare Center, Newport, RI, for the whale data and interesting discussions.

16

40

1200

6000

30 1000

5000

FREQUENCY (Hz)

AMPLITUDE

20 10 0 −10

4000

800

3000

600

2000

400

1000

200

−20 −30 −40

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0

0.6

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0.6

(a) 40

1200

6000

30 1000

5000

FREQUENCY (Hz)

AMPLITUDE

20 10 0 −10

4000

800

3000

600

2000

400

1000

200

−20 −30 −40

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0

0.6

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0.6

(b) 40

1200

6000

30 1000

5000

FREQUENCY (Hz)

AMPLITUDE

20 10 0 −10

4000

800

3000

600

2000

400

1000

200

−20 −30 −40

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0

0.6

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0.6

(c) 40

1200

6000

30 1000

5000

FREQUENCY (Hz)

AMPLITUDE

20 10 0 −10

4000

800

3000

600

2000

400

1000

200

−20 −30 −40

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0

0.6

0

0.1

0.2

0.3 0.4 TIME (seconds)

0.5

0.6

(d) Figure 4: LEFT: time series, RIGHT: TFD (a) Original whale sound recorded in ambient ocean noise, (b) recovered sound by LTI ltering, (c) recovered sound by proposed AM-FM method, (d) residual signal (di erence between original signal and AM-FM recovered signal).

17

References

[1] D.J. Nelson, \Estimation of FM Modulation of Multi-Component Signals from the Fourier Phase" in Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, vol. 6, pp. 3421-3424, 1998. [2] D.J. Nelson and O.P. Kenny, \Invertible Time-Frequency Representations", in Proc. Proc. Conf., July, 1998. [3] D.J. Nelson, \Invertible Time-Frequency Surfaces", in Proceedings and Time-Scale, Pittsburgh, October, 1998.

of the SPIE Adv. Sig.

of IEEE Conference on Time-Frequency

[4] D.J. Nelson and W. Wysocki, \Cross-spectral methods with an application to speech processing", in Proc. of the SPIE Adv. Sig. Proc. Conf., July, 1999. [5] L.P. Yaroslavsky, M. Eden, Fundamentals

, Birkhause, Boston, 1996.

of Digital Optics

[6] V. Kober and G. Cristobal, \Fast recursive algorithms for the short-time discrete cosine transform," tronics Letters, submitted. [7] S.F. Boll, \Suppression of acoustic noise in speech using spectral subtraction", IEEE Signal Process. ASSP-27, pp.113-120, 1979.

Elec-

Trans. Acoust. Speech

[8] J.S. Lim and A.V. Oppenheim, \Enhancement and bandwidth compression of noisy speech", 67, pp. 1586-1604, 1979.

Proc. IEEE

[9] R.J. McAulay, M.L. Malpass, \Speech enhancement using a soft-decision noise suppression lter", Trans. Acoust. Speech Signal Process. ASSP-28, pp.137-145, 1980. [10] F. C  akrak and P. Loughlin, \Multiple-window nonlinear time-varying spectral analysis," ICASSP'98, vol. 4, pp. 2409-2412, May 1998.

IEEE

IEEE Proc.

[11] F. C  akrak and P. Loughlin,\Instantaneous frequency estimation of polynomial phase signals," Proc. IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, pp. 549-552, Oct. 1998. [12] F. Cakrak and P. Loughlin, \A new multi-window time-frequency approach yielding accurate low-order conditional moments,", Thirty Third Annual Asilomar Conference on Signals, Systems, and Computers, Submitted, May 1999. [13] L. Cohen, \Time-frequency distributions { A review," [14] L. Cohen,

Time-Frequency Analysis.

, vol. 77, no. 7, pp. 941-981, 1989.

Proc. IEEE

, Prentice-Hall, 1995.

[15] P. Loughlin and B. Tacer, \On the amplitude and frequency modulation decomposition of signals,\ J. Acoust. Soc. Am., vol. 100, no. 3, pp. 1594-1601, 1996. [16] P. Loughlin and F. C  akrak, \Time-varying coherent AM-FM demodulation and denoising of acoustic signals," 138th mtg. Acoust. Soc. Am., Columbus, OH, Nov. 1999 (to appear).

18

Suggest Documents