A Novel Approach for Design of a Speech ...

3 downloads 0 Views 442KB Size Report
Keywords- voiced, unvoiced, zero-crossing rate (ZCR), noise, MSE,. SNR, NLMS adaptive filter. I. INTRODUCTION. With the increasing need of digital electronic ...
A Novel Approach for Design of a Speech Enhancement System using NLMS Adaptive Filter and ZCR based Pattern Identification Sivaranjan Goswami Dept. of Electronics and Communication Engineering Don Bosco College of Engineering and Technology Guwahati, India [email protected]

Pinky Deka Dept. of Electronics and Communication Engineering Don Bosco College of Engineering and Technology Guwahati, India [email protected] Bijeet Bardoloi Dept. of Electronics and Communication Engineering Don Bosco College of Engineering and Technology Guwahati, India [email protected]

Abstract— Speech signals often get degraded by various types of noise at different stages of speech recording, processing and communication systems. One major source of noise is the background noise, which highly degrades the speech signal quality and decreases the listening comfort. Speech enhancement is a section of digital speech processing in which the interfering noise is eliminated from the speech and the noise-free speech is estimated from the noisy speech signal. The work here proposes a novel approach for the enhancement of speech signal which has been highly degraded by background noise. The noisy speech signal is fed through two different stages. In the first stage, an auto-trained NLMS adaptive filter is applied to reduce the noise level. The auto trained adaptive filter automatically designs itself for the particular background without any previous training for that particular background. Then the output is passed through a ZCR based pattern identification approach for further enhancement of the speech signal. It is observed that the proposed system increases the overall output SNR of the signal by about 4 times of the input SNR. Keywords- voiced, unvoiced, zero-crossing rate (ZCR), noise, MSE, SNR, NLMS adaptive filter

I.

INTRODUCTION

With the increasing need of digital electronic devices such as mobile phone, digital audio/video recorder etc.

Darathi Dutta Dept. of Electronics and Communication Engineering Don Bosco College of Engineering and Technology Guwahati, India [email protected]

Dipjyoti Sarma Assistant Professor Dept. of Electronics and Communication Engineering Don Bosco College of Engineering and Technology Guwahati, India [email protected]

efficient recording, processing and transmission of speech signal in the digital domain has become very important. During these processes, the speech signal gets degraded by various types of noise at various steps. One major source of noise is the background noise, which is nothing but the other acoustic sound waves from various sources that enters the system along with the original signal at the source. It is very common to observe that when a person communicates through a mobile phone from a noisy area, the listener at the other end also hears the background noise and sometimes the background noise even suppresses the speech. A number of researches are going on all over the world to develop various algorithms for speech enhancement. The purpose of such enhancement algorithms is to reduce background noise, improve speech quality or suppress channel or speaker interference [1]. A two stage approach to speech enhancement has been proposed in this work. The work described here mainly focuses on the enhancement of the speech signal by eliminating the interfering noise from the signal. In the first stage, an auto-trained NLMS adaptive filter is applied to reduce the noise level. The auto trained adaptive filter automatically designs itself for the particular background

without any previous training for that particular background. background In the second stage a pattern identification approach is applied to distinguish the samples containing noise--mixed speech and only noise components. The samples containing only noise components are then suppressed in order to further increase the listening comfort and the SNR. The next section provides a brief description of the speech signal and various types of excitation. In section II the basics of speech generation model and types of excitation are discussed, followed by a brief theory of adaptive filters in the field of speech enhancement in section III.. In section IV we have discussed the experimental rimental details and algorithm for the first and the second stage. Section V provides a detailed discussion on the experimental results and the conclusion is provided in section VI. II.

III.

ADAPTIVE FILTER FOR SPEECH ENHANCEMENT

An adaptive filter is a computational device that attempts to model the relationship between two signals in real time in an iterative manner [3]. ]. It can be used to establish relationship between a speech signal corrupted by some noise and a clean version of the same speech signal. The established relationship can then be used for estimating the clean speech from the noisy speech even when the clean speech is unknown. The process of establishing ishing the relationship is called training. The block diagram of the set-up up for training an adaptive filter is shown in figure 2.

SPEECH GENERATION MODEL AND AN TYPES OF EXCITATION

The human speech production system is a complex mechanical system. The air exhaled by the lungs is modulated by various hard and soft tissues initially by the glottal fold and then by the tissues of the vocal tract such as tongue, lips, jaw, and velum. In Digital Speech Processing, this process is represented as a discrete time model as shown in figure 1. 1 The system containing the lungs and the glottal fold comes in the block Excitation Generator. The vocal tract is modeled as a linear system, which is usually an all zero digital FIR filter. The vocal tract parameters are the parameters of the digital filter.

Fig. 2. Block Diagram of Adaptive Filter for Speech Enhancement

An adaptive filter has two parts, namely, a digital filter and an update algorithm.. Initially, the parameters of the digital filter are initialized with zeros. The input signal x(n) is nothing but the clean signal s(n) which is corrupted by some noise n0(n) that needs to be filtered out. y(n) is the output of the filter. It is subtracted from s(n) to calculate the error e(n). In every step the filter parameter vector W is updated. The process is repeated until the system converges, that is, the error is minimized. The function of the adaptive algorithm is to update W iteratively to minimize e(n). A. Adaptive Filter Algorithms

Fig. 1. Block diagram of speech generation model

Depending upon the excitation signal u(n) in figure 1 speech signals can be of two types: Voiced speech and Unvoiced speech. For voiced speech the excitation to the linear system is a quasi-periodic sequence of discrete (glottal) pulses. For unvoiced speech, the linear system is excited by a random number generator that produces a discrete-time discrete noise signal with flat spectrum [2]. There are four more types of excitation such as mixed, plosive, whisper and silence. Silence is the pause between speech when there is no excitation. The remaining three are nothing but combinations of voiced, ced, unvoiced and silent [1].

The most commonly used adaptive filtering algorithms for speech processing are [4] • Least Mean Square (LMS) algorithm • Recursive Least Square (RLS) algorithm • Normalized Least Mean Square (NLMS) algorithm A comparative discussion of all these three algorithms has been given by Allam Mousa et. al. in [4]. [ In the work it has been shown that NLMS adaptive filters are best in terms of minimization of MSE, less convergence time and high stability. Therefore refore we will use NLMS algorithm for the experiments of this work. B. Normalized Least Mean Square (NLMS) algorithm It is a modified version of LMS algorithm. The computational complexity is slightly increased. The update equation is given by equation 1

   + 1 =     +

. . ̅  | ̅ |



(1)



Where, |̅ | is the sum of the squares of all the elements of the input vector ̅  at instant n.

IV.

EXPERIMENTAL DETAILS

B. Second stage of the speech enhancement process: Pattern Recognition Based Approach to Suppress Noise

A. First stage of the speech enhancement process: Auto Update NLMS Algorithm In this stage, an auto-update algorithm for NLMS adaptive filter has been proposed. It involves a training session of the adaptive filter before starting the processing. A clean reference speech signal s0(n) of the same speaker must be already available for training the adaptive filter. s0(n) should not necessarily contain the same words as that in the speech to be enhanced, however, it is preferred that it contains all the sounds (phonems) that are present in the original noisy speech so that it contains all the frequency components of the speaker’s voice. Let us assume that x(n) be the input noisy signal so that  =  + 

(2)

Where, s(n) is the clean speech, to be obtained and n(n) is the interfering background noise. The first 0.5 second is assumed to contain only noise component which can be easily maintained during the recording of the signal, that is, 

 =  ∀ {: 0 ≤  ≤  } 

(3)

Where, Fs is the sampling rate. Now, let us take L0 be the length of the reference speech s0(n). A noisy reference speech x0(n) is generated using the equation 4. 

  =   +  !" #,  % 

Where,

1) Create the noise mixed signal x0(n) as shown in equation 4. 2) Initialize all the parameters of the all zero FIR filter  with zeros. 3) Apply NLMS algorithm to update the filter parameters using equation 1 until the filter converges. 4) Filter the original input speech signal x(n) with  . 5) Measure Performance and proceed to the second stage.

(4)

!"&,  =  × ( − 1

where k is an integer. Now we have all the necessary signals for training an adaptive filter. x0(n) is the noisy version of the clean speech s0(n). Filter parameters are initialized with zeros. Now the training process of the adaptive filter can be obtained to find the filter that passes the frequency components of speech and stops the frequencies of the noise of the source that is corrupting the speech. The filter coefficients are updated using NLMS algorithm as in equation  . Now the entire input 1 to find the final parameters * signal x(n) is filtered with this filter to estimate the clean version of x(n), which is actually unknown. The steps of this stage can be summarized as given below:

In this stage a pattern recognition based approach is applied for identification of the Voiced, Unvoiced and Silent parts of speech. The decision is taken based on average ZCR and average magnitude computed for a short time frame of 20 milliseconds (ms) [5]. 1) Calculation of Zero-Crossing Rate(ZCR) The ZCR of a signal between an interval ∆t has been found using the equation 1. ZCRaverage=N/(2 ∆t)

(5)

Where N is the number of times the polarity of the signal is changed during the time-frame. 2) Decision of Voiced, Unvoiced and Silent For every time-frame, the average ZCR is calculated and the power corresponding to the frequency is calculated using Fourier Transform. Then the result is subjected to the threshold condition given in equations 2 and 3, Unvoiced:

fN ≥ a and |x| ≤ b

(6)

Voiced:

fN ≤ c and |x| ≥ d

(7)

Where, the subscript N denotes normalized value and a, b, c, d are user defined threshold values. 3) Proposed Algorithm It is to be noted that the threshold conditions of equations 6 and 7 cannot be applied directly to the speech which is corrupted by background noise. Because the silent frames will also be marked as voiced or unvoiced because of the background noise. The proposed algorithm is based on the following three assumptions: 1) The first 1 second of the signal contains only background noise. 2) The frequency of the noise source is different from the vocal tract frequency or ZCR. 3) The human voice has dominating amplitude, since mouth is closer to the microphone than the noise source. It is to be noted that under the third assumption we can conclude that:



The ZCR of the voiced speech will not be influenced by noise. It will only add some local maxima and minima to the signal.



No noise pulse will be marked as voiced since its power level is low.

The unvoiced and silent frames are obtained by using the noise reference of the first one second of the signal. Where, Where the ZCRs are given by equation 5. Now the samples containing only background noise have been suppressed. All the remaining samples are enhanced speech samples which are filtered for further enhancement and increase the comfort of listening. V.

be concluded that MSE is minimum for µ=0.0010 and order=50. Table 1 Variation of MSE of NLMS output with increasing filter order at different step sizes (µ) ( Order µ=0.0010 µ=0.0015 µ=0022 µ=0032 35 0.0054 0.0065 0.0082 0.0104 40 0.0039 0.0046 0.0061 0.0080 45 0.0037 0.0041 0.0048 0.0062 50 0.0036 0.0040 0.0047 0.0060 55 0.0041 0.0041 0.0044 0.0049 60 0.0041 0.0042 0.0044 0.0049 65 0.0053 0.0056 0.0060 0.0066

EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section a detailed discussion on the results of the experimental works is provided. The results of each stage of the experiment are provided separately. A. Evaluation of First Stage The proposed algorithm has been tested for speech corrupted orrupted with real background noise by mixing speech signal and noise signal. The first algorithm is tested for two situations: 1) Keeping step size fixed and varying order 2) Keeping order fixed and varying step size Fig 4: Variation of SNR (in dB) with Filter Order when step size is kept fixed

Figure 4 shows the variation of SNR with order and step size of the NLMS adaptive filter. It is observed that the SNR of the filter output becomes much higher than the input SNR for all orders and step-sizes. The SNR of the input noisy speech is 6.493 dB.. It is also observed that for the point at minimum MSE discussed above, above the SNR is found to be 16.3192 dB which exceeds the input SNR by around 1.5 times.

Fig.3 Variation of MSE with Filter Orderr when step size is kept fixed

Figure 3 shows the variation of the MSE of the NLMS output with varying filter order and step size. It is observed that the MSE first decreases with increasing order, reaches a minimum value and then increases. However, the MSE is less than that of the input noisy signal only when step size is less. The details are as shown in table 1. The MSE of the input noisy signal is mse(x) = 0.0054. From the table it can

B. Evaluation of Second Stage The enhanced speech signal obtained from the auto-update NLMS adaptive filter is now fed to the second stage of the work. In the second stage a pattern identification approach is applied to distinguish the samples containing noise-mixed noise speech and only noise components. components The algorithm of the second stage is tested by comparing compar with the results obtained by applying the threshold conditions of equations 6 and 7. The results are shown in Table 2. TABLE 2 PERCENTAGE OF SILENT FRAMES MARKED UNVOICED Background

1st Algorithm

2nd Algorithm

No Noise

0%

0%

AWGN

80%

30%

Natural Noise

58%

23%

REFERENCES It is observed that the implementation of the second stage increases the accuracy of the system. When this stage is applied after stage-1; 1; the listening comfort of the speech increases. Since, noise-only only frames are suppressed; the SNR of the signal becomes very much higher (28.5282 dB). dB) Thus the system provides an increase in output SNR by around 22 dB to the input SNR. Figure 5 shows the waveform after each stage.

Fig 5: Stage 1 and Stage 2 output

C. Discussion The proposed approach is a very simple approach for speech enhancement. It enjoys the benefits and simplicity of adaptive filter approaches which are very popular in the field of speech enhancement and at the same time reduces the need of training an adaptive filter prior to its application in a particular background. It does not require any library of background noises as in [5].. The approaches is suitable for enhancement of both real-time time speech signal and offline recorded speech since it does not require any spatial information unlike the multiple tiple channel adaptive filters which are specially designed for real-time time application such as [6]. However, the speed of the filter may be limited by its order and the speed of the hardware platform used. VI.

CONCLUSION

It is observed that the two stages of the speech enhancement algorithm reduces the MSE and increases the SNR and the listening comfort of the noisy speech signal to a considerable extent. Another advantage of the system is that as the data containing only small background noise are converted to pure silence after the second stage, the memory requirement for the storage of the speech in digital form also subsequently decreases. Thus the system provides a better solution for the real time application in different rent digital electronic devices for both speech enhancement and optimization of memory requirement.

[1] John R. Deller, Jr. John H. L. Hansen and John G. Proakis. “Discrete-Time Time Processing of Speech Signals”, John Wiley and Sons, Inc; New York, 2000 [2] L. R. Rabiner and R. R W. Schafer, “Introduction to Digital Speech Processing”, Processing Now Publishers Inc. USA, Volume 1 Issue 12, 2007. [3] S. C. Douglas, Markus Marku Rupp, “Convergence Issues in the LMS Adaptive Filter’’, Filter CRC Press LLC, Available at: http://www.dspbook.narod.ru/DSPMW/19.PDF k.narod.ru/DSPMW/19.PDF, 2000 [4] A. Mousa, M. Qados, S. Bader, “Speech Signal Enhancement Using Adaptive Noise Cancellation Techniques”, Canadian Journal on Electrical and Electronics Engineering Vol. 3, No. 7, 2012 [5] C. D. Sigg, T. Dikk and J. M. Buhmann “Speech “ Enhancement Using Generative Dictionary Learning”, ”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 6, pp 1698-1712, 1698 2012 [6] K. Li, Y. Guo, Q. Fu and Y. Yan, “A “ Two Microphone-Based Based Approach for Speech Enhancement in Adverse Environments”, En IEEE International Conferene on Consumer Electronics(ICCE), 2012 [7] S. Goswami, P. Deka, B. Bardoloi, D. Dutta, D. Sarma; “ZCR Based ased Identification of Voiced, Unvoiced and Silent Parts of Speech Signal in Presence of Background Noise”, Noise” Proceedings, IC3A 2013, pp: 134-138, 2013 [8] S. A. Hadei, M. Lotfizad, “A Family of Adaptive Filter Algorithms in Noise Cancellation for Speech Enhancement”, International Journal of Computer and Electrical Engineering, Vol. 2, No. 2, 2010 [9] R. G. Bachu, S. Kopparthi, B. Adapa, B. D. Barkana, “Separation Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal”, Signal Electrical Engineering Department; School of Engineering, University of Bridgeport; available at http://audio-fingerprint.googlecode.com/sv fingerprint.googlecode.com/svnhistory/r62/trunk/referencias/ASEE12008 0044 paper.pdf [10] D. Arifianto, T. Kobayashi, “Voiced/unvoiced Voiced/unvoiced determination of speech signal in noisy environment using harmonicity measure based on instantaneous frequency”, frequency IEEE International Conference on Acoustics, Acoustic Speech, and Signal Processing, pp: 743-746, 746, 2005 [11] Identification of Voice/Unvoiced/Silence regions egions of Speech; available at: http://iitg.vlab.co.in/?sub=59&brch=164&sim=613 &cnt=2

Suggest Documents