Abstract. In all speech communication settings the quality and intelligibility of speech is of ..... the estimate) are set to zero (minus infinity on a logarithmic scale).
SPEECH SIGNAL ENHANCEMENT TECHNIQUES FOR MICROPHONE ARRAYS A DISSERTATION
Submitted to University of Mumbai in partial fulfillment of the requirement for the degree of
MASTER OF ENGINEERING In INSTRUMENTATION AND CONTROL
By DEEPAK RAMNIK GALA
Under the guidance of
Prof. V. M. MISRA
DEPARTMENT OF INSTRUMENTATION ENGINEERING VIVEKANAND EDUCATION SOCIETY'S INSTITUTE OF TECHNOLOGY SINDHI SOCIETY, CHEMBUR (W) MUMBAI-400 071
UNIVERSITY OF MUMBAI (2010)
VIVEKANAND EDUCATION SOCIETY'S INSTITUTE OF TECHNOLOGY SINDHI SOCIETY, CHEMBUR (W) MUMBAI-400 071
CERTIFICATE
This is to certify that the dissertation entitled “Speech Signal Enhancement Techniques For Microphone Arrays" has been completed successfully by Mr. Deepak Ramnik Gala for the award of the degree of Master of Engineering in Instrumentation and Control from University of Mumbai.
Prof.V.M.Misra. Guide Department of Instrumentation and Control
PRINCIPAL
i
DISSERTATION APPROVAL SHEET This is to certify that the thesis entitled “Speech Signal Enhancement Techniques for Microphone Arrays” by Mr. Deepak Ramnik Gala is approved for the degree of Master of Engineering in Instrumentation and Control from University of Mumbai.
Internal Examiner
External Examiner
Head of the Department
Principal
Seal of the Institute
ii
Acknowledgements I would like to extend my sincere appreciation to my guide Prof. V.M.Misra, who made valuable contribution towards the development of my project through its various phases and without whose support it was not possible for me to complete the project work.
I would like to thank Principal Dr. Mrs. Jayalaxmi Nair and H.O.D. Dr . Mrs. Shanta Sondur for the contribution they have made to the project in the various stages of development.
I am especially indebted to Dr. P.P.Vaidya, Prof. J.A.Gokhale, Dr. D.C.Sahni, and Mr. Chandane for their cooperative and helpful attitude.
Last but not the least; I am obliged to all my friends and family members, laboratory and library staffs for their encouragement and cooperation in successful completion of this thesis.
iii
SPEECH SIGNAL ENHANCEMENT TECHNIQUES FOR MICROPHONE ARRAYS
Abstract In all speech communication settings the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. The speech processing systems used to communicate or store speech are usually designed for a noise free environment but in a real-world environment, the presence of background interference in the form of additive background and channel noise drastically degrades the performance of these systems, causing inaccurate information exchange and listener fatigue.
The Spectral Subtraction Technique can be used to reduce stationary noise but the non stationary noise still passes through it. Spectral Subtraction also introduces a musical noise which is very annoying to human ears. Beamforming is another possible method of speech enhancement that can be used. Further the musical noise of Spectral Subtraction can be reduced by Beamforming. Beamforming by itself, however, does not appear to provide enough improvement. Further, the performance of Beamforming becomes worse if the noise source comes from many directions or the speech has strong reverberation.
Therefore, a system has been
designed with a combination of Spectral
Subtraction Technique followed by Beamforming Technique reducing stationary as well as residual, musical noise. Algorithms and associated software have been developed for 1) Spectral Subtraction 2) Beamforming Technique and 3) Spectral Subtraction followed by Beamforming Technique .The last developed technique results in getting a noise free speech free of musical noise and reverberation making the speech intelligible and of good quality. Processing of the signal for Spectral Subtraction, Delay Sum Beamforming and the Combined Techniques, was carried out individually for three different experiments (with 3 , 6 and 10 microphones) and for 4 different cases with 3 different signals and fourth a signal with Gaussian white Noise. The SNR in each case was calculated. iv
From the results obtained by processing the signal with newly developed combined Algorithm method, it can be easily concluded that 1) the SNR of the signal was improved very much by Spectral Subtraction while following Delay and Sum Beamforming further enhances the signal and reduces the residual noise also and thereby improving the quality of speech 2) the increased number of Microphones in a Microphone array results in further improvement in the quality of the speech.
v
Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 INTRODUCTION 1 1.1. General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Speech Signal Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4. Need for Combining Spectral Subtraction and Beamforming Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5. Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6. Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 SPEECH ENHANCEMENT TECHNIQUES 2.1 Spectral Subtraction Technique . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Power Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Magnitude Spectral Subtraction . . . . . . . . . . . . . . . . . . . . 2.1.3 Residual Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 8 10 10 11
2.2 Beamforming Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Technique Combining Spectral Subtraction and Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 THEORY 3.1 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Segmenting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Taking Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Compute the Noise Spectrum Magnitude. . . . . . . . . . . . . 3.1.4 Frame Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Half Wave Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.6 Residual Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.7 Attenuate Signal during Non-Speech Activity . . . . . . . . 3.1.8 Signal Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 16 17 17 17 18 19 19 20 21
3.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 Delay Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 Shifting of each microphone signal . . . . . . . . . . . . . . . . . . 22 3.2.3 Summation of delayed signals . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Combined Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.2 Musical Noise Reduction by Beamforming . . . . . . . . . . . . 25 3.4 Signal to Noise Ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 DEVELOPMENT OF ALGORITHM AND ASSOCIATED SOFTWARE 28 4.1 Algorithm for the Combined Technique . . . . . . . . . . . . . . . . 29 4.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Input-Output Data Buffering and Windowing . . . 30 4.2.2 Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.3 Magnitude Averaging . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.4 Bias Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.5 Bias Removal and Half Wave Rectification . . . . . 35 4.2.6 Residual Noise Reduction . . . . . . . . . . . . . . . . . . . 36 4.2.7 Additional Noise Suppression During NonSpeech Activity . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.8 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.9 Delay Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.10 Shifting of Each Signal . . . . . . . . . . . . . . . . . . . . . 41 4.2.11 Summation of Delayed Signals . . . . . . . . . . . . . . . 41
5 SOFTWARE TESTING AND RESULTS 42 5.1 Standard and Gaussian Noise Test Sample. . . . . . . . . . . . . . 43 5.2 Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Details of Results for Test Sample 1. . . . . . . . . . . . . . . . . . . 47 5.3.1 Test Results for Test Sample 1with 3-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Simulation Results for Test Sample 1 with 3-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Test Results for Test Sample 1 with 6-Microphone Array. . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.4 Simulation Results for Test Sample 1 with 6-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.5 Test Results for Test Sample 1 with 10-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.6 Simulation Results for Test Sample 1 with 10-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Details of Results for Test Sample 2. . . . . . . . . . . . . . . . . . . 54 5.4.1 Test Results for Test Sample 1 with 3-Microphone Array. . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Simulation Results for Test Sample 1 with 3-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.3 Test Results for Test Sample 1 with 6-Microphone Array. . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.4 Simulation Results for Test Sample 1 with 6-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.5 Test Results for Test Sample 1 with 10-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . 58 5.4.6 Simulation Results for Test Sample 1 with 10-Microphone Array . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Details of Results for Test Sample 2 with Positions Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5.1 Test Results for Test Sample 2 (Changed positions) with 3-Microphone Array. . . . 61 5.5.2 Simulation Results for Test Sample 2 (Changed positions) with 3-Microphone Array . . 62 5.5.3 Test Results for Test Sample 2 (Changed positions) with 6-Microphone Array. . . 63 5.5.4 Simulation Results for Test Sample 2 (Changed positions) with 6-Microphone Array. . . 64 5.5.5 Test Results for Test Sample 2 (Changed positions) with 10-Microphone Array. . 65 5.5.6 Simulation Results for Test Sample 2 (Changed positions) with 10-Microphone Array. . 67
5.6 Details of Results for Test Sample 3 with Gaussian noise. . 68 5.6.1 Test Results for Test Sample 3 with Gaussian noise for 3-Microphone Array . . . . . 68 5.6.2 Simulation Results for Test Sample 3 with Gaussian noise with 3-Microphone Array . . . 70 5.6.3 Test Results for Test Sample 3 with Gaussian noise for 6-Microphone Array. . . . . 71 5.6.4 Simulation Results for Test Sample 3 with Gaussian noise for 6-Microphone Array. . . . . 73
5.6.5 Test Results for Test Sample 3 with Gaussian noise for 10-Microphone Array. . . . 74 5.6.6 Simulation Results for Test Sample 3 with Gaussian noise for 10-Microphone Array . . . 76 6 CONCLUSIONS AND FUTURE SCOPE 78 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
LIST OF FIGURES Figure 1.1 Block diagram of the Combined Technique Fig. 2.1 Block diagram of Spectral Subtraction Fig. 2.2 Block diagram of delay and sum Beamforming Fig. 3.1 Algorithm for Spectral Subtraction Fig. 3.2 Algorithm for Beamforming Figure 3.3 An Uniform Linear Array with a source in the near-field Figure 4.1 Algorithm for Combined Technique Figure 4.2 Data Segmentation and Advance Figure 5.1 Test Sample 1 Figure 5.2 Test Sample 2 Figure 5.3 Test Sample 3 Figure 5.4 Added Gaussian White Noise Figure 5.5 Position of 3 Microphone Array with respect to source position Figure 5.6 Enhanced Signal for 3-Microphone Array Figure 5.7 Position of 6 Microphone Array with respect to source position Figure 5.8 Enhanced Signal for Test Sample 1 with 6-Microphone Array Figure 5.9 Position of 10 Microphone Array with respect to source position Figure 5.10 Enhanced Signal for Test Sample 1 with 10-Microphone Array
vi
Figure 5.11 SNR improvements for Test Sample 1 with each Technique Figure 5.12 Position of 3 Microphone Array with respect to source position Figure 5.13 Enhanced Signal for Test Sample 2 with 3-Microphone Array Figure 5.14 Position of 6 Microphone Array with respect to source position Figure 5.15 Enhanced Signal for Test Sample 2 with 6-Microphone Array Figure 5.16 Position of 10 Microphone Array with respect to source position Figure 5.17 Enhanced Signal for Test Sample 2 with 10-Microphone Array Figure 5.18 SNR improvements for Test Sample 2 with each Technique Figure 5.19 Position of 3 Microphone Array with respect to source position Figure 5.20 Enhanced Signal for Test Sample 2 with 3-Microphone Array Figure 5.21 Position of 6 Microphone Array with respect to source position Figure 5.22 Enhanced Signal for Test Sample 2 with 6-Microphone Array Figure 5.23 Position of 10 Microphone Array with respect to source position Figure 5.24 Enhanced Signal for Test Sample 2 with 10-Microphone Array Figure 5.26 Position of 3 Microphone Array with respect to source position Figure 5.27 a) Test Sample-3 with added Gaussian Noise Figure 5.27 b) Enhanced Signal for Test Sample 3 with Gaussian noise with 3Microphone Array Figure 5.28 Position of 6-Microphone Array with respect to source position Figure 5.29 a) Test Sample-3 with added Gaussian Noise vii
Figure 5.29 b) Enhanced Signal for Test Sample 3 (Gaussian noise) with 6-Microphone Array Figure 5.30 Position of 10 Microphone Array with respect to source position Figure 5.31 a) Test Sample-3 with added Gaussian Noise Figure 5.31 b) Enhanced Signal for Test Sample 3 with 10-Microphone Array Figure 5.32 SNR improvements for Test Sample 3 with changed positions
viii
LIST OF TABLES Table 5.1 Results for Test Sample 1 with 3-Microphone Array Table 5.2 Results for Test Sample 1 with 6-Microphone Array Table 5.3 Results with 10 Microphone Array Table 5.4 Results for Test Sample 2 with 3-Microphone Array Table 5.5 Results for Test Sample 2 with 6-Microphone Array Table 5.6 Results for Test Sample 2 with 10-Microphone Array Table 5.7 Results for Test Sample 2 with 3-Microphone Array Table 5.8 Results for Test Sample 2 (Changed positions) with 6-Microphone Array Table 5.9 Results for Test Sample 2 with 10-Microphone Array Table 5.10 Results for Test Sample 3 with Gaussian noise with 3-Microphone Array Table 5.11 Results for Test Sample 3 (Gaussian noise) with 6-Microphone Array Table 5.12 Results for Test Sample 3 with 10-Microphone Array
ix
Chapter 1 INTRODUCTION 1.1 General Description 1.2 Speech Signal Enhancement 1.3 Objective 1.4 Need for Combining Spectral Subtraction and Beamforming 1.5 Design Approach 1.6 Literature Survey 1.7 Applications
1
Chapter 1 INTRODUCTION 1.1 General Description We live in a noisy world! In all applications (telecommunications, hands-free communications, recording, human-machine interfaces, etc.) that require at least one microphone, the signal of interest is usually contaminated by noise and reverberation. As a result, the microphone signal has to be "cleaned" with digital signal processing tools before it is played out, transmitted, or stored. A single microphone picks up too much of ambient, electronic and room reverberation noise. Noise decreases intelligibility and interferes with speech recognition algorithms. Signal processing techniques have their own limitations for removing noise in systems with single microphone. A microphone array is a set of closely positioned microphones. Microphone arrays achieve better directionality than a single microphone by taking advantage of the fact that an incoming acoustic wave arrives at each of the microphones at a slightly different time. Microphone arrays provide new opportunities for noise reduction and speech enhancement. Recent research in microphone-array systems indicates the promise of various techniques in speech enhancement and hands-free communication applications. With the increased maturity in speech and speaker processing technologies, and the prevalence of telecommunications, there is a need for effective speech acquisition devices. Microphone arrays have a distinct advantage as they enable hands-free acquisition of speech and they can also provide information on the location of speakers. The most important objective of a microphone array is to provide a high quality version of the desired speech signal. Microphone array speech enhancement is achieved by Spectral Subtraction or Beamforming technique or combination of both, to reduce the level of localized and ambient noise besides minimizing distortion to speech from the desired direction. 2
1.2 Speech Signal Enhancement Speech Signal Enhancement is the science of taking noisy or garbled speech signals and applying various digital signal processing techniques to make the signal more intelligible; to make a speech transmission clearer and easier to understand. Microphone array speech enhancement is achieved by Spectral Subtraction or Beamforming technique, to reduce the level of localized and ambient noise besides minimizing distortion to speech from the desired direction. Spectral Subtraction is an effective method to reduce additive noise from a single microphone signal. It can do extremely well than other techniques in enhancing low SNR signal, and is simple to implement. However, a problem with Spectral Subtraction is that it introduces an unusual residual noise called musical noise, which is very annoying to human ears. Beamforming is a temporal and spatial filtering process of trying to concentrate the microphone array to sounds coming from only one particular direction. The goal of Beamforming is to sum multiple elements to achieve a narrower response in a desired direction. A Beamformer is a spatial filter that operates on the output of a microphone array in order to enhance the amplitude of a coherent wavefront relative to background noise and directional interference. However two major problems in Beamforming technique are: •
It does not provide enough improvement to significantly improve speech recognition performance.
•
Further, the performance of Beamforming becomes worse if the noise source comes from many directions or the speech has strong reverberation
3
1.3 Objective The objective of the project is to design and develop a new technique combining both single channel Spectral Subtraction and multichannel Beamforming techniques. This new technique will be best suited for most of the applications needing speech enhancement.
1.4 Need for combining Spectral Subtraction and Beamforming Techniques These two techniques are complementary and can be combined to provide better performance than either method alone. One approach, for example, is to use Spectral Subtraction as a preprocessor in each channel before Beamforming. In this case, the summing process of Beamforming reduces the musical noise generated by Spectral Subtraction. Spectral Subtraction considerably reduces background noise keeping the residual noise small. The stationary noise will be suppressed by spectral subtraction but the nonstationary noise still passes through Spectral Subtraction. Beamforming technique reduces the non-stationary noise coming from a direction other than the speech. It further also reduces the residual noise. As a result, speeches interfered by different types of noises or in noise field are enhanced with similar SNR improvements by this method. Hence there is a necessity to have a new technique developed combining the above mentioned two techniques which will overcome the deficiencies of both these techniques i.e. precisely the objective of the present project.
1.5 Design Approach The signal acquired by each microphone of the microphone array will undergo single channel Spectral Subtraction first and then will be processed using multichannel Beamforming technique. An Array toolbox is used for the simulation of microphone array. Placing of sound source at a particular position in a room and recording the 4
resulting sound field with spatially distributed microphones is simulated using Array toolbox.
x1(k)
Spectral Subtraction
x2(k)
Spectral Subtraction
Beamforming
y(k)
Spectral Subtraction
xM(k)
Figure 1.1 Block diagram of the Combined Technique
Figure 1.1 shows the block diagram of the technique combining single channel Spectral Subtraction and multichannel Beamforming technique..
1.6 Literature Survey Single channel speech enhancement has remained a popular area of research for past three decades. The defining characteristic of the single channel problem is that only the degraded speech is available. Many approaches to the single channel ambient noise problem have been proposed. Some of the proposed algorithms include subspace methods [ 15, 16, 17, 18, 19 ], parameterization –synthesis method [ 20, 21 ], statistical-model based methods [ 22, 23, 24, 25, 26 ] and spectral subtractive methods [ 1, 27, 28, 29 ]. Although the single channel problem has attracted a lot of interest in the past, some focus is shifting to the multi-channel problem due to the increasing prevalence of microphone arrays. Many methods based on multi-channel processing have already being proposed [30, 31, 32, 33, 34, 35].
5
1.7 Applications Speech signal enhancement algorithms are applied in a diverse range of applications. Examples of such applications include communication systems like mobile phones, speech coding and compression, human machine interfaces such as automatic speech recognition and in medical devices such as hearing aids. In these applications, the speech enhancement algorithms are employed to 1) improve speech quality, 2) improve speech intelligibility or 3) reduce listener fatigue. Array microphones are used in numerous applications besides computer speech recognition. They are sometimes used in recording audio for films and TV shows, where extraneous noises are a no-no or to amplify the voices of performers in a play. Some hearing aids use an array of miniature microphones to help wearers focus on a single speaker in a noisy environment. You may also find array microphones in cars, where they’re used for hands-free phones. Home automation enthusiasts have been known to use separate arrays in each room, so they can speak a command from anywhere in the house (“Turn kitchen lights on,” “Activate force field”) that will be carried out by equipment attached to a central computer.
6
Chapter 2
SPEECH ENHANCEMENT TECHNIQUES 2.1 Spectral Subtraction Technique 2.1.1 Power Spectral Subtraction 2.1.2 Magnitude Spectral Subtraction 2.1.3 Residual Noise 2.2 Beamforming Technique 2.3 Technique Combining Spectral Subtraction and Beamforming
7
Chapter 2 SPEECH ENHANCEMENT TECHNIQUES With the increased maturity in speech and speaker processing technologies and the prevalence of telecommunications, there is a need for effective speech acquisition devices. Microphone arrays have a distinct advantage of hands-free acquisition of speech and they can also provide information on the location of speakers. Spectral Subtraction and Beamforming techniques can be used for Microphone array speech enhancement. When the noise process is stationary and speech activity can be detected, Spectral Subtraction is a direct way to enhance the noisy speech. It can do extremely well than other techniques in enhancing low SNR signal, and is simple to implement. Multi-channel Beamforming technique takes the advantage of the availability of multiple signal input. It emphasizes signals coming from a particular direction while attenuating the noise from other directions.
2.1 Spectral Subtraction Technique One of the most popular methods of reducing the effect of background (additive) noise is Spectral Subtraction. Spectral Subtraction is one of the earliest and longest standing approaches to noise compensation and speech enhancement brought about, in part, due to its simplicity and versatility. The basic principle of the Spectral Subtraction method is to subtract the magnitude spectrum of noise from that of the noisy speech. The noise is assumed to be uncorrelated and additive to the speech signal. Spectral Subtraction is a method used to reduce the amount of noise acoustically added in the speech signals. Spectral Subtraction is implemented by estimating the noise spectrum from regions that are estimated as "noise-only" and subtracting it from the rest of the noisy speech signal. It is assumed that the noise remains relatively constant prior to, and during speech activity.
8
Windowing
x(k)
Speech / Pause Detection
FFT
Subtract Noise Spectrum
IFFT
Overlap ‐ Add
y(k)
Estimate Noise Spectrum
Fig. 2.1 Block diagram of Spectral Subtraction
Figure 2.1 shows the block diagram of single channel Spectral Subtraction, x1(k) is noisy speech signal acquired from first microphone of microphone array and y(k) is the enhanced signal. Speech, suitably low-pass filtered and digitized, is analyzed by windowing data from half-overlapped input data buffers. The magnitude spectra of the windowed data are calculated and the spectral noise bias calculated during non-speech activity is subtracted off. Resulting negative amplitudes are then zeroed out. Secondary residual noise suppression is then applied. A time waveform is recalculated from the modified magnitude. This waveform is then overlap added to the previous data to generate the output speech. Suppose your speech signal s(k) is corrupted by background noise n(k); that is: x(k) = s(k) + n(k)
(2.1)
Windowing the signal: xw(k) = sw (k) + nw (k)
(2.2)
Taking Fourier transform of both sides Xw(ejω) = Sw (ejω) + Nw (ejω)
(2.3)
Where xw(ejω) , Sw (ejω) and Nw (ejω) are the Fourier transforms of windowed noisy, speech and noise signals respectively. To simplify the notation the w subscript is dropped. Multiplying both sides by their complex conjugates we get: 9
|X (ejω)|2 = |S (ejω)|2 + |N (ejω)|2 + 2|S (ejω)||N (ejω)|cos(∆θ),
(2.4)
where ∆θ is the phase difference between speech and noise:
∆θ = < S (ejω) - < N (ejω),
(2.5)
Taking the expected value of both sides we get: E{|X (ejω)|2}= E{|S (ejω)|2}+ E{|N (ejω)|2}+ E{2|S (ejω)||N (ejω)|cos(∆θ)}, = E{|S (ejω)|2}+ E{|N (ejω)|2}+ 2E{|S (ejω)|}E{|N (ejω)|}E{cos(∆θ)},
(2.6)
In the deriving last equation two reasonable assumptions are made: 1. Noise and speech magnitude spectrum values are independent of each other. 2. The phase of noise and speech are independent of each other and of their magnitude.
2.1.1 Power Spectral Subtraction In power spectral subtraction it is assumed that E{cos(∆θ)}=0, hence: E{|X (ejω)|2}= E{|S (ejω)|2}+ E{|N (ejω)|2},
(2.7)
|S (ejω)|2 = |X (ejω)|2 - E{|N (ejω)|2},
(2.8)
The power spectrum of noise is estimated during speech inactive periods and subtracted from the power spectrum of the current frame resulting in the power spectrum of the speech. Generally Spectral subtraction is suitable for stationary or very slow varying noises (so that the statistics of noise could be updated during speech inactive periods).
2.1.2 Magnitude Spectral Subtraction In magnitude spectral subtraction it is assumed that E{cos(∆θ)}=1, hence:E{|X (ejω)|2}= E{|S (ejω)|2}+ E{|N (ejω)|2} +2E{|S (ejω)|}E{|N (ejω)|} 10
= ( E{(ejω)|}+ E{|N (ejω)|} )2
E{|X (ejω)|
= E{|S (ejω)|}+ E{|N (ejω)|}
(2.9)
The magnitude spectrum of the noise is averaged during speech inactive periods and, again, assuming that the variations of noise spectrum are tolerable, the magnitude spectrum of speech is estimated by subtracting the average spectrum of noise from each segment. |S (ejω)| = |X (ejω)| - E{|N (ejω)|}
(2.10)
2.1.3 Residual Noise The conventional Subtraction method substantially reduces the noise levels in the noisy speech. However, a problem with Spectral Subtraction is that it introduces an unusual residual noise called musical noise, which is very annoying to human ears. As a result of the fluctuations of noise spectrum (whether power or magnitude) around its mean (expected) value, there is always some difference between the actual noise and its mean value. Hence some of the noise remains in the spectrum in the case that the value of noise is greater than its mean and some of the speech spectrum also is removed in the case that our estimate of noise is greater than the actual value of noise. The latter produces negative values in spectrum. These negative values are prevented or set to a floor (sometimes zero) using different techniques. The overall effect puts a noise in the output signal known as residual. The narrow band relatively long-lived portion of residual noise is sometimes referred to as musical noise To explain the nature of the musical noise, one must realize that peaks and valleys exist in the short—term power spectrum of white noise; their frequency locations for one frame are random and they vary randomly in frequency and amplitude from frame to frame. When we subtract the smoothed estimate of the noise spectrum from the actual noise spectrum, all spectral peaks are shifted down while the valleys (points lower than the estimate) are set to zero (minus infinity on a logarithmic scale). Thus, after subtraction there remain peaks in the noise spectrum. Of those remaining peaks, the wider ones are perceived as time varying broadband noise. The narrower peaks, which are relatively 11
large spectral excursions because of the deep valleys that define them, are perceived as time varying tones which we refer to as musical noise
2.2 Beamforming Technique Beamforming is a temporal and spatial filtering process using an array of sensors, which emphasizes signals from a particular direction while attenuating noise or interference from the other directions. The goal of Beamforming is to sum multiple elements to achieve a narrower response in a desired direction. A Beamformer is a spatial filter that operates on the output of a microphone array in order to enhance the amplitude of a coherent wavefront relative to background noise and directional interference. Using Beamforming a group of microphone arrays can be calibrated to predominantly receive signals from a chosen angular direction The simplest type of Beamforming uses the "delay and sum" concept. Each microphone's signal can be delayed by an amount proportional to the distance between a known target and that microphone. Then all of these delayed signals are added together, resulting in a large signal component. As long as the noise wasn't coming from the exact same position as the desired signal, the noise signals won't be coherent and thus won't add up. The total noise power will remain approximately the same as for one microphone, but the total signal power will be multiplied by the number of microphones in the array.
Fig. 2.2 Block diagram of delay and sum Beamforming
12
In delay-and-sum Beamforming, delays are inserted after each microphone to compensate for the arrival time differences of the speech signal to each microphone (Figure 2.2). The time aligned signals at the outputs of the delays are then summed together. This has the effect of reinforcing the desired speech signal while the unwanted off-axis noise signals are combined in a more unpredictable fashion. The signal-to-noise ratio (SNR) of the total signal is greater than (or at worst, equal to) that of any individual microphone’s signal. For the m-th channel of the system we will have: ym (k) = x(k -
m
) + nm (k)
(2.11)
where x(k) will be the desired signal,
m
the delay applied to the input signal, nm(k) the
noise present in the channel and ym(k) the available input of this channel. The overall output of the multi-sensor system will be obtained by adding all contributions, with adequate compensating delays in each of them, giving: (k) =
m (k
+
m)
(2.12)
This delay and sum Beamforming process is a very robust scheme. The delay estimation errors reduce the enhancement process in terms of SNR, but inducing little distortion.
2.3 Technique Combining Spectral Subtraction and Beamforming Spectral Subtraction is an effective method to reduce stationary additive noise from a single microphone signal. It can extremely do well than other techniques in enhancing low SNR signal, and is simple to implement. However, it has a major drawback that it introduces an unusual residual noise called musical noise which is very annoying to human ears [1]. Beamforming is another possible method of speech enhancement that can be used. Beamforming by itself, however, does not appear to provide enough improvement.
13
Further, the performance of Beamforming becomes worse if the noise source comes from many directions or the speech has a strong reverberation [13, 14]. Even though both Spectral Subtraction and Beamforming can enhance speech, it is not advisable to apply the single channel algorithm independently to the microphone array signals, as these signals are strongly correlated to each other [4]. A new method combining the advantages of Beamforming and Spectral Subtraction is designed and developed. Spectral Subtraction reduces stationary additive noise from each microphone signal and a Beamforming is followed to enhance the speech and reduce the residual noise. Even though non-stationary noise passes through Spectral Subtraction, the beamformer reduces it, if the noise direction is different from the speech direction.
14
Chapter 3 THEORY 3.1 Spectral Subtraction 3.1.1 Segmenting the Data 3.1.2 Taking Fourier Transform 3.1.3 Compute the Noise Spectrum Magnitude 3.1.4 Frame Averaging 3.1.5 Half Wave Rectification 3.1.6 Residual Noise Reduction 3.1.7 Attenuate Signal during Non-Speech Activity 3.1.8 Signal Reconstruction 3.2 Beamforming 3.2.1 Delay Calculation 3.2.2 Shifting of each microphone signal 3.2.3 Summation of delayed signals 3.3 Combined Technique 3.3.1 Spectral Subtraction 3.3.2 Musical Noise Reduction by Beamforming 3.4 Signal to Noise Ratio (SNR)
15
Chapter 3 THEORY 3.1 Spectral Subtraction The method of Spectral Subtraction is widely used for single-channel speech enhancement, if the speech signal is corrupted by additive noise. It is based on the manipulation of the magnitude of the noisy-speech spectrum. Spectral Subtraction is implemented by estimating the rest of the noise spectrum from regions that are estimated as “noise only” and subtracting it from the rest of the noisy speech signal. It is assumed that the noise remains relatively constant prior to, and during speech activity. x(k ) Hamming Window FFT Compute Noise Spectrum Magnitude
Subtract Bias Half-Wave Rectify Reduce Residual Noise Compute Speech Activity Detector Attenuate Signal During Non-Speech Activity IFFT
S Fig. 3.1 Algorithm for Spectral Subtraction
16
3.1.1 Segmenting the Data The data from the signal are segmented and windowed, such that if the sequence is separated into half-overlapped data buffers, then the sum of these windowed sequences adds back up to the original sequence [1].
3.1.2 Taking the Fourier Transform Let s(k) and n(k) be represented by a windowed speech signal and noise signal respectively. The sum of the two is then denoted by x(k), x(k) = s(k) + n(k).
(3.1)
Taking the Fourier Transform of both sides gives X (e jω ) = S ( e jω ) + N (e jω )
(3.2)
Where,
x (k ) ↔ X ( e jω ) L −1
X ( e jω ) = ∑ x ( k )e − jωk
(3.3)
k =0
x (k ) =
1 2π
π
∫π X (e
jω
)e jωk d ω.
−
3.1.3 Compute Noise Spectrum Magnitude To obtain the estimate of the noise spectrum the magnitude N ( e jω ) of N (e jω ) is replaced by its average value µ ( e jω ) taken during the regions estimated as “noise-only”. For this analysis the first 25ms were used as the “noise-only”. The phase θ N ( e jω ) of N (e jω ) is replaced by the phase θ X ( e jω ) of X ( e jω ) . 17
Through manipulation and substitution of equation (3.2) we obtain the spectral subtraction estimator Sˆ ( e jω ) : jω Sˆ ( e jω ) = ⎡⎣ X ( e jω ) − µ ( e jω ) ⎤⎦ e jθ x ( e ) .
(3.4)
The error that results from this estimator is given by
ε ( e jω ) = Sˆ ( e jω ) − S ( e jω ) = N ( e jω ) − µ ( e jω )e jθ . x
(3.5)
3.1.4 Frame Averaging In efforts to reduce this error local averaging is used because ε ( e jω ) is simply the difference between N (e jω ) and its mean µ . Therefore X ( e jω ) is replaced with X ( e jω ) .
Where X ( e jω ) =
1 M
M −1
∑ X (e ω ) i =0
j
i
X i ( e jω ) = i th time-windowed transform of x(k)
By substitution in equation (3.4) we have jω Sˆ A ( e jω ) = ⎡ X ( e jω ) − µ ( e jω ) ⎤ e jθ x ( e ) ⎣ ⎦
(3.6)
The spectral error is now approximately
ε (e jω ) = Sˆ A (e jω ) - Sˆ ( e jω ) ≅ N − µ
(3.7)
Where, N ( e jω ) =
1 M
M −1
∑ N (e ω ) . i =0
j
i
Thus, the sample mean of N ( e jω ) will converge to µ ( e jω ) as a longer average is taken [1]. 18
It has also been noted that averaging over more than three half-overlapped frames, will weaken intelligibility.
3.1.5 Half-Wave Rectification For frequencies where X ( e jω ) is less than µ ( e jω ) , the estimator Sˆ ( e jω )
will
become negative, therefore the output at these frequencies is set to zero. This is half-wave rectification. The advantage of half-wave rectification is that the noise floor is reduced by µ ( e jω ) [1]. When the speech plus the noise is less than µ ( e jω ) this leads to an incorrect removal of speech information and a possible decrease in intelligibility.
3.1.6 Residual Noise Reduction While half-wave rectification zeros out the speech plus noise that is less than µ ( e jω ) , speech plus noise above µ ( e jω ) still remain. When there is no speech present in a given signal the difference between N and µ e jθn is called noise residual and will demonstrate itself as disorderly spaced narrow bands of magnitude spikes. Once the signal is transformed back into the time domain, these disorderly spaced narrow bands of magnitude spikes will sound like the sum of tone generators with random frequencies. This is a phenomenon known as the “musical noise effect”. Because the magnitude spikes fluctuate from frame to frame, we are able to reduce the audible effects of the noise residual by replacing the current values from each frame with the minimum values chosen from the adjacent frames. The motivation behind this replacement scheme is threefold: first, if the amplitude of Sˆ ( e jω ) lies below the maximum noise residual, and it varies radically from analysis
frame to frame, then there is a high probability that the spectrum at that frequency is due to noise; therefore, suppress it by taking the minimum value; second if Sˆ ( e jω ) lies below the maximum but has a nearly constant value, there is a high probability that the spectrum 19
at that frequency is due to low energy speech; therefore, taking the minimum will retain the information; and third, if Sˆ ( e jω ) is greater than the maximum, there is speech present a that that frequency; therefore, removing the bias is sufficient [1]. Residual Noise Reduction is implemented as: Sˆi ( e jω ) = Sˆi ( e jω ) for Sˆi (e jω ) ≥ max N R ( e jω )
{
}
Sˆi ( e jω ) = min Sˆ j ( e jω ) j = i − 1, i, i + 1 for Sˆi (e jω ) < max N R (e jω )
(3.8)
Where, N R ( e jω ) = N − µ e jθn and max N R ( e jω ) = maximum value of noise residual measured during non-speech activity.
3.1.7 Attenuate Signal during Non-Speech Activity The amount of energy in Sˆ ( e jω ) compared to µ ( e jω ) supplies an indication of the presence of speech activity contained inside a given analysis frame. Empirically, it was determined that the average (before versus after) power ratio was down at least 12dB [1]. This offered an estimate for detecting the absence of speech given by: ⎡ 1 T = 20log10 ⎢ ⎣⎢ 2π
π
∫
−π
⎤ S (e jω ) dω ⎥ jω µ (e ) ⎥⎦
If T was less than -12dB for a particular frame, it was classified as having no speech and attenuated by a factor c, where 20log10 c = −30dB . -30dB, was found to be a reasonable, but not optimum amount of attenuation [1]. The output of the spectral estimate including signal attenuation is given by: ⎧⎪ S ( e jω ) T ≥ −12dB S ( e jω ) = ⎨ jω ⎪⎩ cX ( e ) T ≤ −12dB 20
3.1.8 Signal Reconstruction After bias removal, half-wave rectification, noise residual reduction and signal attenuation, the inverse transform is taken for each window and overlap added to form the output speech sequence.
3.2 Beamforming Technique The Beamformer is based on the idea that the output of each sensor will be the same, except that each one will be delayed by a different amount. So, if the output of each sensor is delayed appropriately then we add all the outputs together, the signal that was propagating through the array will reinforce, while noise will tend to cancel.
S1(k)
S2(k)
--
SM(k
Delay calculation for each microphone signal Shifting of each microphone signal
Summing the all the delayed signals
y(k)
Fig. 3.2 Algorithm for Beamforming
3.2.1 Delay Calculation Since the arrival time of the speech wavefront is different to each microphone as shown in Figure 3.3, the temporal differences between microphones should be known to
21
be aligned. For example, in case that kth microphone has the longest distance from the source, the signal received on mth microphone should be delayed by
d=
(3.9)
Figure 3.3 An Uniform Linear Array with a source in the near-field
3.2.2 Shifting of each microphone signal For the m-th channel of the system we will have: ym (k) = x(k -
m
) + nm (k)
(3.10)
where x(k) will be the desired signal,
m
the delay applied to the input signal, nm(k) the
noise present in the channel and ym(k) the available input of this channel.
3.2.3 Summation of delayed signals The overall output will be obtained by adding all contributions, with adequate compensating delays in each of them, giving:
(k) =
m (k
+
m)
(3.11) 22
The signal-to-noise ratio (SNR) of the total signal is greater than (or at worst, equal to) that of any individual microphone’s signal. This system makes the array pattern more sensitive to sources from a particular desired direction.
3.3 Combined Technique Spectral Subtraction improves the SNR significantly keeping the musical noise small. Beamforming is followed to enhance the speech and reduce the residual noise so that we can hardly perceive the musical noise. Even though non-stationary noise passes though Spectral Subtraction, the Beamformer reduces it if the noise direction is different from the speech direction. The details of the developed method are described below.
3.3.1 Spectral Subtraction Suppose the signal s(k) is corrupted by additive noise n(k) such as
y ( k ) = s ( k ) + n( k )
(3.12)
and its Fourier transform is Y (ω ) = S (ω ) + N (ω )
(3.13)
If the signal and the noise are uncorrelated, the power spectral density Sy(ω) is,
S Y (ω ) = S S (ω ) + S N (ω )
(3.14)
The power spectrum can be approximated by 2
2
Y (ω ) ≈ S (ω ) + N (ω )
2
(3.15)
After the power spectrum of the noisy speech is estimated, from (3.15), the power spectrum of the speech can be estimated as
23
∧
2
S (ω )
= Y (ω )
2
2
∧
− N (ω )
∧ ⎛ ⎜ N (ω ) ⎜ = ⎜1 − Y (ω ) ⎜⎜ ⎝
⎛ 1 = ⎜⎜1− ⎝ SNR(ω) 2
= H(ω) .Y(ω)
2
2
⎞ ⎟ ⎟ ⎟. Y (ω ) ⎟⎟ ⎠
2
⎞ ⎟.Y (ω) 2 ⎟ ⎠
(3.16)
2
(3.17)
where SNR(ω) is a Posteriori SNR defined
SNR (ω ) =
Y (ω )
2
∧
2
(3.18)
N (ω )
H(ω) is the system response of the Spectral Subtraction filter. Then speech can be estimated such as ∧
∧
∧
S (ω ) = S (ω ) ⋅ e j∠ S (ω )
∧
= H (ω) ⋅ Y (ω) e j∠ S (ω) = H (ω) ⋅ Y (ω) e j∠Y (ω)
(3.19)
= H(ω) ⋅Y(ω) (3.20) and 24
[
∧
j∠ Y (ω )
S ( k ) = IFFT H (ω ). Y (ω ) .e
]
(3.21)
where,
H (ω ) =
⎛ ⎜ ⎜ ⎜1 − ⎜ ⎜ ⎝
2
∧
N (ω ) Y (ω )
2
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(3.22)
3.3.2 Musical Noise Reduction by Beamforming Many approaches have been tried to reduce the musical noise of spectral subtraction. It is necessary that the musical noise of spectral subtraction is characterized and understood so that we can find a way to eliminate it. For convenience, suppose that the observed signal doesn't have any speech but noise alone, so that Y(ω) = N(ω). Similarly with (3.17), the speech spectrum is estimated as 2
∧
2
S (ω ) ≈ H (ω ) N (ω )
2
(3.23)
Where
H
(ω ) =
⎛ ⎜ ⎜ ⎜ 1 − ⎜ ⎜ ⎝
∧
2
N (ω ) N (ω )
2
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
Then, the estimate S(ω) shows the characteristics of the musical noise after spectral subtraction. From (3.19), S(ω) is represented by
25
∧
S (ω ) ≈ H (ω ) ⋅ N (ω ) e j∠N (ω )
= e j∠N(ω )
(3.24)
2
2
∧
N(ω ) − N (ω )
(3.25)
Suppose we have M microphones so that there are M channels. When the musical noise of the mth microphone is Sm(ω), it can represented as ∧
S m (ω ) = e
j ∠ N m (ω )
2
∧
2
N(ω ) − N (ω )
(3.26)
where Nm(ω) is the noise at the mth microphone for delay-sum Beamformer and the beam direction is set perpendicular direction to the microphone array, then the output is
1 S (ω ) = M ∧
=
∑ M
∧
S m (ω )
m =1
2
∧
2
N(ω ) − N (ω ) ⋅
1 ∑ e j∠N m (ω ) M
(3.27)
It should be noted that (3.27) is viewed from Minkowski inequality such as M
M
1 1 ej∠Nm (ω) ≤ ej∠Nm (ω) ≤1 M m=1 M m=1
∑
∑
where the equality is true if and only if e
(3.28)
j∠ N 1 (ω )
= e j∠ N 2 (ω ) = ...e j∠ N m (ω )
Thereafter, the musical noise of spectral subtraction can be reduced by Beamforming in case the noise doesn't come from the look direction of Beamformer. Beamforming
26
eliminates musical noise with a few microphones since the musical noise is perceptually annoying; but its power is not intense [4].
3.4 Signal to Noise Ratio (SNR) Signal to Noise Ratio (SNR) is a performance measure used in many applications. The expression for SNR is given below. Here the original or the reference signal is defined as s(n) and the enhanced speech signal is denoted by y(n)
(3.29) The use of this technique is most appropriate in situations where the intent is to reproduce the original speech signal exactly.
27
Chapter 4 DEVELOPMENT OF ALGORITHM AND ASSOCIATED SOFTWARE 4.1 Algorithm for the Combined Technique 4.2 Algorithm Implementation 4.2.1 Input-Output Data Buffering and Windowing 4.2.2 Frequency Analysis 4.2.3 Magnitude Averaging 4.2.4 Bias Estimation 4.2.5 Bias Removal and Half Wave Rectification 4.2.6 Residual Noise Reduction 4.2.7 Additional Noise Suppression During NonSpeech Activity 4.2.8 Synthesis 4.2.9 Delay Calculation 4.2.10 Shifting of Each Signal 4.2.11 Summation of Delayed Signals
28
Chapter 4 DEVELOPMENT OF ALGORITHM AND ASSOCIATED SOFTWARE 4.1 Algorithm for Combined Technique Based on the development of the last section, a complete analysis-synthesis algorithm is constructed.
Figure 4.1 Algorithm for Combined Technique
29
4.2 Algorithm Implementation This section presents the specifications required to implement a technique combining Spectral Subtraction with Beamforming.
4.2.1. Input-Output Data Buffering and Windowing Speech is segmented and windowed such that in the absence of spectral modifications, if the synthesis speech segments are added together, the resulting overall system reduces to an identity. The data are segmented and windowed using the result [37] that if a sequence is separated into half-overlapped data buffers and each buffer is multiplied by a Hamming window, then the sum of these windowed sequences adds back up to the original sequences. The window length is chosen to be approximately twice as large as the maximum expected pitch period for adequate frequency resolution [38]. For the sampling rate of 8.00 kHz a window length of 256 points shifted in steps of 128 points was used. Fig. 4.2.2 shows the data segmentation and advance.
Figure 4.2 Data Segmentation and Advance
The function segment chops a signal to overlapping windowed segments. A= SEGMENT(X,W,SP,WIN) returns a matrix which its columns are segmented and windowed frames of the input one dimensional signal, X. W is the number of samples per window, default value W=256. 30
SP is the shift percentage, default value SP=0.4. WIN is the window that is multiplied by each segment and its length should be W. The default window is hamming window. function Seg = segment(signal,W,SP,Window) if nargin Data structure with recording environment parameters. The fields are: "fs" sampling frequency used to resample the wave file to create "sigin" "c" (Optional)speed of sound in recorded enviornment (default is 345 m/s) "corners" (Optional) a 2 column matrix denoting the coordinates of a pair of opposite corners of the rectangular room in which the recording was made. If not present room reverberation will NOT be simulated. 82
"reflecc"
(optional) a 6 element vector denoting the reflection coefficients
(between 0 and 1) or each of the room surfaces. The first 4 elements correspond to the walls going clockwise around the vertex given in the first column of "corners" (looking down from the ceiling), the next element is the ceiling and the last element is the floor. If "corners" is not given these parameters will not be used. The default values are .8 for all walls and .6 for the floor and ceiling. "airatten" (optional) paramters is the frequency dependent scale factor, which is in units of dB per meter-Hz. The default value is -3.2808399e-5. If set to 0 no attenuation due to the air path is applied (the value must be either 0 or negative). "mpath"
(optional) 3 or 4 row vector containing the x,y,z (z if 3D included)
coordinates of multipath scatterers in the last 2 or 3 rows. The first row is the isotropic scattering coefficient (unitless and less than 1 for a passive scatterer). Only first order multipath is considered. If not present, multipath is not generated. This cannot be used if "reflecc" is given, if "corners" is given with "mpath" the image method will be used to simulate the reverberation and mpath will be ignored. 3. WAVWRITE Write Microsoft WAVE (".wav") sound file. WAVWRITE(Y,FS,NBITS,WAVEFILE) writes data Y to a Windows WAVE file specified by the file name WAVEFILE, with a sample rate of FS Hz and with NBITS number of bits. NBITS must be 8, 16, 24, or 32. Stereo data should be specified as a matrix with two columns. For NBITS < 32, amplitude values outside the range [-1,+1] are clipped. WAVWRITE(Y,FS,WAVEFILE) assumes NBITS=16 bits. WAVWRITE(Y,WAVEFILE) assumes NBITS=16 bits and FS=8000 Hz. 4. function output=SpecSub(signal,fs,IS) S is the noisy signal, FS is the sampling frequency IS is the initial silence (noise only) length in seconds (default value is .25 sec) 83
5. function ReconstructedSignal=OverlapAdd2(XNEW,yphase,windowLen,ShiftLen); Y=OverlapAdd(X,A,W,S); Y is the signal reconstructed signal from its spectrogram. X is a matrix with each column being the fft of a segment of signal. A is the phase angle of the spectrum which should have the same dimension as X. If it is not given the phase angle of X is used which in the case of real values is zero (assuming that it is the magnitude). W is the window length of time domain segments if not given the length is assumed to be twice as long as fft window length. S is the shift length of the segmentation process ( for example in the case of non overlapping signals it is equal to W and in the case of 50 overlap is equal to W/2. if not given W/2 is used. Y is the reconstructed time domain signal. 6. function [NoiseFlag, SpeechFlag, NoiseCounter, Dist] = vad (signal, noise, NoiseCounter, NoiseMargin, Hangover) [NOISEFLAG, SPEECHFLAG, NOISECOUNTER, DIST] = vad (SIGNAL, NOISE, NOISECOUNTER, NOISEMARGIN, HANGOVER) Spectral Distance Voice Activity Detector SIGNAL is the current frames magnitude spectrum which is to labeled as noise or speech, NOISE is noise magnitude spectrum template (estimation), NOISECOUNTER is the number of imediate previous noise frames, NOISEMARGIN (default 3)is the spectral distance threshold. HANGOVER (default 8) is the number of noise segments after which the SPEECHFLAG is reset (goes to zero). NOISEFLAG is set to one if the the segment is labeld as noise NOISECOUNTER returns the number of previous 84
noise segments, this value is reset (to zero) whenever a speech segment is detected. DIST is the spectral distance. 7. function Seg = segment(signal,W,SP,Window) SEGMENT chops a signal to overlapping windowed segments A = SEGMENT (X,W,SP,WIN) returns a matrix which its columns are segmented and windowed frames of the input one dimentional signal, X. W is the number of samples per window, default value W=256. SP is the shift percentage, default value SP=0.4. WIN is the window that is multiplied by each segment and its length should be W. the default window is hamming window. 8. function val=SNR(Data,Start,End) SNR is a simple function to compute Signal-to-Noise-Ratio (SNR), val = SNR (Data,Start,End) Data is the vector that contain signal and [Start,End] is a segment of signal that is just noise. The returned val is in dB.
85
REFERENCES
86
References [1] Steven F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, Dept. of Computer Science, University of Utah Salt Lake City, Utah, April 1979
[2] Alan V. Oppenheim and Ronald W. Schafer, “Discrete-Time Signal Processing”. Prentice Hall, New Jersey 1989 [3] J. R. Deller, J. H. L. Hansen, J.G. proakis, “Discrete-time processing of speech signals”. 2nd edition, IEEE press, 2000.
[4] Cho, J., Krishnamurthy, A. “Speech enhancement using microphone array in moving vehicle environment.“ Intelligent Vehicles Symposium, 2003. Proceedings. IEEE Volume, Issue , 9-11 June 2003 pp. 366 - 371 [5] L. R. Rabiner and B. H. Juang “Fundamentals of Speech Recognition“ Englewood Cliffs, NJ: Prentice-Hall, 1993
[6] J. Bitzer, K. U. Simmer, and K. Kammeyer, “Multi-microphone Noise Reduction Techniques as Front-end-Devices for Speech Recognition” ELSEVIER Speech Communication, 2001, pp. 3-12
[7] O. Cappe, “Elimination of Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor” IEEE Trans. SAP, vol.2, No.2, April 1994, pp. 345-349
[8] S. Ogata and T. Shimamura, “Reinforced Spectral Subtraction Method to Enhance Speech Signal ” IEEE Catalogue No. 01CH37239, 2001, pp. 242-245
[9] N. Virag, “Single Channel Speech Enhancement Based on Masking Properties of Human Auditory System,” IEEE Trans. SAP, vol. 7, No. 2, March 1999, pp. 126-137 87
[10] Gannot, S.; Burshtein, D.; Weinstein, E., “Signal enhancement using beamforming and nonstationarity with applications to speech”, Signal Processing, IEEE Transactions on Volume 49, Issue 8, Aug 2001, pp. 1614 – 1626
[11] J. Ryan, R. Goubran, "Near-Field Beamforming for Microphone Arrays", icassp, pp.363, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97) - Volume 1, 1997
[12] Ivan Tashev and Henrique S. Malvar “A New Beamformer Design Algorithm for Microphone Arrays” Microsoft Research One Microsoft Way, Redmond, WA 98052, USA
[13] R. Cole, “Survey of the State of the Art in Human Language Technology”, Cambridge University Press, New York, NY, 1997.
[14] S. Haykin, “Adaptive Filter Theroy”, Prentice Hall, Upper Saddle River, NJ, 3rd edition, 1995
[15] Y. Ephraim and H. L. Van Trees, “A signal Subspace Approach for Speech Enhancement”, IEEE Trans. On Speech and Audio Processing, 3(4):251-266, 1995.
[16] P. S. K. Hansen, “Signal Subspace Methods for Speech Enhancement”, Technical University of Denmark, PhD thesis, 1997
[17] Y. Hu and P.C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise” IEEE Trans. Speech Audio Proc., pages 334-341, July 2003.
[18] S. Gazor and A. Rezayee, “An adaptive subspace approach for speech enhancement”, In Proc. ICASSP 2000, pages 1839-1842, 2000.
88
[19] A. Rezayee and S. Gazor, “An adaptive klt approach for speech enhancement”, IEEE Trans. Speech Audio Processing, 9:87-95, 2001.
[20] B. Yegnanarayana, C. Avendano, H. Hermansky, and P. Satyanarayana Murthy, “Speech Enhancement using Linear Prediction Residual”, Speech Communication, 28:25-42, 1999.
[21] J. S. Lim, “Evaluation of a correlation subtraction method for enhancing speech degraded by additive white”, IEEE Trans. Acoust., Speech, Signal Processing, 26:471472, 1978.
[22] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator”, IEEE Trans. ASSP, 32:1109-1121, 1984.
[23] Y. Ephraim, “Statistical-model-based speech enhancement systems”, Proc. IEEE, 80(10): 1526-1555, 1992
[24] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator”, IEEE Trans. On ASSP, 33(2):443-445, April 1985.
[25] I. Cohen, “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator”, IEEE Signal Processing Letters, pages 113-116, April 2002
[26] P.C. Loizou, “Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum”, IEEE Trans. Speech Audio Proc., pages 857-869, September 2005.
[27] J.S. Lim and A.V. Oppenheim, “Enhancement and bandwidth compression of noisy speech”, Proc. IEEE, 67:1586-1604, 1979.
89
[28] M. Berouti, R. Schwatz, and J.Makhoul, “Enhancement of speech corrupted by acoustic noise”, In Proc. IEEE Acoust., Speech Signal Process., pages 208-211, 1979.
[29] B.L. Sim, Y. C. Tong, J.S. Chang and C.T. Tan, “A parametric formulation of the generalized spectral subtraction method”, IEEE Trans. Speech and Audio Proc., 6(4):328-337, July 1998.
[30] Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction”, IEEE Trans. Acout., Speech, Signal Proc., 34:1391-1400, December 1986.
[31] J. Kleban and Y.Gong, “Hmm adaptation and microphone array processing for distant speech recognition”, In Proc. ICASSP, volume 3, pages 1411-1414, June 2000.
[32] J.G. Ryan and R.A. Goubran, “Array optimization applied in the near field of a microphone array”, IEEE Trans. Speech and Audio Processing, 8(2):173-176, March 2000.
[33] Z. Li and M.W. Hoffman, “Evaluation of microphone arrays for enhancing noisy and reverberant speech for coding”, IEEE Trans. Speech and Audio Proc., 7(1):91-95, January 1999.
[34] T.B. Hughes, K. Hong-Seok, J.H. DiBiase, and H.F. Silveman, “Using a real-time tracking microphone array as input to an hmm speech recogonizer”, In Proc. ICASSP, volume1, pages 249-252, May 1998.
[35] B. Widrow, “A microphone array for hearing aids”, IEEE Circuits and Systems Magazine, 1(2):26-32, 2001.
[36] J. Flanagan, A. Surendran, and E. Jan, “Microphone Arrays and Speaker Identification Spatially selective sound capture for speech and audio processing”,Speech Communication, 1993, pp. 207-222
90
[37] M. R. Weiss, E. Aschkenasy, and T. W. Parsons, “Study and development of the INTEL technique for improving speech intelligibility”, Nicolet Scientific Corp., Final Rep. NSC-FR/4023,Dec. 1974.
[38] J. Makhoul and J. Wolf, “Linear prediction and the spectral analysis of speech”, Bolt, Beranek, and Newman Inc., BBN Rep.
91
PUBLICATIONS
92
Publication
A paper on “Speech Enhancement Combining Spectral Subtraction and Beamforming Techniques for Microphone Array” was presented and published in “International Conference & Workshop on Emerging Trends in Technology (ICWET), 2010” pp 163166.
The conference was held in collaboration with ACM, in co-operation with ACM SIGHART & SIGARCH USA and in association with CDAC, acts IEEE (Bombay Section). Total 222 camera ready papers copies were received in various tracks as follows: Track 1 – ALGORITHMS Track 2 – ADVANCED COMPUTER APPLICATION Track 3 – DATABASE ENGINEERING Track 4 – SOFTWARE ENGINEERING Track 5 – INTELLIGENT SYSTEMS Track 6 – COMMUNICATION ENGINEERING Track 7 – ELECTRONIC, MICROWAVE DEVICES & CIRCUITS Track 8 – EMBEDDED SYSTEMS & APPLICATIONS Track 9 – BIOMEDICAL, BIO-INFORMATICS & BIOTECHNOLOGY
93