Development of a Real Time Hearing Enhancement Algorithm for Crowded Social Environments
by
Yuxiang (Brian) Wang
A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto
© Copyright by Yuxiang (Brian) Wang 2014
Development of a Real Time Hearing Enhancement Algorithm for Crowded Social Environments Yuxiang (Brian) Wang Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014
Abstract A novel hearing enhancement algorithm was developed particularly for real time processing within crowded social environments. Binaural microphone array steering provides initial estimation of sound source location. Parallel noise gating is then applied to reduce crowd noise. This system is adaptive to environmental changes due to an adaptive noise bank, noise identification and selection utilizes Amplitude Modulation pitch tracking as well as pitch and formant continuity checks. The algorithm has been verified through both signal processing metrics as well as a formal listening experiment consisting of 10 subjects. It can achieve up to 10 dB Signal to Noise Ratio improvement on a near-zero dB SNR sound segment containing a target speaker immersed in crowd noise. Signal distortions as well as digital artifacts are significantly reduced with our method, improving target speech intelligibility. The listening experiment demonstrates a 39% accuracy increase in target speech recognition over the original noised segment, and a 31% accuracy increase over conventional binary masking noise reduction method. The algorithm has a computational speed that allows for real time processing.
ii
Acknowledgments
I would like to express my sincere appreciation to my thesis supervisor and mentor Professor Willy Wong, you have taught me much and guided me graciously over the years. I am extremely grateful for your kind encouragement, your constructive criticism, and the numerous opportunities that you’ve given me. Your advices on the project, on research in general, and on the bigger picture of life have helped me grow as a researcher and as a person. I would also like to thank my committee members, Professor Pascal Van Lieshout, Professor Hans Kunove, and my committee chair Professor Mireille Broucke. Your suggestions and generosity are much appreciated and treasured. Last but not least, I would like to thank the other students in the Sensory Communication Lab, discussions with you have generated much needed ideas and breakthroughs in my research, and I’ve learned many precious lessons on various fields of expertise from your help.
iii
Table of Contents Acknowledgments.......................................................................................................................... iii Table of Contents ........................................................................................................................... iv List of Figures ................................................................................................................................ vi List of Appendices .......................................................................................................................... x Chapter 1
: ............................................................................................................................... 1
Chapter 2
: ............................................................................................................................... 3
2.1 Auditory Scene Analysis ..................................................................................................... 3 2.2 Auditory Segregation Analysis ........................................................................................... 4 2.3 Directional Hearing ............................................................................................................. 5 2.3.1
Beamforming Techniques ....................................................................................... 6
2.3.2
Blind Source Separation ......................................................................................... 9
2.4 Noise Reduction ................................................................................................................ 10 2.5 Speech Identification and Tracking .................................................................................. 12 2.5.1
Pitch Tracking ....................................................................................................... 13
2.5.2
Formant Tracking.................................................................................................. 18
2.5.3
Continuity of Pitch and Formant........................................................................... 19
Chapter 3
: ............................................................................................................................. 22
3.1 Auditory Attention and Assumptions ............................................................................... 22 3.2 Objective Evaluation Metrics ........................................................................................... 23 3.2.1
Underlying Principals ........................................................................................... 24
3.2.2
Experiments on Various Objective Evaluation Metrics........................................ 25
Chapter 4
: ............................................................................................................................. 34
4.1 Directional Hearing Implementation ................................................................................ 36 4.2 Pitch Tracking Implementation......................................................................................... 40 iv
4.2.1
AM demodulation Pitch Tracking Implementation .............................................. 40
4.2.2
Comparison between Pitch Tracking Methods ..................................................... 46
4.3 Continuity Check Implementation .................................................................................... 53 4.4 Parallel Noise Gating Implementation .............................................................................. 54 4.4.1
Noise Gating Parameter Optimization .................................................................. 54
4.4.2
Digital Artifact and Signal Distortion Investigations ........................................... 58
4.4.3
Parallel Processing Investigations ......................................................................... 62
Chapter 5
: ............................................................................................................................. 65
5.1 Experiment Design............................................................................................................ 65 5.2 Experiment Results ........................................................................................................... 67 5.3 Computational Time ......................................................................................................... 72 Chapter 6
: ............................................................................................................................. 74
References and Bibliography ........................................................................................................ 76 Appendix I .................................................................................................................................... 85 Appendix II ................................................................................................................................... 88 Experiment Questionnaire ............................................................................................................ 88
v
List of Figures Figure 2.3.1 Demonstration of basic beamforming technique for antenna satellite ...................... 7 Figure 2.3.2 Demonstration of fixed beamformer technology ........................................................ 8 Figure 2.3.3 General side-lobe canceller model, and adaptive model for beamformers. .............. 9 Figure 2.3.4. General illustration of blind source separation ........................................................ 9 Figure 2.4.1 Brief demonstration of the noise gating process. ..................................................... 11 Figure 2.5.1 Demonstration of peak detection in frequency domain. .......................................... 14 Figure 2.5.2 Demonstration of amplitude modulation. ................................................................ 16 Figure 2.5.3 Demonstration of amplitude modulation in time domain. ....................................... 17 Figure 2.5.4 A circuit diagram of a basic envelope detector. ...................................................... 17 Figure 2.5.5 Demonstration of the effect of envelope detectors on time domain signals. ............ 18 Figure 2.5.6 Demonstration of formants especially in relation with higher harmonics. ............. 19 Figure 2.5.7 Spectrogram of an Utterance in English.................................................................. 20 Figure 2.5.8 Difficulty in Prolonged Target Speaker Separation. ............................................... 21 Figure 3.2.1 Spectrogram of a vowel-silence-vowel-silence-vowel synthetic sound segment. .... 26 Figure 3.2.2 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the addition of crowd noise. .................................................................................................................................. 26 Figure 3.2.3 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the distortion of harmonic contents (signal distortion). ...................................................................................... 27 Figure 3.2.4 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the addition of digital artifacts. ............................................................................................................................. 27 vi
Figure 3.2.5 Results of various intelligibility evaluation metrics in different noise conditions for a synthetic sound segment. ............................................................................................................ 29 Figure 3.2.6 Results of various intelligibility evaluation metrics in different noise conditions for recorded speech signals. ............................................................................................................... 31 Figure 3.2.7 Spectrogram of the noise gating processed signal whose parameters are optimized by fwSNR evaluation metric. ......................................................................................................... 33 Figure 3.2.8 Spectrogram of the noise gating processed signal whose parameters are optimized by LLR evaluation metric. ............................................................................................................. 33 Figure 4.1 Overview of the algorithm (1) ..................................................................................... 35 Figure 4.2. Overview of the algorithm (2). ................................................................................... 36 Figure 4.1.1 Speech weighting for each frequency band as suggested by previous studies......... 37 Figure 4.1.2 Speech weighted directional response of the 5-microphone delay and sum microphone array system. ............................................................................................................. 38 Figure 4.1.3 Speech steering response of a binaural microphone array setup. ........................... 39 Figure 4.2.1 Frequency representation of a signal modulated with a 4000 Hz carrier signal. ... 41 Figure 4.2.2 Frequency representation of a signal demodulated with an envelope detector. ..... 41 Figure 4.2.3 Performance of the envelope detector on a 10 Hz sinusoidal wave amplitude modulated by a 4000 Hz carrier frequency shown in time domain. ............................................. 42 Figure 4.2.4 Frequency representation of a recorded utterance of a vowel at 220 Hz................ 43 Figure 4.2.5 Frequency representation of the demodulated utterance shown in Fig 4.2.4, with a clear peak at the fundamental frequency. ..................................................................................... 43 Figure 4.2.6 Spectrogram of a synthetic vowel segments compared with pitch tracking results.. 45
vii
Figure 4.2.7 Pitch tracking results of a speech signal using three different methods: AM demodulation, Yaapt, and Praat autocorrelation ......................................................................... 47 Figure 4.2.8 Comparison of pitch tracking methods using various evaluation criteria; 25 ms windows and 10 ms overlap were used for this comparison ........................................................ 48 Figure 4.2.9 20% gross error accuracy comparison between three pitch tracking methods at various processing window sizes. ................................................................................................. 49 Figure 4.2.10 Big error accuracy comparison of three pitch tracking methods at various processing window size ................................................................................................................. 50 Figure 4.2.11 Frequency domain representation of a demodulated speech segment using 10 ms window size. .................................................................................................................................. 51 Figure 4.2.12 Frequency domain representation of a demodulated speech segment using 25 ms window size. .................................................................................................................................. 51 Figure 4.2.13 Comparison between the 20% gross error accuracy of three pitch tracking methods in noise added environments with various SNR. ............................................................ 52 Figure 4.4.1 LLR scores of noise gating output as a function of noise threshold parameter values. ........................................................................................................................................... 56 Figure 4.4.2 LLR scores of noise gating output as a function of frequency smoothing parameter values, given optimal noise threshold level. ................................................................................. 57 Figure 4.4.3. LLR scores of noise gating output as a function of continuity step parameter values, given optimal noise threshold level and frequency smoothing. .................................................... 58 Figure 4.4.4 Spectrogram of crowd noise. ................................................................................... 60 Figure 4.4.5 Frequency domain representation of the crowd noise shown in Fig 4.4.4. ............. 60 Figure 4.4.6 Spectrogram of digital artifact resulted from heavy digital signal processing of a crowd noise sample. ...................................................................................................................... 61
viii
Figure 4.4.7 Frequency domain representation of the digital artifacts shown in Fig 4.4.6......... 61 Figure 4.4.8 Log Likelihood Ratio of noised speech signals at various numbers of parallel processing noise gating................................................................................................................. 63 Figure 4.4.9 Spectrograms of the same speech signal with various digital processing ............... 64 Figure 5.2.1 Catagorized comparison between different processing algorithms on a noised speech signal. ................................................................................................................................ 68 Figure 5.2.2 Categorized results comparison between female speaker and male speaker stimuli in listening tests............................................................................................................................. 69 Figure 5.2.3 Results comparison between listening tests performed with 0 dB, 5 dB and 10 dB SNR stimuli.................................................................................................................................... 71
ix
List of Appendices Appendix I. Consent Form ........................................................................................................... 83 Appendix II. Experiment Questionaire ........................................................................................ 86
x
1
Chapter 1 : Introduction Hearing aids have been receiving increasing attention in recent years. With the aging of global population, hearing loss has become an increasingly common problem [1]. Current hearing aids are often lacking. For example, many hearing loss patients reported that they could not effectively focus their attention to a particular speaker in a group environment even while wearing their prescribed hearing aids [1-5]. Difficulties in crowded social environment can lead to not only embarrassment for the hearing loss patients, but misunderstandings as well as miscommunications. Thus one of the most sought after improvement on hearing aids today is the addition of selective hearing in a cost-effective and intuitive design. Recent surveys have shown that a significant portion of hearing loss patients are not satisfied with their hearing aids and often do not wear them [5-6]. Selective hearing refers to several key functions that the human auditory system performs including sound source localization, attenuation of noise, as well as signal tracking over time [3, 8-11]. These functions are often collectively called Auditory Scene Analysis, and are performed in the human brain by the combined effort of the auditory system including auditory cortex [1-5, 8-18]. Aging and hearing impairment often introduce degradation in the auditory cortex, making it difficult to perform the aforementioned functions, which in turn introduces difficulties in multiple-speakers environments for the hearing loss patients [7-13, 21-22]. While there have been other designs of selective hearing aids boosting most notably directional hearing and sound source localization, currently they all encounter difficulties in either intelligibility improvement or user friendliness [5-6]. More specifically, several designs have utilized microphone array technology or directional microphones to selectively amplify sound coming from a certain direction indicated by the user [27, 29]. These designs can produce significant improvement on signal to noise ratio, however due to their hardware requirement they are clearly visible and sometimes require additional user manipulations. Brooks has concluded from his survey of 184 hearing loss individuals that hearing loss patients disliked visible devices
2
due to social stigma, many of whom will avoid wearing their hearing aids despite improvement on speech intelligibility in crowded environment [7]. Due to the above mentioned reasons, it is necessary to develop a system that enhances speech intelligibility for hearing loss patients in crowded social environments with the following design criteria: -
Allows real time processing such that the system can provide users with clarified speech during each conversation without significant delay Achieve significant improvement in Signal to Noise Ratio (SNR) as well as speech intelligibility Allows for minimally visible setups such that the hearing loss patients will better accept the design
Digital signal processing applied into hearing aids and various devices is a potential solution that satisfies all three of the design criteria. Hence we propose a real-time hearing enhancement algorithm that performs functions akin to the human selective hearing process.
3
Chapter 2 : Background 2.1 Auditory Scene Analysis One of the most important features of human hearing is its ability to accurately distinguish and follow various sound sources in a complex acoustic signal. For example, in a crowded cocktail party, a normal hearing human being can converse with particular people in the room and accurately attribute heard speech with each speaker despite heavy interfering signals. This phenomenon of the human auditory system is often termed the “Cocktail Party Effect” or “Auditory Scene Analysis” [1-3]. An “auditory scene” refers to a situation where multiple acoustic sources are present, such that the acoustic signals arrive at the ears in a combined complex signal; in such situations the human auditory system can often separate the received signals into different auditory “streams” and follow each selected “stream” individually [1-3, 10, 14, 18]. Two specific problems then further arise: how do humans segregate the complex acoustic signals into separate streams amidst heavy cross-over interferences [2,9,10,12,18-19]. Also, upon successful segregation how do humans direct their attentions to a particular source of sound while suppressing the surrounding competing sound sources [2,8,15-17]. Computer scientists and engineers have attempted to model the segregation and focusing functions of the human auditory system for years, but detailed reconstruction of the human auditory system is very difficult due to the influences of top-down processes on the auditory system. Davis et al have reviewed recent findings in neuroscience, linguistics and audiology and concluded that top-down information flow drives and constrains the interpretation of spoken input, they further proposed that top-down interactive mechanisms within auditory networks play an important role in explaining the perception of spoken language [15]. Sussman et al have conducted experiments measuring event related potentials of individuals given specific instructions in listening tests such as ignore the sound or attend to the pattern, their results demonstrated that top-down effects on the sound representation is maintained in the auditory cortex [16]. In addition, these top-down effects may also influence peripheral auditory processes in the cochlea.
4
Even though current engineering models of the human auditory system cannot thoroughly explain the details of human hearing as a whole, they are often sufficient in solving certain specific and practical problems --- including those of various voice-recognition technologies, as well as medical problems such as those related to hearing loss [5, 21].
2.2 Auditory Segregation Analysis Specific functions important to human selective hearing include binaural hearing and directional auditory perception [27-44], target signal identification [23-24], and target signal tracking across time [45-55]. Identifying these aspects of the acoustics environment is called Auditory Segregation Analysis. Translating these functions to the domain of digital signal processing, techniques such as directional microphone systems, noise reduction, pitch tracking, and speech identification can be employed to mimic the aforementioned functions [3, 8-11]. Hearing loss can be caused by either damages to the peripheral auditory organs, or damages to auditory cognition in the brain. Direct damages to peripheral auditory organs (ear canals, ear drums, cochlea, and hair cells) in low severity can be treated with band-pass-filtered amplification of input signals, which is the method utilized in most of the hearing aids available today [6]. In high severity peripheral auditory organ damage cases (ear drum rupture etc), hearing aids in general are of little help to the patients, cochlea implants are most likely required [9-13], hence this case is not to be considered for hearing aid design. Studies have shown that most hearing loss patients do not only have difficulties receiving sound, but also have difficulties processing these sounds. Shin-Cunningham et al reviewed recent findings related to hearing loss and arrived at the conclusion that hearing loss patients will likely have more difficulty distinguishing an auditory scene compared to normal hearing people [21]. In other words, most hearing loss patients experience problems in auditory processing functions such as directionality discrimination, signal separation as well as target tracking. These problems are very likely the source of their difficulties in tracking a particular speaker in a crowded social environment. To alleviate the loss of auditory cognitive functions in these hearing loss patients, a hearing aid design with inbuilt algorithm that mimics such essential procedures can provide these functionalities to their users, hence helping the patients to converse with more ease in crowded social environments and other related situations.
5
A literature review has been conducted to examine what are the main procedures prominent in the human selective hearing process, some of the most important factors include directional hearing, noise reduction, signal identification and tracking [3, 8-11]. Moore et al in particular reviewed recent studies and further suggested other important factors to consider for signal separation and tracking, these include temporal envelope, fundamental frequency, phase spectrum, and internaural time or intensity differences [17]. Based on this information, we propose an algorithm that mimics each prominent function in the human selective hearing process. The detailed theories behind each of these procedures and the available signal processing methods to represent them are described in the following sections.
2.3 Directional Hearing Directional hearing refers to the human cognitive process of analyzing the binaural sound input from each ear, and identifying the sound source location through the differences in these two channels. Given the same sound source, the differences between signals received at each ear are collectively known as Internaural Time Difference as well as Internaural Level Difference [1018]. Acoustic signals from a particular sound source would arrive at the two ears differently due to the head shadowing effect, pinna and shoulder echos, distance between sound source and the receiver, as well as environmental effects such as room echos and sound absorption [10-12, 18]. The normal human auditory cognition system can easily distinguish the location of the sound sources given a binaural input, making decisions of sound source locations quickly based on binaural differences. Once the locations of the sound sources are determined, the human auditory system along with the central cognition system work together to amplify signal from a particular direction, effectively “shifting the attention” to only a selective few sound sources and suppresses all other signals. However, hearing loss patients often have difficulties in directional hearing due to losses in their cognitive functions; the loss of directional hearing along with problems in auditory attention make it very difficult for hearing loss patients to attune to one particular speaker in a crowded environment, as they would be overwhelmed by auditory information coming from all directions. Marrone et al have performed experiments where normal hearing and hearing impaired subjects of different age groups are placed in rooms with varying reverberation. The subjects are instructed to localize the direction of the sound while a sound is played to them from an
6
unknown direction [28]. They found that aging have a significant negative impact on the performance of the subjects. Several different designs and algorithms have been previously developed, intending to provide directionality to hearing loss patients. Nearly all of these methods utilize spatial locations of the receivers in relations to each other along with the physical characteristics of sound wave as theoretical basis [6, 27]. Such methods include blind source separation as well as beamforming techniques. One example is a necklace shaped microphone array device proposed by Widrow [29]. The general underlying principals of these methods will be discussed in more details below.
2.3.1
Beamforming Techniques
Current hearing aids often employ various methods to improve performance; one of the most popular techniques is manipulation of sound source location through directional microphones [27]. A popular method is to utilize beamforming in a microphone array to introduce directionality in received signals. Beamforming is a technique traditionally employed in antenna satellite signal detection as shown in Figure 2.3.1 below. It assumes that the sound source is distant enough from the receiver such that the sound can be treated as a plane wave. The system then selectively amplifies the sound coming from a specific direction through manipulation of temporal delays, effectively “steering” the focus towards a particular direction [25-27]. Using beamforming in the microphone array allows directional variations in the output sound without movements in the microphone array itself. There are two types of beam forming techniques that are widely employed today: fixed beamforming methods and adaptive methods. It should be noted that while some of the previously developed microphone array methods provide noticeable improvement on speech intelligibility for the users, the requirement of external visible hardware brings much discomfort to the users [7]. This is a significant obstacle that warrants our primary consideration.
7
Figure 2.3.1 Demonstration of basic beamforming technique for antenna satellite
2.3.1.1
Fixed Beamformers
Fixed beamformers are a class of beamformers designed with fixed parameters and do not adjust their response to the statistics of incoming signals [25-27]. The simplest structure for fixed beamformers is the “delay and sum” structure, which uses only the differences in travelling distances of sound waves to selectively increase signal strength in the desired direction. This concept assumes that an array of microphones are placed in a straight line with fixed distances in between, the sound wave radiated from the sound source would then arrive at different microphones at different times depending on directionality. The arrival time difference (or phase difference) of the perceived signals at each microphone can therefore be manipulated such that the signal from a particular direction is enhanced. An illustration of such a situation is shown in Figure 2.3.2 below.
8
Figure 2.3.2 A scenario where fix beamer is used to identify the target signal Traditionally the intended signals in antenna satellite beamforming are of narrow frequency bands; however, human voice is a broad band signal ranging from 100 Hz to 8000 Hz, simply using narrow band beamforming would introduce steering difficulties [31-33]. Adjusting the response of these channels for different frequency bands can attenuate the steering difficulties resulted from variations in the frequency spectrum [31-33]. For each frequency sub-band, a different weight would be assigned to each input channels depending on their directionality. In this way, the choice of weights can trade off performances between unimportant and important bands.
2.3.1.2
Adaptive Beamformers
Adaptive beamforming techniques employ signal estimation and cancellation based on incoming signal statistics [34-43]. Adaptive methods employ the basic principles of fixed beamformers (delay and sum), with the addition of memory to the system; this setup allows adaptive algorithms to analyze the signals both temporally and spatially: on top of the fixed beamforming algorithm, the additional memory would provide the system with information such as frequency spectrum of the incoming signals. A design of such an adaptive beamformer system is shown below in Figure 2.3.3.
9
Figure 2.3.3 General side-lobe canceller model, and adaptive model for beamformers.
2.3.2
Blind Source Separation
Blind source separation is a collective terminology for many different methods targeting different situations. For acoustic signal processing, since the sources of acoustic signals are generally independent of each other, Independent Component Analysis (ICA) is often used. In most cases, the assumption of no time delays or spatial echoes in the system is made to simplify the problem. It should be noted that in general, if N sources are present in the environment, at least N microphones have to be utilized for the success of the method. The general theory behind Independent Component Analysis lies in maximizing the independence of each component and minimizing mutual information [44]. Below is a general illustration of the blind source separation algorithms:
Figure 2.3.4. General illustration of blind source separation
10
Due to its requirement on numerous microphones to identify a corresponding number of sound sources, the ICA is not a particularly suitable method to be used in this project. Its widespread application in speech related projects warrants an investigation, however the main obstacle to apply blind source separation to this particular project is the nature of our target environment. Crowd noise often comes from an unidentified number of sound sources, and this number changes from one crowded environment to another. Hence, designing a system with a predetermined number of microphones would be a significant challenge.
2.4 Noise Reduction Noise reduction in auditory signal processing refers to the suppression of environmental noise including interfering speech. The human auditory system is able to perform such noise reduction to surrounding environmental noise in some situations, such that one single target sound stands out from the rest. For hearing loss patients however, this function can be compromised [3-5]. One of the most commonly employed auditory noise reduction techniques is noise gating, otherwise known as binary masking [56-57]. Noise gating is an audio post-processing technique that is commonly used to reduce relatively stable environmental noise, such as white Gaussian noise. The input signal is compared to a noise sample in frequency domain. If at a certain frequency the input signal intensity is less than that of the noise sample, this frequency is eliminated from the signal being processed. Specific implementation of noise gating can vary significantly, however all share the same fundamental principle as well as the need of a noise sample. Many existing audio processing software utilizes noise gating in their de-noising algorithm [56-57]. A demonstration of the noise gating process is shown below in Figure 2.4.1. While noise gating is often able to achieve a significant level of noise reduction, one potential problem is that in environments where the noise has similar spectral profile to the intended signal, residual noise will appear at certain frequencies resulting in digital artifact. Furthermore, at certain frequencies the intended signal will be treated as noise and suppressed, resulting in signal distortion. These problems will be germane to this particular project due to our targeted crowd environment.
11
Figure 2.4.1 Brief demonstration of the noise gating process. A) shows the spectral distribution of the noise sample, B) shows the recorded signal containing both intended signal as well as noise, C) frequency bins that do not contain high enough energy compared to the noise profile in the top graph will be zeroed, resulting in the final processed signal shown in the bottom graph
12
2.5 Speech Identification and Tracking Speech processing often requires the identification of intended signal within a short time window based on previous knowledge of such signals. The intended signal in our case is the human speech, which is known to have fundamental frequencies between 80-260 Hz for adult voiced speech. Also, it is known that harmonics occur at approximately multiples of the fundamental frequencies for human speech. Furthermore the formant structure of the voiced speech signals displays particular patterns depending on the vowels spoken. Several studies have demonstrated that the speech fundamental frequency range is one of the most significant identifier in individual speakers [45, 47-54]. For example, LaRiviere has conducted experiments of aural speaker identification, during which he attempts to ascertain the relative contributions of fundamental frequency and formant frequencies [51]. Pitch tracking is therefore an important function that we will investigate. Other possible factors to be investigated include the formant locations within each speech segment obtained from linear prediction coefficient (LPC) frequency response envelope. Snell has discussed about the estimation of formant frequencies and bandwidths from the filter coefficients obtained through LPC analysis of speech [52]. Formant locations have been shown to correspond closely with spoken vowels in human speech [58-59], however it would be more difficult to employ it in the identification of individual speakers. Previous studies have shown that the rate of change in the fundamental frequency and formant frequencies of human speeches are limited due to constraints on muscle movements [58-59]. Flege et al have performed experiments that demonstrates a maximum average tongue velocity of 35.8 m/s transitioning from one vowel to another [61]. In light of this result, an algorithm can be designed to check for the rate of change of individual utterances over time. Checking for continuity this way can potentially identify the target speaker amongst competing speech signals. Many algorithms today utilize some form of continuity criterion to track speech signals, including pitch tracking algorithms as well as source separation algorithms [45-55]. One particular example is a robust formant tracking algorithm using a time-varying adaptive filter bank to track individual formant frequencies developed by Mustafa and Bruce [45]. Many indicators of individual speaker involve time-domain data acquisition, such as the speech speed, accent, and other characteristics [51-55]. For example, in their study Liu et al mentioned that the performance of speech recognition system degrades when speaker accent is different
13
from that in the training set, and accent-independent as well as accent-dependent recognition both require more training data [53]. Given the real-time computation criteria as well as the need for fast classification of previously unknown speakers, these time-domain characteristics cannot be reasonably incorporated into this model. For example, while the speed of speech can be a powerful indication of individual speakers, in order to identify such a measure and classify particular speech signals according to these saved profiles, at least several windowed signals need to be examined. Applied to real time designs this would represent significant loss of speech information at the beginning of an utterance before the individual can be identified, hence causing further confusion to the listeners. We next identify some current techniques used for speech identification.
2.5.1
Pitch Tracking
Identification of the fundamental frequency or pitch can be summarized into two groups of methods: frequency domain peak finding techniques, and time domain autocorrelation techniques [45-50].
2.5.1.1
Peak Detection Techniques
Detecting a peak within the speech fundamental frequency range in frequency domain is arguably the most intuitive method in finding the pitch of a speech signal. However, such methods have obvious drawbacks such as reliance on frequency resolution in the transformed signal as well as proper selection of peak when multiple peaks of varying amplitudes is present in the signal [47-49].
14
Figure 2.5.1 Demonstration of peak detection in frequency domain. The highest peak in the intended frequency range corresponds to the pitch in that windowed sound segment. Considerations such as harmonics check and intelligent peak selection are an integral part of frequency domain peak detection. As shown in Figure 2.5.1, peak detection is the process of identifying a peak within the speech fundamental frequency range in frequency domain, often by finding the local maxima. Since human speech fundamental frequency ranges from 80 to 260 Hz, it is possible that higher harmonics of a pitched speech sound lies within the same range as the fundamental frequency itself. Several mechanisms can be utilized to minimize this detection error, including checking for potential harmonics structure upon each detection [47-49]. A novel pitch tracking method involves first using amplitude modulation (AM) demodulation to extract the time domain envelope of a signal, and then apply peak finding to identify the pitch from the envelope. Hu et al has proposed an speech separation algorithm using pitch tracking alongside amplitude modulation [49]. In this technique, speech signal is treated as an amplitude modulated signal, and a low pass filter is applied to demodulate the target signal and extract the time domain envelope. Fast fourier transformation is then applied to the envelope. Finally the
15
algorithm looks for the biggest non-harmonic peak at the frequency range of between 80 to 260 Hz in frequency domain.
2.5.1.2
Autocorrelation
Autocorrelation refers to the cross-correlation of a signal with itself. In other words, it demonstrates the similarity within the signal as a function of time delay. Autocorrelation methods in pitch identification have proved to obtain high accuracy by previous studies, and many of the top pitch identification algorithms today utilizes a variation of autocorrelation as their primary method [45-50]. For example, formant tracking algorithms proposed by Mustafa et al [45] and Wu et al [47] both used autocorrelation. The general mathematical concept of autocorrelation is summarized below:
Where
is the input signal and
is the complex conjugate of the input signal.
2.5.1.3
Amplitude Modulation
Previous study by Hu et al has suggested that human speech have components similar to an AM signal, where the modulating wave contains the fundamental frequency, or pitch [49]. It is therefore plausible to apply amplitude demodulation to speech signals and extract only the pitch information contained, and use this information for pitch tracking. Amplitude modulation works by varying the amplitude of the transmitted signal in relation to the information being sent. Let
represent an arbitrary waveform that is the message to be transmitted:
Where constant M represent the largest magnitude, and the frequency is assumed that
where
is the carrier frequency, and that
modulation is then formed by the product:
. It is also . Amplitude
16
Where
represents the carrier amplitude.
can also be written in the form:
The resulting modulated signal has three components: a carrier wave and two sinusoidal waves (Sidebands), whose frequencies are slightly above and below
.
The process of amplitude modulation is shown below in Figure 2.5.2 and Figure 2.5.3.
Figure 2.5.2 Demonstration of amplitude modulation. The original signal M(ω) is shown in the top panel in frequency domain. After being amplitude modulated with a carrier frequency of ωc, the resulting signal is shown in the bottom panel.
17
Figure 2.5.3 Demonstration of amplitude modulation in time domain. The top panel contains a high frequency carrier signal, the middle panel contains the lower frequency modulating sinusoidal wave. After performing amplitude modulation using these two signals, the bottom panel depicts the resulting signal, noticeably having the time domain envelope resembling that of the original modulating signal. Amplitude demodulation is often achieved by applying an envelope detector to the signal. For analog circuitry, a typical circuit diagram is shown below in Figure 2.5.4. For digital purposes, an envelope detector can be achieved through the use of a low-pass filter. The ideal effect of an envelope detector is shown below in Figure 2.5.5.
Figure 2.5.4 A circuit diagram of a basic envelope detector.
18
Figure 2.5.5 Demonstration of the effect of envelope detectors on time domain signals. The blue wave represents the original signal, and the red line represents the output of the envelope detector.
2.5.2
Formant Tracking
Formants are the distinguishing or meaningful frequency components of human speech, the information that humans require to distinguish between vowels can be represented by the formant locations [60-61], as demonstrated in Figure 2.5.6 below. In speech science and phonetics, formant is also used to mean an acoustic resonance of the human vocal tract. Formant tracking is an often used technique in speech analysis since formant locations correspond to the vowels spoken in human speech, identifying these vowels can be useful in identifying the sound source [60-61]. One of the signal processing tools widely used to extract formant information from a speech segment is linear predictive coding (LPC). LPC represents the spectral envelope of a digital signal of speech in compressed form using the information of a linear predictive model [60-76]. Since this technique has been widely adapted and thoroughly researched, it will be applied in our algorithm to obtain formant information in speech segments.
19
Figure 2.5.6 Demonstration of formants especially in relation with higher harmonics. Formants are the peaks in the envelope of the harmonic contents within the frequency domain representations of a segment of sound.
2.5.3
Continuity of Pitch and Formant
Previous studies have shown that the rate of change for speech pitch as well as formant locations is limited by the movement of relevant vocalization muscles [58-59, 62-75]. For example, Fledge et al have found a maximum average tongue velocity of 35.8 m/s transitioning from one vowel to another as mentioned above [61]. The rate of speech or the rate of formant transition is found to be limited by the physical change in the human cavities. In light of this finding, an algorithm can be designed to track an individual utterance over time despite interfering signals. Many algorithms today utilize some form of continuity criterion to track speech signals, including pitch tracking algorithms as well as source separation algorithms [45-55]. However, significantly discontinuous signal such as long duration human speech (breaks between words, as well as breaks generated by certain consonants) is a challenge to applying continuity checks. The difficulty associated with such situations is illustrated below.
20
Figure 2.5.7 Spectrogram of an Utterance in English Figure 2.5.7 above demonstrates the continuity of speech features in an utterance. The fundamental frequency and harmonics are noticeably continuous throughout the utterance, with the exception at stop consonants and in-between words, an example of which is at 1.5 seconds in the spectrogram. This continuity can be crucial in the tracking of a target speaker amidst interfering speech signals. Previous studies have been able to track and separate multiple speakers using continuity [45, 47, 49, 54, 61-76], one such example is the formant tracking algorithm proposed by Mustafa and Bruce as mentioned before [45]. However most of these algorithms are limited to continuous speech segments. That is, there is no significant pause or break in the speech segment of the target speaker. For applications in hearing aid this ideal situation is next to impossible, since long breaks between speeches are bound to occur repeatedly in a live conversation. The process of identifying and tracking a speaker through a relatively long period of time is difficult because interfering signals can be easily mistaken as the target signal; this difficulty is illustrated by Figure 2.5.8 below.
21
Figure 2.5.8 Difficulty in Prolonged Target Speaker Separation. Blue line is the intended speech, red line is the competing speech. A is the actual signal containing both speech, B is the mixed sound signal broken into processing windows, and C is what a typical voice identification algorithm would detect. As shown in Figure 2.5.8 above, a received signal contains both targeted speech (blue) as well as interfering speech (red). The signals would be received in windowed segments to allow real-time sound processing and output, within each window the target speech signal would be extracted using assumptions and methods mentioned above (for example, loudness). However, during the windowed segments in which the target speaker does not speak, the extraction process within such windows would mistaken the interfering signal as the target signal, resulting in a final sound output that can be confusing to the subjects. Research on this problem is sparse in current literature, however we propose an alternative application for the continuity criteria in the form of adaptive noise banks. This procedure will be discussed in more detail in the system implementation section.
22
Chapter 3 : Assumptions and Evaluation Metrics 3.1 Auditory Attention and Assumptions Auditory attention refers to the ability for normal hearing human subjects to quickly “attend” to a particular signal while suppressing all other signals. While the suppression of noise and amplification of signal can also be done digitally, the selection decision and therefore the intention itself is very difficult to replicate in an algorithm. Such decisions are the results of complicated cognitive process that determines the importance of the message and/or a person’s subjective interest [8-17], and will be very difficult to incorporate into a real-time algorithm that targets universal auditory processing. However, one can argue that the selective hearing process itself is the result of auditory attention; hence a criterion for auditory attention must be assumed in order for our algorithm to provide intended functionalities. Neuhoff et al performed experiments with participants listening to sentences of different loudness level, and showed that speech loudness is an important factor that heavily influences auditory attention [13]. Due to the difficulty of analyzing speech content and providing an objective metric for subjective interest, in this study we will assume loudness as the most prominent determinant of auditory attention: that is, the loudest signal in a mixture of signals is defined to be the intended signal. Another necessary parameter governed by auditory attention is the desired direction of hearing. It is often the case that when people speak to each other in a crowded environment, their auditory attention is expressed through turning of their head and facing the speaker that they are currently listening to [1-6, 27]. Best et al showed in their research that the effectiveness of auditory spatial attention improves when facing the source and improves even more as listener keeps attention focused on a single direction [77]. Pichora-Fuller et al have also shown in their study that aging has a negative impact on spatial hearing [78]. Because of this commonly observed phenomenon, it can be reasonably assumed that the intended signal must come from approximately in front of the subject. Through beamforming techniques and more than a single microphone, it is relatively easy to amplify only the sound coming from in front of the speaker.
23
In order to solidify the evaluation of our final algorithm, we would like to set quantitative goals for each of our criteria. Killion et al have shown that normal hearing subjects require an estimated 5 dB SNR between targeted speaker and interfering speech signals to sufficiently attend to the target speaker [71]. Their experiment was performed using a female speaking 360 IEEE high predictability sentences with varying SNR aiming to achieve 50% correctness. While Killion et al have found that an SNR of 2 dB is sufficient for normal hearing subjects to achieve 50% accuracy while listening to high predictability daily sentences, the 5 dB SNR margin is more suitable for our project since we aim for more than 50% accuracy and the presence of uncommon speech. Bentler has suggested in her study that a typical loud noise environment has a SNR close to 0 dB [72]. She has performed speech in noise tests on IEEE sentence lists, varying presentation level (53 dB and 83 dB SPL) as well as SNR ratio (0, 5, 10, 15 dB). Zero dB SNR in reference to human selective hearing means that the total energy contained in the interfering crowd noise is equal to the sound energy contained in the target speaker signal. Based on these two previous findings, our algorithm would attempt to provide a 5 dB SNR increase to an original sound file with 0 dB SNR. That is, the SNR in our target situation would be near 0 dB, and the resulting processed signal should have an SNR of greater than 5 dB. In addition, to explore the effect of our algorithm in different crowd noise settings, we will also investigate 5 dB and 10 dB SNR scenarios.
3.2 Objective Evaluation Metrics We have investigated using objective signal processing metric to provide further confirmation of speech intelligibility. Typical signal processing metrics as Signal to Noise Ratio (SNR) do not provide sufficient indication of speech intelligibility in audio signal processing [79-83]; this is because speech comprehension is a complex process, slight variations in particular locations of a speech segment can often lead to significant misunderstanding. For example, commercial noise reduction algorithm can often provide significant SNR improvement in speech signals, reducing environmental noise at the cost of signal quality. Distortions and digital artifact will be introduced into the signal, and as a result the processed signals are often impossible to comprehend despite having a significant reduction of noise. [84] Several studies have investigated into various metrics that can reasonably indicate speech intelligibility in audio signal processing [79-83]. Log likelihood ratio (LLR) has been identified
24
as one of the most robust metric by these studies. We have conducted small scale experiments on original and altered speech segments and confirmed the validity of using LLR as an objective signal processing metric for speech intelligibility. The underlying principles of several evaluation metrics as well as details of our experiments are described in the subsection below.
3.2.1
Underlying Principles
Hu et al [79] have conducted experiments to investigate how well various metrics can reflect the intelligibility of speech information in a processed audio signal. Amongst these metrics, Hu has suggested that log likelihood ratio (LLR) alongside frequency-weighted segmental signal to noise ratio (fwSNR) are the two metrics that can best reflect intelligibility in a processed speech signal while being relatively easy to implement. The LLR is a LPC based measure defined as:
Where ac is the LPC vector of the original speech signal frame, ap is the LPC vector of the enhanced speech frame, and Rc is the autocorrelation matrix of the original speech signal. Only the smallest 95% of the frame LLR values were used to compute the average LLR value [79]. The segmental LLR values were limited in the range of 0 to 2 to further reduce the number of outliers. The frequency-weighted segmental SNR (fwSNR) was computed using the following equation:
Where
is the weight placed on the jth frequency band, K is the number of bands, M is
the total number of frames in the signal,
is the weighted (by a Gaussian-shaped
window) clean signal spectrum in the jth frequency band at the mth frame, and
is the
weighted enhanced signal spectrum in the same band. For the weighting function, we considered the magnitude spectrum of the clean signal raised to a power:
25
Where
is the weighted magnitude spectrum of the clean signal obtained in the jth band
at frame m and γ is the power exponent, which can be varied for maximum correlation. The spectra
were obtained by dividing the signal bandwidth into either 25 bands or 13
bands spaced in proportion to the ear’s critical bands. The 13 bands were formed by merging adjacent critical bands. The weighted spectra were obtained by multiplying the previously obtained spectrum with overlapping Gaussian-shaped windows [30-31, 79] and summing up the weighted spectra within each band. Prior to the distance computation, the clean and processed FFT magnitude spectra were normalized to have an area equal to one. This normalization was found by previous studies to be critically important.
3.2.2
Experiments on Various Objective Evaluation Metrics
We have carried out in-lab experiments by adding various noise conditions to a reference signal and observe the outcome of the evaluation metrics on these altered signals. We have first performed experiments on synthetic vowel signals to eliminate the variation caused by the complexity of human speech, we then performed similar experiments on recorded speech segments to observe the effects in a less controlled but more realistic environment.
3.2.2.1
Experiments on Synthetic Vowels
A combined synthetic vowel sound segment is created, where the first vowel is created with 220 Hz fundamental frequency and a frequency profile that emphasizes formant locations mirroring the vowel “ah”. The first vowel is played for a second at 44100 sampling rate, then followed by a 1 second silent interval. The second vowel then follows with a 240 Hz fundamental frequency and a formant structure resembling “ah”, playing for 1 second. The second vowel is then followed by another 1 second silent interval. Finally, the third vowel is created with 220 Hz fundamental frequency and formant structures resembling the vowel “o”. This combined synthetic sound file is created this way to take into consideration the difference between voiced and silent intervals, the difference between human-like sounds of different pitch, as well as the difference between human-like sounds of different formant structures. The spectrograms of the above mentioned sound segments are shown in figure 3.2.1.
26
Figure 3.2.1 Spectrogram of a vowel-silence-vowel-silence-vowel synthetic sound segment.
Figure 3.2.2 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the addition of crowd noise.
27
Figure 3.2.3 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the distortion of harmonic contents (signal distortion).
Figure 3.2.4 Spectrogram of the synthetic sound segment shown in Fig 3.2.1 with the addition of digital artifacts.
28
Three different noise conditions are investigated. First of which is environmental crowd noise added to original speech signal at 3 dB signal to noise ratio (SNR). The environmental crowd noise is taken from online sound sample database. The second noise condition takes the noise sample from the above mentioned environmental crowd noise, then applies noise gating on the original clean synthetic signal using this noise sample. The result of this manipulation introduces distortion to the signal, zeroing the energy level at various frequencies. This is shown in Figure 3.2.3. The third noise condition involves adding digital artifacts to the original clean synthetic signal. Digital artifacts are obtained by applying noise gating to the above mentioned environmental crowd noise using a section of itself as noise sample, as shown in Figure 3.2.4. Four metrics are then used to score the three noise conditioned synthetic sound segments against the original clean synthetic signal. Other than the LLR and fwSNR measures suggested by Hu et al, root mean squared error (RMSE) as well as subjective intelligibility scores given by 5 volunteers are also included. The results are shown below in Figure 3.2.5. Between the three noise conditions, subjective score shows that signal distortion introduces the biggest impairment to the intelligibility of a speech signal, followed by the environmental noise condition. Digital artifacts are the least intelligibility impairing factor according to the subjective score. Based on this observed trend, LLR scores seem to agree best with the subjective score. However, this preliminary investigation is of very small scale and performed on a controlled synthetic speech signal. A further investigation based on larger sample sizes and recorded speech signals is described below.
29
Figure 3.2.5 Results of various intelligibility evaluation metrics in different noise conditions for a synthetic sound segment.
30
3.2.2.2
Experiments on Recorded Speech Samples
Several speech segments around 20 seconds in length are taken from audio book samples. Three different noise conditions are added to the original signals similar to the synthetic sound experiment above: environmental noise, artifacts, and distortion. RMSE, LLR, and fwSNR calculations are performed on the noised conditioned signals. Furthermore, subjective scores from 5 volunteers were also taken for all the noised conditioned signals. The results are shown below in Figure 3.2.6. As can be seen from Figure 3.2.6 below, the metrics comparison performed on recorded speech signals shows similar results to the one performed on synthetic vowels. Signal distortion seems to introduce the most significant intelligibility loss to the speech signal according to subjective scores, and LLR reflects this fact. Future experiments can be carried out to further investigate the factors that influence the intelligibility of speech, but for the scope of this thesis, this result provides enough confidence for us to select LLR as the objective metric in this study.
31
Figure 3.2.6 Results of various intelligibility evaluation metrics in different noise conditions for recorded speech signals.
32
3.2.2.3
Comparison between Optimized Cases
While previous studies confirmed LLR as one of the best metrics to reflect subjective speech intelligibility, we compared the metrics more specifically optimized for LLR and fwSNR to see if indeed LLR is the best metric in our particular situation. In an investigation into the noise gating technique, we have performed experiments where various noise gating parameters are varied. The details of this experiment are discussed in the next chapter when the implementation of noise gating is explained. With each variation in the parameters, an LLR score as well as an fwSNR score are calculated. As parameters varied, while the general trends of the two intelligibility measurement metrics are often similar, there are also differences. Comparing the spectrograms of the parameter selection that maximized a speech signal’s fwSNR score with the parameter selection that maximized the same speech signal’s LLR score, we can see the differences qualitatively. The spectrograms of the same processed sound segment optimized by fwSNR and LLR respectively is shown in Figure 3.2.7 and Figure 3.2.8. It is apparent from the figures that the fwSNR metric favors output sounds with less digital artifact at the expense of original signal integrity and formant structures, whereas the LLR metric emphasizes the preservation of the formant structure. From subjective listening tests on these two output sound files, we have determined that LLR is a slightly better reflection of intelligibility. More experiments and objective data collection can be done in the future to provide more reliable proof for this claim, but for the scope of this thesis such results provides sufficient confidence for our choice of metric.
33
Figure 3.2.7 Spectrogram of the noise gating processed signal whose parameters are optimized by fwSNR evaluation metric.
Figure 3.2.8 Spectrogram of the noise gating processed signal whose parameters are optimized by LLR evaluation metric.
34
Chapter 4 : System Implementation The desired direction and target of auditory attention is first determined based on two assumptions mentioned in the previous section. The first assumption is that the desired direction of hearing is right in front of the listener. The second assumption is that the loudest continuous human speech in this desired direction is the target signal for the subject’s auditory attention. The sound received from the microphones is combined according to beamforming principles assuming the target speaker is in front of the listener. The sound signal is then transformed into frequency domain and processed through parallel noise gating using noise samples stored in the adaptive noise bank. Pitch tracking as well as formant tracking are performed on the sound signal, continuity checks are also applied to this signal to identify whether it is a target speech signal or noise. If this sound segment is identified as noise, the original unprocessed sound segment is added to the noise bank, and some of the parameters used in the algorithm are slightly adjusted. The final processed sound is output to the listener while the cycle repeats. This overall process of the algorithm is illustrated by Figure 4.1 and Figure 4.2 below. The detailed implementation of each individual process is discussed below in the subsections.
35
Figure 4.1 Overview of the algorithm (1)
36
Figure 4.2. Overview of the algorithm (2). Y is the input noisy sound signal, where x is the intended signal and n is the crowd noise. H is fourier transformation, and P is the noise gating process. VAD stands for voice activated detection.
4.1 Directional Hearing Implementation In order to implement a directional sound input module, first we aim to investigate whether if a microphone array would provide significant enough enhancement to directional hearing, we also want to investigate whether if alternative designs are plausible. A five omni-directional microphone array circuitry have been developed in our lab according to the Delay and Sum principals [27-28] in accordance to the design criteria listed in the introduction. Due to the varying wavelength of the incoming signal, certain frequencies are influenced by the microphone array more so than others. We have conducted a preliminary experiment in lab to measure the beamformer's steering response. Steering response differs from directional response but the two concepts are related. In a measurement of directional response, the beamformer look direction is fixed while the signal source is moved across the arc and thus probes the beamformer's sensitivity to signals from various directions. In contrast, a steering response is generated by changing the look direction of a beamformer while the signal source is fixed.
37
To characterize the performance for speech signals, we computed an Articulation-Index [30-31] weighted SNR gain based on the individual frequency bands, with the result shown in Figure 4.1.2. In this method, the SNR gain of each band (measured in dB) is weighted according to the importance of that band's contributions to speech intelligibility, the weighting for each frequency band is demonstrated in Figure 4.1.1 below. The sum across these weighted SNR gain provides a relevant metric for enhancement of speech signals. This is a currently accepted weighting method for computing SNR gain [6, 30-31] based on band-specific gain.
Figure 4.1.1 Speech weighting for each frequency band as suggested by previous studies. According to the speech frequency band weighting in Figure 4.1.1 above, a speech steering response can be obtained for the Delay and Sum microphone array system by combining the responses of each particular frequency band. The following equation is used to calculate this combined speech steering response.
Where each index i representing a subband and ai representing Articulation Index weights. The Wi represent the input weight vector for each sub-band i and
is the steer direction.
38
The combined speech steering response is shown below in Figure 4.1.2 for our Delay and Sum 5 microphones array set up.
Figure 4.1.2 Speech weighted directional response of the 5-microphone delay and sum microphone array system. From Figure 4.1.2 above, we can see that a Delay and Sum (DS) microphone array system with a five microphone setup is able to provide approximately 6 dB difference between sound sources in front of the listener and sound sources beside the listener. However, much of these differences lie in the range of the higher frequencies due to limited distance between microphones. For lower frequencies where the speech fundamental frequency and the first speech formant lies, since the wavelength of the signal may exceed the maximum distance between the microphones, the directional response provided by this microphone array setup isn’t significant. It was therefore expected that a DS beamformer will not provide sufficient performance strictly due to the small distance between microphones. Lower frequency inputs which have spatial wavelengths much larger than the aperture are essentially unaffected. Furthermore because directional response varies strongly with frequency, there is audible spectral distortion of offdirection signals. As distance between microphones is a hard constraint in our application for portability reasons, we could not compensate for the lower frequency speech information by increasing the distance. A further concern is the slightly intrusive nature of a hardware based design. While it is plausible to design a 5 microphones setup and mount the microphones on a pair of glasses, such a design is
39
still clearly visible and may introduce embarrassment to its users. Therefore we’ve also investigated a 2 microphone setup that can be easily adapted to various devices without significant hardware constraints. This binaural setup would have a microphone placed at each ear of the user, and obey the same Delay and Sum principles as the 5 microphone setup mentioned above. The largest aperture is identical between the two setups, meaning that for larger wavelength and lower frequency signals, the binaural setup should perform equally as well as a 5 microphone setup. For high frequency content however, the binaural setup would not be able to provide as much of a directional selectivity as the 5 microphone array setup. Steering responses for various frequencies were measured on the binaural setup. The procedures are identical to the measurements done for the 5 microphone setup. Articulation-Index is again used to compute the combined speech response.
Figure 4.1.3 Speech steering response of a binaural microphone array setup. Figure 4.1.3 above demonstrates the speech steering response of a binaural microphone array setup. There is a maximum of almost 3 dB difference between signals coming from sound sources in front of the user and sound sources on the side of the user. This difference is smaller than the 6 dB difference observed in the 5 microphone setup, however the lower frequency response between the two setups are very similar. Considering that portability and none-intrusive design are important criteria for this study, a less visible and more friendly binaural design was chosen for this project over a 5 microphone design.
40
4.2 Pitch Tracking Implementation In this section, the implementation of and comparisons done between various pitch tracking methods is discussed. Two of the three pitch tracking methods investigated are taken directly from published codes. Praat autocorrelation pitch tracking utilizes a modified and widely accepted autocorrelation method [85]. Yaapt is a recent pitch tracking algorithm known for its robustness with telephone speech, and utilizes normalized cross correlation which relies on both spectral and temporal pitch tracking [86]. One particular pitch tracking method, AM demodulation based pitch tracking, is a relatively new approach hence its implementation is discussed in details below.
4.2.1
AM demodulation Pitch Tracking Implementation
An AM demodulation Pitch Tracking algorithm is implemented according to the theories presented in Chapter 2. The input speech signal is treated as an AM modulated signal, a low pass filter acts as an envelope detector which obtains the modulating wave of the signal. A frequency domain peak detection algorithm is then used to extract the peak within a 80-260 Hz range, these peaks are checked for harmonics, and finally the pitch is determined. To test the performance of the envelope detector, we apply AM modulation to synthetic signals according to the theories described in Chapter 2. We then apply our envelope detector to the AM modulated signal and obtained the modulating wave. As shown below in Fig 4.2.1 and Fig 4.2.2, a synthetic 220 Hz sinusoidal signal is modulated with a carrier frequency of 4000 Hz. The envelope detector is applied to the modulated signal, and the original signal is restored with identical frequency profile and different intensity level.
41
Figure 4.2.1 Frequency representation of a signal shown after amplitude modulated with a 4000 Hz carrier signal.
Figure 4.2.2 Frequency representation of the amplitude modulated signal shown in Fig 4.2.1 after demodulation with an envelope detector.
42
The performance of the envelope detector can also be shown in time domain. Figure 4.2.3 below shows a 10 Hz synthetic sinusoidal signal modulated with a 4000 Hz carrier frequency. An envelope detector is applied to this signal and the original 10 Hz signal is obtained.
Figure 4.2.3 Performance of the envelope detector on a 10 Hz sinusoidal wave amplitude modulated by a 4000 Hz carrier frequency shown in time domain. Having verified the envelope detector, we next verify the AM demodulation pitch tracking algorithm as a whole. Figure 4.2.4 below is the frequency profile of a windowed sample of a recorded speech signal. The envelope detector is applied to this windowed signal and the resulting demodulated signal has a frequency profile as shown in Fig 4.2.5. The original recorded signal is a vowel spoken at 220 Hz, and the demodulated signal shows a clear peak at 220 Hz.
43
Figure 4.2.4 Frequency representation of a recorded utterance of a vowel at 220 Hz.
Figure 4.2.5 Frequency representation of the demodulated utterance shown in Fig 4.2.4, with a clear peak at the fundamental frequency (220 Hz).
44
Next, we apply the AM demodulation pitch tracking algorithm on several synthetic signals, including a sinusoidal wave, a synthetic vowel, as well as a chirp. As can be seen in the spectrograms and the pitch tracking results shown below in Figure 4.2.6, the algorithm is able to correctly identify the pitch of the signal regardless of high harmonics and pitch changing over time. The algorithm is next tested on complex recorded speech signals along with 2 other pitch tracking algorithms (Praat autocorrelation and Yaapt), the results of which is shown in the next sections.
45
Figure 4.2.6 Spectrogram of synthetic vowel segments compared with pitch tracking results. A illustrates a signal with complex harmonics, B is the pitch tracking result of the same signal. C illustrates a chirp sound, D is the pitch tracking result of the chirp in C.
46
4.2.2
Comparison between Pitch Tracking Methods
One of the objectives of this thesis is to determine the most appropriate pitch extraction method to be used in our algorithm, taking into consideration performance as well as computational requirement. Previous studies have performed similar experiments to compare between various pitch tracking methods, and evaluation metrics have been proposed by Zahorian et al [87]. Zahorian et al suggested evaluating pitch tracking methods against the reference pitch using various error margins and either sum up the error quantitatively (Gross Error) or identify the number of errors in discrete numbers (Big Error). An experiment is carried out to compare between different pitch tracking methods, the candidate methods are selected from the review of previous comparison studies as well as considerations for ease of implementation [46-50, 85-87]. In particular, AM demodulation pitch tracking technique, Yaapt, as well the autocorrelation pitch tracking used in Praat, are taking into consideration based on previous studies. Several speech segments each lasting for approximately 20 seconds are chosen as the stimulus. All three candidate pitch tracking methods are tested on all the speech segments at various processing window sizes and degrees of overlap. In addition, for each different window size and overlap, a reference pitch value is extracted by manually viewing the fourier transformed frequency profile of each processing window. An example of the pitch tracking results in comparison to each other as well as the ground truth pitch is shown below in Figure 4.2.7.
47
Figure 4.2.7 Pitch tracking results of a speech signal using three different methods: AM demodulation, Yaapt, and Praat autocorrelation
48
For the evaluation of each method, several different error margins are used, each representing the tolerance for detected pitch to deviate from the ground truth frequency. Big error is termed by Zahorian et al to represent significant detection errors that either detect a pitch when there is none or failed to detect any pitch when a pitch is present. Gross error on the other hand evaluates the total error from every point of pitch detection in the process, and this error is judged by an error margin. Figure 4.2.8 below demonstrates the comparison between the three pitch tracking methods based on various evaluation criteria. It can be seen that AM demodulation method has the smallest big error, however Praat autocorrelation method performs best in the 5% error margin case. This suggests that Praat autocorrelation method’s detections are closer to the reference pitch than the other two methods as long as it successfully identifies the presence of a pitch. The AM demodulation method performs only slightly worse than the Praat autocorrelation method in the 5% error margin case.
Figure 4.2.8 Comparison of pitch tracking methods using various evaluation criteria; 25 ms windows and 10 ms overlap were used for this comparison
49
The difference in processing window size influences the outcome of each pitch tracking method significantly. Figure 4.2.9 and Figure 4.2.10 below demonstrates the comparison between the three methods using 10 ms, 25 ms, and 100 ms window sizes with 50% overlap. It can be seen that both Praat autocorrelation method and the Yaapt method performs better with smaller window size, however the AM demodulation method performs slightly better with larger window size. This difference might be due to the underlying difference between time domain autocorrelation and frequency domain peak finding techniques, where bigger window size ensures higher frequency resolution for peak finding techniques.
Figure 4.2.9 20% gross error accuracy comparison between three pitch tracking methods at various processing window sizes.
50
Figure 4.2.10 Big error accuracy comparison of three pitch tracking methods at various processing window size Figure 4.2.11 and Figure 4.2.12 below demonstrate the difference between the frequency domain information due the window sizes. As can be seen, with a larger window, the peaks are much more prominent and easier identified. However once a window is large enough that it contains a significant pitch change within the same window, the identification of pitch through this method is significantly harder due to multiple distinct peaks present. As for autocorrelation pitch tracking’s decreased performance due to increasing window size, similar findings are found and explained in previous studies [85-87].
51
Figure 4.2.11 Frequency domain representation of a demodulated speech segment using 10 ms window size.
Figure 4.2.12 Frequency domain representation of a demodulated speech segment using 25 ms window size.
52
Figure 4.2.13 below demonstrates the performances of three different pitch tracking algorithm in the presence of crowd noise. Crowd noise sample is taken from online sources [88, 89, 90], and added into the original speech signal at various SNR. The pitch tracking accuracy of all three methods are negatively affected by the addition of crowd noise, however both the AM demodulation based pitch tracking method as well as Yaapt pitch tracking method are much more robust to noise compared to the Praat autocorrelation pitch tracking method. One explanation for Praat autocorrelation to perform worse in noisy environment is that noise often smears the autocorrelation results such that the pitch tracking algorithm would pick a candidate an octave above or below the true pitch.
Figure 4.2.13 Comparison between the 20% gross error accuracy of three pitch tracking methods in noise added environments with various SNR. The above experimental results have shown that while all three methods are able to achieve high accuracy depending on the evaluation criteria and parameters, each method has its own strength. Since our algorithm require significant frequency domain signal manipulation, the window size in combination with the sampling rate has to be large enough to cover the spectrum of human
53
speech. Taking into consideration its significantly easier implementation, lower computational cost, as well as better performance in the presence of crowd noise, we decide to adopt AM demodulation pitch tracking method in our algorithm. Once a fundamental frequency (pitch) is identified in the human speech range of a windowed signal, the algorithm attempts to identify peaks approximately at the multiples of the fundamental frequency in frequency domain. This is to ensure that the signal is indeed human voice, since many noise signals would have fundamental frequencies within human speech range but do not have complex harmonics (such as pure tones).
4.3 Continuity Check Implementation After the pitch tracking process, LPC analysis is performed on the signal to extract the first 3 formant locations of the windowed signal. Pitch and formant locations found in the above processes are recorded into a separate variable for up to 300 ms. This stored information is used to check for pitch and formant continuity in future processing windows. Continuity of the fundamental and formant frequencies is checked according to absolute position as well as the rate of change over past 300 ms. Previous study by Tjaden and Weismer have examined speaking-rate-induced spectral and temporal variability formant trajectories for target words produced in a carrier phrase at speaking rates ranging from fast to slow [73]. Based on their findings, the maximum rate of change for pitch was identified as approximately within 5 Hz/ms, and the rate of change for first formant was identified as approximately within 4 Hz/ms. The windowed signals that pass the continuity check are labeled as “signal”, those that do not pass the check will be labeled as “noise” and stored into the adaptive noise bank. One of the difficulties in speech recognition is identifying unvoiced speech. Many algorithms use data banks of consonant information for such classifications, however due to the real-time criteria for our design we approach this problem differently. In the case where no human speech is found in a windowed signal, this signal is either identified as an un-voiced speech (such as “s” or “t”) or fed back to the adaptive static noise bank such that the noise reduction procedure in the next iterations can more accurately resemble the subject’s environment. Determining whether a signal is an un-voiced human speech or noise is done through simple observation of its total
54
energy across the spectrum. While this is not ideal, for the purpose of an adaptive noise bank, these errors are tolerated. The combined speech identification module including AM demodulation pitch tracking, formant location, as well as continuity checks have been tested on speech signals, and achieve 70%-80% correct identification of “noise” segments. Since errors often occur for consonants where the formant structure and pitch are not clearly seen, simply using this identification system to extract solely the intended signal will result in segmented speech signals that can confuse the listeners. However, for the purpose of building a real-time adaptive noise bank, this accuracy is more than acceptable.
4.4 Parallel Noise Gating Implementation 4.4.1
Noise Gating Parameter Optimization
The strength and performance of Noise Reduction using Noise Gating depends on the various parameters set by the algorithm, including thresholds, frequency smoothing, and noise sample selection [91-93]. Increasing threshold generally increases the strength of noise reduction at the expense of signal quality, increasing frequency smoothing generally reduces digital artifact at the expense of signal resolution [56-57, 91-97]. The initial parameter values in this algorithm have been found through experiments by varying each parameter and evaluate the processed 0 dB noised signals using both LLR as the metric. There are three parameters that we have looked at in particular. Noise threshold level is a primary parameter used in all noise gating algorithms. It sets the threshold of each frequency bin according to a multiplier of the intensity at the same frequency in the noise profile. Noise threshold level is directly responsible for the strength of noise reduction. The second parameter, frequency smoothing, is another parameter that is frequently used in audio processing software. Instead of comparing each individual frequency bin with its counterpart in the noise profile, adding in frequency smoothing allows several frequency bins to be summed up in intensity before a comparison with the noise profile. The process reduces digital artifact and signal distortion at the expense of noise reduction strength. The third parameter is continuity step; this parameter takes into consideration of previously processed windows, comparing the current window’s frequency profile to that of the previous window. For each frequency bin, the
55
algorithm checks for peak intensities in surrounding frequency bins whose range is identified by continuity step, and a comparison is carried out with a multiplier. If a significant enough peak is detected in the vicinity of the target frequency bin in the previous window, and if this peak is large enough in value compared to the value contained in the target frequency bin, then the contents contained in the target frequency bin is preserved. Otherwise it will be treated without bias according to noise threshold level. The multiplier is responsible for determining the maximum allowable sudden intensity change in and around a frequency bin, and we’ve set it to 80% for our study, this parameter can be optimized in future studies. The continuity step parameter is used to further reduce digital artifacts in the processed signal. A batch processing procedure is written in MATLAB where the parameters are each varied in incremental steps. The Branch and Bound optimization method is utilized, where noise threshold level is optimized first, frequency smoothing next, and the continuity parameters last. The output sound of each variation is computed for LLR scores. It should be noted that the Branch and Bound method has limitations, most notably it tends to miss the additional maxima and minima to a parameter space. Eckstein et al have discussed such problems in their paper in addition to providing suggestions on improving the branch and bound algorithm [98]. Since the goal of our optimization is to select an approximately optimized set of parameter for practical application, and since the Branch and Bound method is significantly less time consuming than an exhaustive search, we feel justified to choosing this method. Once the optimal parameters are found using the Branch and Bound optimization, each parameter is then individually varied in an exhaustive search within the vicinity of previously optimized values to obtain higher precision. As the noise gating process is not linear, the parameters are not independent of each other. Changing one parameter often leads to the shifting of optimized values in other parameters. However an approximate trend graph can still be computed to give a set of parameter values that can be adopted for the next steps of the study. Through simulations we have found that the two factors that best contribute to a change in intelligibility are the noise threshold as well as frequency smoothing. Since these two parameters are common parameters existing in almost all commercial noise gating algorithms, their prominent role is expected. Compared to these two factors, continuity step only contribute slightly to the intelligibility of the final sound output, this however could be the result of the
56
Branch and Bound optimization method we used. An exhaustive enumeration optimization is needed to fully compare the significance of the parameters against each other, but for our application the branch and bound method is able to give us the desired initial parameter values. The LLR scores of the same sound files processed through noise gating while varying all of the above mentioned parameters are shown below, where in Figure 4.4.1 the LLR score is arranged as a function of increasing noise threshold values, in Figure 4.4.2 the LLR score is arranged as a function of increasing frequency smoothing values given optimal noise threshold, and in Figure 4.4.3 the LLR score is arranged as a function of increasing continuity step value giving optimized noise threshold and frequency smoothing parameters.
Figure 4.4.1 LLR scores of noise gating output as a function of noise threshold parameter values.
57
Figure 4.4.2 LLR scores of noise gating output as a function of frequency smoothing parameter values, given optimal noise threshold level.
58
Figure 4.4.3. LLR scores of noise gating output as a function of continuity step parameter values, given optimal noise threshold level and frequency smoothing. From the figures above, we can observe that noise threshold is likely the most prominent factor out of all the factors investigated. This observation comes from the distinct and steady trend of LLR score as noise threshold is increased, it also comes from the fact that the variance in Fig 4.4.1 is noticeably smaller than that of Fig 4.4.2 and Fig 4.4.3. Varying frequency smoothing does not seem to provide significant differences in the output sound, as the variance of LLR score is quite large compared to the changes in mean score. Also, we can see that in Fig 4.4.3, continuity step below 20 Hz per window (with window length of 25 ms and 10 ms overlap) yields noticeably lower LLR, most likely due to frequency smoothing being optimized at 20 Hz. In light of these optimization experiments, we will be setting our initial noise threshold multiplier to 1.2, frequency smoothing at 20 Hz, and continuity step at 20 Hz.
4.4.2
Digital Artifact and Signal Distortion Investigations
Digital artifact and signal distortion are major problems in audio signal processing and especially in noise gating. Due to the discrete threshold gated nature of the noise gating algorithm as well as
59
the fluctuation that occurs often within short windows of speech sample spectra, noise contained in certain frequency bins are sometimes left untouched while surrounding frequency bins are zeroed, leaving “islands” in the spectra that translate into short pitched sounds in time domain. These islands form at random locations in the spectra and change from one processing window to another, resulting in an almost melodic cluster of random pitched sounds which can interfere with the intelligibility of target signal. These sounds are also very irritating to listen to. These melodic noise sounds are collectively called digital artifact. By the same principle, often intended signals will have some of its frequency contents zeroed in the noise gating process, resulting in a signal with slightly distorted harmonic contents. This distortion is critical to the impairment of speech intelligibility, introducing “muzzled” speeches or “electronic” speeches that are hard to comprehend. This harmonic distortion by noise gating is a form of signal distortion as Esch et al have pointed out [99], and for which they proposed an efficient artifact noise suppression system. The initiative to make use of parallel noise gating in our algorithm arose from the fact that digital artifact and signal distortion are major distractions to the intelligibility of a speech signal despite an increased SNR. We have investigated the behavior of the digital artifact introduced by noise gating, and found that these artifacts often appear at random locations on the spectrum depending on the spectral profile of the original noise source. The details of such investigations are discussed below. Fig 4.4.4 displays the spectrogram of a crowd noise sample taken from an online database [88, 89, 90]. Figure 4.4.5 is the overall frequency content of this sample. Noise samples are taken from a section of this sound file, and noise gating is applied to this sound file using the above mentioned noise sample to reduce 30 dB of noise. The resulting sound file is full of digital artifacts as shown in Figure 4.4.6 below. The digital artifacts appear to be quite random just by visual inspection and subjective listening. Further analysis shows that the residual digital artifacts combine to form a spectral profile similar in shape and envelope to that of the original signal. This suggests that the digital artifacts are most likely random within the confines of the noise sample profile.
60
Figure 4.4.4 Spectrogram of crowd noise.
Figure 4.4.5 Frequency domain representation of the crowd noise shown in Fig 4.4.4.
61
Figure 4.4.6 Spectrogram of digital artifact resulted from heavy digital signal processing of a crowd noise sample.
Figure 4.4.7 Frequency domain representation of the digital artifacts shown in Fig 4.4.6.
62
In order to reduce random noise, one of the common methods used in digital signal processing is through repetitive summation of various recorded samples, where the intended signal is relatively similar in each sample but the noise varies due to its random nature [94-97]. An example of such a technique is the summation and averaging of EEG samples to enhance intended signal. We would like to apply similar strategy to noise gating, where a sound segment is noise gated independently using several different noise profiles. The resulting processed signals would each present digital artifact at random locations in the spectrum while the intended speech signal is relatively stable across all resulting signals. A summation of the independently processed signals would then amplify the overall intensity of the intended signal without amplifying the overall intensity of the random noise. This process is referred to as parallel noise gating.
4.4.3
Parallel Processing Investigations
Small scale experiments have been carried out in order to confirm our hypothesis of parallel noise gating improving the intelligibility of processed speech signal, as well as to find the optimal number of parallel processing noise gating procedures. Figure 4.4.8 below demonstrates the LLR scores of speech signals with added crowd noise at 0 dB SNR as a function of the number of parallel noise gating procedures performed. The LLR score is a reflection of speech signal intelligibility, and it rises significantly as soon as noise gating is applied to the noised signal. As the numbers of parallel noise gating procedures are performed, the LLR score also rises, peaking at around 4 parallel procedures and stays relatively constant afterwards. Due to this experiment, we have decided to perform 4 parallel noise gating procedures for our algorithm to maximize both the intelligibility improvement as well as computational resource.
63
Figure 4.4.8 Log Likelihood Ratio of noised speech signals at various numbers of parallel processing noise gating; noised speech signals were obtained by adding crowd noise to clean signals at 0 dB SNR
Based on the findings from the above figure, in our algorithm 4 instances of noise gating are created for each window of signal segment. Each instance refers to a different section of noise sample from the adaptive noise bank. The 4 processed signals are combined into a single sound file where the target signal is strengthened and the randomly distributed digital artifact is relatively reduced along with the signal distortion in each single instance of noise gating.
64
Figure 4.4.9 below demonstrates the difference between a clean signal, signal with significant crowd noise added, noised signal processed using traditional noise gating (or binary masking), as well as noise signal processed using quadruple parallel noise gating. It can be seen that the noised signal processed with parallel noise gating is significantly more clear compared to both the original noised signal as well as the noise signal processed with traditional noise gating. In particular, parallel noise gating introduces much less digital artifact and signal distortion compared to traditional noise gating. While each individual instance of noise gating provide approximately 3-5 dB increase in SNR, parallel noise gating is able to provide 5-10 dB improvement to the SNR of the processed signal at significantly reduced signal distortion.
Figure 4.4.9 Spectrograms of the same speech signal with various digital processing; A) the clean signal, B) the clean signal with crowd noise added at 0 dB SNR, C) the noised signal processed through traditional noise gating, D) the noised signal processed through parallel noise gating
65
Chapter 5 : Experiment and Evaluation 5.1 Experiment Design In order to evaluate the above mentioned algorithm, it is necessary to measure the intelligibility of output sound against that of the original unprocessed signal. While we have employed LLR as the objective metric to identify the intelligibility of a signal, subjective listening tests are still the gold standard for intelligibility measurements. Hence we have designed and conducted experiments to measure the intelligibility of speech segments before and after processing. An ethics protocol was submitted to the research ethics board (REB) with protocol ID 29500. We expect that the results would demonstrate higher intelligibility from signals processed through our algorithm compared to both the original noisy signals as well as the same signal processed using standard commercial noise removal algorithms. Intelligibility of a speech signal is the ability for a target population to correctly identify the linguistic information contained in the signal. While some studies require the subjects to understand the meaning of such speech signals, in our study it is only necessary for subjects to correctly identify the sounds. This is because complex factors such as vocabulary and individual comprehension would all influence the ability for a subject to grasp the meaning of a sentence. These factors are outside of the scope of our hearing enhancement algorithm. For this reason, we have designed our experiment using the low predictability Speech Perception In Noise (SPIN) sentences which have very low frequencies of appearing in daily conversation; using such sentences can minimize the advantages that individual comprehension would introduce. Scores are assigned for correct identification of both consonants and vowels to prevent the disadvantages that limited vocabulary or language exposure would introduce. The audio stimulus was presented via the sound card to a pair of stereo headphones. The sound pressure level (SPL) of the audio stimulus has been calibrated to 70 dB SPL, which approximates the normal conversation level and poses no danger to the subjects. Experiments were conducted in locations where the surrounding environmental sound does not exceed 40 dB SPL.
66
200 low predictability spin sentences were used to generate sample signals; each sentence was 35 seconds in length and spoken by both male and female speakers. The crowd noise was taken from several online sound effect databases, each normalized in acoustics energy and frequency profile. The crowd noise samples were added onto the speech sound at 0 dB, 5 dB and 10 dB SNR. These sound files are called the “unprocessed noised signals”. The unprocessed noised signals were then passed through a commercial noise removal algorithm as well as our algorithm, the outcome of which are called the “commercial processed signals” and “target processed signals” respectively. Both the commercial noise removal algorithm and our own algorithm were tuned to reduce the noise level by 10 dB. The particular noise level reduction was achieved by adjusting threshold and smoothing parameters. 2 sets of unprocessed noisy signal, commercial processed signal as well as target processed signal were prepared for each participant. None of the sentences in any of these sets would overlap. Subjects are given a trial run of noisy signals, and then given the noisy signal test, the traditional noise gating test, as well as our method test in that order. The stimulus was presented to the subjects through headphones. Both the left ear and the right ear would receive identical sound information. 10 volunteers were recruited for the experiment, 8 males and 2 females. The volunteers were between 21-27 years of age, all have normal hearing, and no apparent medical conditions. Each subject was expected to fill out a questionnaire while listening to the sound file. The questionnaire consisted of two parts: part one asked the subject to write down the sentences that he has listened to, and this process was performed at the same time that the subject was listening to the sound file, note that each sentence was followed by a 15 second interval that allows the participant to write; score was given only for the correct identification of the last word in each sentence, accuracy was assigned according to the phonetic components instead of the actual spelling. Part two of the questionnaire asked the subject to rate the overall sound file in clarity, intelligibility, and authenticity, in comparison to their daily conversations. The definition and description of clarity, intelligibility and authenticity were explained to the subjects; where clarity refers to the presence of the signal relative to the noise, intelligibility refers to the ability to comprehend the speech, and authenticity refers to the naturalness of the speech signal. Upon completion of the listening tests, scores were assigned to each single test based on the accuracy of the answer. We would analyze the scores and perform standard statistical
67
calculations including average and variance calculations to estimate how much improvement in intelligibility is introduced by our algorithm.
5.2 Experiment Results The algorithm has been implemented according to the design decisions described above. The obtained results demonstrate that the algorithm has been working as intended, providing a significant SNR increase to our primary application environments (static noise and interfering speech signals) without significant signal distortion or addition of digital artifact. The results of the experiments are shown below in Figure 5.2.1 and Table 5.2.1. Our method allowed subjects to achieve an approximately 86.75% accuracy in the listening tests, whereas the original noised signal counterparts scored 47.75% accuracy. The traditional noise gating (binary mask) algorithm on the other hand provides only a slight improvement over the unprocessed signal, especially taking into consideration individual variance. The calculated LLR score matches well with the experimental data, where our algorithm scores significantly higher than both the original unprocessed signal as well as traditional noise gating (binary masking) algorithm. For the three qualitative assessments, our algorithm provides noticeably better clarity and intelligibility compared to the unprocessed signals, while the traditional binary mask algorithm provides only improvement in signal clarity but compromises both intelligibility and authenticity.
68
Figure 5.2.1 Catagorized comparison between different processing algorithms on a noised speech signal.
69
Table 5.2.1 Mean values and variance of scores obtained in listening test Objective Score
Clarity
Our Method Traditional Unprocessed
86.75% 55.5% 47.75%
84.5% 68.5% 48.5%
78.5% 42.5% 61%
83.5% 47.5% 87.5%
Our Method Variance Traditional Unprocessed
7.14% 8.95% 10.44%
8.48% 9.28% 8.78%
8.08% 10.18% 8.75%
2.78% 9.18% 11.38%
Mean
Intelligibility Authenticity
Overall, the algorithm is able to signifiantly suppress none-target speech signals (5-10 dB SNR improvement), all the while avoiding significantly distorting the target speech signal itself.
Figure 5.2.2 Categorized results comparison between female speaker and male speaker stimuli in listening tests
70
Table 5.2.2 Mean values of scores obtained in listening test categorized by gender Objective Score
Clarity
Our Method Female Traditional Unprocessed
87.5% 60% 52%
85% 72% 55%
76% 47% 63%
82% 58% 90%
Our Method Traditional Unprocessed
86% 51% 43.5%
84% 65% 42%
81% 38% 59%
85% 37% 85%
Male
Intelligibility Authenticity
Another interesting observation is that the female speech stimuli provide slightly better intelligibility and clarity compared to the stimuli speech containing a male voice. However, in all categories of comparison the difference is barely noticeable especially when variance is taken into consideration. This lack of clear trend is demonstrated in Table 5.2.2 and Figure 5.2.2 above. In our particular experiment, the mean pitch difference between the female voice and the male voice is approximately within 50 Hz, where the female voice is generally slightly higher in pitch. Previous study by Susanne has found that the higher the pitch is during speech, the easier it is for others to comprehend [99]. This conclusion arose from Susanne’s experiments on the subjects’ comprehension of male versus female speech stimuli. However in our study we cannot make conclusive statements on this issue due to the differences being too small.
71
Figure 5.2.3 Results comparison between listening tests performed with 0 dB, 5 dB and 10 dB SNR stimuli Table 5.2.3 Mean values and variance of scores obtained in listening test categorized by stimuli SNR 0 dB SNR
5 dB SNR
10 dB SNR
Our Method Traditional Unprocessed
86.75% 55.75% 47.75%
89.5% 71% 63.5%
93% 83.5% 82%
Our Method Variance Traditional Unprocessed
7.14% 8.64% 10.44%
2.48% 12.9% 12.28%
6.1% 6.28% 16.6%
Mean
Figure 5.2.3 and Table 5.2.3 above demonstrate the results obtained after performing listening tests of varied stimuli SNR ratio. In all three conditions (Our Method, Traditional Noise Gating Method and Unprocessed Noised Segments) the results demonstrate increased identification of keyword as SNR increases. The effect is particularly strong in the unprocessed noised segment,
72
which can be compared with previous findings. Killion et al have suggested that normal hearing subjects will obtain 50% correct identification of keywords at around 2 dB SNR [71]. While our results do not extrapolate exactly to this value, given the large variance in our results, it is well within the confidence interval. Bentler has performed speech in noise tests on lists of IEEE sentences [72], and found significant variation even between lists in the same database. Given the variance in both our results as well as those of Bentler’s, the general trend obtained by both studies agrees quite well. It should be noted that such increase in SNR is detectable for normal hearing patients but due to fully functioning auditory cognition, it does not provide them with as much help as it does for the hearing loss patients. For hearing loss patients, we expect our algorithm to provide even more noticeable benefits. The above results demonstrate that our algorithm is able to provide significant improvement on speech intelligibility and clarity without introducing too many artifacts and signal distortion, confirming our hypothesis and validating the performance of the algorithm.
5.3 Computational Time Since the output of the algorithm (processed speech) will likely be perceived together with the original speech signal by the listener, a significant temporal delay caused by algorithm processing time will introduce echo. Litovsky et al has mentioned that previous studies found 3050 ms to be the temporal delay for which a listener will find the speech echo to be “annoying” [14]. This echo effect is known as the precedence effect. Since an initial delay is needed for the algorithm to perform the first computational cycle, another aim of our algorithm is to reduce the computational time of each cycle below 50 milliseconds. The computational time needed to perform each cycle of the algorithm is investigated on C code. After repeated tests, the average CPU elapse time of each cycle of our algorithm is 43 ms when written into C code and run on a duo core 2.80 GHz processor and 3.0 GB RAM PC. This number translates into roughly 120 million clock cycles per iteration. The algorithmic complexity in Big O notation is 5 * n * log(n) + 8 * n, which is simplified to O(n log n). This computational speed allows the algorithm to perform in real-time, achieving one of our initial
73
goals. However further optimization can be done to reduce the number of fix point calculations within the algorithm and significantly improve the algorithm’s efficiency.
74
Chapter 6 : Conclusion In this thesis project, we have designed a real time hearing enhancement algorithm for crowded social environments that is able to provide 5-10 dB SNR improvement with relatively few signal distortion. Each critical function within the algorithm including pitch tracking as well as parallel noise gating have been tested using both listening test as well as signal processing metrics. The final performance of the combined system has also been tested both in a subjective listening experiment as well as utilizing objective intelligibility metrics. Design decisions made during the thesis project have all been based on either previous studies in the field or small scale experiments performed in our lab. Based on experimental results, the resulting algorithm provide a significant intelligibility improvement to a speech signal in a crowded social environment and performed noticeably better than current standard noise removal algorithms. As mentioned early in this thesis, the criteria for this algorithm include minimal visibility, realtime computational speed below 50 ms, as well as significant intelligibility improvement beyond 5 dB. Our algorithm has been able to achieve all three criteria. Future work on this project may involve the implementation of this algorithm on various platforms, including mobile devices and possibly FPGA boards. In addition, several sections of this project can be expanded into independent studies, including: -
Detailed comparison between time-domain autocorrelation pitch tracking method and frequency domain peak finding methods. One particular interest is the variation of window size and overlap and their effects on the performance of the algorithms. Another interest is each method’s robustness in various noise conditions, varying noise type and noise level.
-
An expanded comparison between various signal processing metrics used for the evaluation of speech intelligibility. We would like to vary the conditions and provide sufficient subjective data to validate any claims. We would also like to take into
75
consideration each metric’s strength and explaining each metric’s behavior through their respective theoretical backgrounds. -
Further investigation into crowd noise, particularly the effect of the number of interfering speakers in the background on the spectra of the crowd noise, along with variations in the spectra of crowd noise due to factors such as ambient noise and room echo.
-
We would like to investigate into the nature and behavior of both digital artifacts as well as signal distortion, providing more concrete associations between their occurrence and their cause.
-
Detailed analysis particularly for the parameter optimization of noise gating.
76
References and Bibliography [1] S. A. Shamma, C. Micheyl. “Behind the scenes of auditory perception”, Curr Opin Neurobiol 20, no. 3 (2010 Jun): 361-6. [2] J. H. McDermott. “The cocktail party proble”, Curr Biol 19, no. 22 (2009 Dec): R1024-7. [3] A. S. Bregman. Auditory Scene Analysis: the Perceptual Organisation of Sound. Cambridge, MA: MIT Press,1990. [4] T. D. Griffiths, J. D. Warren. “What is an auditory object?”, Nat Rev Neurosci 5, (2004): 887-892. [5] B. G. Shinn-Cunningham. “I want to party, but my hearing aids won't let me!”, Hearing J 62, (2009): 10-13. [6] V. Hamacher, J. Chalupper, J. Eggers, E. Fischer, U. Kornagel, H. Puder, U. Rass. “Signal Processing in High-End Hearing Aids: State of the Art, Challenges, and Future Trends”, EURASIP Journal on Applied Signal Processing 18, (2005): 2915-2929. [7] D. N. Brooks. “Some factors influencing choice of type of hearing aid in the UK: behind-theear or in-the-ear”, British Journal of Audiology 28, (1994): 91-98. [8] C. Alain. “Breaking the wave: effects of attention and learning on concurrent sound perception”, Hear Res 229, (2007): 225-236. [9] J. S. Snyder, C. Alain. “Toward a neurophysiological theory of auditory stream segregation”, Psychol Bull 133, (2007): 780-799. [10] R. P. Carlyon. “How the brain separates sounds”, Trends Cogn Sci 8, (2004): 465-471. [11] D. Pressnitzer, M. Sayles, C. Micheyl, I. M. Winter. “Perceptual organization of sound begins in the auditory periphery”, Curr Biol 18, (2008): 1124-1128. [12] C. Alain, B. M. Schuler, K. L. McDonald. “Neural activity associated with distinguishing concurrent auditory objects”, J Acoust Soc Am 111, (2002): 990-995.
77
[13] J. G. Neuhoff, K Karmer, “Pitch and loudness interact in auditory display: can the data get lost in the map?”, J Exp Psychol Appl 8,no. 1 (2002): 17-25. [14] R. Y. Litovsky, H. S. Colburn, W. A. Yost, S. J. Guzman. “The precedence effect”, J. Acoust. Soc. Am 106, no. 4 (1999): 1. [15] M. H. Davis, I. S. Johnsrude. “ Hearing speech sounds: Top-down influences on the interface between audition and speech perception”, Hearing Research 229, no. 1-2( 2007): 132147. [16] E. Sussman, I. Winkler, M. Huotilainen, W. Ritter, R. Naatanen. “Top-down effects can modify the initially stimulus-driven auditory organization”, Cognitive Brain Research 13, no. 3 (2002): 393-405. [17] B. C. J. Moore, H. Gockel. “Factors Influencing Sequential Stream Segregation”, Acta Acustica united with Acustica 88, no. 3 (2002): 320-333. [18] D. G. Sinex. “Spectral processing and sound source determination”, Int Rev Neurobiol 70, (2005): 371-398. [19] C. Micheyl, A. J. Oxenham. “Pitch, harmonicity and concurrent sound segregation: psychoacoustical and neurophysiological findings”, Hear Res, 2009. [20] Y. Shao, D. L. Wang. “Sequential organization of speech in computational auditory scene analysis”, Speech Communication, 2009. [21] B. G. Shinn-Cunningham, V. Best. “Selective attention in normal and impaired Hearing”, Trends in amplification 12, (2008): 283. [22] T. L. Arbogast, C. R. Mason, G. Kidd Jr. “ The effect of spatial separation on informational masking of speech in normal-hearing and hearing-impaired listeners”, Journal of the Acoustical Society of America 117, (2005): 2169-2180. [23] M. Girolami. “A nonlinear model of the binaural cocktail party effect”, Neurocomputing 22, (1998): 201-215.
78
[24] S. Choi, H. Hong, H. Glotin, F. Berthommier. “Multichannel signal separation for cocktail party speech recognition: A dynamic recurrent network”, Neurocomputing, 2002. [25] S. A. Shelkuno. Antennas: Theory and Practice. John Wiley and Sons Inc, 1952. [26] T. Quatieri. Discrete-Time Speech Signal Processing: Principles and Practice. Prentice Hall, 2001. [27] J. M. Kates, M. R. Weiss. “A comparison of hearing-aid array-processing techniques”, Journal of Acoustical Society of America 99, no. 5 (1996): 3138-3148. [28] N. Marrone, C. R. Mason, G. Kidd. Jr. “The effects of hearing loss and age on the benefit of spatial separation between multiple talkers in reverberant rooms”, J. Acoust. Soc. Am 124, (2008): 3064. [29] B. Widrow. “A microphone array for hearing aids”, Adaptive Systems for Signal Processing, Communications, and Control Symposium, (2000). [30] K. D. Kryter. “Methods for the Calculation and Use of the Articulation Index”, Journal of Acoustical Society of America 34, (1962). [31] J. E. Greenberg, P. M. Peterson, and P. M. Zurek. “Intelligibility-weighted measures of speech-to-interference ratio and speech system performance”. The Journal of the Acoustical Society of America 94, no. 10 (1993): 3009. [32] C. Dolph. “A Current Distribution for Broadside Arrays Which Optimizes the Relationship between Beam Width and Side-Lobe Level”, Proceedings of the IRE 34, (1946): 335-348. [33] D. Cheng. “Optimum scannable planar arrays with an invariant sidelobe level”, Proceedings of the IEEE 56, no. 11 (1968): 1771-1778. [34] P. J. Bevelacqua and C. A. Balanis. “Minimum Sidelobe Levels for Linear Arrays”, IEEE Transactions on Antennas and Propagation 55, (2007): 3442-3449.
79
[35] M. M. Khodier and C. G. Christodoulou. “Sidelobe Level and Null Control Using Particle Swarm Optimization”, IEEE Transactions on Antennas and Propagation 53, no. 8 (2005): 26742679. [36] M. W. Ho_man, T. D. Trine, K. M. Buckley, and D. J. Van Tasell. “Robust adaptive microphone array processing for hearing aids: Realistic speech enhancement”, Journal of Acoustical Society of America 96, no. 2 (1994): 759-770. [37] P. M. Peterson. “Using Linearly-constrained Adaptive Beamforming to Reduce Interference in Hearing Aids from Competing Talkers in Reverberant Rooms:, IEEE, (1987): 2364-2367. [38] L. Gri and C. Jim. “An alternative approach to linearly constrained adaptive Beamforming”, IEEE Transactions on Antennas and Propagation 30, (1982): 27-34. [39] O. Frost. “An Algorithm For Linearly Constrained Adaptive Array Processing”, Proceedings of the IEEE 60, (1972): 926-935. [40] S. Vorobyov, A. B. Gershman, Z.-Q. Luo, and N. Ma. “Adaptive Beamforming With Joint Robustness Against Mismatched Signal Steering Vector and Interference Non-Stationarity”, IEEE Signal Processing Letters 11, (2004): 108-111. [41] O. Hoshuyama, a. Sugiyama, and a. Hirano. “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters”, IEEE Transactions on Signal Processing 47, no. 10 (1999): 2677-2684. [42] H. Cox. “Robust adaptive beamforming”, IEEE Transactions on Acoustics, Speech and Language Processing 35, (1987): 1365-1376. [43] N. K. Jablon. “Adaptive Beamforming with the Generalized Sidelobe Canceller in the Presence of Array Imperfections”, IEEE Transactions on Antennas and Propagation 34, no. 8 (1986): 996-1012. [44] A. Koutras, E. Dermatas, G. Kokkinakis. “Blind Speech Separation of Moving Speakers in Real Reverberant Environments”, Proceedings of IEEE International Conference on the Acoustics, Speech and Signal Processing vol .02, (2000).
80
[45] K. Mustafa, I. C. Bruce. “Robust Formant Tracking for Continuous Speech With Speaker Variability”, IEEE Transactions on Audio, Speech, and Language Processing 14, no.2 (2006). [46] A. Cheveigne, H. Kawahara. “YIN, a Fundamental Frequency Estimator for Speech and Music”, J. Acoust. Soc. Am. 111, no. 4 (2002). [47] M. Wu, D. Wang, G. J. Brown. “A Multipitch Tracking Algorithm for Noisy Speech”, IEEE Transactions on Speech and Audio Processing 11, no. 3 (2003). [48] T. Kitahara, M. Goto, H. G. Okuno. “Pitch-Dependent Identification of Musical Instrument Sounds”, Applied Intelligence 23, (2005): 267-275. [49] G. Hu, D. Wang. “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation”, IEEE Transactions on Neural Networks 15, no. 5 (2004). [50] M. Goto. “A Real-Time Music-Scene-Description System: Predominant-F0 Estimation for Detecting Melody and Bass Lines in Real-World Audio Signals”, Speech Communication 43, (2004): 311-329. [51] C. LaRivere, “Contributions of fundamental frequency and formant frequencies to speaker identification”, Phonetica 31, no. 31 (1974): 97-185. [52] R. C. Snell, “Formant Location From LPC analysis Data”, IEEE Transaction on Speech and Audio Processing 1, (1993): 129-134. [53] W. A. Liu, P. Fung, “Fast accent identification and accented speech recognition”, IEEE Conference on Acoustics, Speech, and Signal Processing 1, (1999): 221-224. [54] R. Hasan, M. Jamil, G. Rabbani, S. Rahman. “Speaker Identification Using Mel Frequency Cepstral Coefficients”, ICECE 2004, (2004): 28-30. [55] Y. Li, D. Wang. “Separation of Singing Voice from Music Accompaniment for Monaural Recordings”, IEEE Transactions on Audio, Speech, and Language Processing 15, no. 4 (2007). [56] M. Baker, D. Logue. “A Comparison of Three Noise Reduction Procedures Applied to Bird Vocal Signals”, Journal of Field Ornithology 78, no. 3 (2007): 240-253.
81
[57] J. Arnold, J. Kahn, G. Pancani. “Audience Design Affects Acoustic Reduction via Production Facilitation”, Psychonomic Bulletin & Review 19, no. 3 (2012): 505-512. [58] S. G. Adams, G. Weismer, R. D. Kent. “Speaking Rate and Speech Movement Velocity Profiles”, Journal of Speech and Hearing Research 36, (1993): 41-54. [59] T. H. Shawker, B. C. Sonies. “Tongue Movement During Speech: A Real-Time Ultrasound Evaluation”, Journal of Clinical Ultrasound 12, no. 3 (1984): 125-133. [60] E. S. Solov’eva, V. A. Konyshev, S. V. Selishchev. “Use of Pitch and Formant Analysis in Speech Biometry”, Biomedical Engineering 41, no. 1 (2007): 34-38. [61] J. E. Flege. “Effects of Speaking Rate On Tongue Position and Velocity of Movement in Vowel Production”, J. Acoust. Soc. Am. 84, (1988): 3. [62] T. Collins. “Perceptual Formant Tracking”, ProQuest Dissertation and Theses, 2008. [63] L. Goffman, A. Smith. “Development and Phonetic Differentiation of Speech Movement Patterns”, Journal of Experimental Psychology: Human Perception and Performance 25, no. 3 (2007): 649-660. [64] M. D. McClean, S. M. Tasko. “Association of Orofacial Muscle Activity and Movement During Chenges in Speech Rate and Intensity”, Journal of Speech, Language, and Hearing Research 46, (2003): 1387-1400. [65] A. S. Medfferd, J. R. Green. “Articulatory-to-Acoustic Relations in Response to Speaking Rate and Loudness Manipulations”, Journal of Speech, Language, and Hearing Research 53, (2010): 1206-1219. [66] A. T. Neel, P. M. Palmer. “Is Tongue Strength an Important Influence on Rate of Articulation in Diadochokinetic and Reading Tasks?”, Journal of Speech, Language, and Hearing Research 55, (2012): 235-246. [67] K. Nohara, Y. Kotani, Y. Sasao, M. Ojima, T. Tachimura, and T. Sakai. “Effect of a Speech Aid Prosthesis on Reducing Muscle Fatigue”, J Dent Res 89, no. 5 (2010): 478-481.
82
[68] P. Prrier, S. Fuchs. “Speed-curvature Relations in Speech Production Challenge the 1/3 Power Law”, Journal of Neurophyssiol 100, (2007): 1171-1183. [69] C. J. Poletto, L. P. Verdum. “Correspondence between laryngeal vocal fold movement and muscle activity during speech and nonspeech gestures”,Journal of Applied physiology 97, (2004): 858-866. [70] S. Sternburg, S. Monsell, R. L. Knoll, C. E. Wright. “The Lantency and Duration of Rapid Movement Sequences: Comparison of Speech and Typewriting”, Information Processing in Motor Control and Learning, chapter 6, (1978): 118-150. [71] M. C. Killion, P. a. Niquette, G. I. Gudmundsen, L. J. Revit, and S. Banerjee. “Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners”,The Journal of the Acoustical Society of America 116, no. 4 (2005): 2395. [72] R. A. Bentler. “List equivalency and test-retest reliability of the speech in noise test”, American Journal of Audiology 9,no. 2 (2000): 84-100. [73] K. Tjaden, G. Weismer. “Speaking-rate-induced variability in F2 trajectories”, J. Speech Lang. Hear. Res. 41, (1998): 976-989. [74] S. M. Stephen, J. R. Westbury. “Speed-curature relations for speech-related articulatory movement”, Journal of Phonetics 32, (2004): 65-80. [75] P. J. Watson, R. S. Schlauch. “Fundamental Frequency Variation With an Electrolarynx Improves Speech Understanding: A Case Study”, American Journal of Speech-Language Pathology 18, (2009): 162-167. [76] D. Rendall, S. Kollias, C. Ney. “Pitch (F0) and formant profiles of human vowels and vowel-like baboon grunts: The role of vocalizer body size and voice-acoustic allometry”, J. Acoust Soc. Am 117, (2005): 944-955. [77] V. Best, E. J. Ozmeral, N. Kopčo, B. G. Shinn-Cunningham. "Object continuity enhances selective auditory attention.",Proceedings of the National Academy of Sciences 105, no. 35 (2008): 13174-13178.
83
[78] K. Pichora-Fuller, M. Kathleen, P. E. Souza. "Effects of aging on auditory processing of speech.", International journal of audiology 42, no. S2 (2003): 11-16. [79] Y. Hu, P. C. Loizou. “Evaluation of Objective Quality Measure for Speech Enhancement”, IEEE Transcation on Audio, Speech, and Language Processing 16, (2008): 229-238. [80] R. E. Crochiere, J. M. Tribolet, L. R. Rabiner. “An Interpretation of the Log Likelihood Ratio as a Measure of Waveform Coder Performance”, IEEE Transcation on Acoustics, Speech, and Signal Processing, vol. ASSP-28, (1980): 318-323. [81] L. R. Bahl, F. Jelinek, R. Mercer. “A Maximum Likelihood Approach to Continuous Speech Recongition”, IEEE Transcation on Pattern Analysis and Machine Intelligence 1, (1983): 179190. [82] A. Sarampalis, S. Kalluri, B. Edwards, E. Hafter. “Objective Measure of Listening Effort: Effects of Background Noise and Noise Reduction”, Journal of Speech, Language, and Hearing Research 52, (2009): 1230-1240. [83] K. Pichora-Fuller, B. A. Schneider, E. MacDonald. “Temporal jitter disrupt speech intelligibility: A simulation of auditory aging”, Hearing Research 223, (2007): 114-121. [84] T. Esch, P. Vary. “Efficient musical noise suppression for speech enhancement system”, ICASSP 2009. IEEE International Conference on Acoustics, Speech and Signal Processing, (2009): 4409 – 4412. [85] P. Boersma, D. Weenik, “Praat – A System for Doing Phonetics by Computer”, Eurospeech CD Software & Courseware, 1999. [86] K. Kasi, S. A. Zahorian. “Yet another algorithm for pitch tracking”, ICASSP 02, pp. 361364. Orlando. [87] S. A. Zahorian, H. Hu. “A spectral/temporal method for robust fundamental frequency tracking”, J. Acoust. Soc. Am 123, (2008): 4559. [88] “PacDV free sound effects”, PacDV, accessed July 19, 2014.www.pacdv.com/sounds/index.html.
84
[89] Sound Jay, accessed July 19, 2014. www.soundjay.com. [90] SoundBible.com,, accessed July 19, 2014. soundbible.com. [91] G. Yu, S. Mallat, E. Bacry. “Audio Denoising by Time-Frequency Block Threshold”, IEEE Transcation on Signal Processing 56, (2008): 1830-1839. [92] D. Wu, W. Zhu. “A Compressive Sensing Method of Noise Reduction of Speech and Audio Signals”, IEEE 54th International Midwest Symposium on Circuits and Systems, (2011): 1-4. [93] T. Rohdenbug, V. Hohmann, B. Kollmeier. “Subband-Based Parameter Optimization in Noise Reduction Schemes by Means of Objective Perceptual Quality Measures”, IWAENC 2006 Paris , (2006): 12-14. [94] T. Esch, M. Rüngeler, F. Heese, P. Vary. "A modified minimum statistics algorithm for reducing time varying harmonic noise." ITG-Fachbericht-Sprachkommunikation 2010 (2010). [95] Y. Takahashi, H. Saruwatari. “Theoretical Musical Noise Analysis and its Generalization for Methods of Integrating Beamforming and Spectral Subtraction Based on Higher-Order Statistics”, IEEE International Conference on Acoustics, Speech and Signal Processing, (2010): 93-96. [96] Y. Takahashi, H. Saruwatari. “Musical Noise Analysis Based on Higher Order Statistics for Microphone Array and Nonlinear Signal Processing”, IEEE International Conference on Acoustics, Speech and Signal Processing, (2009): 229-232. [97] W. Ma, M. Yu, J. Xin, S. Osher. “Reducing Musical Noise in Blind Source Separation by Time-Domain Sparse Filter and Split Bregman Method”, INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, (2010): 26-30. [98] J. Eckstein, N. Goldberg. “An improved branch-and-bound method for maximum monomial agreement”, Technical Report RRR 14, 2009. [99] S. Susanne. “Effects of Stimulus Duration and Type on Perception of Female and Male Speaker Age”, Fonetik 2005 In Proceedings of Fonetik, (2005): 87-90.
85
Appendix I CONSENT FORM Title of Research Project: Measurement of Speech Intelligibility before and after Digital Signal Processing Investigators: Principal Investigator: Mr. Brian Wang
Phone: 416-629-9904 Email:
[email protected]
Supervisors: Dr. Willy Wong
Phone: 416-978-8734 Email:
[email protected]
Please feel free to contact persons above if any question or problem arises. You can contact the Office of Research Ethics at
[email protected] or 416-946-3273 should you have questions about your rights as participants. Sponsor/Funding: The research is sponsored by the Natural Sciences and Engineering Research Council of Canada, and Institute of Optical Sciences, University of Toronto. Background & Purpose of Research: The purpose of this study is to gather data on the speech intelligibility of signals within crowd environment, and compare the results before and after digital processing using our algorithm. Invitation to Participate: You are being invited to participate in this research study. Eligibility: To participate in this study you must be over 18 years of age, have normal hearing, and efficient with the English language.
86
Procedures: You will participate in 6 different listening tests. You are expected to listen to the sound files and fill out a questionnaire at the same time; each sound file should only be listened to once. The questionnaire consist of two parts: part one asks you to write down each sentence that you have listened to; part two of the questionnaire asks you to rate the overall sound file in clarity, intelligibility, and authenticity, in comparison to your daily conversations. The procedures will be explained to you by the investigator, you’ll then be participating in a trial run to better understand the process. You will be asked to ensure that you understand the procedure before beginning the experiment. Each listening test lasts approximately 1-2 minutes. The total time required of you is no greater than 1 hours in total. Most testing will take place in quiet environments around the University of Toronto campus. Voluntary Participation & Early Withdrawal: Participation in this study is voluntary and declining to participate will not adversely affect student status, standing, academic evaluation, etc. You may decline to enter or withdraw from this study at any time. Risks/Benefits: The maximum loudness level you will experience is 70 dB SPL. This level is within normal conversation ranges. Furthermore, we will ensure that you are comfortable with the stimuli before each session. There are no known side effects to the procedure. We have taken all possible precautions to prevent exposure to excessive sound level. You will not benefit directly from taking part in this study. Privacy & Confidentiality and Publication of research findings: Your identity will be kept strictly confidential. All records bearing your name will be kept in a locked file by the Principal Investigator. In any published reports of this work, you will be identified only by a coded number, and no other information will be provided that could reveal your identity. New Findings: If information becomes available that may be relevant to your willingness to continue to participate in the study, you will be made aware of it in a timely manner by the investigator. Compensation: There is neither immediate compensation nor costs incurred by participating in this study.
87
Rights of Participants: You waive no legal rights by participating in this study.. Dissemination of findings: You may request a copy of a final report. Copy of informed consent for participant: You are being given a copy of this informed consent form to keep for your own records.
__________________________________
__________________________________
Name (printed)
Signature
__________________________ Date
Version Date 10/01/2013
88
Appendix II
Experiment Questionnaire A set of 10 short sentences (each approximately 3-5 seconds) will be played to you through headphones, there is a 10 seconds interval between each sentence. You are expected to write down the sentences immediately after listening to each sentence. Note that accuracy will not be judged by the correct spelling of the words but rather the phonetic information. At the end of the questionnaire, we will ask you to qualitatively rate the entire sound file on a detailed scale. Part One 1.____________________________________________________________________________ 2.___________________________________________________________________________ 3.____________________________________________________________________________ 4.____________________________________________________________________________ 5.____________________________________________________________________________ 6.____________________________________________________________________________ 7.____________________________________________________________________________ 8.____________________________________________________________________________ 9.____________________________________________________________________________ 10.___________________________________________________________________________
89
Part Two Question One: Compared to your daily conversations in crowded social environments, how clear is the signal in the sound file? (Please select one) 1. 2. 3. 4. 5.
Muffled and/or Fragmented, not clear at all Intermediate Similar to real life situations in crowded environment Intermediate Similar to real life situations in less crowded environment
Question Two: Compared to your daily conversations in crowded social environments, how intelligible is the signal in the sound file? (Please select one) 1. Confusing and not intelligible at all 2. Intermediate 3. Considering the rarity of the testing sentences, I understood as much as I would in real crowded situations 4. Intermediate 5. Considering the rarity of the testing sentences, I understood more than I would in real crowded situations
Question Three: Compared to your daily conversations in crowded social environments, how natural-sounding is the signal in the sound file? (Please select one) 1. 2. 3. 4. 5.
Digital and/or robotic, not natural at all Intermediate I can tell the differences, but it’s not too bothersome/unpleasant Intermediate I can only tell slight differences