Multi-stream Approach for Robust Speech ...

1 downloads 0 Views 1MB Size Report
Feipeng Li, Phani S. Nidadavolu, Sri Harish Mallidi, and Hynek Hermansky. Center for ...... [6] S. Ganapathy, J. Pelecanos, and M. K. Omar, “Feature nor- malization ... [15] J. B. Allen, “How do humans process and recognize speech?” Speech ...
Multi-stream Approach for Robust Speech Recognition in Unknown Noise Feipeng Li, Phani S. Nidadavolu, Sri Harish Mallidi, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, 21218 Abstract Current automatic speech recognition systems are sensitive to noise due to the lack of redundancy in system architecture. Inspired by the parallel processing of human speech perception, we developed a prototype multi-stream system for robust speech recognition. It is assumed that unknown noise of arbitrary spectral shape can be approximated by white noise or speech-shaped noise of similar level across many narrow frequency bands. The fullband speech is decomposed into multiple frequency subbands to form independent channels for parallel processing. Subband speech signals are characterized by both long-term temporal modulation and short-term spectral modulation, for the representation of slow-varying and fast changing speech components respectively. Within each subband an ensemble neural nets are trained in clean and various signal-to-noise ratios in order to optimize information extraction in unknown noise. The information from multiple subbands are integrated by neural nets to form many processing streams. The N best streams, selected by a performance monitor, are averaged to give a more reliable estimate. Index Terms—robust speech recognition, additive noise, multi-stream, ensemble net, performance monitor

EDICS Category: SPE-SPER

I. Introduction

T

HE state-of-the-art automatic speech recognition (ASR) systems are still far behind human listeners in their capability to adapt to unknown noise in realistic environments. Systematic comparison between the two type of systems indicates that the machine system requires an additional 12 dB in signal-to-noise ratio (SNR) to achieve the same level of performance on consonant identification [1]. It degrades dramatically at noise levels that have little effect on human listeners [2]. This work was supported by the Defense Advanced Research Projects Agency (DARPA) RATS project D10PC0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, or the U.S. Government.

The fragileness of ASR systems can be attributed primarily to two aspects: the variability in speech representation and the lack of redundancy in system architecture. Plenty of work has been done on how to generate an invariant feature for robust speech recognition [3], [4], [5], [6], [7]. Artificial neural network (ANN) was first introduced in [8], [9] to suppress the effect of noise on speech feature. In recent years, deep neural network (DNN) [10], [11], [12] and convolutional neural network (CNN) [13], [14] became widely used in ASR frontend processing, which produces a more informative speech representation. On the other hand, there has been little change in the structure of mainstream ASR systems. The fullband speech is still being treated as a single channel for speech recognition. Distortion at one frequency affects the entire speech feature, which causes the machine system easily to fail in adversary circumstances. The human auditory system developed a parallel and hierarchical scheme for robust speech communication after millions of years of biological evolution. Instead of having only one channel for the fullband speech, the human cochlea has about 40 critical bands which serve as independent channels for speech perception [15]. Partial contamination of speech in certain critical bands has little effect on the overall system since the central auditory system is listening to the most reliable channels [15], [16]. I-A. History of Multi-stream ASR Inspired by the parallel processing of human speech perception [15], [17], several researchers proposed the idea of multi-stream speech recognition [18], [19], in which the fullband speech is divided into multiple subbands to emulate the critical bands in human cochlea. Past research effort has been focused on stream formation and stream fusion, specifically, 1) How to represent speech signal to form multiple independent streams for parallel processing? 2) How to integrate information from multiple streams to give a reliable estimate? The representation of speech signal follows the path of short-term spectral modulation, long-term temporal modulation, and multi-resolution spectro-temporal anal-

ysis. In the early studies [20], [21], [22] the speech signal is parameterized by short-term linear prediction code (LPC) or cepstrum coefficients, which is unsuitable for subband speech of narrow frequency band. Next, the short-term spectral trajectories are stacked along time to form the parameterized TempoRAl Patterns (TRAPS) [23], [24], [25]. In [26] an analysis technique, named frequency domain linear prediction (FDLP) was developed for the approximation of subband envelopes using all-pole models. With a long analysis window, FDLP can generate a relative invariant representation for reverberant speech by normalizing the gain of autoregressive model across many narrow frequency bands [27]. The long-term FDLP feature [28] was initially used in a multi-stream system with 21 streams [29]. Recently, the FDLP feature was extended to include both short-term spectral modulation and long-term temporal modulation within a frequency subband [30]. Alternative spectro-temporal representations include mRASTA [31], cortical feature [32], and Gabor feature [33], [34], [35]. The fusion technique evolves from simple methods, such as, voting and averaging, to more sophisticated methods using neural network and confidence measure. An easy way to integrate decisions from multiple streams is to pick the one with highest posterior probability [20]. Alternative methods include arithmetic and geometric mean of posterior probabilities from multiple streams [36], [37] and the techniques for the mixture of experts [38]. Due to the interaction of speech events across different subbands, the performance of single fusion methods is generally worse than the advanced fusion method which recovers the interactive information [39], [40]. In [41], [23], the subband information are integrated by training a neural network for each of all possible combinations of seven subbands. When the speech is corrupted by noise, it is generally helpful to assess the quality of individual streams and leave out the most corrupted stream(s) [42]. The stream quality can be evaluated by the estimated subband signal-to-noise ratio [21], [22], [41] or entropy of posterior probability [43], [44]. A performance monitor [45] was invented for the evaluation of stream quality [37], [30]. The top N streams are averaged to give a more reliable estimation [46], [37]. The progress on stream formation and fusion technique produces a multistream system that show good performance in narrowband noise [37], [30]. I-B. Open Problems How to deal with unknown noise of arbitrary spectral shape is a long-lasting problem in ASR research. Today most ASR systems use artificial neural net (ANN) for front-end processing, which works well when the test condition matches the training condition. To compensate for the effect of noise, it is common to train the neural

net in the expected noise(s) [47], [48], which usually produces a system of better generalization ability. However, in many realistic environments both the noise type and signal-to-noise ratio are changing with time. It is impractical to train a single neural net and expect it to show good performance for all noisy conditions. The multi-stream framework provides a natural solution for the quantization of unknown noise of arbitrary spectral shape. Within a narrow frequency band the timevarying noise spectral can be approximated by white noise or speech-shaped noise of similar intensity over short-time intervals. Thus, the generalization ability of a multi-stream speech recognition system can be improved by training the subband systems in white noise or speech-shaped noise at various levels. In this study we develop a multi-stream system for robust speech recognition in unknown additive noise. The remainder of this paper is organized as follows. Section II summarizes the most important discoveries on human speech perception that are related to multi-stream speech recognition. Section III first gives an overview of the multi-stream system, and then explains the details of the three main modules. Section IV presents the experiments, followed by the results in Section V. Finally, Section VI presents a summary and discussion about future research.

II. Human Speech Perception The human auditory system employs a parallel and hierarchical scheme [49], [50] for speech perception. At the peripheral the cochlea is characterized by a tonotopic organization such that different regions of the basilar membrane are tuned to different frequencies. Simultaneous masking occurs only when the stimuli and masker fall within the same critical band [51]. The tonotopic map projects through cochlear nucleus, midbrain, thalamus, to the primary auditory cortex and surrounding areas [52], [53], where neurons are sensitive to harmonics, pitch [54], temporal modulation, and frequency-modulation [55], [56]. Those lower-level features are assembled in the superior temporal gyrus for the identification of speech sounds [57]. II-A. Temporal Integration Speech sounds are perceived as distinctive patterns of energy distributed across many critical bands along the basilar membrane [58]. The duration of perceptual cue ranges from tens of ms for consonants [59], [60] to hundreds of ms for vowels [61] and syllables [58]. Accordingly, the human auditory system has two types of windows for temporal integration [62]. The fastchanging components are processed by the left hemisphere which has a short integration window of 20–50 ms, while the slow-varying components are processed

TABLE I C OMPARISON OF FRONT- END PROCESSING BETWEEN MACHINE AND HUMAN AUDITORY SYSTEM Machine fullband used as a single channel (short-time) spectral template degrades dramatically in unseen noise

by the right hemisphere which has a long integration window of 150–250 ms. Deficit in temporal processing can cause poor speech recognition ability despite normal intelligence and normal hearing [63], [64]. II-B. Frequency Integration In the 1920s, Fletcher and his colleagues investigated the contribution of different frequency bands to human speech perception [65]. “Nonsense syllables” are high/low-pass filtered and presented to normal hearing listeners. It was discovered that the average phoneme error rate of the full-band speech e is equal to the product of error rates for the low-pass filtered stimuli and the complimentary high-pass filtered stimuli for all cutoff frequencies. Further investigation indicates that the two-band product rule can be generalized into a multi-band form e = e1 e2 . . . eK

(1)

Human fullband speech processed by ≈ 40 critical bands temporal fluctuation across many critical bands degrades slowly as a log function of SNR (Eq. 3)

the speech signal. Given the AI of a communication channel, the phoneme error rate e can be predicted by e = eAI 0

(4)

where e0 ≈ 0.02 is the phoneme error rate in clean condition. The AI model is very accurate in predicting the phoneme accuracy in non-sense syllables for almost all types of distortion except for band-stop filtering. It is the basis of both Speech Transmission Index(STI) [67] and Speech Intelligibility Index (SII) [68]. 90 dB 70 dB

Power Spectral

Structure Feature Generalization

50 dB 30 dB 10 dB

0.25

0.5

1.0

2.0

4.0

8.0

Frequency (Hz)

where ei denotes the phoneme error rate of speech in band i. In other words, the K frequency bands are acting as independent channels for speech communication [15], [16]. Obviously, the fullband error e approaches zero if more than two or three bands have small error rates. II-C. Generalization in Noise The human auditory system show perfect generalization ability in unknown noise. Based on the multi-band product rule (Eq. 1) and a large amount of human perceptual data [65], French and Steinberg derived the Articulation Index (AI) of speech intelligibility [66], [15]. K 1 X AI = AIk (2) K k=1

where AIk , 1 ≤ k ≤ 20 is the specific AI of the k th articulation band (1 articulation band ≈ 2 critical bands).   1 AIk = min log10 (1 + SNR k ), 1 (3) 3 where SNR k = (c σσns )2 is the effective signal-to-noise ratio in the k th frequency band and c ≈ 2 is a constant determined by the speech-peak to rms ratio [66]. AI is a number between 0 and 1. It can be interpreted as the total amount of information being transmitted by

Fig. 1. Approximation of unknown noise (solid blue curve) by narrowband white noise (short horizontal red lines)

II-D. Machine vs. Human Being The main differences between the ASR front-end processing and human auditory processing are summarized in Tab. I. First, the machine system has no redundancy in system architecture. All frequency components are mixed to form a single stream. Degradation at one frequency affects the entire speech feature. In contrast, the human auditory system has about 40 critical bands from 0.3 to 8 kHz which act as independent channels for speech perception. Deep learning helps improving the ASR performance by replacing the hand-engineered feature with a more informative data-driven representation, but it is still uncertain how the biological structure for parallel and hierarchical processing can be learned through (un)supervised learning. Second, the speech feature for machine systems is derived based on the short-time Fourier transform coefficients, for which the speech signal is divided into short frames of 25 ms for every 10 ms; while the human auditory system employs a more sophisticated multi-resolution spectro-temporal representation for speech components of different time and frequency scales. Third, the machine systems are very sensitive to unseen noise. It degrades quickly

Ensemble Net 1

FDLP2

Ensemble Net 2

I: Stream Formation

……

…… FDLP2

Ensemble Net K

II: Adaptation

Performance Monitor ……

speech signal

FDLP2

Fusion ANNs

…… multiple processing streams

Average top N streams

word/phone sequence Decoder

III: Fusion

Fig. 2. Block diagram of multi-stream system for robust speech recognition. A simple version of multi-stream system (denoted as MS) has only one processing stream at the fusion stage that integrates outputs from all subband ensemble nets. A full version of multi-stream system with performance monitor (denoted as MS-PM) has multiple processing streams, each represents a possible combination of the K subbands.

when the testing condition and the training condition mismatch. While the human auditory system seems to generalize well to all kinds of unseen conditions. It guarantees optimum information extraction at various noise levels within each articulation band.

III. Multi-stream Speech Recognition The multi-stream system for robust speech recognition emulates the temporal integration, frequency integration, and noise generalization of human speech perception (refer to Sec. II) in the front-end processing. The basic idea is to build a parallel system that can take the advantage of redundancy in speech signal and recognize speech in narrow frequency bands. It is superior to the conventional single-stream system in that noise is contained in narrow frequency bands, hence corruption of one channel has little effect on the overall system. More importantly, the multi-stream framework provides a natural solution to the quantization of non-stationary additive noise (refer to Fig. 1). It is assumed that within a short-time interval the non-stationary additive noise can be approximated by white noise or speech-shaped noise of similar level across multiple narrow frequency bands. Therefore, the generalization ability of a multi-stream system can be improved by training the neural nets for each frequency subband at various signal-to-noise ratios, which greatly simplifies the problem of dealing with unknown noise. The multi-stream system takes the Hidden Markov Model - Artificial Neural Network (HMM-ANN) paradigm for speech recognition [69]. The system block diagram is depicted in Fig. 2. It consists of three stages: stream formation, adaptation, and fusion. At the first stage, the full band speech is decomposed into multiple frequency subbands named band-limited streams for parallel processing. Within each subband speech signal is analyzed by a multi-resolution filterbank to facilitate the representation of both short-term spectral modulation and long-term temporal modulation. At the second

stage, subband ensemble nets are trained to estimate the posterior probability of a speech sound at various noise levels, followed by a neural net trained to select the posteriors produced by the neural net of matched condition, so that the subband system can approach optimal information extraction over a wide range of signal-to-noise ratio. At the third stage, the decisions of subband ensemble nets based on partial acoustic evidence are fused by neural network to give a more reliable decision. In a simple version of multi-stream system (denoted as MS), the posteriors from K subbands are integrated by a single fusion neural net to form the default processing stream, which covers all frequency subbands. The default processing stream is decoded to get the phone/word sequence. In a full version of multistream system with performance monitor (denoted as MS-PM), the posteriors from K subbands are integrated by a set of fusion neural nets to create many processing streams, each represents a possible combination of K subbands. A performance monitor is employed to evaluate the quality of posteriors for each processing stream. The top N streams are averaged to give a more reliable estimation. III-A. Stream Formation Stream formation decomposes the fulband speech into K subbands to form multiple independent band-limited streams of equal contribution. Depending on the sampling rate of speech signal, K is equal to 7 for wide-band speech (sampling rate=16 kHz) and 5 for narrow-band speech (sampling rate=8 kHz). Each subband covers about three barks along the auditory frequency axis. Within a frequency subband speech sounds are encoded by the subband hybrid feature FDLP2 [30], which is extended from the Frequency-domain linear prediction (FDLP) [26] for long-term temporal representation [28]. The FDLP provides a parametric representation of Hilbert envelope of speech signal across narrow frequency bands. The speech signal is decomposed into

freq. warping & analysis hi-res. filterbank

Speech x(n) DCT

X(f )

X ( f )  X ( f ) W ( f ) H m

H m

lo-res. filterbank

XmL ( f )  X ( f ) WmL ( f )

short-term spetral modulation FDLP

H m

e (n) 2

FDLP

emL (n)

mix

SBH feature cepstral + Δ + Δ2 static + Δ adaptive

long-term temporal modulation H (f ) and a low-resolution Fig. 3. Block diagram of subband hybrid feature FDLP2. Speech signal is analyzed by a high-resolution filterbank Wm L (f ) to facilitate both short-term spectral modulation and long-term temporal modulation for the representation of slow-varying and filterbank Wm fast-changing speech elements.

multiple subbands by multiplying the DCT coefficients with a set of cosine windows, followed by autocorrelation and Levinson-Durbin recursion to derive the poles, which are then interpolated to reconstruct the subband envelopes [26]. The long-term FDLPm feature was optimized for single-stream speech recognition [28]. It has a uniform filterbank of 2.5 Bark bandwidth and a fixed integration window of 250 ms, which may be not be optimal for speech events of different time and frequency scales. Moreover, according to a previous study [27] the FDLP processing can be used to compensate for reverberation by increasing the spectral sampling rate to 4 filters/Bark and normalizing the gain of all-pole models across frequency [27].

and the low-resolution filterbank WL has a bandwidth of 2.5 Bark with a spectral sampling rate of 2 filters/Bark. The band-passed DCT coefficients are used to derive two sets of subband envelopes eL/H (n) by applying the FDLP [26], [28]. The short-term spectral modulation feature is derived based on the high-resolution spectro-temporal representation eH (n). It has an integration window of 25 ms, which is appropriate for the representation of fastchanging speech elements, such as, stop and fricatives. The long-term temporal modulation feature is computed on the basis of a multi-resolution representation, specifically, eH (n) down-sampled by 2 and then interleaved with eL (n). It has an integration window of 250 ms for the representation of slow-varying speech components, such as, vowels and syllables.

Fig. 4. Multi-resolution filterbank including a hi-res W H (solid) and lo-res filterbank W L (dashed)

The subband hybrid feature FDLP2 provides a more informative representation for subband speech signal. It is extended from the FDLPm feature [28] by replacing the uniform filterbank with a multi-resolution filterbank of higher spectral sampling rate (4 filters/Bark) so that it can facilitate both short-term spectral modulation and long-term temporal modulation for the representation of slow-varying and fast-changing speech elements respectively [30]. Fig. 3 depicts the block diagram of FDLP2 [30]. The speech signal is analyzed by multiplying the discrete cosine transform (DCT) coefficients with two sets of cosine windows (Fig. 4) Wm (f ) = 0.5 + 0.5 cos(π(f − fm )/B)

(5)

where fm and B are the center frequency and 6 dB filter bandwidth (i.e., amplitude = 0.5), both on Bark scale. The high-resolution filterbank WH has a bandwidth of 1.0 Bark with a spectral sampling rate of 4 filters/Bark,

Fig. 5. Subband phoneme accuracy of FDLPm and FDLP2 (averaged over 10 clean and noisy conditions from the NoiseEX database)

The subband hybrid feature FDLP2 is tested for phoneme recognition on TIMIT speech in clean condition and 10 noises of various type and level from the NoiseEX database. The details of the phoneme recognition system is described in Sec. IV-A. Fig. 5 depicts the phoneme accuracy of seven subbands. It is shown that the FDLP2 feature achieves a 20-40% gain in most frequency subbands, as compared to the FDLPm feature [30]. The combination of both shortterm spectral feature and long-term temporal feature

sible SNR conditions. In this study, we use the training speech signal at quiet, 20, 10, and 5 dB SNR because the neural net trained in noise seem to generalize well within ±5 dB SNR. Instead of using white noise, which has a much bigger masking effect in the high frequency than in the low frequency, speech-shaped noise is used because it provides a balanced masking across all frequency subbands. For any unknown noise, the posteriors of the neural net gives the best performance is selected for each frequency subband. Suppose the narrow-band noise within a frequency subband is quantized by SNRs={Q, L, M, H} dB. The training of subband ensemble net takes two steps.

significantly enhances the amount of information being extracted from individual subbands [30]. III-B. Adaptation Stream adaptation optimizes subband information extraction for each band-limited stream in unknown additive noise. The fullband speech signal is decomposed into multiple frequency subbands. It is assumed that the effect of wide-band unknown noise on speech recognition can be simulated by white noise across many narrow frequency bands. Depending on the spectral shape and intensity of unknown noise, the instantaneous subband signal-to-noise ratio may change from ≤ 0 dB (heavy noise) to ≥ 30 dB (close to quiet). How to deal with white noise of such a wide range within a frequency subband? In speech recognition a neural net is trained to estimate a posteriori probability of speech sound given the acoustic evidence [70]. It performs the best when the distribution of testing data matches that of training data. When the speech sound is corrupted by noise, the mismatch between the training and testing data causes the performance to drop quickly. Adding noise to training data helps to improve the generalization ability of a neural net, but it reduces the amount of phonetic and linguistic information being carried by the speech signal. The higher the noise level, the less the neural net learns. As a result, a neural network trained in noise generally show significantly worse performance than that in clean condition. Multi-style training, in which a neural net is trained with different type of noise at various levels, maintains a balance between the performance and generalization ability. It gives stastifactory performance within a certain SNR range, but the performance drop significantly for clean and highly noisy conditions.

subband speech feature

x

ANN-L ANN-M ANN-H

For each snr in SNRs • add speech-shaped noise to all utterances in clean training set at snr dB SNR; • compute FDLP2 feature X for all utterances in the noisy training set; • train ANN-$snr$ with FDLP2 feature X using back-propagation;

yQ

yL

yM yH

Concatenate

ANN-Q

Step 1: Train feed-forward ANN-Q/L/M/H

y

Pad Temporal Context

z

ANN Selector

Fig. 6. Subband ensemble net for optimal information extraction. ANN-Q/L/M/H trained in quiet, low, middle, and high level of noise

To achieve optimum subband information extraction over a wide range of SNR, we divide the SNR range from 0 dB to clean condition into several segments from low, middle, high SNRs to quiet, and train a neural net for each segment. The case of SNR ≤ 0 dB is ignored because there is little information carried by the speech signal. Fig. 6 depicts the block diagram of the subband ensemble net. A set of N neural nets, including ANN-Q trained in quiet and several others ANN-L/M/H trained in various levels of noise, are employed to cover all pos-

p

In order for the ANN selector learn to pick the best posteriors produced by the ANN of matched condition, a training set of noisy speech is created by adding noise to each utterance at Q/L/M/H level in a random order, so that at any time there is always an ANN working on the matched condition, while the location of ANN to be selected shift randomly across the four positions. To make sure that the prior probability is equal for all noises, a sequence of pseudo-random number with uniform distribution between 1 and N is used as the index of noise type to be applied at any given time during the training. Next, the synthesized noisy speech are forward-passed through ANN-Q/L/M/H. The output of the four neural nets are concatenated and used as the feature for the training of ANN selector. Step 2: Train ANN selector For each utterance u in the clean training set • randomly pick a snr in SNRs with equal prob.; • add speech-shaped noise to u at snr dB SNR; • compute the FDLP2 feature x for utterance u; • forward-pass x through ANN-Q/L/M/H; • concatenate ANN outputs y=[yQ yL yM yH ]; • pad temporal context z(t)=[y(t + 10) y(t + 5) y(t) y(t − 5) y(t − 10)]; • train a three-layer perceptron that maps z to l using back-propagation; The subband ensemble net is tested for generalization ability in noise on the TIMIT database using the same

Phoneme Error Rate (PER)(%)

85 80

0dB

75 5dB 70 10dB

65

15dB

60

20dB mix

55

clean ensemble

50 45

0dB

5dB

10dB

15dB SNR in test

20dB

clean

Fig. 7. Phoneme Error Rate (PER) of subband ensemble net (denoted as “ensemble”) vs. multi-style trained neural net (denoted as “mixed”) and single neural nets trained in clean or a certain SNR

phoneme recognition system for the test of FDLP2 feature. Results indicate that the subband ensemble net works well for a wide range of SNR across multiple frequency subbands. Fig. 7 depicts the phoneme error rate of subband ensemble net for the 4th subband in clean and various levels of babble noise. For comparison the phoneme error rates of single neural nets trained in clean, babble noise at 20, 15, 10, 5, 0 dB SNR, or multi-style training (i.e., all the conditions mixed) are also plotted on the same figure. The neural net trained in clean works very well in clean, but the performance drops quickly as the noise level increases. The neural nets trained in various levels of babble noise show better generalization ability, but their maximum performance decreases noticeably even in mild noise at 20 dB SNR. Multi-style training helps improve the performance in the range of 10 to 20 dB SNR, but it still performs worse in clean conditions and in heavy noise (≤10 dB SNR) than the neural nets of matched conditions. The subband ensemble net significantly outperforms all single nets on matched conditions in every frequency subband. III-C. Fusion Stream fusion integrates information from multiple band-limited streams (subbands) to give a more reliable hypothesis of target speech sound. The major challenges include: 1) how to combine the decisions of subband ensemble nets; and 2) how to evaluate the quality of individual streams. III-C1. Integration of Subband Information The information from multiple band-limited streams are integrated by neural network to account for the interaction of speech events across different frequency subbands. For example, the vowels are defined by both F1 and F2, which are included in the first and second band-limited streams respectively. Classification based on the partial evidence of only F1 or F2 causes a high

degree of confusion between different vowels. Linear weighted fusion has little success in restoring the interactive information, neither does the Dempster-Shafer’s method [44], which works to a certain degree only when the individual streams are highly correlated. The most effective way of integrating information from individual streams is to train an artificial neural network to learn the detailed interaction of speech events across various streams through supervised learning. All nonempty combinations of 7/5 band-limited streams form a total of 127/31 processing streams produced by the second stage fusion neural nets. Depending on the intensity and spectral shape of noise, some processing streams may be more reliable than others. To choose the more reliable processing streams for the final fusion, it is necessary to have a procedure to select the informative processing streams and remove those corrupted by the noise. III-C2. Evaluation of Stream Quality A performance monitor named the M measure [45] is used to evaluate the reliability of individual processing streams based on the posterior probability produced by the fusion MLPs. For a given window of T consecutive frames the M measure is given by PT −∆t D(Pt , Pt+∆t ) (6) M (∆t) = t=0 T − ∆t where D is the symmetric KL divergence between two posterior feature vectors Pt and Pt+∆t . The fusion neural networks integrate partial information from multiple frequency subbands to give the posterior probability of speech sounds for every 10 ms [70]. When the speech signal is clean, the ANN is confident in classifying the speech sound, the distribution of posterior probability P (t) is more peaky with the probability of target sound approaches 1 while all other items close to zero, which leads to a larger difference between Pt and Pt+∆t . When the speech signal is corrupted by noise, the neural network is not so confident, the posterior probability P (t) is relatively flat, which makes Pt and Pt+∆t more similar to each other. Thus, the M measure reflects the quality of the speech signal. The M measure is proportional to the signal-to-noise ratio of speech signal [45].

IV. Experiments The multi-stream approach is evaluated for robust speech recognition on three different speech corpus. In the first experiment the full version of multi-stream system with performance monitor (denoted as MS-PM) is tested for phoneme recognition using the TIMIT speech corpus. In the second experiment, a simple version of multistream system (denoted as MS) is evaluated for phoneme recognition in additive noise and other unknown distortion using telephone speech from the RATS corpus for

Arabic Levantine. In the third experiment, the MS system is tested for Large-Vocabulary Continuous Speech Recognition (LVCSR) using the Aurora-4 speech corpus. IV-A. Experiment I on TIMIT The first experiment assesses the effectiveness of multistream framework for robust speech recognition in unknown additive noise and the contribution of subband ensemble net and performance monitor to system performance. Speech: The TIMIT corpus for American English consists of 5040 sentences, of which 3696 are used for training, and the rest 1344 are used for testing. The target phoneme set contains 40 phonemes. The speech sounds are sampled at 16kHz. The clean speech are mixed with more than 10 types of noise from the NoiseEx database and 1 kHz pure-tone noise at 0, 5, 10, and 15 dB SNR to create noisy speech for testing. Condition: The multi-stream system is compared with a single-stream (SS) baseline that is trained in clean condition. The speech sounds are encoded by three types of speech features, including short-term spectral feature MFCC, short-term spectral feature with noise reduction PNCC, and long-term temporal feature FDLPm. The generalization ability of a multi-stream system with subband ensemble net (denoted as ensemble) is evaluated by comparing its performance with another multi-stream system that has only a single neural network, either trained in clean or multi-style condition (denoted as clean and mixed), for each subband. Three versions of multi-stream system are tested in order to assess the contribution of performance monitor. MS denotes the simple version of multi-stream system with only one processing stream which covers all frequency subbands. MS-PM denotes the full version of multi-stream system with a performance monitor for stream selection. To estimate the upper limit of performance gain using stream selection, we also include a multi-stream system, denoted as MS-Oracle, in which the top N streams are manually selected based on the phoneme error rate after decoding. System: The multi-stream phoneme recognition system and the single-stream baseline take the Artificial Neural Network - Hidden Markov Model - (ANN HMM) paradigm [69]. For simplicity, both systems are built as single-state monophone systems without using any context information. The single-stream baseline includes a multi-layer perceptron (MLP) trained to classify phonemes, followed by a Viterbi decoder for phone sequence, which has a state transition matrix of equal probabilities for self-transitions and next-state transitions. The block diagram of multi-stream system is given

in Sec III. The subband ensemble nets are trained with babble noise, which has the same long-term spectral shape as speech signal. The dimensionality of the presigmoid outputs of subband ensemble nets is reduced to 25 by applying the Karhunen-Loeve Transform (KLT) [9]. Reduced vectors from all subband classifiers are concatenated [71] then to form an input for the fusion neural net, which estimates the final posterior probability of phonemes given the acoustic evidence. All neural nets are three-layer MLPs with a hidden layer of 1000 units. IV-B. Experiment II on RATS Arabic Levantine The second experiment evaluates the multi-stream system for phoneme recognition in additive noise and other distortion using the DARPA RATS corpus for Arabic Levantine. Speech: The RATS corpus of Arabic Levantine consists of 9 channels of telephone speech. Channel src contains the original clean speech, while channel A, B, C, D, E, F, G, H contain speech of various types of unknown distortion, including, additive noise, convolutional noise, low-pass filtering, frequency smearing, frequency shifting, etc. Each channel has 40 hours of speech. The target phoneme set consists of 38 phonemes. The speech sounds are sampled at 8 kHz. Channel A, D, G are selected as the unseen channels for testing. Condition: To assess the generalization ability to unseen distortion, the multi-stream system (denoted as MS) is compared with the single-stream baseline trained either on clean speech from the src channel or both clean and noisy speech from all channels except for the testing channel. Since the phoneme recognition system never “saw” the distortion before, it is suppose to give a bad performance. For comparison, we also include the results of singlestream for matched channels, in which the phoneme recognition system is trained on speech from all channels or the same channel for testing (denoted as All and TC respectively). For these two type of situations, the ASR system already “saw” the type of distortion before testing, therefore it is supposed to show better performance. System: The multi-stream phoneme recognition system and the single-stream baseline are similar to those of exp. I. Due to the large amount of computation, only the simple version of MS system with the default processing stream is tested and compared with the SS baseline. Both the MS system and the SS baseline are built as single-state monophone systems because the phoneme transcription, derived by running force-alignment on the word-level transcription, is inaccurate in the start and ending time for many speech sounds. Except for the ANN selectors of subband ensemble net, all ANNs have

the same size of 420×1000×1000×250×38. The simple MS system with subband ensemble net is trained on synthesized noisy speech created by mixing speech in the src channel with speech-shaped noise created by averaging clean speech from the same channel. The speech-shaped noise required for the training of subband ensemble net is created by randomly select 100 segments of clean speech in channel src and average over all segments, so the MS system has never seen any data from the noisy channels. IV-C. Experiment III on Aurora4 The third experiment evaluates the performance of a multi-stream LVCSR system on the Aurora 4 speech corpus.

Fig. 8. Performance of MS system with subband ensemble net vs. MS systems with a single neural network trained in clean or multi-style (“mixed”) trained in babble noise for each subband

Speech: The Aurora-4 database is developed based on the DARPA Wall Street Journal (WSJ0) corpus for read speech. The training set contains 7138 sentences from 83 speakers. The test set contains 330 utterances in 6 different noise and channel mis-match conditions. There are six different noise types (street, babble, train, car, restaurant, airport) at various SNRs from 5-15 dB and two different microphones. The speech signal is sampled at 16 kHz. Condition: The simple version of multi-stream LVCSR system (denoted as “MS”) is compared with the singlestream baseline trained on clean speech from the 1st microphone using the MFCC or FDLPm feature. System: A multi-stream LVCSR system is created by using the Kaldi toolkit for speech recognition [72]. A hidden Markov model Gaussian mixture model (HMMGMM) system is trained on the input features, transformed using linear discriminant analysis (LDA) and maximum likelihood linear transform (MLLT), with a trigram language model. The HMM-GMM system is the used to generate context dependent triphone state alignments for for the training of deep neural networks (DNN). Due to the large amount of computation involved in the training of deep neural networks, only the default processing stream which integrates information from five subbands is included in the system. The multistream LVCSR system is built as three-state triphone systems using a trigram language model. Except for the ANN selectors, which are 4-layer MLPs of size 420×1024×1024×2024, all other neural nets in the subband ensemble nets and the fusion nets are 7-layer deep neural networks (DNNs) with 1024 units in the hidden layers. The weights of DNNs are pre-trained by using the constractive divergence learning procedure for Restricted Boltzmann Machines (RBMs) [73]. A single-stream LVCSR system using the 9-frame MFCC or FDLPm feature is adopted as the baseline.

Fig. 9. Comparison between MS system and single-stream baseline using MFCC and FDLP feature in unknown noises. “subway (15)” means subway noise at 15 dB SNR

V. Results The multi-stream approach substantially increases the robustness of ASR systems to unknown noise. V-A. Experiment I on TIMIT Table II lists the phoneme error rate (PER) of variants of single-stream system and multi-stream systems. The multi-stream system with subband ensemble net substantially out-performs the MS system (both MS and MS-PM) that has only a single neural network, either trained in clean or multi-style-trained, for all noisy conditions. Performance monitor is highly effective in rejecting narrow band noise, e.g., the error rate for 1 kHz pure tone decrease changes from 49.28% to 39.87%. Stream selection with performance monitor generally helps when the band-limited streams are unbalanced. It reduces the phoneme error rate by about 5% relative for MS system with clean-trained subband neural nets, but the gain diminishes for MS system with multistyle training. The average phoneme error rate (PER) of the MS system is on average about 40%, 30%, and 28% relative lower than the single-stream baseline using MFCC, PNCC, and FDLP respectively.

TABLE II E XP. I: P HONEME E RROR R ATE (%) OF MULTI - STREAM SYSTEM ON TIMIT Noise (dB SNR) clean babble (15) subway (15) buccaneer1 (15) factory1 (10) restaurant (10) machinegun (10) street (5) exhall (5) music55 (5) f16 (0) car (0) 1 kHz tone (0)

MFCC 33.50 65.14 57.42 66.76 74.65 72.48 52.08 80.98 79.90 82.47 77.32 90.24 80.56

SS PNCC 33.51 58.93 49.48 63.94 72.05 67.14 48.27 70.26 74.97 78.38 84.31 52.58 71.42

FDLP 31.35 57.10 46.62 60.90 68.10 63.14 50.95 67.26 70.67 73.46 86.10 54.32 64.59

clean 30.04 47.05 38.28 56.30 63.74 58.19 40.48 62.56 69.47 75.60 86.54 40.79 48.72

MS mixed 31.70 39.15 35.27 49.81 56.46 49.65 36.99 54.72 58.04 65.88 89.31 35.78 50.98

ensemble 32.29 36.03 33.93 42.63 45.91 42.96 37.76 44.41 48.30 62.69 71.27 34.41 49.28

clean 29.45 45.16 36.28 53.82 61.56 55.69 36.21 59.37 64.46 72.38 83.83 35.48 39.20

MS-PM mixed ensemble 30.99 31.10 38.48 34.84 34.35 32.61 48.48 40.97 55.26 44.64 48.60 41.60 35.45 36.14 53.92 43.36 57.76 46.77 64.37 59.82 85.07 69.56 34.56 33.18 40.99 39.87

clean 27.57 42.27 33.65 50.40 58.52 52.98 33.48 55.72 61.80 68.36 81.29 33.20 35.67

MS-Oracle mixed ensemble 28.63 28.39 36.33 32.08 31.87 30.07 46.20 38.33 53.39 41.99 46.65 38.86 32.71 32.27 52.20 40.65 55.93 43.74 61.74 55.80 84.66 67.23 31.90 30.43 37.32 36.85

SS – single-stream baseline trained in clean; MS – multi-stream system with only one stream including all subbands; “clean” and “mixed” refer to the training conditions for subband neural nets; MS-PM – multi-stream system with a performance monitor; Oracle – multi-stream system with manually selected N best streams.

Genralization in Noise: The subband ensemble net significantly enhances the robustness of the multi-stream system in additive noise. Fig. 8 depicts the PER of a MS system with subband ensemble net as a function of SNR in babble noise, which is “known” to the system. The results are compared to another MS system that has a single subband neural nets trained in either clean or multi-style condition for every subband. The three curves start at the same level near 30% in clean condition. As the noise level increases, the MS system with clean-trained subband neural nets climbs quickly to about 60% at 10 dB SNR. The MS system with multistyle trained subband neural nets show much slower decrease in system performance, but still it is far behind the MS system with subband ensemble net, which has a phoneme error rate of about 40% at 10 dB SNR. Figure 9 depicts the PER of MS system and the singlestream (SS) baseline using MFCC and FDLPm feature in various types of unknown noise. As the noise level increases, the PER of SS(MFCC) climbs up quickly from about 30% in clean condition to more than 90% in car noise at 0dB SNR. In contrast, the MS system degrades much slower in noise. For car noise at 0 dB SNR, a low-frequency noise that corrupts the speech components below 0.5 kHz, the PER of MS system is very close to that of clean condition, suggesting the the proposed approach is very effective in dealing with narrow-band noise. Sources of Gain in Performance: The MS-PM system achieves a 40% reduction in average PER in unknown noise. Most of the contribution comes from the subband ensemble net. Figure 10 depicts the average Phoneme Error Rate (PER) of multi-stream system and singlestream baseline in unknown noise. The PER is averaged for all noisy conditions from 0 to 15 dB SNR. The single-stream system with FDLPm feature is used as

Fig. 10. Contribution of subband ensemble net and performance monitor to the gain in performance

the baseline because it is much better than the one with MFCC feature. It is observed that with a more informative subband speech feature FDLP2 the multistream system with a single neural network trained in clean has a Phoneme Error Rate that is about 10% relative lower as compared to that of single-stream baseline. The subband ensemble net, which is better than multi-style training with various noise levels, provides an additional 18% gain on top of the multi-stream with clean trained subband neural networks. V-B. Experiment II on RATS Arabic Levantine The multi-stream approach also works for the DARPA RATS speech inspite of the various type of unknown distortion in addition to the additive noise. Table III lists the PER of MS system with ensemble net and several variants of SS system for RATS corpus for Arabic Levantine. The SS(matched) refers to the cases where the single-stream baseline is trained on speech from all channels or the target channel for testing (denoted

TABLE III E XP. II: P HONEME E RROR R ATE (%) OF MULTI - STREAM SYSTEM ON RATS A RABIC L EVANTINE

Chan src A B C D E F G H mean(A–H)

SS (matched) All TC 55.9 51.8 73.0 69.8 72.6 69.9 78.1 78.2 64.9 60.1 71.4 70.0 64.0 64.1 60.4 58.5 68.3 68.6 69.1 67.4

SS (mismatch) src All but TC 51.8 59.2 96.3 75.7 91.5 76.0 88.9 79.3 83.8 69.3 93.2 73.3 85.5 69.4 61.8 62.0 91.9 71.2 86.6 72.0

TABLE IV E XP. III: W ORD E RROR R ATE (%) OF MULTI - STREAM LVCSR SYSTEM ON AURORA 4 CORPUS

MS 52.7 73.4 75.9 75.9 68.0 77.1 70.5 55.6 74.8 71.4

SS(matched)–single-stream baseline trained on all channels or the testing channel (denoted as “All” and “TC”); SS(mismatch)–single-stream baseline trained on src or all channels except for the testing channel (denoted as “src” and “All but TC”); MS–simple version of multi-stream system

as “All” and “TC”). The SS(mismatched) refers to the cases where the single-stream baseline is trained on src or all channels except for the target channel for testing (denoted as “src” and “All but TC”). In the last column, the MS system with ensemble net in each subband is trained in clean and speech-shaped noise synthesized from speech in src channel. The system has never seen any of the noisy channels. MS vs. SS of mismatched Case: The MS system is substantially better than the SS system trained on speech from src channel only. Due to the heavy noise and unknown type of distortion, the SS system trained on src channel fails for all noisy channels except for channel G, which is corrupted by additive noise at about 15 dB SNR. In contrast, the MS system show stable performance for all channels. Notice that the MS system is trained on speech from the src channel only. It has never seen the distortion of channel A to H. The MS system (6th column in Tab. III) has an average PER of 71.4% for all noise channels from A to H, which is substantially better than the ave PER of 86.6% by the SS baseline (4th column in Tab. III), suggesting that the proposed subband ensemble net is very effective in dealing with noise. Since the noisy channels from A to H share certain type of distortions, such as, convolutional noise, additive noise, frequency smearing, etc. Training the SS baseline using speech from from all channels but the test channel (denoted as “All but TC”) helps improve the system’s generalization ability. Results show that MS system also out-performs this type of SS system of mismatched case for both clean and noisy conditions. MS vs. SS of matched Case: The MS system is comparable to the SS system trained on speech from all channels, including the target channel for testing. The PER of MS system is on average about 2% absolute higher than the SS system for noisy channels (2nd

Conditions clean airport babble car restaurant street train

SS (MFCC) mic-1 mic-2 2.8 18.1 20.8 40.0 18.8 39.3 11.8 27.9 26.4 42.6 25.7 44.2 27.2 44.8

SS (FDLPm) mic-1 mic-2 3.1 17.2 19.6 41.5 18.6 38.7 13.6 33.7 24.3 41.9 25.4 43.8 23.6 41.7

MS mic-1 3.1 8.6 7.2 4.1 11.0 8.9 9.0

mic-2 11.7 23.8 23.9 16.2 26.9 24.7 24.3

mic-1/2 refers to two different recoding microphones. Both the SS and MS systems are trained on speech from the 1st microphone.

column in Tab. III) and about 3% absolute better than the latter for the src channel. The MS system is not as good as the SS system trained and tested on the same channel (3nd column in Tab. III), which is not surprising. In the RATS project, the ASR system is supposed to be able to deal with unknown distortion from an unseen channel. So the SS system on matched cases is a “cheating” experiment. Nevertheless, it is still interesting for comparison. V-C. Experiment III on Aurora4 The multi-stream approach improves the robustness of LVCSR system in additive noise. Tab. IV show the results of exp. III on Aurora4 database, which contains speech distorted by both additive noise and convolutional noise introduced by different recording microphone. Both the SS baselines and the MS system are trained clean speech recorded by the 1st microphone. Additive Noise: The multi-stream LVCSR system substantially outperforms the single-stream baseline for all noisy conditions. The average WER of all noisy conditions drops from 42.18% for the single-stream system to 15.40% for the multi-stream system. The average WER of the proposed MS LVCSR system is about 63% relative lower than the single-stream baseline. In addition, the MS LVCSR system is slightly better than the single-stream baseline for the clean condition. Convolutional Noise: The difference between the performance of the two microphones are at the same level for both SS and MS system, indicating that the subband hybrid feature FDLP2 is not effective in compensating for the convolutional noise introduced by the change of recording microphone.

VI. Summary and Discussion In this study, we describe a multi-stream approach that can deal with unknown additive noise in realistic environment. It is assumed that any unknown noise can be approximated by white noise (or speech-shaped

noise) of similar level across multiple frequency subbands. The full band speech is divided into multiple frequency subbands for parallel processing. Within each subband, the speech signal is represented by the Hilbert envelops derived by using FDLP across many narrow frequency bands. A subband ensemble net is trained to optimized information extraction at various levels of noise, followed by a neural net to select the suitable net of matched condition. The subband information are integrated by a set of neural nets to form many processing streams, each representing a possible combination of the seven subbands. A performance monitor is employed to evaluate the quality of individual streams, the top N streams are averaged to give a more reliable estimation. The multi-stream approach is tested on phoneme recognition using the TIMIT and DARPA RATS corpus of Arabic Levantine. Experimental results show that the subband ensemble net substantially optimizes information extraction withint each frequency subband in noisy conditions. In a third experiment, the multi-stream approach is validated on LVCSR using the Aurora4 speech corpus. It is shown that the substantial gain in performance copies to LVCSR task as well. The subband speech signal are represented by the long-term Hilbert envelopes derived by using FDLP. Theoretically, it should be able to generate an relatively invariant representation for reverberant speech. Suppose X(ω, t) and R(ω, t) denote the short-time Fourier coefficients of the orignal signal x(t) and reverberrant signal r(t) respectively, the effect of reverberation can be modeled as R(ω, t) ≈ X(ω, t)H(ω, t) [27], where H(ω, t) is the short-time Fourier transform of the room impulse response. When the analysis window is longer than the reverberation time (RT60 ), defined as the time for the reverberant sound r(t) to drop 60 dB below the original level, H(ω, t) turns into a time-invariant transfer function [74], which can be effectively compensated for by dividing the fullband speech into many narrow frequency bands and normalizing the gain of the autoregressive (AR) models in each band [27]. Limitations: In the current version of multi-stream system, the fullband speech is decomposed into bandlimited of about 25% of overlap between neighboring subbands, thus the neighboing band-limited streams are correlated to certain degree. As a consequence, the multi-band product rule, which applies to human speech perception, does not apply to our MS system. The future version of multi-stream system may have many narrow band-limited streams, which allows for a more delicate and efficient use of the frequency resources when the speech is corrupted by noise. The multi-stream system is complex. Some researcher suggested that we stack all neural nets and run backpropagation to fine tune the weights. The subband hybrid feature FDLP2 does not shrink

the difference between the speech recorded by two different microphones. More research is need to find solutions to other type of distortion. The performance monitor based on M measure requires an integration window of at least 1 seconds, which greatly restricted its use to non-stationary noise. In order to compensate for the deficiency in front-end processing, current ASR systems rely heavily on word and language models to reduce the word error rate. In contrast, the context information for human speech perception is quantitatively equivalent to adding statistically independent channels to those already available [75]. It is still uncertain how to build a ASR system with feedback that can use top-to-bottom information as well.

VII. References [1] J. J. Sroka and L. D. Braida, “Human and machine consonant recognition,” Speech Communication, vol. 45, no. 4, pp. 401– 423, 2005. [2] R. P. Lippmann, “Speech recognition by machines and humans,” Speech communication, vol. 22, no. 1, pp. 1–15, 1997. [3] H. Hermansky and N. Morgan, “Rasta processing of speech,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 4, pp. 578–589, 1994. [4] S. Greenberg and B. E. Kingsbury, “The modulation spectrogram: In pursuit of an invariant representation of speech,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 3. IEEE, 1997, pp. 1647–1650. [5] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (pncc) for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4101–4104. [6] S. Ganapathy, J. Pelecanos, and M. K. Omar, “Feature normalization for speaker verification in room reverberation,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 4836–4839. [7] Q. Li and Y. Huang, “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 6, pp. 1791–1801, 2011. [8] N. Morgan and H. A. Bourlard, “Neural networks for statistical recognition of continuous speech,” Proceedings of the IEEE, vol. 83, no. 5, pp. 742–772, 1995. [9] H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, vol. 3. IEEE, 2000, pp. 1635–1638. [10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012. [11] A.-r. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hinton, and M. A. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5060–5063. [12] Y. Bengio, “Learning deep architectures for ai,” Foundations and R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. trends [13] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4277–4280. [14] L. Deng, O. Abdel-Hamid, and D. Yu, “A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6669–6673.

[15] J. B. Allen, “How do humans process and recognize speech?” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 4, pp. 567–577, 1994. [16] F. Li and J. B. Allen, “Multiband product rule and consonant identification,” The Journal of the Acoustical Society of America, vol. 126, no. 1, pp. 347–353, 2009. [17] G. A. Miller and P. E. Nicely, “An analysis of perceptual confusions among some english consonants,” The Journal of the Acoustical Society of America, vol. 27, no. 2, pp. 338–352, 1955. [18] H. Hermansky, “Multistream recognition of speech: Dealing with unknown unknowns,” 2013. [19] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 7–13, 2012. [20] P. Duchnowski, “A new structure for automatic speech recognition,” Ph.D. dissertation, Massachusetts Instittute of Technology, Cambridge, MA, 1992. [21] H. Bourlard, S. Dupont, H. Hermansky, and N. Morgan, “Towards subband-based speech recognition,” in Proc. EUSIPCO96, 1996, pp. 1579–1582. [22] H. Bourlard and S. Dupont, “A mew asr approach based on independent processing and recombination of partial frequency bands,” in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 1. IEEE, 1996, pp. 426–429. [23] S. R. Sharma et al., “Multi-stream approach to robust speech recognition,” Ph.D. dissertation, Ph. D.), Oregon Graduate Institute, 1999. [24] P. Jain et al., “Temporal patterns of frequency-localized features in asr,” 2003. [25] F. Gr´ezl and H. Hermansky, “Local averaging and differentiating of spectral plane for trap-based asr.” in INTERSPEECH, 2003. [26] M. Athineos and D. P. Ellis, “Autoregressive modeling of temporal envelopes,” Signal Processing, IEEE Transactions on, vol. 55, no. 11, pp. 5237–5245, 2007. [27] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech using frequency domain linear prediction,” Signal Processing Letters, IEEE, vol. 15, pp. 681–684, 2008. [28] S. Ganapathy, S. Thomas, and H. Hermansky, “Temporal envelope compensation for robust phoneme recognition using modulation spectrum,” The Journal of the Acoustical Society of America, vol. 128, no. 6, pp. 3769–3780, 2010. [29] F. Li, S. H. R. Mallidi, and H. Hermansky, “Phone recognition in critical bands using sub-band temporal modulations.” in INTERSPEECH. ISCA, 2012. [30] F. Li, “Subband hybrid feature for multi-stream speech recognition,” in Acoustics, Speech and Signal Processing, 2014. Proceedings of the 2014 IEEE International Conference on. IEEE, 2014. [31] H. Hermansky and P. Fousek, “Multi-resolution rasta filtering for tandem-based asr.” in Interspeech, 2005, pp. 361–364. [32] N. Mesgarani, S. Thomas, and H. Hermansky, “A multistream multiresolution framework for phoneme recognition.” in INTERSPEECH, 2010, pp. 318–321. [33] M. Kleinschmidt, “Localized spectro-temporal features for automatic speech recognition.” in INTERSPEECH, 2003. [34] S. Y. Zhao and N. Morgan, “Multi-stream spectro-temporal features for robust speech recognition.” in INTERSPEECH, 2008, pp. 898–901. [35] B. T. Meyer and B. Kollmeier, “Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition,” Speech Communication, vol. 53, no. 5, pp. 753–767, 2011. [36] J. Ming and F. J. Smith, “A probabilistic union model for subband based robust speech recognition,” in Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, vol. 3. IEEE, 2000, pp. 1787– 1790. [37] E. Variani, F. Li, and H. Hermansky, “Multi-stream recognition of noisy speech with performance monitoring.” in INTERSPEECH, 2013, pp. 2978–2981. [38] K. Woods, K. Bowyer, and W. P. Kegelmeyer Jr, “Combination of multiple classifiers using local accuracy estimates,” in Computer Vision and Pattern Recognition, 1996. Proceedings CVPR’96, 1996 IEEE Computer Society Conference on. IEEE, 1996, pp. 391–396. [39] H. Hermansky, S. Tibrewala, and M. Pavel, “Towards ASR on

[40] [41]

[42]

[43]

[44] [45]

[46] [47]

[48] [49] [50] [51] [52]

[53] [54] [55] [56]

[57] [58]

[59]

[60]

partially corrupted speech,” in Proc. Int. Conf. Spoken Language Processing, 1996, pp. 462–465. A. Hagen, “Robust speech recognition based on multi-stream ´ processing,” Ph.D. dissertation, Ecole Polytechnique F´ed´erale de Lausanne, 2002. S. Tibrewala and H. Hermansky, “Sub-band based recognition of noisy speech,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 2. IEEE, 1997, pp. 1255–1258. R. P. Lippmann and B. A. Carlson, “Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise,” in Proc. Eurospeech, vol. 97, 1997, pp. 37–40. H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combination rules in hmm/ann multi-stream asr,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, vol. 2. IEEE, 2003, pp. II–741. F. Valente and H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence.” in ICASSP (4), 2007, pp. 1129–1132. H. Hermansky, E. Variani, and V. Peddinti, “Mean temporal distance: Predicting asr error from temporal properties of speech signal,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7423– 7426. T. Ogawa, F. Li, and H. Hermansky, “Stream selection and integration in multistream asr using gmm-based performance monitoring.” in INTERSPEECH, 2013, pp. 3332–3336. R. Lippmann, E. Martin, and D. B. Paul, “Multi-style training for robust isolated-word speech recognition,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87., vol. 12. IEEE, 1987, pp. 705–708. L. Deng, A. Acero, M. Plumpe, and X. Huang, “Largevocabulary speech recognition under adverse acoustic environments.” in INTERSPEECH, 2000, pp. 806–809. J. E. Peelle, I. S. Johnsrude, and M. H. Davis, “Hierarchical processing for speech in human auditory cortex and beyond,” Frontiers in human neuroscience, vol. 4, 2010. E. A. Lopez-Poveda, A. R. Palmer, and R. Meddis, The Neurophysiological Bases of Auditory Perception. Springer, 2010. H. Fletcher, Speech and Hearing in Comminication, ASA ed., J. B. Allen, Ed. Woodbury, NY: Acoustical Society of America, 1995. L. M. Romanski, J. F. Bates, and P. S. Goldman-Rakic, “Auditory belt and parabelt projections to the prefrontal cortex in the rhesus monkey,” Journal of Comparative Neurology, vol. 403, no. 2, pp. 141–157, 1999. T. Hackett, I. Stepniewska, and J. Kaas, “Thalamocortical connections of the parabelt auditory cortex in macaque monkeys,” Journal of Comparative Neurology, vol. 400, pp. 271–286, 1998. D. Bendor and X. Wang, “Neural coding of periodicity in marmoset auditory cortex,” Journal of neurophysiology, vol. 103, no. 4, pp. 1809–1822, 2010. J. P. Rauschecker, B. Tian, and M. Hauser, “Processing of complex sounds in the macaque nonprimary auditory cortex,” Science, vol. 268, no. 5207, pp. 111–114, 1995. X. Wang, M. M. Merzenich, R. Beitel, and C. E. Schreiner, “Representation of a species-specific vocalization in the primary auditory cortex of the common marmoset: temporal and spectral characteristics,” Journal of Neurophysiology, vol. 74, no. 6, pp. 2685–2706, 1995. N. Mesgarani, C. Cheung, K. Johnson, and E. F. Chang, “Phonetic feature encoding in human superior temporal gyrus,” Science, vol. 343, no. 6174, pp. 1006–1010, 2014. R. Drullman, J. M. Festen, and R. Plomp, “Effect of reducing slow temporal modulations on speech reception,” The Journal of the Acoustical Society of America, vol. 95, no. 5, pp. 2670–2680, 1994. F. Li, A. Menon, and J. B. Allen, “A psychoacoustic method to find the perceptual cues of stop consonants in natural speech,” The Journal of the Acoustical Society of America, vol. 127, no. 4, pp. 2599–2610, 2010. F. Li, A. Trevino, A. Menon, and J. B. Allen, “A psychoacoustic method for studying the necessary and sufficient perceptual cues of american english fricative consonants in noise,” The Journal of the Acoustical Society of America, vol. 132, no. 4, pp. 2663–

2675, 2012. [61] A. S. House, “On vowel duration in english,” The Journal of the Acoustical Society of America, vol. 33, no. 9, pp. 1174–1178, 1961. [62] D. Poeppel, “The analysis of speech in different temporal integration windows: cerebral lateralization as asymmetric sampling in time,” Speech Communication, vol. 41, no. 1, pp. 245–255, 2003. [63] P. Tallal, S. L. Miller, G. Bedi, G. Byma, X. Wang, S. S. Nagarajan, C. Schreiner, W. M. Jenkins, and M. M. Merzenich, “Language comprehension in language-learning impaired children improved with acoustically modified speech,” Science, vol. 271, no. 5245, pp. 81–84, 1996. [64] F.-G. Zeng, S. Oba, S. Garde, Y. Sininger, and A. Starr, “Temporal and speech processing deficits in auditory neuropathy,” Neuroreport, vol. 10, no. 16, pp. 3429–3435, 1999. [65] H. Fletcher and R. H. Galt, “The perception of speech and its relation to telephony,” J. Acoust. Soc. Amer., vol. 22, pp. 89–151, 1950. [66] N. French and J. Steinberg, “Factors governing the intelligibility of speech sounds,” The journal of the Acoustical society of America, vol. 19, no. 1, pp. 90–119, 1947. [67] H. J. Steeneken and T. Houtgast, “A physical method for measuring speech-transmission quality,” The Journal of the Acoustical Society of America, vol. 67, no. 1, pp. 318–326, 1980. [68] A. N. S. Institute, American National Standard: Methods for Calculation of the Speech Intelligibility Index. Acoustical Society of America, 1997. [69] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach. Springer, 1994, vol. 247. [70] M. D. Richard and R. P. Lippmann, “Neural network classifiers estimate bayesian a posteriori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991. [71] J. Pinto, B. Yegnanarayana, H. Hermansky, and M. MagimaiDoss, “Exploiting contextual information for improved phoneme recognition,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4449–4452. [72] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB. [73] G. Hinton, “A practical guide to training restricted boltzmann machines,” Momentum, vol. 9, no. 1, p. 926, 2010. [74] C. Avendano, “Temporal processing of speech in a time-feature space,” 1997. [75] A. Boothroyd and S. Nittrouer, “Mathematical treatment of context effects in phoneme and word recognition,” The Journal of the Acoustical Society of America, vol. 84, no. 1, pp. 101–114, 1988.

Feipeng Li received his BS and MS, both in Electrical Engineering from Wuhan University, China, in 1996 and 1999 respectively. In 1999 he joined the State Key Laboratory for Surveying, Mapping, and Remote Sensing at Wuhan University, where he was an assistant scientist and lecturer. In Aug. 2003 he became a PhD student in the ECE Department, University of Illinois at Urbana-Champaign and received his PhD degree in 2009. After graduation he spent one and a half year learning auditory neuroscience at the Center for Hearing and Balancing, Johns Hopkins University School of Medicine. Currently he is a postdoc fellow at the Center for Language and Speech Processing, Johns Hopkins University. His research interest is in signal processing, human speech perception, machine learning, and automatic speech recognition.

Phani Sankar Nidadavolu received his B.E(2006) in Electrical and Electronics Engineering from Andhra University, India and M.Tech(2009) in Control Engineering from Indian Institute of Technology Kanpur (IIT-K), India. He worked as a research associate (2012-2013) at Speech and Image Processing Lab, Dept. of ECE, Indian Institute of Technology Hyderabad (IIT-H), India. Currently, he is a Ph.D. student at Center for Language and Speech Processing, Dept. of ECE, Johns Hopkins University, USA. His research interests include robust speech recognition and machine learning.

Sri Harish Mallidi received his B.Tech (2008) and M.S (2010) in Electronics and Communications from International Institute of Information Technology, Hyderabad (IIIT-H), India. Currently, he is a Ph.D. student at Center for Language and Speech Processing, Dept. of ECE, Johns Hopkins University, USA. His research interests include signal processing for robust speech applications and machine learning. He is a student member of IEEE from 2013.

Hynek Hermansky (Fellow, IEEE) received the Dipl.Ing. degree from Brno University of Technology, Brno, Czech Republic, in 1972, and the Dr.Eng. degree from the University of Tokyo, Tokyo, Japan, in 1985. He is the Julian S. Smith Professor of Electrical Engineering and the Director of the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD. He is also a Professor at the Brno University of Technology and an External Fellow at the International Computer Science Institute, Berkeley, CA. He has been working in speech processing for over 30 years, previously as a Director of Research at the IDIAP Research Institute, Martigny, Switzerland; a Titullary Professor at the Swiss Federal Institute of Technology, Lausanne, Switzerland; a Professor and Director of the Center for Information Processing, Oregon Health and Science University (OHSU), Portland; a Senior Member of Research Staff at US WEST Advanced Technologies, Boulder, CO; a Research Engineer at Panasonic Technologies, Santa Barbara, CA; and a Research Fellow at the University of Tokyo. His main research interests are in acoustic processing for speech recognition. Dr. Hermansky is a Fellow of the International Speech Communication Association. He is the General Chair of the 2013 IEEE Automatic Speech Recognition and Understanding Workshop. He was in charge of plenary sessions at the 2011 International Conference on Acoustics, Speech, and Signal Processing (ICASSP) in Prague, Czech Republic; the Technical Chair at the 1998 ICASSP in Seattle, WA; and an Associate Editor for the IEEE TRANSACTION ON SPEECH AND AUDIO. He is also a Member of the Editorial Board of Speech Communication.

Suggest Documents