A Study on Hands-Free Speech/Speaker Recognition

A Study on Hands-Free Speech/Speaker Recognition

January, 2008

DOCTOR OF ENGINEERING

Longbiao Wang

Toyohashi University of Technology

i

A study on hands-free speech/speaker recognition Abstract Automatic speech recognition (ASR) systems are known to perform reasonably well when the speech signals are captured using a close-talking microphone. However, there are many environments where the use of a close-talking microphone is undesirable for reasons of safety or convenience. Hands-free speech communication has been more and more popular in some special environments such as an office or the cabin of a car. In this thesis, robust speaker localization, speech recognition and speaker identification methods under a distant-talking environment (that is, a hands-free environment) are presented. In this thesis, we first propose a robust speech/speaker recognition method by incorporating the estimated speaker position information into Cepstral Mean Normalization (CMN), which is called Position-Dependent CMN (PDCMN). The system measures the transmission characteristics (the compensation parameters for position-dependent CMN) from some grid points in the room a priori. Four microphones are arranged in a T-shape on a plane, and the sound source position is estimated by Time Delay of Arrival (TDOA) among the microphones using a proposed closed-form solution. The system then adopts the compensation parameter corresponding to the estimated position and applies a channel distortion compensation method to the speech (that is, position-dependent CMN) and performs speech/speaker recognition. For close-talking speaker recognition, a speaker recognition method by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs has been proposed, which shows robustness for the change of the speaking style in a close-talking environment. In this thesis, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. Speaker recognition experiments were conducted on NTT database (22 male speakers) and Tohoku University and Panasonic database (20 male speakers). The speaker identification result by GMM showed that the proposed position-dependent CMN achieved the relative error reduction rates of 64.0% from w/o CMN and 30.2% from the position-independent CMN with the NTT database. By integrating the position-dependent CMN into the combination use of speaker-specific GMMs and speaker-adapted syllable-based HMMs, a furthermore improvement was obtained. For speech recognition, we propose a robust speech recognition by combining multiple microphone-array processing with the position-dependent CMN. In a distant environment, the speech signal received by a microphone is affected by the microphone position, the distance and direction from the sound source to the

ii

microphone, and the quality of the microphone. Therefore, complementary use of multiple microphones may achieve robust recognition. The features obtained from the multiple channels are integrated with the following two types of processing. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple decoder processing. The secong method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition, which is called single decoder processing. In this thesis, a delay-and-sum beamforming combined with multiple decoder processing or single decoder processing is proposed. This is called multiple microphone-array processing. Furthermore, Position-Dependent CMN (PDCMN) is also integrated with the multiple microphone-array processing. In a distant-talking environment, the length of channel impulse response is longer than the short-term spectral analysis window. Therefore, conventional short-term spectrum based Cepstral Mean Normalization (CMN) is not effective under these conditions. In this thesis, we also propose a robust speech recognition method by combining a short-term spectrum based CMN with a long-term one. We assume that a static speech segment (such as a vowel, for example) affected by reverberation can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. The cepstra of the static and non-static speech segments are normalized by the corresponding cepstral means. In this thesis, the concept of combining short-term and long-term spectrum based CMN is extended to PDCMN. We call this Variable Term spectrum based PDCMN (VT-PDCMN). Since PDCMN/VT-PDCMN cannot normalize speaker variations because a positiondependent cepstral mean contains the average speaker characteristics over all speakers, we also combine PDCMN/VT-PDCMN with conventional CMN in this study. The experiments based on our proposed methods using limited vocabulary (100 words) distant-talking isolated word recognition in a real environment were conducted. Combining the multiple microphone-array using the single decoder with position-dependent CMN, a 3.0% improvement (46.9% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN was achieved in a real environment at 1.77 times the computational cost. The proposed variable term spectrum based PDCMN achieved a relative error reduction rate of 60.9% over the conventional short-term spectrum based CMN and 30.6% over the short-term spectrum based PDCMN. We also propose a blind dereverberation method based on spectral subtraction by Multi-Channel Least Mean Square (MCLMS) algorithm for distant-talking speech recognition. In a distant-talking environment, the length of channel impulse response is longer than the short-term spectral analysis window. Therefore,

iii

the channel distortion is no more of multiplicative nature in a linear spectral domain, rather it is convolutional, and the conventional Cepstral Mean Normalization (CMN) is not effective to compensate for the late reverberation under these conditions. By treating the late reverberation as additive noise, a noise reduction technique based on spectral subtraction is proposed to estimate power spectrum of the clean speech using power spectra of the distorted speech and the unknown impulse responses. To estimate the power spectrum of the impulse response, a Variable Step-Size Unconstrained MCLMS (VSS-UMCLMS) algorithm for identifying the impulse response in a time domain is extended to the spectral domain. This method does not require the knowledge of the position, that is, position-independent. We conducted the experiments on distorted speech signals simulated by convolving multi-channel impulse response with clean speech. An average relative recognition error reduction of 17.8% over conventional CMN under various severe reverberant conditions was achieved using only 0.6 second speech data to estimate the spectrum of the impulse response.

v

ハンズフリーによる音声認識と話者認識に関する研究論文要旨遠隔環境において，伝送歪みは音声認識や話者認識の性能を大きく劣化させる。本研究では，ハンズフリー環境における話者位置推定および音声認識と話者識別方法を提案する。本研究は，まず話者位置によって事前推定されたパラメータを用いた Position Dependent Cepstral Mean Normalization（位置依存 CMN と呼ぶ）による頑健な遠隔音声認識手法および話者認識手法を提案する。部屋をいくつかの区域に分割しておき，事前に区域の中心位置からマイクロホンまでの伝達特性（位置依存の補正パラメータ）を計測しておく。本システムは，四つのマイクロホンを T 字型に配置し，相互相関を用いて求めたマイクロホンペアの間の到着時間差によって音源位置を推定する。次に，推定した発話位置によって事前に計測した補正パラメータを選択し，CMN によって音声を補正して遠隔発話の音声認識あるいは話者認識を行う。近接発話による話者認識に対しては，GMM（Gaussian Mixture Model）によるテキスト独立型話者認識と話者適応化した HMM（Hidden Markov Model）による音節モデルを用いたテキスト独立型話者認識とを組み合わせた話者認識手法が提案されてきた。本研究は，この組合わせ話者認識手法を遠隔発話に拡張し，話者位置依存 CMN と統合する。東北大・松下孤立単語データベース及び NTT データベースによる話者認識の評価を行った。遠隔環境下における話者認識性能への影響と各手法の有効性を実験的に分析した。提案手法は従来の手法より半分以上の誤り削減率が得られた。音声認識に対して，発話位置依存ケプストラム平均正規化とマルチマイクロフォンアレイ処理の併用による遠隔発話の音声認識を提案する。各マイクロフォンによる認識結果によって認識単語候補に投票し，最多得票の単語を最終の結果として選択する（投票法），もしくは各マイクロフォンによる単語尤度を同一単語毎に加算して尤度最大の単語を最終の結果として識別する（最大連合尤度法）。さらに，投票法あるいは最大連合尤度法と遅延和ビームフォーミングを統合するマルチマイクロフォンアレイを提案する。マルチマイクロフォンアレイ処理の前に，各チャンネルの入力を位置依存 CMN により補正し，音声認識を行う。シミュレーション環境と実環境において孤立単語認識実験を行った。実環境において、提案した発話位置依存 CMN とマルチマイクロフォン処理の併用手法では，従来の発声毎 CMN に基づく遅延和ビームフォーミング処理より 3.2%（50.0%の相対誤り減少率）の改善を達成することができた。遠隔環境下では，残響時間は短時間分析窓より長い。一般的な発話毎ケプストラム平均正規化では，フレーム内の残響が除去できるが，フレーム長より長い残響は補正できない。定常音声区間（母音など）における残響の影響は，長時間窓のスペクトル分析により定式化できると仮定する。この仮定に基づき，長短の複数時間

vi

窓分析によるケプストラム平均を使って、定常音声区間と非定常音声区間における残響を補正する。一般的な発話毎 CMN では，認識する発話が短い場合，ケプストラム平均が正確に推定できない。この問題を解決するために，発話位置依存 CMN に基づく遠隔発話の音声認識手法を提案し，さらに位置毎に長短の複数時間窓分析によるケプストラム平均を用いて音声を補正する。発話位置依存 CMN では，位置毎に複数話者からケプストラム平均を求めるため話者の特性を含まないため，話者の特性が補正できない。そこで，伝達特性と話者特性を同時に補正するために，長短時間窓分析に基づく位置依存ケプストラム平均と話者毎のケプストラム平均との線形結合による正規化に基づく遠隔発話の音声認識を実現し，東北大・松下単語データベースにより評価した。組み合わせ手法は発話毎 CMN や発話位置依存 CMN より認識性能が大きく改善した。本研究は，遠隔発話音声認識ためのマルチチャンネル LMS（Multi-Channel Least Mean Square）アルゴリズムによるスペクトルサブトラクションに基づくブラインド残響除去方法を提案する。遠隔環境下では，残響時間は短時間分析窓より長いため，線形スペクトル領域で，チャンネル歪みは乗算性ではなく，畳み込み演算になる。インパルス応答の後部残響の影響は加算性雑音と見なし，スペクトルサブトラクション（Spectrum Subtraction）を使って，残響音声とインパルス応答のパワースペクトルを用いてクリーン音声のパワースペクトルを推定する。インパルス応答のパワースペクトルを推定するために，時間領域のインパルス応答を求めるための VSS-UMCLMS（Variable Step-Size Unconstrained MCLMS）アルゴリズムをスペクトル領域に拡張する。本手法では，あらかじめ音源の位置の知識を必要としない特長がある。様々な残響条件下でのマルチチャンネルのインパルス応答と東北大・松下単語データベースのクリーン音声を畳み込んで作成した残響音声により提案手法を評価した。提案手法は従来の CMN より約 17.8%のエラー削減率が達成できた。

vii

Acknowledgements This research work would not have been possible without the kind, continuous support and encouragement from his supervisors, colleagues, friends and his family. In this opportunity, I would like to give my appreciation to my supervisor, Prof. Seiichi Nakagawa, who has given many directions in my research and also has been supportive along these years. Especially, for his ideas concerning this study, his inspiring guidance and insisting encouragement which contributed a great deal to the progress of this research. I also would like to thank Associate Prof. Norihide Kitaoka for all guidances, helpful suggestions and valuable discussions in this research. My thanks also to Prof. Yoshiaki Tadokoro and Associate Prof. Tomoyosi Akiba for all the comments and suggestions in writing of this thesis. I also wish to thank various people who gave a lot of help during the course of this work. Many thanks to all colleagues in Nakagawa laboratory and also to all members of Doctoral meeting for all advices for this research. Thanks to the International Communication Foundation (ICF) and the Ministry of Education, Culture, Sports, Science and Technology of Japan (Monbukagakusho) for awarding me financial means to complete this project. And finally, great thanks to my wife, my parents, and my relatives and friends who endured this long process with me, always offering support and love.

Contents

Contents

ix

List of Figures

xiii

List of Tables

xv

1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Purpose and Contributions of This Research . . . . . . . . . . . . . 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 An Overview of Automatic Speech Recognition and Speaker Recognition 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Speech Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Linear Predictive Analysis . . . . . . . . . . . . . . . . . . 2.2.2 Mel Frequency Cepstral Coefficient . . . . . . . . . . . . . 2.2.3 Temporal cepstral derivative . . . . . . . . . . . . . . . . 2.3 Speech Recognition Based on Hidden Markov Model . . . . . . . 2.3.1 HMM based speech modeling and its application for speech recognition . . . . . . . . . . . . . . . . . . . . 2.3.2 The Evaluation Problem: The Forward Algorithm . . . . 2.3.3 The Decoding Problem: The Viterbi Algorithm . . . . . . 2.3.4 The Estimation Problem: Baum-Welch Algorithm . . . . 2.3.5 Extending to Continuous Mixture Density HMMs . . . . 2.4 Basic Methods for Speaker Recognition . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 6

. . . . . .

9 10 10 10 13 14 15

. . . . . . .

15 17 18 19 20 20 22

3 Speaker Position Estimation 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Method of Speaker Position Localization . . . . . . . . . . . . . . . 24 ix

CONTENTS

x

3.3 3.4

Experiments of Speaker Position Localization . . . . . . . . . . . . 27 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Normalization of Transfer Characteristic in Cepstral Domain 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Conventional Cepstral Mean Normalization . . . . . . . . . . . . . 35 4.3 Position-dependent Cepstral Mean Normalization . . . . . . . . . . 37 4.3.1 Real-time CMN . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 Incorporate Speaker Position Information into Real-time CMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.3 Problem and Solution . . . . . . . . . . . . . . . . . . . . . 38 4.4 Combination of Short-term and Long-term Spectrum Based CMN/PDCMN 39 4.4.1 Combination of Short-term and Long-term Spectrum Based CMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.2 Variable-term Spectrum Based Position-dependent CMN . 40 4.4.3 Static and Non-static Speech Segment Detection . . . . . . 41 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Distant-talking Speaker Recognition 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Speaker Recognition Based on Position-dependent CMN by Combining Speaker-specific GMM with Speaker-adapted HMM . . . . 5.2.1 Speaker modeling . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Speaker identification procedure . . . . . . . . . . . . . . 5.3 Experiments of Distant-talking Speaker Recognition . . . . . . . 5.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Speaker recognition results . . . . . . . . . . . . . . . . . 5.3.3 Integrating speaker position estimation with speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Distant-talking Speech Recognition 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Multiple Microphone Processing . . . . . . . . . . . . . . . . . 6.2.1 Multiple decoder processing . . . . . . . . . . . . . . . . 6.2.2 Single decoder processing . . . . . . . . . . . . . . . . . 6.2.3 Multiple microphone-array processing . . . . . . . . . . 6.3 Combination of PDCMN/VT-PDCMN and Conventional CMN 6.3.1 Fixed-weight combinational CMN . . . . . . . . . . . . 6.3.2 Variable-weight combinational CMN . . . . . . . . . . . 6.4 Experiments of Distant-talking Speech Recognition . . . . . . . 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

43 . 43 . . . . . .

46 46 47 49 49 51

. 58 . 60

. . . . . . . . . .

63 63 65 65 66 68 69 69 69 70 70

xi

6.4.2 6.4.3 6.4.4

6.5

Recognition experiment by single microphone . . . . . . . . 72 Recognition Experiment of Speech Uttered by Humans . . . 73 Experimental Results for Multiple Microphone Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.5 Preliminary experimental results based on the combination of short-term and long-term spectrum based CMN . . . . . 77 6.4.6 Experimental results based on the combination of PDCMN/VTPDCMN and conventional CMN . . . . . . . . . . . . . . . 78 6.4.7 Discussion – Analysis of Effect of Compensation Parameter Estimation for CMN on Speech/Speaker Recognition . . . . 82 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7 Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Distant-talking Speech Recognition 91 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Dereverberation Based on Spectral Subtraction . . . . . . . . . . . 93 7.3 Compensation Parameter Estimation for Spectral Subtraction by Multi-channel LMS Algorithm . . . . . . . . . . . . . . . . . . . . . 94 7.3.1 Blind Channel Identification in Time Domain . . . . . . . . 94 7.3.2 Extending VSS-UMCLMS Algorithm to Compensation Parameter Estimation for Spectral Subtraction . . . . . . . . . 97 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 97 7.4.2 Experimental results of simulated speech data and discussion 99 7.4.3 Experimental results of real-world speech data and discussion104 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8 Conclusions 109 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2 Future Problems and Directions . . . . . . . . . . . . . . . . . . . . 113 Bibliography

115

A List of 200 Words Selected from Tohoku University and Panasonic Isolated Spoken Word Database 127 List of Publications

131

List of Figures

2.1

speech production model

. . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2

Block diagram of LPC preprocessor for speech signals. . . . . . . . . . 11

2.3

Flow diagram for MFCC extraction. . . . . . . . . . . . . . . . . . . . 13

2.4

Example of HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5

A trellis in the forward computation . . . . . . . . . . . . . . . . . . . 17

3.1

Microphones arranged for speaker position estimation (d=20cm). . . . 25

3.2

Experimental environment . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3

Room configuration (room size: (W)3 m x (L)3.45 m x (H)2.6 m)

3.4

Waveforms of Japanese isolated word “ASAHI” emitted from a loudspeaker located at various positions. . . . . . . . . . . . . . . . . . . . 31

4.1

Illustration of compensation of transmission characteristics between human and loudspeaker (same microphone) . . . . . . . . . . . . . . . 38

5.1

Speaker identification by combining speaker-specific GMM with speakeradapted syllable-based HMMs . . . . . . . . . . . . . . . . . . . . . . . 49

5.2

Speaker identification by GMM (isolated word) . . . . . . . . . . . . . 53

5.3

Speaker identification by GMM (NTT database) . . . . . . . . . . . . 54

5.4

Speaker recognition error rates based on PDCMN by combining speakerspecific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 100 isolated words) 56

5.5

Speaker recognition error rates based on PDCMN by combining speakerspecific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 50 isolated words) 59

5.6

Speaker recognition error rates based on PDCMN by combining speakerspecific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 30 isolated words) 60 xiii

. . 28

xiv

6.1 6.2 6.3 6.4 6.5 6.6

7.1 7.2 7.3 7.4

7.5 7.6

LIST OF FIGURES

Illustration of multiple microphone processing using multiple decoders (utterance level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of multiple microphone processing using single decoder (frame level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extended area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech recognition and speaker recognition performances using the conventional CMN with different length of utterance . . . . . . . . . Comparison of speaker recognition results for different distances between the sound source and the microphone . . . . . . . . . . . . . . Comparison of speech recognition results for different distances between the sound source and the microphone . . . . . . . . . . . . . . Impulse response of the first channel from 4-th multi-channel impulse responses. The sampling frequency is 12 kHz. . . . . . . . . . . . . . window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logarithm estimation error of Japanese isolated word “ASAHI” from 4-th reverberant speech. . . . . . . . . . . . . . . . . . . . . . . . . . Speech recognition rates based on conventional CMN, the proposed method and the inverse filtering method with different length of utterance to calculate cepstral mean. . . . . . . . . . . . . . . . . . . . Spectrogram of clean speech, reverberant speech and dereverberant speech of Japanese isolated word “GINKOO”. . . . . . . . . . . . . . Spectrogram of clean speech, reverberant speech and dereverberant speech of Japanese isolated word “SUUGAKU”. . . . . . . . . . . .

. 67 . 67 . 74 . 86 . 88 . 89

. 98 . 98 . 100

. 103 . 106 . 107

List of Tables

3.1 3.2

speaker position estimation result using about 300 ms speech data . . 32 speaker position estimation result using one utterance (about 0.6 second) 32

5.1

A list of 116 syllable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 5.3

Four methods for speaker recognition . . . . . . . . . . . . . . . . . . . 51 speaker identification error rate by combination method using positiondependent CMN (3 utterances) . . . . . . . . . . . . . . . . . . . . . . 55 Speaker recognition error rate (%) (3 utterances). Models were adapted/trained using 100 words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4 5.5

Speaker recognition error rate by optimal α (%) . . . . . . . . . . . . 58

5.6

Speaker recognition error rate (%) (3 utterances). Models were adapted/trained using 30/50 words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Comparison of speaker recognition error rate of ideal PDCMN with realistic PDCMN (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.7

6.1

Recognition results emitted by a loudspeaker (average of results obtained by 4 independent microphones: %). . . . . . . . . . . . . . . . . 72

6.2

Recognition results of human utterances (results obtained by microphone 1 shown in Fig. 3.1: %) . . . . . . . . . . . . . . . . . . . . . . 75

6.3

Comparison of recognition accuracy of single microphone with multiple microphones using multiple decoders (%) . . . . . . . . . . . . . . 75

6.4

Comparison of recognition accuracy of multiple microphone-array processing using single decoder with that using multiple decoders (%) . . 75 The individual speech recognition results for short-term and long-term cepstra. Cepstral means were estimated from 100 isolated words for each speaker. (single microphone: %) . . . . . . . . . . . . . . . . . . 77 Speech recognition results for the combination of short-term and longterm spectrum based CMN. Cepstral means were estimated from 1 word, 10 words and 100 words for each speaker. (single microphone: %) 77

6.5

6.6

xv

xvi

6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 7.1

7.2 7.3

7.4

7.5

LIST OF TABLES

Speech recognition results for the combination of PDCMN/VT-PDCMN and conventional CMN (%) . . . . . . . . . . . . . . . . . . . . . . . . Speech recognition results for the combination of PDCMN/VT-PDCMN and conventional CMN using a single microphone (%) . . . . . . . . . Speech recognition results for the combination of PDCMN/VT-PDCMN and conventional CMN using a microphone array (%) . . . . . . . . . An example of Euclidean cepstrum distance between different speakers, different vowels and different positions . . . . . . . . . . . . . . . . Euclidean cepstrum distance . . . . . . . . . . . . . . . . . . . . . . . Speech recognition error rate (%) . . . . . . . . . . . . . . . . . . . . . Speaker recognition error rate (%) . . . . . . . . . . . . . . . . . . . . Distances of each area (m) . . . . . . . . . . . . . . . . . . . . . . . . . Detail record conditions for impulse responses measurement. “angle”: recorded direction between microphone and loudspeaker. “RT60 (second)”: reverberation time in room. “S”: small, “L”: large. . . . . . baseline results (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech recogniton performacne of single channel for simulated reverberant speech. 2 or 4 microphones were used to estimate the spectrum of impulse response. Only the speech data of first channel was evaluated. For proposed method, the first channel speech was compensated by the impulse response of the first channel (%). . . . . . . . . . . . Speech recogniton performacne of multiple channels for simulated reverberant speech. 4 microphones were used to estimate the spectrum of impulse response. Delay-and-sum beamforming was performed to 4-channel dereverberant speech signals. For proposed method, each channel speech was compensated by the corresponding impulse response (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech recogniton performacne of single channel for real-world reverberant speech. Only the speech data of first channel of each area was evaluated. For our proposed method, the first channel speech was compensated by the impulse response of the first channel estimated by 4 microphones. The numbers of reverberation window D in Eq. (7.4) were set to 2, 4 and 8 (%). . . . . . . . . . . . . . . . . . . . .

79 81 81 82 83 84 84 87

. 99 . 101

. 102

. 104

. 105

A.1 list of 200 words selected from Tohoku University and Panasonic isolated spoken word database . . . . . . . . . . . . . . . . . . . . . . . . 127

Chapter 1

Introduction

1.1

Background

Speech communication has been the nature and dominant mode of humanhuman interaction and information exchange. Automatic speech/speaker recognition has been the objects of intense study for more than 50 years [2, 3, 86, 6, 4, 7, 8, 15, 16, 45, 17, 18, 19, 37]. In the recently 20 years, speech processing technology is known to perform reasonably well, and more and more commercial systems have been developed. In most of them, speech signals are capured in a noise-free environment using a close-talking microphone worn near the mouth of the speaker. Read speech and similar types of speech, e.g. news broadcasts reading a text, can be recognized with accuracy higher than 95% using stateof-the-art Automatic Speech Recognition (ASR) technology using a close-talking microphone [19]. In many applications such as the systems based on a car or a meeting room environment, hands-free speech/speaker processing technology is desirable for reasons of safety or convenience to implement a more natural and user-friendly speech based human-machine interface. For example, while operating a vehicle, the very act of wearing a microphone is distracting and dangerous. Japan has been erecting the legislation to disallow a driver to use a hand-held cellular phone while driving from Oct. 2004. In a meeting room, microphones restrict the movement of the participants by tethering them to their seats by wires. And it is unlikely that users of an information kiosk will want to put on a headset before asking for help. In all of these situations, wires can break and tangle, be creating a frustrating experience for the user. In such situations, a better alternative to head-mounted microphones is to place a microphone some distance from the user. Unfortunately, in a distant-talking environment, reverberant and background noise can drastically degrade the speech signal, which in turn degrade speech/speaker recognition performance. For speech signal corrupted by rever1

2

CHAPTER 1. INTRODUCTION

beration, the direct speech signal is added with reflects from various directions. And the additive noise decrease the Signal-to-Noise Ratio (SNR) which results in the difficulty of speech/speaker recognition. Many technologies have been proposed and studied for hands-free speech/speaker recognition [20, 21, 22, 23, 24, 88, 89]. Compensating an input feature is the main way to reduce a mismatch between a real and a training environments. Cepstral Mean Normalization (CMN) is a simple and effective way of normalizing the feature space and thereby reducing channel distortion [4, 34, 55]. CMN reduces the errors caused by the mismatch between test and training conditions, and it is also very simple to implement. It has, therefore, been adopted in many current systems. However, there exists a tradeoff between recognition error and delay [55]. Thus, the usual CMN cannot obtain an excellent recognition performance with a short delay. Furthermore, the conventional CMN, in which cepstral means are estimated from the entire current utterance using the short-term analysis window, is not effective under distant-talking environment because the duration of the impulse response of reverberation has usually a much longer tail than the short-term analysis window. Therefore, it is a challange study using the CMN under these conditions. Several studies have focused on decreasing the above problem. Raut et al. [56, 57] use preceding states as units of preceding speech segments, and by estimating their contributions to the current state using a maximum likelihood function, and they adapt the models accordingly. In [61, 62] a multiresolution channel normalization based speech recognition front end has been implemented by subtracting the mean of the log magnitude spectrum using a long-term (high frequency resolution; 2 seconds) spectral analysis window. Inverse filtering is a straightforward way to compensate for reverberant speech [105, 32]. In [105], a Multiple-input/output INverse Theorem (MINT) was proposed for realizing exact inverse filtering of acoustic impulse responses in a room. The impulse response can be measured a priori [104] or blind estimated [110, 111]. Blind channel identification is desirable in acoustical signal processing systems since a priori estimation of impulse response should consume high-cost and may be unavariable in some cases. The innovative idea of BCI was first proposed by Sato in [38]. Early studies of blind channel identification and equalization focused primarily on higher (than second) order statistics (HOS) based methods [39]. In [108, 109, 110, 111], adaptive multi-channel least mean square and Newton algorithms for blind channel identification was proposed. However, there exists non-minimum phase problem when we estimate the impulse response or calculate the inverse filtering. In [29, 31, 30], a method that estimates an inverse filter based on the harmonic structure of speech was proposed. In [27, 28], a novel dereverberation method utilizing multi-step forward linear prediction was proposed. They estimated the linear prediction coefficients in time domain and suppress amplitude of late reflections using spectral subtraction in spectral domain. It

1.2. PURPOSE AND CONTRIBUTIONS OF THIS RESEARCH

3

indicated that even if the room impulse response is non-minimum phase, it is possible to recover the late reflection amplitude. Similar to [27, 28], a reverberation compensation method for speaker recognition using spectral subtraction in which the late reververation was treated as additive noise was proposed [87, 88]. However, they estimated the optimum compensation parameter of the late reverberation empirically on a development dataset, which degraded the parctical utility. Many other techniques have been proposed for distant-talking speech/speaker recognition. Beamforming is one of the simplest and the most robust means of spatial filtering, which can discriminate between signals based on the physical locations of the signal sources [22, 23, 76]. In [24], a new approach to microphonearray processing was proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of the correct hypothesis. Since the inverse filtering is very difficult to estimate blindly, training HMMs using reverberant data is an alternate method to handle with reverberant speech [41, 42]. However, in particular, in a reverberant environment with a reverberation time (RT) of more thant 0.5 seconds, the ASR performance cannot be improved even with an acoustic model trained with a matched reverberation condition [27]. In [47], Sehr et al. proposed the feature-domain dereverberation capabilities of a novel approach for ASR in reverberant environments. By combining a network of clean speech HMMs and a reverberation model, the most likely combination of the HMM output and the reverberation model output was found during decoding time ty an extended version of the Viterbi algorithm. For hands-free speech communication, there are many other applications except the distant-talking speech recognition and speaker recognition (including speaker identification and speaker verification). In hands-free environments, sound source localization and tracking [80, 65, 66], blind speech separation [82, 81], speech enhancement [83] etc. have been increasing concern. Although hands-free speech communication has been studies for many years, the performance is still not sufficient, and there remain a lot of challenged task to solve.

1.2

Purpose and Contributions of This Research

Hands-free speech signals contain a lot of information. Using the sound source location, speaking direction, speech contents, speaker information etc., user-friendly human-machine interface can be expected. The purpose of this thesis is to implement robust and efficient hands-free speech communication systems by developing and integrating the speaker localization system, distant-talking speech recognition system and distant-talking speaker recognition. In this thesis,

4


we first propose a novel close-form speaker localization approach and then present a robust speech/speaker recognition using speaker position information, in which the conventional CMN is modified for more effective performance and lower computational complexity . And then, we propose a dereverberation method based on spectral subtraction by multichannel Least Mean Square (LMS) for robust distant-talking speech recognition. In a distant-talking environment, channel distortion drastically degrades the speech recognition performance. This is predominantly caused by the mismatch between the real and the training environments. Cepstral Mean Normalization (CMN) is a simple and effective way of normalizing the feature space and thereby reducing channel distortion [4, 34, 55]. In order to be effective for CMN, the length of the channel impulse response needs to be shorter than the short-term spectral analysis window which is usually 16 ms - 25 ms. However, the duration of the impulse response of reverberation usually has a much longer tail in a distanttalking environment. Therefore, the conventional CMN, in which cepstral means are estimated from the entire current utterance using the short-term analysis window, is not effective under these conditions. In this thesis, we combine a short-term spectrum based CMN with a long-term spectrum based CMN, which we call Variable-Term spectrum based CMN (VT-CMN). We assume that static speech segments (such as vowels, for example) affected by reverberation can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. However, a tradeoff exists between recognition error and delay when using the conventional CMN [55]. A long delay that is likely to be unacceptable when the utterance is long. On the other hand, if the utterance is short, an accurate cepstral mean cannot be estimated. To overcome these problems, we propose a robust speech/speaker recognition method using a new real-time CMN based on speaker position, which we call Position-Dependent CMN (PDCMN). In this method, we measure the transmission characteristics (the compensation parameters for position-dependent CMN) from certain grid points in a room a priori. The system then adopts the compensation parameter corresponding to the estimated position, applies a channel distortion compensation method to the speech (that is, position-dependent CMN) and performs speech/speaker recognition. Furthermore, the concept of variable-term spectrum based CMN is extended to PDCMN. We call this method Variable- Term spectrum based PDCMN (VT-PDCMN). PDCMN or VT-PDCMN can indeed compensate efficiently for the channel transmission characteristics depending on speaker position, but cannot normalize the speaker variation because a position-dependent ceptral mean does not contain speaker characteristics. On the contrary, the conventional CMN can compensate for both the transmission and the speaker variations, but cannot achieve good

1.2. PURPOSE AND CONTRIBUTIONS OF THIS RESEARCH

5

recognition performance for short utterances because the sufficient phonemics balance cannot be obtained. Thus, the position-dependent cepstral mean is combined with the conventional cepstral mean in order to simultaneously compensate for the channel distortion and speaker variation effectively. In a distant environment, the speech signal received by a microphone is affected by the microphone position and the distance from the sound source to the microphone. If an utterance suffers fatal degradation by such effects, the system can not recognize it correctly. Fortunately, the transmission characteristics from the sound source to every microphone should be different, and the effect of channel distortion for every microphone (it may contain estimation errors) should also be different. Therefore, complementary use of multiple microphones may achieve robust recognition. In this thesis, the maximum vote (that is, Voting method (VM)) or the maximum summation likelihood (that is, Maximum-summationlikelihood method (MSLM)) of all channels is used to obtain the final result, which is called multiple decoder processing. This should obtain robust performance in a distant environment. However, the computational complexity of multiple decode processing is K (the number of input streams) times that of a single input. To reduce the computational cost, the output probability of each input is calculated at frame level, and a single decoder using these output probabilities is used to perform speech recognition, which is called single decoder processing. To utilize the spatial filtering ability, a delay-and-sum beamforming is combined with multiple decoder processing or single decoder processing. For speaker recognition, a text-independent speaker recognition method by combining speaker-specific Gaussian Mixture Models (GMMs) with syllable-based HMMs adapted to the speakers by MAP was proposed in [36]. The robustness of this speaker recognition method for the change of the speaking style in closetalking environment was evaluated in [36]. In this thesis, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. Our experiments showed that the proposed method improved the speaker recognition performance remarkably in a distant environment. In this thesis, we propose a blind reverberation reduction method based on spectral subtraction by adaptive Multi-Channel Least Mean Square (MCLMS) algorithm for distant-talking speech recognition. Speech captured by distanttalking microphones is distorted by the reverberation. With long impulse response, the spectrum of the distorted speech is approximated by convolving the spectrum of clean speech with the spectrum of impulse response. We treat the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction can be easily applied to compensate for the late reverberation. By excluding the phase information from the dereverberation operation as in [26, 28], the dereveration reduction on a power spectrum domain provided a


6

robustness to certain errors that a conventional sensitive inverse filtering method could not achieve. The compensation parameter (that is, the spectrum of the impulse response) for spectral subtraction is required. In [109, 111], an adaptive MCLMS algorithm was proposed to blindly identify the channel impulse response in a time domain. In this paper, we extend this method to blindly estimate the spectrum of impulse response for the spectral subtraction in a frequency domain.

1.3

Thesis Outline

Chapter 2 presents an overview of the basic approaches for speech recognition and speaker recognition. We first describe feature extraction methods, e.g. Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficient (MFCC) used for speech/speaker recognition. We then introduce an automatic speech recognition based on Hidden Markov Models (HMMs). The fundamental theory for HMM and the solutions of its evaluation, decoding and learning problems are described. Finally, a basic speaker identification approach based on Gaussian Mixture Models (GMMs) is briefly described. In Chapter 3, we propose a novel sound source localization method using a closed-form solution based on Time Delay of Arrival (TDOA). The TDOA between distinct microphone pairs for sound source localization is first described. Then, we proposed a close-form solution for TDOA based speaker localization utilizing the symmetry of a T-shape 4 microphones. Evaluation experiments and results are described and discussed at the end of this chapter. Chapter 4 presents two transfer characteristic normalization methods in cepstral domain. An environmentally robust real-time effective channel compensation method called Position-Dependent CMN (PDCMN) is first proposed. To overcome the problem which the length of channel impulse response is longer than the short-term spectral analysis window, a long-term spectrum based CMN is proposed to compensate for the effect of late reververation on a static speech segment. Then, a combination of short-term and long-term spectrum based CMN/PDCMN is presented. In Chapter 5, robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM is proposed. We briefly describe the speaker modeling based on Gaussian Mixture Models (GMMs) and speaker-adapted syllable-based HMMs. Then, the speaker identification procedure by combining these two type of speaker models is described, and it is incorporated with the position-dependent CMN. We evaluate and discuss the distant-talking speaker identification experiments using NTT database for speaker recognition (22 male speakers) and Tohoku University and Panasonic isolated spoken word database (20 male speakers) at the end of this chapter.

1.3. THESIS OUTLINE

7

In Chapter 6, we propose a robust distant-talking speech recognition by combining the transfer characteristic normalization method in cepstral domain proposed in Chapter 4 with multiple microphone-array processing. We describe a novel multiple micorphone processing technique using a multiple decoder processing or a single decoder processing. Multiple decoder processing is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, while single decoder processing is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition, which resulting in lower computational cost. At the end of this chapter, we conduct the experiments of our proposed method using a limited vocabulary (100 words) distant-talking isolated word recognition in a seminor room environment. In Chapter 7, we present a reverberation compensation method based on Spectral Subtraction (SS) by a multichannel LMS algorithm. Assuming that late reververation is similar to additive noise, the noise reduction technique based on spectrum subtraction is applied. Then, a time domain adaptive multichannel LMS algorithm is described, and it is extended to blindly estimate the optimal parameter for spectrum subtraction. At the end of the chapter, we evaluate our method using reverberant speech simulated by convolving clean speech with multichannel impulse responses. An average relative recognition error reduction of 17.8% over conventional CMN under various severe reverberant conditions was achieved. The conclusions drawn from this research are summarized in Chapter 8 where some directions of the future research are presented as well.

Chapter 2

An Overview of Automatic Speech Recognition and Speaker Recognition

Research in automatic speech and speaker recognition has now spanned five decades. Hidden Markov Models (HMMs) have been the most prominent technique used in modern speech recognition systems [1, 8, 15, 16, 25, 18]. The basic HMM theory was first described in the classic paper by Baum and his colleagues [1]. The HMM is a very powerful statistical method of modeling the observed data samples of a discrete-time series. The underlying assumption of the HMM is that the data samples can be well characterized as a parametric random process, and the parameters of the stochastic process can be estimated in a precise and well-defined framework. For speaker recognition, various types of speaker models have long been studied. HMMs have become the most popular statistical tool for the text-dependent task. The best results have been obtained using continuous density HMMs (CHMMs) for modeling speaker characteristics [46, 51]. For the text-independent task, the temporal sequence modeling capability of the HMM is not required. Therefore, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model [48, 45]. In this chapter, we review the basic methods for speech recognition based on HMM and speaker recognition based on GMM and HMM. We also describe the speech signal analysis (feature extraction) using in speech/speaker recognition. 9

CHAPTER 2. AN OVERVIEW OF AUTOMATIC SPEECH RECOGNITION AND SPEAKER RECOGNITION

10

2.1

Introduction

2.2

Speech Signal Analysis

It has been considered that human performs a spectral analysis when hearing a speech signal. Thus, a short-time spectral analysis should also be important for speech recognition. A simplified view of speech production is given in Fig. 2.1. That is to say, speech wave is produced by articulating sound source and the shape of voice tract. Therefore, the spectrum of the short-time speech segment is multiplied by the microstructure spectrum corresponding to the sound source and the spectrum envelope corresponding to the articulation based on the shape of the voice tract.

Voiced sound source (impulse) vocal tract

speech wave

(articulation) Unvoiced sound source (white noise) Figure 2.1: speech production model In this thesis, LPC (Linear Predictive Coefficient) and MFCC (Mel-Frequency Cepstrum Coefficients) are used as acoustic features.

2.2.1

Linear Predictive Analysis

Fig. 2.2 shows the block diagram of the LPC front-end proprocessor [16]. The basic steps in the processing include the following ∗ : 1. Preemphasis - The digitized speech signal, s(n) obtained by an analog-todigital converter (ADC) with some sampling frequency fs , is put through a low order digital system (typically a first-order FIR filter), to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing. The used preemphasis network is the fixed first-order system: H(z) = 1 − az −1 , ∗

0.9 ≤ a ≤ 1.0

(2.1)

The step 1 to step 3 are also used in Mel Frequency Cepstral Coefficient (MFCC) feature extraction.

2.2. SPEECH SIGNAL ANALYSIS

11

Speech signal Preemphasis Preemphasized speech signal Windowing Windowed signal Autocorrelation analysis Autocorrelation value Durbin algorithm LPC coefficients LPC parameter conversion LPC cepstral coefficients Figure 2.2: Block diagram of LPC preprocessor for speech signals.

where a is a preemphasis coefficient. 2. Windowing - The preemphasized speech signal is blocked into frames of N samples, with adjacent frames being separated by M samples. The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the begining and end of each frame. The window is defined as w(n), a typical window used for the autocorrelation method of LPC is the Hamming window, which has the form: w(n) = 0.54 − 0.46 cos

2πn , N −1

0≤n≤N −1

(2.2)

3. Autocorrelation Analysis - Each frame of windowed signal x ˜l (n) is next autocorrelated to give:


12

r˜l (m) =

N −1−m

x ˜l (n)˜ xl (n + m),

m = 0, 1, · · · , p

(2.3)

n=0

where the highest autocorrelation value, p, is the order of the LPC analysis. Here, the p is chosen to be 14. A side benefit of the autocorrelation analysis is that the 0th autocorrelation Rl (0) is the energy of the lth frame. 4. LPC analysis - The next processing is the LPC analysis, which converts each frame of p + 1 autocorrelations into an LPC coefficients. The basic idea behind the LPC model is that a given speech sample at time t, s(t) can be approximated as a linear combination of the past p speech samples, such that [16]: p ai s(t − i) + Gu(t) (2.4) s(t) = i=1

where u(t) is a normalized excitation, G is the gain of the excitation and {ai } are the LPC coefficients. The Durbin’s method [16] is used for calculation of the LPC coefficients and the algorithm is as follows: E (0) = r(0) ki = (i)

αi

(i) αj (i)

E

r(i) −

(2.5)

L−1

(i−1) r(|i − j=1 αj (i−1) E

j|)

,

1≤i≤p

= ki =

(2.6) (2.7)

(i−1) αj

= (1 −

(i−1) − ki αj−1 ki2 )E (i−1)

(2.8) (2.9) (2.10)

where subscript l on rl (m) is omitted. The set of equations are solved recursively for i = 1, 2, · · · , p and the final solution is given as: am = α(p) m ,

1≤m≤p

(2.11)

5. LPC parameter conversion to cepstral coefficients - A very important LPC parameter set for speech/speaker recognition, which can be derived directly from the LPC coefficient set, is the LPC cepstral coefficients, Cn . The recursion used is [16]: C1 = −a1 Cn = −an −

n−1 m=1

(1 −

m )am Cn−m , 2 ≤ n ≤ p n

(2.12) (2.13)

2.2. SPEECH SIGNAL ANALYSIS

Speech signal

13

MFCC

Mel-scale filter bank

Windowing

…

frequenc

Figure 2.3: Flow diagram for MFCC extraction.

Cn = −

n−1 m=1

(1 −

m )am Cn−m , p < n. n

(2.14)

The cepstral coefficients, which are the coefficients of the Fourier transform representation of the log magnitude spectrum, have been shown to be a more robust, reliable feature set for speech recognition than the LPC coefficients, the PARCOR coefficients, or the log area ratio coefficients. 6. Mel-scale warping of the cepstral coefficient - Although the LPC derived cepstral coefficients are good representation of the speech, they are distributed along a linear frequency axis. This is undesirable because the ability of the human ear to discriminate between frequencies is approximated by a logarithmic function of the frequency [9]. Furthermore, it was demonstrated [10] that mel-scaled coefficients yield superior recognition accuracy compared to linearly scaled ones. Therefore, there is strong motivation for transforming the LPC cepstrum coefficients into a mel-frequency scale. The transform method has been proposed in [11], and is also used in this thesis.

2.2.2

Mel Frequency Cepstral Coefficient

The another feature parameter extraction method MFCC is introduced in this section. The flow diagram for MFCC extraction is shown in Fig. 2.3 [16, 18]. At first, the FFT is performed on a windowed short-time signal. Then, the log power of the spectrum obtained above is mapped onto the mel-scale, using triangular overlapping windows. Finally, the Discrete Cosine Transform (DCT) is performed on the list of mel log power. We define a filterbank with L filters, where filter l (l = 1, 2, · · · , L) is trian-


14

gular filter given   W (k; l) =

k−klo (l) kc (l)−klo (l)  khi (l)−k khi (l)−kc (l)

{klo (l) ≤ k ≤ kc (l)} {kc (l) ≤ k ≤ khi (l)},

(2.15)

where klo (l), kc (l), khi (l) are th lowest, central an highest frequencies of the l-th filter. In the adjacent frames, the follow relational expression kc (l) = khi (l − 1) = klo (l + 1)

(2.16)

is formed. The central points kc (l) are uniformly spaced in the mel-sacle. The mel-scale is given by M el(f ) = 2569 log 10 (1 +

f ). 700

(2.17)

The log-amplitude at the output of each filter is then computed as m(l) = ln[

hi

(l = 1, · · · , L).

W (k; l)|S(k)|]

(2.18)

k=lo

Finally, the mel-frequency cepstrum is then the DCT of the L filter outputs. In this thesis, the filter number L is set to 24, 10 or 12 MFCCs are used.

2.2.3

Temporal cepstral derivative

The cepstral representation (LPC or MFCC) of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame. An improved representation can be obtained by extending the analysis to include information about temporal spectral derivative [12]. Given the k-th cepstrum at time t, Ck (t), time duration T , frame shift for speech analysis ∆T , the r-th Regressive Cepstral Coefficients (RGC) is calculated as: N

Rrk (t, T, ∆T, N ) =

X −1 1 − (T − ∆T ) t+ N −1 2

Pr (X, N )Ck

X=1

N

(2.19)

Pr2 (X, N )

X=1

where N is the frame number for the RGC calculation. The weighting functions for 1st and 2nd regressions are given as: P1 (X, N ) = X

(2.20)

1 P2 (X, N ) = X 2 − (N 2 − 1) 12

(2.21)

2.3. SPEECH RECOGNITION BASED ON HIDDEN MARKOV MODEL

a11 = 0.3 b11 (a)= 1.0 b11 (b)= 0.0

[ S1

a12 = 0.7 b12 (a)= 0.5 b12 (b)= 0.5

[

a22 = 0.2 b22 (a)= 0.3 b22 (b)= 0.7

]

[ S2

]

a23 = 0.8 b23 (a)= 0.0 b23 (b)= 1.0

[

15

] S3 ]

Figure 2.4: Example of HMM

2.3

Speech Recognition Based on Hidden Markov Model

2.3.1

HMM based speech modeling and its application for speech recognition

A HMM(Hidden Markov Model)[1, 8, 15, 16, 18] is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the problem is to determine the hidden parameters from the observable parameters. For example, speech is represented as a symbol sequence {s1 , s2 , · · · , sT } after feature extraction and quantization. HMM is a speech production model which produces such a sequence during the state transition. An example of HMM which observes the sequence of a symbol set S = {a, b} is shown in Fig. 2.4. aij is the transition probability from state Si to state Sj , bij (sk ) is the probability of emitting symbol sk when state Si is transited to Sj . They meet the follow conditions, respectively: aij = 1, bij (sk ) = 1. (2.22) j

k

Here, we consider that a symbol sequence abb is observed. There are two different state sequences S1 S1 S2 S3 and S1 S2 S2 S3 when the symbol sequences are generated by Fig. 2.4. Set the HMM as M, the probabilities corresponding to two different state sequences are P (S1 S1 S2 S3 |M) = a11 · a12 · a23

(2.23)

P (S1 S2 S2 S3 |M). = a12 · a22 · a23

(2.24)

The observation probabilities of the symbol sequence abb given the model M and two different state sequences S1 S1 S2 S3 and S1 S2 S2 S3 are then given by P (abb|S1 S1 S2 S3 , M) = b11 (a) · b12 (b) · b23 (a)

(2.25)


16

P (abb|S1 S2 S2 S3 , M) = b12 (a) · b22 (b) · b23 (a).

(2.26)

Therefore, the probabilities of the symbol sequence abb and two different state sequences given the modle M are P (abb, S1 S1 S2 S3 |M) = P (S1 S1 S2 S3 |M) · P (abb|S1 S1 S2 S3 , M) (2.27) = a11 · b11 (a) · a12 · b12 (b) · a23 · b23 (a)

(2.28)

P (abb, S1 S2 S2 S3 |M) = P (S1 S2 S2 S3 |M) · P (abb|S1 S2 S2 S3 , M) (2.29) = a12 · b12 (a) · a22 · b22 (b) · a23 · b23 (a).

(2.30)

Finally, the probability of the symbol sequence abb given the model M is P (abb, Sq1 Sq2 Sq3 Sq4 |M) (2.31) P (abb|M) = Sq1 Sq2 Sq3 Sq4 (2.32) = P (abb, S1 S1 S2 S3 |M) + P (abb, S1 S2 S2 S3 |M) = a11 · b11 (a) · a12 · b12 (b) · a23 · b23 (a) +a12 · b12 (a) · a22 · b22 (b) · a23 · b23 (a) = 0.3 × 1.0 × 0.7 × 0.5 × 0.8 × 1.0 +0.7 × 0.5 × 0.2 × 0.7 × 0.8 × 1.0 = 0.1232,

(2.33) (2.34) (2.35) (2.36) (2.37)

where qt is the state number at time t. Given the definition of HMMs, there are three problems of interest: • The Evaluation Problem - Given a model and a sequence of observations, what is the probability that the model generated the observations? • The Decoding Problem - Given a model and a sequence of observations, what is the most likely state sequence in the model that produces the observations? • The Learning Problem - Given a model and a set of observations, what should the model’s parameters be so that it maximizes the probability of generating the observations? If we could solve the evaluation problem, we would have a way of scoring the match between a model and an observation sequence, which could be used for pattern recognition, since the likelihood can be used to compute posterior probability, and the HMM with the highest posterior probability can be determined as the desired pattern for the observation sequence. If we could solve the decoding problem, we could find the best matching state sequence given an observation sequence, which could be used for continuous speech recognition. Last but not least, if we could solve the learning problem, we would have the means to automatically estimate the model parameter from an ensemble of training data. In the next 3 subsections, we will introduce the solutions to these three problems.


17

symbol sequence a 1.0

S1

b 0.3

0.3 × 1.0

state

S2

0.7 × 0.5 0.35

0.2 × 0.3

0.2 × 0.7

0.8 × 0.0 0.0

0.0

0.3 × 0.0

0.7 × 0.5 0.0

b

0.7 × 0.5

0.154

0.8 × 1.0 0.0

0.0

0.3 × 0.0

0.2 × 0.7

0.02156

0.8 × 1.0 0.28

0.1232

S3 Figure 2.5: A trellis in the forward computation

2.3.2

The Evaluation Problem: The Forward Algorithm

To calculate the probability of symbol sequence given the HMM M, the most intuitive way is to sum up the probabilities of all possible state sequences described in Sec. 2.3.1. However, the direct evaluation of Eq. (2.31) requires computational complexity of O(2T · N T ), which is unrealistic. A more efficient algorithm so-called forward algorithm can be used to calculate Eq. (2.31). Forward probability αt (i) is defined as: αt (i) = P (s1 , s2 , . . . , st , qt = i|M),

(2.38)

where αt (i) is the probability that the HMM M is in state Si at time t having observation sequence s1 , s2 , . . . , st . The probability can be inductively computed using forward algorithm as follows: *Forward algorithm • Step 1: Initialization α0 (i) = πi , 1 ≤ i ≤ N

(2.39)

where πi is an inital state distribution. • Step 2: For t = 0, 1, 2, . . . , perform step 3 • Step 3: For all states j αt+1 (j) =

N i=1

αt (i)aij bij (st+1 )

(2.40)


18

• Step 4: Termination P (o|M) =

αT (i)

(2.41)

Si ∈F

where N is the number of the states, T is the length of the symbol sequence, and F is a set of the final states. It is easy to show that the complexity for the forward algorithm is O(N 2 T ), which is very efficient comparing to the original complexity O(2T · N T ). Considering the computation of Eq. (2.31) using the forward algorithm, the inductive computation procedure can be illustrated in a trellis shown in Fig. 2.5.

2.3.3

The Decoding Problem: The Viterbi Algorithm

The forward algorithm effectively computes the probability that an HMM generates an observation sequence by summing up the probabilities of all possible paths. However, it is desirable to find the optimal path (or state sequence) associated with the given observation sequence. A formal technique for finding the best state sequence based on dynamic programming methods is called Viterbi algorithm [94, 16, 18]. The viterbi algorithm can be regarded as a modified forward algorithm. Instead of summing up the probabilities of all different paths, the Viterbi algorithm picks the optimal path with the maximum probability. In a manner similar to the forward algorithm, the best path probability ψ(t) is the probability of the most likely state sequence at time t, and the Viterbi probability can then be calculated inductively as: *Viterbi algorithm • Step 1: Initialization f0 (i) = πi , 1 ≤ i ≤ N

(2.42)

ψ0 (i) = 0

(2.43)

where πi is an initial state distribution. • Step 2: For t = 0, 1, 2, . . . , perform step 3 • Step 3: For all states j ft+1 (j) = max ft (i)aij bij (st+1 )

(2.44)

ψt+1 (j) = argmaxi=1...N ft (i)aij bij (st+1 ).

(2.45)

i=1...N

Here, symbol observation probability is P˜ (o|M) = max fT (i) Si ∈F

(2.46)


qT∗ = argmaxSi ∈F fT (i),

19

(2.47)

and ∗ ), qt∗ = ψt+1 (qt+1

t = T − 1, . . . , 1,

(2.48)

where qt∗ is the best state sequence.

2.3.4

The Estimation Problem: Baum-Welch Algorithm

The third problem of HMMs is to estimate the parameter set θ = {πi , aij , bij (s)} to maximize the probability of generating the observation sequences. There is no known analytical method that maximizes the joint probability of the training data in a closed form. The problem can be solved by the iterative Baum-Welch algorithm, also known as EM (expectation-maximization) method or forward-backward algorithm [37, 16, 18]. Before we describe the formal Baum-Welch algorithm, we first define a few useful terms. In a manner similar to the forward probability, we define backward probability βt (i) as: βt (i) = P (st , st+1 , . . . , sT |qt = i, M)

(2.49)

where βt (i) is the probability of observation sequence st , st+1 , . . . , sT given that the HMM M is in state Si at time t. The backward probability βt (i) can then be calculated inductively as: *Backward algorithm • Step 1: Initialization βT (i) = 1, Si ∈ F

(2.50)

• Step 2: For t = T, T − 1, T − 2, . . . , perform step 3 • Step 3: For all states j βt (j) =

N

βt+1 (i)aij bij (st+1 ).

(2.51)

i=1

We first define γ(i, j, t), which is the probability of taking the transition from state i to state j at time t, given the model and observation sequence st , i.e., γ(i, j, t) =

αt−1 (i)aij bij (st )βt (j) αt (i)βt (j). .

(2.52)

i

We can iteratively refine the HMM parameter set θ = { πi , aij , bij (s)} by maximizing the likelihood for each iteration. According to the EM algorithm, the


20

final result of this reestimation procedure is an ML estimate of the HMM. A set of reasonable reestimation formulas is: j γ(i, j, 1) (2.53) π î = i,j γ(i, j, 1) t γ(i, j, t) a îj = j t γ(i, j, t) t:st =s γ(i, j, t) ˆbij (s) = t γ(i, j, t)

(2.54)

(2.55)

It should be pointed out that the forward-backward algorithm leads to local maxima only.

2.3.5

Extending to Continuous Mixture Density HMMs

So far, all of the observation probabilities bij (sk ) were defined as discrete probability density. However, if the observation does not come from a finite set, but from a continuous space, the discrete output distribution discussed in the previous sections needs to be modified. The most general representation of the probability density function (pdf) is a finite mixture of a continuous density distributions. The observation probability of taking the transition from state i to state j, given the feature vector o is a finite mixture of the form: bij (o) =

M

λijm , bijm (o).

(2.56)

m=1

where λijm is the mixture coefficient for the m-th mixture of taking the transition from state i to state j. While bijm (o) represents the m-th pdf. Without loss of generality, we assume that bijm (o) is Gaussian distribution, and bijm (o) becomes: bijm (o) =

1 (2π)n/2 |Σijm |1/2

1 exp(− (o − µijm)t Σ−1 ijm (o − µijm)). 2

(2.57)

where µijm, Σijm are mean vector and covariance matrix for the m-th mixture component of taking the transition from state i to state j, respectively.

2.4

Basic Methods for Speaker Recognition

Speaker recognition technology is closely related to speech recognition, where it means automatic speaker (talker) recognition by machine. The general area of speaker recognition includes two fundamental tasks, speaker identification and speaker verification. The speaker identification task is to classify an unlabeled

2.4. BASIC METHODS FOR SPEAKER RECOGNITION

21

voice sample as belonging to (haveing been spoken by) one of a set of N reference speakers (N possible outcomes), whereas the speaker verification task is to decide whether or not an unlabelled voice sample belongs to a specific reference speaker (2 possible outcomes - the sample is either accepted as belonging to the reference speaker or rejected as belonging to an impostor). In this thesis, we study the speaker identification under a distant-talking environment. Therefore, we only review the basic methods for the speaker identification in this section. Depending on the level of user cooperation and control in an application, the speech used for these tasks can be either text-dependent or text-independent. In a text-dependent application, the speaker is required to speak a predetermined (fixed) utterance. In contrast, text-independent speaker recognition does not rely on a specific text being spoken. The text-independent speaker recognition is more difficult but also more flexible. For speaker recognition, various types of speaker models have long been studied. The invention of the probabilistic HMM and their mathematical foundations gave rise to the speaker recognition systems using these models. HMMs have become the most popular statistical tool for the text-dependent task. The best results have been obtained using continuous density HMMs (CHMMs) for modeling speaker characteristics [46]. For the text-independent task, the temporal sequence modeling capability of the HMM is not required. Therefore, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model [48]. There are two principal motivations for using Gaussian mixture densities as a representation of a speaker identity. The first motivation is that individual component densities of a multi-variate density, like the GMM, could model some underlying set of acoustic classes. The second motivation for using Gaussian mixture densities for speaker identification is the empirical observation that a linear combination of Gaussian basis functions is capable of representing a large class of sample distributions [49]. One of the powerful attributes of the GMM is its ablitity to form smooth approximations to arbitrary-shaped densities. Many GMM-based speaker identification systems have been proposed [48, 45, 36]. Reynolds and Rose [45] proposed an effective Gaussian mixture speaker models for robust text-independent speaker identification. The details for speaker identification technique based on GMMs will be described in Section 5.3.2.1. In GMM modeling techniques, feature vectors are assumed to be statistically independent. Although this is not true, it allows one to simplify mathematical formulations. To overcome this assumption, models based on segments of feature frames were proposed [34]. One of the disadvantages of GMM is that the acoustic variability dependent on phonetic events is not directly taken into account. In other words, this modeling is not sufficiently constrained by the phonetic temporal pattern. Therefore, speech recognition techniques have been


22

used for text-dependent speaker identification [35]. This approach is also used for text-independent speaker identification. Nakagawa et al. [36, 37] proposed a novel speaker recognition method by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs, which showed robustness for the change of the speaking style in a close-talking environment. The details for speaker identification technique based on speaker-adapted HMM will be introduced in Section 5.2.1.2.

2.5

Summary

This chapter presented the basic methods for speech recognition and speaker recognition. Two feature extraction methods used in this thesis, LPC and MFCC, were described in Section 2.2. In Section 2.3, the principle and implement of HMM-based speech recognition were briefly described. Finally, the approaches and motivations of HMM-based and GMM-based speaker identification were briefly reviewed in Section 2.4.

Chapter 3

Speaker Position Estimation

3.1

Introduction

Speaker localization using acoustical signal has received significant attention lately [79, 68, 80, 64, 66, 72, 69]. It has been found in many important applications such as human-robot interaction [65, 71], automatic video steering in videoconferencing [70] and intelligent rooms [73, 72]. A large number of speaker localization algorithms have been proposed in literatures, including Time Delay Of Arrival (TDOA) based [79, 68, 80, 64, 66], steered-beamformer based [79] and high-resolution estimation based [79], with varying degrees of accuracy and computational complexity. In general, the estimate precision is dependent upon a number of factors. Major issues include (1) the quantity and quality of microphones employed, (2) microphone placement relative to each other and the speech source to be analyzed, (3) the ambient noise and reverberation levels, (4) the number of active sources and their spectral content [80]. For various sound source localization methods, TDOA based approach is the most popular method. The TDOA between any two microphones defines a hyperbolic function with the microphones corresponding to the foci of the hyperbola. Speaker localization based on Time Delay Of Arrival (TDOA) between distinct microphone pairs has been shown to be effectively implementable and to provide good performance even in a moderately reverberant environment and in noisy conditions [79, 68, 80, 64, 66]. Speaker localization in an acoustical environment involves two steps. The first step is estimation of time-delays between pairs of microphones. The next step is to use these delays to estimate the speaker location. The performance of TDOA estimation is very important to the speaker localization accuracy. The prevalent technique for TDOA estimation is based upon 23

CHAPTER 3. SPEAKER POSITION ESTIMATION

24

Generalized Cross-Correlation (GCC) in which the delay estimation is obtained as the time-lag which maximizes the cross-correlation between filtered versions of the received signals [67]. Various investigators [80, 66, 90] have proposed some more effective TDOA estimation methods in noisy and reverberant acoustic environments. On the other hand, it is necessary to find the speaker position using estimated delays. The Maximum Likelihood (ML) location estimate is one of the common methods because of its proven asymptotic consistency. It does not have a closedform solution for the speaker position because of the nonlinearity of the hyperbolic equations. The Newton-Raphson iterative method [91], Gauss-Newton method [92], and Least Mean Squares (LMS) algorithm are among possible choices to find the solution. However, for these iterative approaches, selecting a good initial guess to avoid a local minimum is difficult, the convergence consumes much computational time, and the optimal solution cannot be guaranteed. Therefore, it is our opinion that an ML location estimate is not suitable for real-time implementation of a speaker localization system. For a realistic sound source localization system, a stability result with high accuracy in real time using small number of microphones is desirable. We therefore propose a method to estimate the speaker position using a novel closed-form solution utilizing the symmetry of microphone pairs. Using this method, the speaker position can be estimated in real time using TDOAs.

3.2

Method of Speaker Position Localization

We estimate the speaker position as follows: It is assumed that N microphones are located at position (xi , yi , zi ), i = 1, · · · , N , and the sound source is located at (xs , ys , zs ). The distance between the sound source and the ith microphone is denoted by Di =

(xi − xs )2 + (yi − ys )2 + (zi − zs )2 .

(3.1)

The difference in the distances from the sound source between microphone i and microphone j is given by τij , dij = Di − Dj = cˆ

(3.2)

where c is the velocity of sound and τîj is the Time Delay Of Arrival (TDOA). TDOA can be estimated by Crosspower Spectrum Phase (CSP) [68, 95]. CSP is defined as follows: Xi (ω, t)Xj∗ (ω, t) , Φ(ω, t) = |Xi (ω, t)||Xj (ω, t)| +∞ Φ(ω, t)ej2πωτij dω, C(τij , t) = −∞

(3.3) (3.4)

3.2. METHOD OF SPEAKER POSITION LOCALIZATION

25

Z M3(0,0,d)

M1(0,0,0)

M4(0,-d,0)

M2(0,d,0) Y

X Figure 3.1: Microphones arranged for speaker position estimation (d=20cm).

where Xi and Xj are the spectra of sound signals received by a microphone pair, and t indicates the speech frame index. Then C(τij , t)s are summed up along t: C(τij ) =

T −1

C(τij , t).

(3.5)

τîj = arg max C(τij ).

(3.6)

t=0

Finally, TDOA τîj is given by τij

In order to estimate the speaker position in a 3-D space, four microphones are required theoretically. Microphones are set on a plane as indicated in Fig. 3.1. We can estimate the speaker position by using three microphone pairs (M 1, M 2), (M 1, M 3), (M 1, M 4). The first microphone (M1) is regarded as the reference and is placed at the origin of the coordinate system. The other three microphones are placed on the plane at the same distance d from the first microphone (M1). We can derive three equations from Equations (3.1) and (3.2): (x1 − xs )2 + (y1 − ys )2 + (z1 − zs )2 − (x2 − xs )2 + (y2 − ys )2 + (z2 − zs )2 = cˆ τ12 (3.7) (x1 − xs )2 + (y1 − ys )2 + (z1 − zs )2 − (x3 − xs )2 + (y3 − ys )2 + (z3 − zs )2 = cˆ τ13 (3.8) (x1 − xs )2 + (y1 − ys )2 + (z1 − zs )2 − (x4 − xs )2 + (y4 − ys )2 + (z4 − zs )2 = cˆ τ14

(3.9)


26

Because of the symmetry of three microphone pairs, the equations with square root can be simply solved as: −b3 + b3 2 − 4a3 c3 , (3.10) ys = 2a3 where a3 = b3 = c3 =

d2 d2 − , (cˆ τ14 )2 (cˆ τ13 )2 d3 d3 + − 2d, (cˆ τ14 )2 (cˆ τ13 )2 τ13 )2 d4 d4 (cˆ τ14 )2 (cˆ − , − + 4(cˆ τ14 )2 (cˆ τ13 )2 4 4

(3.11) (3.12) (3.13)

then, zs =

−b2 +

b2 2 − 4a2 c2 , 2a2

(3.14)

where d2 , (cˆ τ12 )2 d3 − d, (cˆ τ12 )2

a2 = −

(3.15)

b2 =

(3.16) (3.17)

c2 =

τ12 )2 d2 d3 d4 d4 (cˆ τ13 )2 (cˆ ys 2 − ( − , (3.18) − d)y + − + s (cˆ τ13 ) (cˆ τ13 )2 4(cˆ τ13 )2 4(cˆ τ12 )2 4 4

and finally, xs =

−ys2 + a1 zs2 + b1 zs + c1 ,

(3.19) (3.20)

where d2 − 1, (cˆ τ13 )2 d3 = d− , (cˆ τ13 )2 d4 (cˆ τ13 )2 d2 − . = + 4(cˆ τ13 )2 4 2

a1 =

(3.21)

b1

(3.22)

c1

(3.23)

This method involves a relatively low computational cost, and there is no position estimation error if the TDOA estimation is correct because no assumption is needed for the relative position between the microphones and the sound source.

3.3. EXPERIMENTS OF SPEAKER POSITION LOCALIZATION

27

Figure 3.2: Experimental environment

Of course, this approach leads to an estimation error caused by the measuring error of TDOA. If there are more than 4 microphones, we can also estimate the location by using the other combination of 4 microphones. Thus, we can estimate the location by the average of estimated locations at only a small computational cost.

3.3

Experiments of Speaker Position Localization

We performed the speaker position estimation experiments in the room shown in Fig. 3.2 measuring 3.45 m × 3 m × 2.6 m without additive noise. The room was divided into the 12 (3 × 4) rectangular areas shown in Fig. 3.3, where the area size is 60 cm × 60 cm. In our experiments, the room was set up as the seminar room as shown in Fig. 3.2 with a whiteboard beside the left wall, one table and some chairs in the center of the room, one TV and some other tables etc. The reverberation time of the room is about 150 ms. Twenty male speakers each with a close-microphone uttered 200 isolated words. The 200 isolated words are phonetic balance common isolated words selected from Tohoku University and Panasonic isolated spoken word database [78]. The list of 200 isolated words is shown in Appendix A. The average time of all utterances was about 0.6 seconds. The 4 channel speech signals were simultaneously acquired by Thinknet DF-2000 real-time 16 channel data recorder. The wave-


28

1.85 m 1.0 m

1.15 m

0.3 m 0.2 m

1.15 m

microphone array

1

2

3

0.6 m

4

5

6

0.6 m

4.45 m

3.45 m

7

8

9

0.6 m

10

11

12

1.0 m

0.6 m

0.6 m

0.6 m

3.0 m

Figure 3.3: Room configuration (room size: (W)3 m x (L)3.45 m x (H)2.6 m)

forms of one of the test utterance “ASAHI” emitted from a loudspeaker located at positions 2, 5 and 9 were shown in Fig. 3.4. The speech signal were degraded due to reverberation and attenuation. The greater the distance between the sound source and the microphone, the more severe was the degradation. Therefore, in a distant-talking environment, sound source estimation is a challenged task. The system segmented the speech with a 2048-point window with an 80point shift. One utterance or 20 frames were used for each utterance to estimate the speaker position, that is to say, about 0.6 second (for one utterance) or the first 3568 points (≈ 300 ms) were used for TDOA estimation by CSP. 250 interpolations between consecutive samples (extracted τij ) were used to obtain more accurate speaker position estimation performance. To evaluate the speaker position estimation, two measures were defined as follows:

3.4. SUMMARY

29

• Relative precision of estimated position (%) (xr − xe )2 + (yr − ye )2 ) × 100 (1 − x2r + yr2

(3.24)

where (xr , yr ) and (xe , ye ) are real position and estimation position, respectively. Although positions ze of the z-axis were estimated using four microphones, they were not used to calculate the relative precision of estimated position because the position area could be determined by only positions (xe , ye ) of x-axis and y-axis. • Area estimation rate (%):

I(xe , ye ) =

P

I(xe ,ye ) #utterance

× 100

1 if estimated speaker position (xe , ye ) is within the correct area 0 otherwise (3.25)

where I(xe , ye ) denotes an indicator. The speaker position estimation results for areas 1, 2, 3, 5, 9, 10 and 12 were shown in Tables 3.1 and 3.2. It was shown that good speaker estimation performance was obtained. When using about 300 ms speech data to estimate the CSP, more than 92% relative precision of estimated position was achieved. At the center of all 12 areas (i.e., area 5), the average estimation error was less than 10 cm, while the distance between the sound source and the microphone was 110 cm. In our position-dependent CMN method proposed in Chapters 4, 5 and 6, the estimated speaker position is used to determine the area (60 cm × 60 cm) in which the speaker should be. Thus, almost a 100% area estimation rate was achieved. Even when the estimated area was incorrect, one of the neighboring areas was always selected in this experiment.

3.4

Summary

For a realistic sound source localization system, a stability result with high accuracy in real time using small number of microphones is desirable. In order to estimate the speaker position in a 3-D space, T-shape four microphones were used. In this chapter, we proposed a TDOA-based speaker localization using a novel closed-form solution utilizing the symmetry of microphone pairs. At the first step, time-delays between pairs of microphones were estimated based on the Crosspower Spectrum Phase (CSP). At the second step, these estimated TDOAs were used to estimate the speaker location. We performed the speaker localization experiments in a seminar room shown in Fig. 3.2 measuring 3.45 m × 3 m × 2.6 m without additive noise. When using about 300 ms speech data to estimate

,

30


the CSP, more than 92% relative precision of estimated position was achieved. The average estimation error was less than 10 cm when the distance between the sound source and the microphone was 110 cm. This indicated that the accuracy of the proposed speaker localization could be used to determine the area (60 cm × 60 cm) in which the speaker should be.

3.4. SUMMARY

31

x 10

4

1

0.5

0

-0.5

-1

0

1000

2000

3000

4000

5000

6000

7000

6000

7000

6000

7000

(a) position 2 x 10

4

1

0.5

0

-0.5

-1

0

1000

2000

3000

4000

5000

(b) position 9 x 10

4

1

0.5

0

-0.5

-1

0

1000

2000

3000

4000

5000

(c) position 10 Figure 3.4: Waveforms of Japanese isolated word “ASAHI” emitted from a loudspeaker located at various positions.


32

Table 3.1: speaker position estimation result using about 300 ms speech data

area 1 2 3 5 9 10 12 Ave.

Relative precision of estimated position (%) 94.3 94.5 95.8 92.1 92.0 90.7 85.6 92.1

Area estimation rate (%) 100 100 100 100 99.4 99.2∗ 98.2∗ 99.5

real position (x,y,z) (cm) (50, -60, 0) (50, 0, 0) (50, 60, 0) (110, 0, 0) (170, 60, 0) (250, -60, 0) (250, 60, 0) —

* Note: The estimated speaker position is treated as within correct area if it is the nearest to the center of the correct area, even if it belongs to the outside of the range of the correct area (60 cm × 60 cm).

Table 3.2: second)

area 1 2 3 5 9 10 12 Ave.

speaker position estimation result using one utterance (about 0.6

Relative precision of estimated position (%) 94.2 94.9 95.9 92.2 92.0 93.1 89.5 93.1

Area estimation rate (%) 100 100 100 100 99.9 99.8∗ 99.2∗ 99.8

real position (x,y,z) (cm) (50, -60, 0) (50, 0, 0) (50, 60, 0) (110, 0, 0) (170, 60, 0) (250, -60, 0) (250, 60, 0) —

* Note: The estimated speaker position is treated as within correct area if it is the nearest to the center of the correct area, even if it belongs to the outside of the range of the correct area (60 cm × 60 cm).

Chapter 4

Normalization of Transfer Characteristic in Cepstral Domain

4.1

Introduction

Automatic speech recognition (ASR) systems are known to perform reasonably well when the speech signals are captured using a close-talking microphone. However, there are many environments where the use of a close-talking microphone is undesirable for reasons of safety or convenience. Hands-free speech communication [20, 21, 22, 23, 24] has been more and more popular in some special environments such as an office or the cabin of a car. Unfortunately, in a distant-talking environment, the channel distortion may drastically degrade speech recognition performance. This is mostly caused by the mismatch between the practical environment and the training environment. In this chapter, we describe two transfer characteristic normalization methods in cepstral domain to degrade the effect of channel distortion. Compensating an input feature is the main way to reduce a mismatch between the practical environment and the training environment. Cepstral Mean Normalization (CMN) has been used to reduce channel distortion as a simple and effective way of normalizing the feature space [4, 33]. CMN reduces errors caused by the mismatch between test and training conditions, and it is also very simple to implement. Thus, it has been adopted in many current systems. However, the system should wait until the end of speech to activate the recognition procedure when adopting a conventional CMN [4]. The other problem is that the accurate cepstral mean can not be estimated especially when the utterance is short. However, the recognition of short utterances such as commands, city names etc. is very important in many applications. In [63], the CMN was modi33

34

CHAPTER 4. NORMALIZATION OF TRANSFER CHARACTERISTIC IN CEPSTRAL DOMAIN

fied to estimate compensation parameters from a few past utterances for real-time recognition. But in a distant environment, the transmission characteristics from different speaker positions are very different. This means that the method in [63] can not track the rapid change of the transmission characteristics caused by change in the speaker position, and thus can not compensate for the mismatch in the context of hands-free speech recognition. In this chapter, we propose an effective channel compensation method for robust speech/speaker recognition using a new real-time CMN based on speaker position, which we call position-dependent CMN. We measured the transmission characteristics (the compensation parameters for position-dependent CMN) from some grid points in the room a priori. Four microphones were arranged in a T-shape on a plane, and the sound source position was estimated by Time Delay of Arrival (TDOA) among the microphones [67, 68, 66]. The system then adopts the compensation parameter corresponding to the estimated position and applies a channel distortion compensation method to the speech (that is, positiondependent CMN) and performs speech recognition. Speech/speaker recognition uses the input features compensated by the proposed position-dependent CMN. In our method, cepstral means have to be estimated a priori from utterances spoken in each area, but this is costly. The simple solution is to use utterances emitted from a loudspeaker to estimate them. But they can not be used to compensate for real utterances spoken by a human, because of the effects of recording and playing equipment. We also solve this problem by compensating the mismatch between voices from human and loudspeaker using compensation parameters estimated by a low-cost method. In order to be effective for CMN, the length of the channel impulse response needs to be shorter than the short-term spectral analysis window which is usually 16 ms - 25 ms. However, the duration of the impulse response of reverberation usually has a much longer tail in a distant-talking environment. Therefore, conventional CMN, in which cepstral means are estimated from the entire current utterance using the short-term analysis window, is not effective under these conditions. Several studies have focused on decreasing the above problem. Raut et al. [56, 57] use preceding states as units of preceding speech segments, and by estimating their contributions to the current state using a maximum likelihood function, they adapt the models accordingly. In this paper, we address the effect of long reverberation by a feature-based compensation method that is easier to be implemented. In [61, 62] a multiresolution channel normalization based speech recognition front end has been implemented by subtracting the mean of the log magnitude spectrum using a long-term spectral analysis window. At first, they used a long time window (high frequency resolution; 2 seconds) analysis and applied channel normalization. Then, they transformed the long-time representation to a short-time representation. Finally, cepstral domain features

4.2. CONVENTIONAL CEPSTRAL MEAN NORMALIZATION

35

were computed for speech recognition. In this paper, we directly normalized the cepstral domain feature based on long-term spectrum corresponding to static speech signal and short-term spectrum corresponding to non-static speech signal for speech recognition in one step. In this chapter, we propose another effective channel compensation method for robust speech recognition by combining a short-term spectrum based CMN with a long-term spectrum based CMN, which we call Variable-Term spectrum based CMN (VT-CMN). We assume that static speech segments (such as vowels, for example) affected by reverberation can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. For speech recognition, short-term and long-term cepstral coefficients are extracted a priori. The cepstral distance of neighboring frames is used to discriminate the static and non-static speech segments. A speech segment with a smaller variance between neighboring frames is detected as a static speech segment. The cepstra of static and non-static speech segments are normalized by the corresponding cepstral means. In conventional CMN, the cepstral mean is previously estimated by averaging along the entire current utterance and is kept constant during the normalization. However, this off-line estimation involves a long delay that is likely to be unacceptable when the utterance is long. If the utterance is short, an accurate cepstral mean cannot be estimated. Various window CMN methods have been used to normalize the feature vectors in an on-line version [54, 55]. However, a tradeoff exists between delay and recognition error [55]. Thus, the usual CMN cannot achieve good recognition performance with a short delay. The position-dependent CMN mentioned above can be performed well without any delay because it measure the compensation parameter a priori, and it will be indicated in Chapters 5 and 6. In this chapter, position-dependent cepstral means are estimated from short-term cepstra using non-static speech segments and from long-term cepstra using static speech segments. The cepstra of the static and non-static speech segments are then subtracted from the corresponding cepstral means depending on the speaker position. We call this method Variable-Term spectrum based PDCMN (VT-PDCMN).

4.2

Conventional Cepstral Mean Normalization

A simple and effective way of channel normalization is to subtract the mean of each cepstral coefficient (CMN) [4, 33, 86], which removes time-invariant distortions caused by the transmission channel and the recording device. When speech s[l] is corrupted by convolutional noise h[l] and additive noise n[l], the observed speech x[l] becomes

36


x[l] = h[l] ⊗ s[l] + n[l].

(4.1)

We, however, conducted our experiments in a silent seminar room, with the condition that the effect of noise is ignored in this paper. So Eq. (7.1) becomes x[l] = h[l] ⊗ s[l]. CMN has been used to compensate for the convolution distortion. In order for CMN to be effective, the length of the impulse response has to be shorter than the short-term spectral analysis window. However, in a distant-talking environment, the length of impulse response is longer than the short-term spectral analysis window, and therefore the late effect of impulse response cannot be compensated. To analyze the effect of impulse response, the impulse response h[l] can be separated into two parts h1 [l] and h2 [l] as

h[l] l < L h[l + L] l ≥ 0 , h2 [l] = , (4.2) h1 [l] = 0 otherwise 0 otherwise where L is the length of the spectral analysis window, and h[l] = h1 [l]+ δ(l − L)⊗ h2 [l]. δ() is a dirac delta function (that is, unit impulse function). The formula (7.1) can be rewritten as x[l] = s[l] ⊗ h1 [l] + s[l − L] ⊗ h2 [l],

(4.3)

where the early effect is within a frame (analysis window), and the late effect is over multiple frames. In [87, 88], the early term of Eq. (7.3) was compensated by the conventional CMN, whereas the late term of Eq. (7.3) was treated as additive noise, and a noise reduction technique based on spectrum subtraction was applied. In this paper, we focus on increasing the length of the analysis window L, which reduces the early effect of the impulse response (that is, the first term of Eq. (7.3)) as much as possible. Cepstrum is obtained by DCT transforming a logarithm of a power spectrum of the signal (that is, C x = DCT (log |DF T (x)|2 )), and thus Eq. (7.1) becomes C x = C h + C s,

(4.4)

where C x , C h and C s express the cepstra of observed speech x, transmission characteristics h, and clean speech s, respectively. Based on this, the convolutional noise is considered as additive bias in the cepstral domain, so the noise (transmission characteristics or channel distortion) can be compensated by CMN in the cepstral domain as: C˜t = Ctx − ∆C,

(t = 0, ..., T ),

∆C ≈ C¯x − C¯ train ,

(4.5) (4.6)

4.3. POSITION-DEPENDENT CEPSTRAL MEAN NORMALIZATION

37

where C˜t and Ctx are the compensated and original cepstra at time frame t, and C¯ x and C¯ train are the cepstral means of utterances to be recognized and those to be used to train the speaker-independent acoustical model, respectively.

4.3 4.3.1

Position-dependent Cepstral Mean Normalization Real-time CMN

When using the conventional CMN, the compensation parameter ∆C can be calculated at the end of input speech. This prevents real-time processing of speech recognition. The other problem of conventional CMN is that accurate cepstral means can not be estimated especially when the utterance is short. We solve these problems under the assumption that the channel distortion does not change drastically. In our method, the compensation parameter is calculated from utterances recorded a priori. The new compensation parameter is defined by (4.7) ∆C = C¯ environment − C¯ train , where C¯ enviornment is the cepstral mean of utterances recorded in a practical environment a priori. Using this method, the compensation parameter can be applied from the beginning of recognition of current utterance. Moreover, as the compensation parameter is estimated from a sufficient number of cepstral coefficients of utterances, so it can compensate for the distortion better than the conventional CMN. We call this method real-time CMN. In our early work [63], the compensation parameter is calculated from past recognized utterances. Thus, the calculation of the compensation parameter for the n-th utterance is, ∆C (n) = (1 − α)∆C (n−1) − α × (C¯ train − C (n−1) ),

(4.8)

where ∆C (n) and ∆C (n−1) are the compensation parameters for the n-th and (n − 1)-th utterances, respectively, and C (n−1) is the mean of cepstrums of the (n − 1)-th utterance. Using this method, the compensation parameter can be calculated before recognition of the n-th utterance. This method can indeed track the slow changes in transmission characteristics, but the characteristic changes caused by the change in speaker position or speaker are beyond the tracking ability of this method.

4.3.2

Incorporate Speaker Position Information into Real-time CMN

In a real distant environment, the transmission characteristics of different speaker positions are very different because of the distance between the speaker and the microphone, and the reverberation of the room. Hence, the performance of a speech recognition system based on the real-time CMN will be drastically degraded because of the great change of channel distortion.


38

h Cenvironmen t

x Chuman

s Chuman h Cloudspea ker

x Cloudspea ker

h Cenvironmen t h Cenvironmen t

s Chuman h Cloudspea ker

~

s x h h Chuman = Cloudspea ker − Cloudspea ker − Cenvironmen t

Figure 4.1: Illustration of compensation of transmission characteristics between human and loudspeaker (same microphone)

In this chapter, we incorporate speaker position information into the real-time CMN [93], and use it to compensate for reverberant speech for distant-talking speech/speaker recognition in Chapters 5 and 6. We call this method positiondependent CMN. The new compensation parameter for the position-dependent CMN is defined by: (4.9) ∆C = C¯ position − C¯ train , where C¯ position is the cepstral mean of utterances affected by the transmission characteristics between a certain position and the microphone. In our experiments in Chapters 5 and 6, we divide the room into 12 areas as Fig. 3.3 and measure/prepare the C¯ position corresponding to each area.

4.3.3

Problem and Solution

In the position-dependent CMN, the compensation parameters should be calculated a priori depending on the area, but it is not realistic to record a sufficient amount of utterances spoken in each area by a sufficient number of humans because that would take too much time. Thus, in our experiment, the utterances were emitted from a loudspeaker in each area. However, because the cepstral means were estimated by using utterances distorted by the transmission characteristics of the channel including the loudspeaker, they can not be used to compensate for real utterances spoken by human. In this paper, we solve this problem by compensating the mismatch between voices from humans and loudspeaker. An observed cepstrum of a distant human’s

4.4. COMBINATION OF SHORT-TERM AND LONG-TERM SPECTRUM BASED 39 CMN/PDCMN

utterance is as follows: x s h = Chuman + Cenvironment , Chuman

(4.10)

x s h , Chuman and Cenvironment are the cepstrums of observed human where Chuman utterance, emitted human utterance and transmission characteristics from human’s mouth to the microphone, respectively. However, an observed cepstrum of a distant loudspeaker’s utterance is as follows: x s h = Cloudspeaker + Cenvironment Cloudspeaker s h h = Chuman + Cloudspeaker + Cenvironment ,

(4.11) x s h , Cloudspeaker and Cloudspeaker are the cepstrums of observed where Cloudspeaker speech emitted by the loudspeaker, human utterances emitted by the loudspeaker and transmission characteristics of the loudspeaker, respectively. That is, the speech emitted by the loudspeaker is human speech corrupted by the transmission characteristics of the loudspeaker. The difference between Equations (4.10) and h , and this is independent of the other environment such as (4.11) is Cloudspeaker the speaker position. Thus, the compensation parameter ∆C in (4.9) is modified as:

∆C = {C¯ position − C¯ train } − {C¯ loudspeaker − C¯ human },

(4.12)

where C¯ human and C¯ loudspeaker are cepstral means of close-talking human utterances and those of utterances from a close-loudspeaker. We used far fewer human utterances to estimate C¯ human than to estimate position-dependent cepstral means. In addition, we need only close-talking utterances, which are easier to record than distant-talking utterances. A detailed illustration is shown in Fig. 4.1. An example of Euclidean cepstrum distances between different speakers, different vowels and different positions are described and analyzed in Chapter 6.

4.4

Combination of Short-term and Long-term Spectrum Based CMN/PDCMN

4.4.1

Combination of Short-term and Long-term Spectrum Based CMN

In the traditional method, a short-term cepstral analysis is used. However, the duration of impulse response of reverberation usually has a much longer tail in a distant-talking environment. Therefore, the conventional CMN is not effective under these conditions.

40


For the static part of speech signals, the spectrum can be extracted by the long-term analysis window because the speech signal is stationary. We assume that a static speech segment affected by long reverberation can be modeled by the long-term spectrum based CMN. Thus, the effect of long reverberation on a static speech signal may be compensated by the long-term spectrum based CMN. On the other hand, for the non-static part of speech signals, the Fourier transform cannot be applied to a long-term analysis window because the longterm speech signal is not stationary. This result in long-term analysis window based spectrum yields too low time resolution for transient speech. Thus, the long-term CMN cannot be applied to non-static part of speech signals because long-term cepstral mean is not available too. In the case of a non-static speech segment, the traditional short-term spectrum based CMN is used. Thus, the combination of short-term and long-term spectrum based CMN is defined as:

C˜t = Ctx − ∆C =

Ctx short − ∆C short Ctx long − ∆C long

Ctx short − (C¯ x short − C¯ train short ) = Ctx long − (C¯ x long − C¯ train long )

if t-th speech segment is non-static

,

(4.13)

if t-th speech segment is static where Ctx short and Ctx long are the original short-term and long-term cepstra at time frame t, C¯ x short and C¯ x long are short-term and long-term cepstral means of utterances to be recognized, and C¯ train short and C¯ train long are short-term and long-term cepstral means of utterances to be used to train the speakerindependent acoustical model, respectively.

4.4.2

Variable-term Spectrum Based Position-dependent CMN

We extend the concept of combining short-term and long-term spectrum based CMN to PDCMN, which we call Variable-Term spectrum based PDCMN (VT-PDCMN). C¯ position and C¯ train are both estimated by averaging short-term cepstra obtained from non-static speech segments and long-term cepstra obtained from static speech segments. The cepstrum of the t-th speech segment Ctx is compensated by ∆C = C¯ position − C¯ train , while the corresponding C¯ position and C¯ train are selected as:

¯ position

C

=

C¯ position short C¯ position long

if t-th segment is non-static if t-th segment is static

,

(4.14)

4.5. SUMMARY

C¯ train

41

C¯ train short = C¯ train long

if t-th segment is non-static

,

(4.15)

if t-th segment is static

where C¯ position short , C¯ position long are short-term and long-term cepstral means of utterances emitted from a certain position, and C¯ train short and C¯ train long are short-term and long-term cepstral means of utterances to be used to train the speaker-independent acoustical model, respectively.

4.4.3

Static and Non-static Speech Segment Detection

Test and training utterances include static and non-static speech. In order to estimate the cepstral means of static and non-static speech segments and to normalize the corresponding cepstral features, static speech segment detection and non-static speech segment detection are necessary and important for speech recognition using the proposed method. It is well known that a static speech segment has smaller variance between neighboring frames than a non-static speech segment. To discriminate the static and non-static speech segments, the cepstral distance (Euclidean distance) of neighboring frames is defined as: M m )2 , (Ctm − Ct+1 (4.16) D(Ct , Ct+1 ) = m=1

where Ctm is the m-th cepstrum of the t-th frame. A speech segment is more likely to be a static speech segment when the cepstral distance (that is, spectral envelope distance) between neighboring frames is small. For the sake of simplicity, a certain percentage of speech segments with smaller cepstral distances is identified as static speech segments in this chapter and Chapter 6.

4.5

Summary

In this chapter, we presented some effective channel normalization technologies for speech/speaker recognition in which the conventional CMN was modified for more effective performance and lower computational complexity. In Section 4.2, a simple and effective channel normalization method, CMN, was briefly described. To address the problem of tradeoff between recogniton error and delay using the conventional CMN, a robust speech/speaker recognition method using a novel real-time CMN based on speaker position which was called Position-Dependent CMN (PDCMN) was proposed in Section 4.3. In distant-talking environments, the duration of the impulse response of reverberation usually has a much longer tail. Therefore, the conventional CMN/PDCMN is not effective under these conditions. In Section 4.4, a combination of a short-term spectrum based CMN with a long-term spectrum based CMN was proposed and the concepts of the

42


variable-term spectrum based CMN was extended to PDCMN. These novel channel normalization methods in cepstral domain will be evaluated in Sections 5 and 6.

Chapter 5

Distant-talking Speaker Recognition

5.1

Introduction

Distant-talking speaker identification and verification [87, 88, 89] have received more and more attention recently. However, in a distant environment, the channel distortion may drastically degrade speaker recognition performance. This is mostly caused by the mismatch between the practical environment and training environment. Over the past few decades, several approaches were proposed to compensate for the adverse effects of a mismatch between training and testing conditions. One of them is a feature-based compensation technique, which compensates for the noisy feature to match the training environment before it is fed to the recognizer. The other one is a model-based adaptation technique, which adapts/trains the speaker models using the data in a noise condition. In this chapter, we used both the feature-based compensation technique and the model-based adaptation technique to obtain a robust distant speaker recognition performance. A feature-based technique which compensates for an input feature is the main way to reduce a mismatch. Cepstral Mean Normalization (CMN) has been used to reduce the channel distortion as a simple and effective way of normalizing the feature vectors [4, 34]. CMN is sometimes supplemented with variance normalization. Recently, other efficient feature normalization approaches have been proposed for improving speaker recognition performance, mainly feature warping [43] and short-time Gaussianization [52]. The feature warping consists of mapping the observed cepstral feature distribution to a normal distribution over a sliding window, the various cepstral coefficients being processed in parallel streams. The 43

44

CHAPTER 5. DISTANT-TALKING SPEAKER RECOGNITION

short-time Gaussianization is similar but applies a linear transformation to the feature before mapping them to a normal distribution. This linear transformation, which can be estimated by the EM algorithm, makes the resulting features better suited to diagonal covariance GMMs. Barras and Gauvain [14] evaluated those feature normalization methods on cellular phone data. The essence of our study is how the transmission characteristics were affected depending on the speaker position. Since it is very easy to implement, CMN has been adopted in many current systems without loss of generality, and we used CMN to verify our idea. The other feature normalization methods were not evaluated in this paper. In the conventional CMN, the cepstral mean is previously estimated by averaging along the entire current utterance and is kept constant during the normalization. However, this off-line estimation involves a long delay that is likely unacceptable when the utterance is long. If the utterance is short, the accurate cepstral mean cannot be estimated. Alternatively, when the conditions (environment, channel, speaker, etc.) do not change for a period of time, the cepstral mean can be estimated from a given set of previous utterances, and so the delay is avoided [63]. Nevertheless, condition change such as speaker and speaking position change etc. is existed in certain applications. In a distant environment, the transmission characteristics from different speaker positions are very different, so the Kitaoka method [63] cannot track the rapid change of the transmission characteristics. Various window CMN methods have been used to normalize the feature vectors in an on-line version [54, 55]. However, there exists a tradeoff between delay and recognition error [55]. Thus, the usual CMN cannot obtain an excellent recognition performance with a short delay. In this chapter, we propose a robust speaker recognition method using a new real-time CMN based on speaker position which is called position-dependent CMN described in Chapter 4. How to normalize intra-speaker variation of likelihood (similarity) values is one of the most difficult problems in speaker verification. Recently, some score normalization techniques have been used for speaker verification, while it is not applicable for speaker identification. The most frequently used among them are the Z-norm and the T-Norm [13, 14]. These two score normalization methods, which normalize the distribution of the scores, have been proven to be quite efficient [13]. Barras and Gauvain [14] indicated that the combination of feature normalization and score normalization improved the verification performance more than that of the individual one since their effects were cumulative. In this chapter, we describe that feature normalization dependent on speaker position compensated the mismatch between test and training conditions efficiently. We did not conduct speaker verification experiments using such a combination in this chapter, but we believe that the combination also achieves good performance on speaker verification. For speaker recognition, various types of speaker models have long been stud-

5.1. INTRODUCTION

45

ied. Hidden Markov models (HMMs) have become the most popular statistical tool for the text-dependent task. The best results have been obtained using continuous density HMMs (CHMMs) for modeling speaker characteristics [46]. For the text-independent task, the temporal sequence modeling capability of the HMM is not required. Therefore, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model [48]. In GMM modeling techniques, feature vectors are assumed to be statistically independent. Although this is not true, it allows one to simplify mathematical formulations. To overcome this assumption, models based on segments of feature frames were proposed [34]. One of the disadvantages of GMM is that the acoustic variability dependent on phonetic events is not directly taken into account. In other words, this modeling is not sufficiently constrained by the phonetic temporal pattern. Therefore, speech recognition techniques have been used for text-dependent speaker identification [35]. This approach is also used for textindependent speaker identification. Nakagawa et al. [36, 37] proposed a new speaker recognition method by combining speaker-specific GMMs with speakeradapted syllable-based HMMs, which showed robustness for the change of the speaking style in a close-talking environment. In this chapter, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. The MAP-based speaker adaptation method from the speaker-independent GMM (speaker-adapted GMM) was indeed an effective speaker identification method [44], but Nakagawa et al. [36, 37] indicated that the speaker-specific GMM obtained slightly better performance than the speaker-adapted GMM. So we used the speaker-specific GMMs in this chapter. Since the training data were not sufficient for speaker-specific HMMs, we used speaker-independent HMMs adapted to the speakers by MAP (speaker-adapted HMMs) instead of speaker-specific HMMs. Reynolds and Rose [45] also proposed an effective Gaussian mixture speaker models for robust text-independent speaker identification. The essence of this chapter is that the transmission characteristics are distorted depending on different speaker positions and that the proposed position-dependent CMN can address this problem effectively. Furthermore, the position-dependent CMN could also be integrated with various speaker models. The consideration of state-of-the-art speaker models is beyond the scope of this chapter. Thus, in this chapter, syllable-based HMMs adapted to the speakers by MAP and speaker-specific GMMs whose parameters were estimated by the EM algorithm using training data uttered by the corresponding speaker were used for speaker identification.


46

5.2

5.2.1 5.2.1.1

Speaker Recognition Based on Position-dependent CMN by Combining Speaker-specific GMM with Speakeradapted HMM Speaker modeling Gaussian Mixture Model (GMM)

A GMM is a weighted sum of M component densities and is given by the form

P (X|λ) =

M

ci bi (x),

(5.1)

i=1

where x is a d-dimensional random vector, bi (x), i = 1, · · · , M , is the component densities and ci , i = 1, · · · , M , is the mixture weights. Each component density is a d-variate Gaussian function of the form bi (x) =

1 (2π)d/2 |Σi |1/2

1 exp {− (x − µi )T Σ−1 i (x − µi )}, 2

(5.2)

with mean vector µi and covariance matrix Σi . The mixture weights satisfy the constraint that M

ci = 1.

(5.3)

i=1

The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities. These parameters are collectively represented by the notation λ = {ci , µi , Σi },

i = 1, . . . , M.

(5.4)

In our speaker recognition system, each speaker is represented by such a GMM and is referred to by this model λ. For a sequence of T test vectors X = x1 , x2 , · · · , xT , the standard approach is to calculate the GMM likelihood in the log domain as

L(X|λ) = log p(X|λ) =

T

log p(xt |λ).

(5.5)

t=1

The speaker specific GMM parameters are estimated by the EM algorithm using training data uttered by the corresponding speaker using the HTK toolkit [53].

5.2. SPEAKER RECOGNITION BASED ON POSITION-DEPENDENT CMN BY COMBINING SPEAKER-SPECIFIC GMM WITH SPEAKER-ADAPTED HMM 47

5.2.1.2

Speaker adapted HMM

A parameter set of HMM is given by λ = {A, B, π}, where A, B and π denote a set of state transition probability, a set of output probability density functions, and a set of initial state probabilities, respectively. We used contextindependent syllable-based HMMs as acoustic models, each of which has a left-toright topology and consists of 5 states, 4 with pdfs (probability density functions) of output probability. Each pdf consists of four Gaussians with full-covariance matrices. The number of syllables is 114 or 116. A list of 116 syllable used in this thesis is shown in Table 5.1. Speaker adaptation is performed for B. We describe in brief adaptation methods for a Gaussian distribution. The speaker adaptation by Maximum A Posterior Probability Estimation (MAP) [50, 51] is in the following: µ ˆN

γµ0 + N (γ + N − 1)ˆ µN −1 + XN i=1 Xi = , = γ+N γ+N

(5.6)

where {X1 , X2 , · · · , XN } denotes training sample vectors and γ corresponds to the prior mean vector µ0 , that is, the reliability of the mean vector µ0 of ˆ N ) denotes speaker-independent distribution [40]. We set γ to 15 [51]. N (ˆ µN , Σ an estimated Gaussian Model adapted by training samples.

5.2.2

Speaker identification procedure

Figure 5.1 shows the procedure of our speaker identification system. In this system, input speech is analyzed and transformed into a feature vector sequence by front-end analysis block and then each test vector is fed to all reference speaker models of GMM and speaker adapted syllable-based HMMs in parallel. Stated mathematically, the ability of GMMs is to find a maximum possible speaker S given the observation sequences O and the ability of HMMs for speaker identification is to find a maximum possible unit (syllable) sequences U given the reference speaker S and the observation sequences O . When the HMMs are combined with GMMs, the ability of the combination method is to find the joint maximum probability among all possible unit (syllable) sequences U and speakers S given the observation sequences O. Thus, the speaker could be identified accurately because the combination method uses the uttered context. Our proposed combination method is formulated as: ˆ , S} ˆ = arg max P (U |S, O)P (S|O) {U U,S

= arg max P (U, S|O), U,S

(5.7) (5.8)

where the first term of the right-hand side of the Equation (5.7) is calculated by the pdfs (probability density functions) of HMMs for speaker recognition and


48

Table 5.1: A list of 116 syllable 母＼子音

k

g

s

z

t

ts

d

n

a

ka か ki き ku く ke け ko こ

ga が gi ぎ gu ぐ ge げ go ご

sa さ su す se せ so そ

za ざ zi じ zu ず ze ぜ zo ぞ

ta た ti てぃ te て to と

tsu つ -

da だ di でぃ du どぅ de で do ど

na な ni に nu ぬ ne ね no の

母＼子音

h

f

p

b

m

y

r

w

a

ha は hi ひ he へ ho ほ

fa ふぁ fi ふぃ fu ふ fe ふぇ fo ふぉ

pa ぱ pi ぴ pu ぷ pe ぺ po ぽ

ba ば bi び bu ぶ be べ bo ぽ

ma ま mi み mu む me め mo も

ya や yu ゆ yo よ

ra ら ri り ru る re れ ro ろ

wa わ wi うぃ -

ky

gy

sh

j

ch

-

dy

ny

kya きゃ kyy きゅ kyo きょ

gya ぎゃ gya ぎゅ gyo ぎょ

sha しゃ shi し shu しゅ she しぇ sho しょ

ja じゃ ju じゅ je じぇ jo じょ

cha ちゃ chi ち chu ちゅ che ちぇ cho ちょ

-

dyu でゅ -

nya にゃ nyu にゅ nyo にょ

hy

-

py

by

my

-

ry

N

rya りゃ ryu りゅ ryo りょ

N ん qs っ -

i u e o

i u e o

a あ i い u う e え o お -

母＼子音

a i u e o

-

母＼子音

a

hya pya bya mya ひゃぴゃびゃみゃ i u hyu pyu byu myu ひゅぴゅびゅみゅ e o hyo pyo byo myo ひょぴょびょみょ文中のショートポーズ：sp，文頭末のポーズ：SIL

5.3. EXPERIMENTS OF DISTANT-TALKING SPEAKER RECOGNITION

49

Figure 5.1: Speaker identification by combining speaker-specific GMM with speaker-adapted syllable-based HMMs

the second one is calculated by the pdfs of GMMs. In a logarithmic domain, the multiplication becomes addition. In a real case, the combination of the GMMs and the HMMs with a weighting coefficient may be a good scheme because of the difference in training/adaptation methods, etc. The i-th speaker dependent GMM produces likelihood LiGM M (X), i = 1, 2, · · · , N , where N is the number of speakers registered. The i-th speaker adapted syllablebased HMMs also produce likelihood LiHM M (X) by using a continuous syllable recognizer. All these likelihood values are passed to the so-called likelihood decision block, where they are transformed into the new score Li (X): Li (X) = (1 − α)LiGM M (X) + αLiHM M (X),

(5.9)

where α denotes a weighting coefficient.

5.3 5.3.1

Experiments of Distant-talking Speaker Recognition Experimental setup

We performed the experiment in a room measuring 3.45 m × 3 m × 2.6 m. The room was divided into 12 (3×4) rectangular areas as shown in Fig. 3.3, where the area size is 60 cm × 60 cm. We measured the transmission characteristics (that is, the mean cepstrums of utterances recorded a priori) from the center of each area. The purpose of this experiment was to verify the validity of our proposed position-dependent CMN and the combination method, so we purely evaluated our method using the speech recorded by one microphone, M 1, as shown in Fig. 3.1. It will be mentioned in Chapter 6 that multiple microphone processing

50


based on position-dependent CMN achieved a significant improvement in speech recognition performance over both conventional microphone array processing and single microphone processing. Therefore, our multiple microphone processing method should also be effective for speaker recognition, but the evaluation of this method will be undertaken in future. In our method, the estimated speaker position should be used to determine the area (60 cm × 60 cm) in which the speaker should be. [95] found that an average location error of less than 10 cm could be achieved using only 4 microphones in a room measuring 6 m × 10 m × 3 m, in which source positions are uniformly distributed in an area 6 m × 6 m. Chapter 3 also revealed that the speaker position could be estimated with estimation errors of 10–15 cm by the T-shaped 4 microphone system shown in Fig. 3.1 without interpolation between consecutive samples. In Section 5.3.2, therefore, we assumed that the position area was accurately estimated and purely evaluated only our proposed speaker recognition methods. Twenty male speakers from Tohoku University and Panasonic isolated spoken word database [78] uttered 200 isolated words, each with a close-microphone. The average time of all utterances is about 0.6 second. For the utterances of each speaker, the first 100 words were used as test data and the rest for estimation of cepstral means C¯position in Equation (4.9). The same last 100 training utterances were also used to adapt the syllable-based HMMs and train the speaker-specific GMM for each speaker. All the utterances were emitted from a loudspeaker located in the center of each area and recorded for test and estimation of C¯position to simulate the utterances spoken at various positions. The sampling frequency was 12 kHz. The frame length was 21.3 ms and the frame shift was 8 ms with a 256 point Hamming window. Then, 114 Japanese speaker-independent syllablebased HMMs (strictly speaking, mora-unit HMMs; [96]) were trained using 27992 utterances read by 175 male speakers (JNAS corpus). Each continuous density HMM had 5 states, 4 with pdfs of output probability. Each pdf consisted of four Gaussians with full-covariance matrices. The feature space for syllable-based HMMs was comprised of 10 mel frequency LPC cepstral coefficients. First- and second-order derivatives of the cepstra plus first and second derivatives of the power component were also included. The feature space for speaker-specific GMM with diagonal matrices was comprised of 10 MFCCs. First-order derivatives of cepstra plus first derivatives of the power component were also included. The mixture number of the GMMs is 32. Speaker-independent HMMs were adapted by MAP using 100 isolated words (about 60 seconds) recorded by close-talking microphone. All the utterances were emitted from a loudspeaker. The GMMs were trained by ML criterion using the same data. Speaker models were also adapted/trained using 50 isolated words (about 30 seconds) and 10 isolated words (about 6 seconds).


51

Table 5.2: Four methods for speaker recognition method W/o CMN Conv. CMN PICMN PDCMN

feature W/o CMN Conv. CMN PICMN PDCMN

model (HMM/GMM) W/o CMN Conv. CMN A priori CMN A priori CMN

Three types of adaptation/training data were used depending on the experimental conditions. The three types of models obtained were: (1) W/o CMN Model using the features without CMN; (2) Conv. CMN Model using the features compensated by the conventional CMN; and (3) A priori CMN Model using the features compensated by CMN with cepstral mean measured a priori. The compensation parameter of the features for a priori CMN Model is defined by: (5.10) ∆C = C¯close−loudspeaker − C¯train , where C¯close−loudspeaker is the cepstral mean of utterances recorded by closetalking microphone ∗ . In our experiments, the four methods shown in Table 5.2 were compared. These four methods were defined by combining the corresponding features and models. PICMN (Position-Independent CMN) compensated the features using the averaged compensation parameters over 12 areas, and the same compensation parameters were used in all areas (Position-Independent). In PICMN and PDCMN, the compensation parameters ∆C were independent of the speakers.

5.3.2

Speaker recognition results

In this section, we assumed that the position area was accurately estimated and purely evaluated only our proposed speaker recognition methods. It is difficult to identify the correct speaker using only one utterance because the average time (about 0.6 second) for all utterances is too short. Therefore, the connective likelihood of 3 utterances (about 1.8 seconds) was used to identify the speaker. As described in Section 5.3.1, 100 isolated words were used for the test, so we had 33 test samples for each speaker, that is, a total 660 samples. In this chapter, the standard CMS was applied to a concatenation of the three words. 5.3.2.1

Speaker recognition by GMM

We evaluated our proposed feature-based compensation technique for speaker identification by GMM. The speaker recognition performance for close-talking ∗

That is, only the difference in recording equipment in training and test environments was compensated

52


microphone (that is, clean speech data) were 99.5% by w/o CMN and 98.2% by CMN, respectively. The proposed method was referred to as PDCMN (PositionDependent CMN). The results were shown in Fig. 5.2. In Fig. 5.2, PDCMN was compared with the baseline (recognition without CMN), the conventional CMN, “CM of area 2”, “CM of area 5”, “CM of area 12” and PICMN (PositionIndependent CMN). Area 2 was nearest the microphones, and “CM of area 2” means that a fixed Cepstral Mean (CM) of the nearest area was used to compensate the input features of all 12 areas. Area 5 was at the center of all 12 areas, and area 12 was the farthest from the microphones. PICMN means the method by which the averaged compensation parameters over 12 areas were used. Without CMN, the recognition rate was drastically degraded according to the distance between the sound source and the microphone. For the feature-based compensation technique, since the conventional CMN removed the speaker characteristic and the utterances were too short (about 1.8 seconds), the result was much worse than that without CMN. The transmission characteristics from different speaker positions are very different, so the Cepstral Mean (CM) of each area is considerably different. In our experiment, area 5 is at the center of whole the area, whereas area 2 and area 12 are nearest and farthest from the microphones, respectively. We compared the ”CM of area 5”, ”CM of area 2” and ”CM of area 12” with ”average CM over all areas (that is, Position-independent Cepstral Mean: PICM)”. The cepstral distance of the ”CM of area 5” from the ”PICM” was much smaller than those of the ”CM of area 2” and the ”CM of area 12”. This means that the variation in the distances of the ”CM of area 5” from the ”CMs of the other areas” was much smaller than those of the other two. So the recognition error rate of ”CM of area 5” (4.71%) was significantly smaller than those of ”CM of area 2” (5.34%) and ”CM of area 12” (5.27%). The variation between the ”CM of area 2” and ”CM of the other area” and that between the ”CM of area 12” and ”CM of the other area” was little, so the performance of ”CM of area 2” and that of ”CM of area 12” were very similar. PICMN worked better than the ”CM of one fixed area” because it averaged compensation parameters over all areas, and thus could obtain a more robust compensation parameter than that of a fixed area. Furthermore, the proposed PDCMN also compensated for the transmission characteristics according to speaker position, and it worked better than PICMN and the other methods. The proposed PDCMN achieved a relative error reduction rate of 64.0% from w/o CMN, 67.2% from Conv. CMN, 46.4% from “CM of area 2”, 39.3% from “CM of area 5”, 45.7% from “CM of area 12” and 30.2% from PICMN, respectively. In light of our previous study on speaker localization, we assumed that the position area was accurately estimated. We also showed the additional reports on the impact of localization error on the recognition rate. We simulated this impact under two localization error situations. One situation was that the true ”area 1”


9.00

53

8.71 7.94

8.00

Recognition error rate (%)

7.00 6.00

5.34

5.27 4.71

5.00

4.10

4.00

2.86 3.00 2.00 1.00 0.00

w/o CMN

Conv. CMN

CM of area 2

CM of area 5

CM of area 12

PICMN PDCMN

Figure 5.2: Speaker identification by GMM (isolated word)

was incorrectly estimated as ”area 2”, and the other was that the true ”area 4” was incorrectly estimated as ”area 5”. Incorrect estimate of the cepstral mean was used to compensate the speech uttered from the true area. We compared the speaker identification error rates for speech uttered from area 1. The rates were 1.37% by ”CM of area 2”, 2.74% by w/o CMN, 4.57% by the conventional CMN, 1.86% by PICMN and 0.76% by PDCMN, respectively. The speaker identification error rates for speech uttered from area 4 were also compared. The rates were 2.59% by ”CM of area 5”, 6.40% by w/o CMN, 7.16% by the conventional CMN, 3.20% by PICMN and 1.98% by the PDCMN, respectively. The performance with localization error was obviously worse than the ideal case with the positiondependent CMN. Although the corresponding area was estimated incorrectly, the result by ”CM of the incorrectly estimated neighbor area” was remarkably better than those by w/o CMN and the conventional CMN, and it was even significantly better than that by the position-independent CMN. Even if the true area had been estimated incorrectly as a neighbor area, the proposed method could work much better than the other methods. — NTT database — To verify the robustness of the proposed method for distant speaker recognition, we also conducted our methods on the NTT (Nippon Telegraph and Telephone Corp.) database, which is a standard database for Japanese speaker iden-


54

15.63 16.00


14.00 12.00

9.43

9.92

10.00

6.50

8.00 6.00 4.00 2.00 0.00

w/o CMN

Conv. CMN

PICMN

PDCMN

Figure 5.3: Speaker identification by GMM (NTT database)

tification. The NTT database consists of recordings of 22 male speakers collected in 5 sessions over 10 months (1990.8, 1990.9, 1990.12, 1991.3 and 1991.6) in a soundproof room. For training the models, the same 5 sentences were used for all speakers, from one session (1990.8). Five other sentences uttered at normal speed and the same for each of the speakers from the other four sessions, were used as test data (text-independent speaker identification). Average duration of the sentences was about 4 sec. The input speech was sampled at 16 kHz. 12 MFCC, their derivative, and delta log-power were calculated at every 10 ms with a hamming window of 25 ms. We also verified the robustness of the proposed method in relation to changes in the environment. The simulated environment in Section 5.3.1 was modified to a more realistic environment in this section. The difference in simulated and real environment may be described as follows: in the simulated environment, there was nothing in the room except microphones and a sound source; in the realistic environment, the room was set up as a seminar room with a whiteboard beside the left wall, one table and some chairs arranged in the center of the room, one TV and some other tables etc. The speaker recognition performance for close-talking microphone (that is, clean speech data) were 96.1% by w/o CMN and 96.8% by CMN, respectively. The average results of all 12 areas were shown in Fig. 5.3. The proposed PDCMN


55

Table 5.3: speaker identification error rate by combination method using position-dependent CMN (3 utterances) combination of HMM GMM MFCC MFCC LPC LPC MFCC LPC — —

MFCC LPC MFCC LPC — — MFCC LPC

error rate (%) 1.01 1.33 0.69 1.26 3.30 3.10 2.86 4.69

achieved a relative error reduction rate of 58.4% from w/o CMN, 31.1% from the conventional CMN, and 34.5% from PICMN, respectively. The proposed PDCMN also worked better than the other methods on the new database under a more realistic environment. In this case, the conventional CMN was better than w/o CMN and PICMN because the average duration of the sentences (about 4 sec) was sufficient to estimate the cepstral mean accurately. This indicated that the proposed method is robust in terms of evaluation data and experimental environment.

5.3.2.2

Speaker recognition by combination method

Nakagawa et al. [36, 37] indicated that the integration of similar methods is less effective than that of different methods. Two speaker models using different features may obtain better results than that using same features. Using the position-dependent CMN, the speaker identification error rates were shown in Table 5.3. Since the combination of LPC-based HMMs and MFCC-based GMMs improved the speaker identification performance more than the other combinations, we used those settings in this paper. The speaker recognition result based on position-dependent CMN by combining GMMs with HMMs was shown in Fig. 5.4. The combination method improved the speaker recognition result remarkably and produced the best result when the weighing coefficient was α = 0.7. The essence of the improvement of the combination method was not the adjustment of the weighting coefficient α, but its simultaneous maximizing of the joint probability among all possible unit sequences and speakers. Even α was set to 0.5, the relative error reduction rate was more than 70%, and significant improvement was achieved on all α range from 0.1 to 0.9. In the following experiments, the weighting coefficient α was set


Speaker recognition error rate (%)

56

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 GMM

0.1

0.2

0.3

0.4

0.5

α

0.6

0.7

0.8

0.9

1.0 HMM

Figure 5.4: Speaker recognition error rates based on PDCMN by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 100 isolated words)

to 0.7 for the combination method. The experimental results for all 12 areas and the average result were shown in Table 5.4. We describe the relationship between the recognition performance and the distance between the sound source and microphone in Chapter 6 and discuss it. The proposed combination method improved the speaker recognition performance significantly in all areas. By using the position-dependent CMN, the combination method achieved a relative error reduction rate of 75.9% from GMM and 77.7% from HMM. There were three reasons for this: (1) The combination method maximized the joint probability among all possible unit (syllable) sequences and speakers, while the GMM-based technique maximized the possible speakers and the HMM-based technique maximized the unit (syllable) sequences given the observation sequences for a given speaker. (2) Using the same closetalking data, the speaker-specific GMMs with diagonal matrices were trained by ML criterion while the speaker-independent HMMs with full covariance matrices were adapted by MAP, so the combination method may obtain robust output probability density functions. (3) The GMMs cannot express the temporal sequences (one state) while the HMMs can express the temporal sequences (that is, syllable sequences) dynamically. Incorporating the combination method with the proposed PDCMN, the proposed PDCMN achieved a relative error reduction rate of 66.7% from w/o CMN, 91.4% from the conventional CMN, and 28.9% from PICMN. So far the weight coefficient α of the combination method was set to 0.7 for

area 1 2 3 4 5 6 7 8 9 10 11 12 Ave.

no CMN 2.90 2.13 2.74 6.40 6.25 7.16 9.91 5.49 9.60 12.80 10.98 18.90 7.94

GMM conv. PICMN CMN 4.57 2.29 3.96 0.91 5.49 1.83 7.16 3.20 8.38 4.12 9.30 3.66 8.38 4.73 10.21 4.12 9.60 4.27 9.30 4.73 14.63 6.55 13.57 8.84 8.71 4.10

PDCMN 1.83 0.76 1.22 2.90 2.44 3.51 2.13 4.27 3.05 4.42 3.51 4.27 2.86

no CMN 1.37 1.37 1.83 3.04 2.59 3.51 4.88 3.20 5.79 7.62 10.98 9.30 4.64

HMM conv. PICMN CMN 4.73 1.07 4.12 0.91 5.49 1.98 10.21 2.59 7.32 2.13 10.82 3.81 13.11 3.05 12.20 3.20 13.87 3.81 14.18 6.10 23.17 7.32 19.51 14.63 11.15 3.53

PDCMN 1.07 0.74 1.37 2.74 1.68 3.81 3.66 3.20 2.74 5.64 6.25 4.12 3.09

Combination method (α = 0.7) no conv. PIPDCMN CMN CMN CMN 0.46 1.07 0.00 0.00 0.30 1.22 0.30 0.30 0.46 1.52 0.46 0.15 1.37 4.27 1.37 0.91 1.68 3.51 0.61 0.61 2.13 4.73 0.61 0.61 1.98 5.18 1.37 0.61 1.37 5.49 0.91 0.91 1.83 3.53 0.76 0.46 3.66 6.40 1.22 1.07 4.42 8.69 2.13 1.22 5.18 9.91 6.86 1.37 2.07 4.36 0.97 0.69


Table 5.4: Speaker recognition error rate (%) (3 utterances). Models were adapted/trained using 100 words.

57


58

Table 5.5: Speaker recognition error rate by optimal α (%) area α result

1 0.7 0.00

2 0.7 0.30

3 0.8 0.00

4 0.6 0.76

5 0.6 0.46

6 0.7 0.61

7 0.7 0.61

8 0.7 0.91

9 0.7 0.46

10 0.6 0.91

11 0.6 1.07

12 0.5 1.22

Ave. 0.61

Table 5.6: Speaker recognition error rate (%) (3 utterances). Models were adapted/trained using 30/50 words. training data (word) 50 30

w/o CMN 8.92 11.85

GMM PIPDCMN CMN 4.80 3.33 5.88 4.43

w/o CMN 9.50 14.20

HMM PIPDCMN CMN 7.36 6.90 12.58 11.70

Combination method w/o PIPDCMN CMN CMN 4.15 1.87 1.38 6.44 3.43 2.63

all areas. Variable weighting factor depending on speaker positions was evaluated here. The speaker recognition results of the combination method with variable weight α were shown in Table 5.5. A relative error reduction rate of 11.6% from that of the fixed weight 0.7 was achieved. It also showed that the greater the distance between sound source and microphone, the greater was the contribution of GMMs (i.e., the weight coefficient α was smaller). Furthermore, in order to investigate the effect of the size of the adaptation/training data, speaker recognition experiments were also conducted by the speaker models adapted/trained using 50 isolated words (about 30 seconds) and 30 isolated words (about 18 seconds). The results of speaker models adapted/trained using 50 isolated words and 30 isolated words are shown in Table 5.6 and Figures 5.5 and 5.6, respectively. The recognition performance was degraded when using the relative small adaptation/training data. However, the proposed methods were also more effective than other methods using the small adaptation/training data.

5.3.3

Integrating speaker position estimation with speaker recognition

In this section, speaker position estimation was integrated with our proposed position-dependent CMN. For this experiment, we rerecorded isolated word utterances with four microphones as shown in Fig. 3.1. The same 200 isolated words uttered by twenty male speakers were emitted from a loudspeaker located in the center of areas 1, 5 and 9. First 100 utterances were used as the test samples and the other for parameter estimation. The recording condition was the same as in Section 5.3.1. The speaker position estimation results are shown in Tables 3.1 and 3.2. Almost 100% of the estimated speaker positions could be used to determine the area (60 cm × 60 cm) in which the speaker should be.


59


8 7 6 5 4 3 2 1 0 0.0 0.1 GMM

0.2

0.3

0.4

0.5

α

0.6

0.7

0.8

0.9

1.0 HMM

Figure 5.5: Speaker recognition error rates based on PDCMN by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 50 isolated words)

The connective likelihood of 3 utterances (about 1.8 seconds) was used to identify the speaker. To simulate the 3 utterances as a sentence, the first 300 ms speech of first utterance was used to estimate the common speaker position of the 3 utterances. The system adopted the compensation parameter corresponding to the automatic estimated position. The speaker recognition results shown in Table 5.7 were slightly different from those in Table 5.4, because the same recording conditions in different recording time (Jun. 2004 and Dec. 2006) could not be guaranteed, e.g., loudspeaker volume and quality of microphones and recording facilities should be different. “Ideal PDCMN” means that the compensation parameter for the correct area is adopted (that is, ideal condition). This method assumed that the position area was accurately estimated. “Realistic PDCMN” means that the compensation parameter for an estimated position area is used (that is, realistic condition). “Ideal PDCM” improved the speaker recognition performance remarkably more than PICMN. “Realistic PDCMN” obtained the same recognition performance as “ideal PDCMN ” because the estimated position areas were almost correct, and the errors of position estimation affected the performance little because the compensation parameters for neighboring areas, which could approximate the correct one, were used in all cases with area estimation errors.


60


14 12 10 8 6 4 2 0 0.0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

α

GMM

0.9

1.0 HMM

Figure 5.6: Speaker recognition error rates based on PDCMN by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs for every 3 test utterances (models were adapted/trained using 30 isolated words) Table 5.7: Comparison of speaker recognition error rate of ideal PDCMN with realistic PDCMN (%)

area 1 5 9 Ave.

5.4

PICMN 2.29 4.42 3.96 3.56

GMM ideal PDCMN 1.07 2.74 2.90 2.34

realistic PDCMN 1.07 2.74 2.90 2.34

PICMN 1.37 2.44 5.18 3.00

HMM ideal PDCMN 1.07 2.29 3.81 2.39

realistic PDCMN 1.07 2.29 3.81 2.39

PICMN 0.15 0.91 1.22 0.76

Combinational method ideal realistic PDCMN PDCMN 0.00 0.00 0.30 0.30 0.76 0.76 0.35 0.35

Summary

In a distant environment, speaker recognition performance may drastically degrade because of the mismatch between the training and testing environments. We addressed this problem by a feature-based compensation technique. We thus proposed a robust distant speaker recognition based on position-dependent CMN. Twenty male speaker recognition experiments were conducted. The speaker identification result by GMM showed that the proposed position-dependent CMN achieved a relative error reduction rate of 64.0% from w/o CMN and 30.2% from the position-independent CMN. We also integrated the position-dependent CMN into the combination use of speaker-specific GMMs and speaker-adapted syllable-based HMMs. The combination method improved the speaker recogni-

5.4. SUMMARY

61

tion performance more than the individual use of either speaker-specific GMMs or speaker-adapted syllable-based HMMs. In the combination method, the position dependent CMN also improved speaker recognition performance significantly more than other feature-based methods. The speaker position estimation was also evaluated and then it was integrated with our proposed position-dependent CMN. A good speaker position estimation performance was achieved, and the same speaker recognition performance as ideal PDCMN was achieved when speaker position estimation was intergrated with our proposed PDCMN. We describe the relationship between the recognition rate and distance in Chapter 6 and discuss it.

Chapter 6

Distant-talking Speech Recognition

6.1

Introduction

In this chapter, two channel distortion compensation methods proposed in Chapter 4 (that is position-dependent CMN and the combination of short-term spectrum based CMN and long-term spectrum based CMN) are used to normalize the transfer characteristics for speech recognition. In a distant environment, the speech signal received by a microphone is affected by the microphone position and the distance from the sound source to the microphone. If an utterance suffers fatal degradation by such effects, the system can not recognize it correctly. Fortunately, the transmission characteristics from the sound source to every microphone should be different, and the effect of channel distortion for every microphone (it may contain estimation errors) should also be different. Therefore, complementary use of multiple microphones may achieve robust recognition. In this chapter, the maximum vote (that is, Voting method (VM)) or the maximum summation likelihood (that is, Maximum-summationlikelihood method (MSLM)) of all channels is used to obtain the final result [75], which is called multiple decoder processing. This should obtain robust performance in a distant environment. However, the computational complexity of multiple decode processing is K (the number of input streams) times that of a single input. To reduce the computational cost, the output probability of each input is calculated at frame level, and a single decoder using these output probabilities is used to perform speech recognition, which is called single decoder processing. Even when using multiple channels, each channel obtained from a single microphone is not stable because it does not utilize the spatial information. On 63

CHAPTER 6. DISTANT-TALKING SPEECH RECOGNITION

64

the other hand, beamforming is one of the simplest and the most robust means of spatial filtering, which can discriminate between signals based on the physical locations of the signal sources [76]. Therefore beamforming can not only separate multiple sound sources but also suppress reverberation for the speech source of interest. Many microphone-array-based speech recognition systems have successfully used delay-and-sum processing to improve recognition performance because of its simplicity, and it remains the method of choice for many array-based speech recognition systems [21, 22, 24, 74]. Nevertheless, beams with a different property would be formed depending on the array structure, sensor spacing and sensor quality [77]. Using a different sensor array, more robust spatial filtering would be obtained in a real environment. In this paper, a delay-and-sum beamforming combined with multiple decoder processing or single decoder processing is proposed. This is called multiple microphone-array processing. Furthermore, the Position-Dependent CMN (PDCMN) is also integrated with the multiple microphone-array processing. PDCMN∗ or VT-PDCMN can indeed compensate efficiently for the channel transmission characteristics depending on speaker position, but cannot normalize the speaker variation because a position-dependent cepstral mean does not contain speaker characteristics. On the contrary, the conventional CMN can compensate for both the transmission and the speaker variations, but cannot achieve good recognition performance for short utterances because the sufficient phonemics balance cannot be obtained. Both variations perform additional operations in the cepstral domain. Thus, the combination of position-dependent cepstral mean and conventional cepstral mean may simultaneously compensate for the channel distortion and speaker variation effectively. In this paper, the sum of weights of position-dependent cepstral and conventional cepstral mean is set to 1 because the transmission characteristics should not be over normalized. In other words, since both the position-dependent cepstral mean and the conventional cepstral mean contain the channel transmission characteristics, the channel distortion would be normalized twice if each weight is set to 1. Indeed, we also conducted experiments using the various weights (the sum of weight was not equal 1) of two kinds of cepstral mean, the results became worse because the amplitude of weight-sum of two kinds of cepstral mean mismatched the real value. In this chapter, we propose a robust distant-talking speech recognition by combining PDCMN/VT-PDCMN with the conventional CMN to address the above problems. The a priori estimated position-dependent cepstral mean is linearly combined with an utterance-wise cepstral mean using the following two combination methods. The first method uses a fixed weighting coefficient over ∗

For the sake of convenience, CMN refers to short-term spectrum based CMN, and PDCMN refers to short-term spectrum based PDCMN in this paper.

6.2. MULTIPLE MICROPHONE PROCESSING

65

the whole test data to obtain the combinational CMN, and this is called fixedweight combinational CMN. However, the optimal weight seems to depend on the speaker position and the length of the utterance to be recognized. Thus, a fixed weighting coefficient does not obtain the optimal result. A variable weighting coefficient may produce better performance. A single input feature compensated by the combinational cepstral means with different weighting coefficients generates multiple input features. Thus, the problem becomes how to obtain the optimal performance for the given multiple input features. Voting on the different hypotheses generated from the multiple input features has been studied in [75, 84]. In [85], a new algorithm to select a suitable channel for speech recognition using the output of the speech recognizer has been proposed. All the methods discussed above use the output hypotheses generated by multiple decoders to estimate the final result. In Section 6.2.2, we will describe the combination of multiple input streams at frame level using a single decoder. Here, we extend this method to the combination of PDCMN/VT-PDCMN and conventional CMN. The second method for obtaining the combinational CMN involves calculating the output probability of each input feature at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called variableweight combinational CMN and is very easy to implement in both isolated word recognition systems and continuous speech recognition systems.

6.2

Multiple Microphone Processing

The Voting method (VM) and Maximum-summation-likelihood method (MSLM) using multiple decoders (that is, multiple decoder processing) are proposed in Section 6.2.1. To reduce the computational cost of the methods described in Section 6.2.1, a multiple microphone processing using a single decoder (that is, single decoder processing) is proposed in Section 6.2.2. In Section 6.2.3, we combine multiple decoder processing or single decoder processing with the delay-and-sum beamforming.

6.2.1

Multiple decoder processing

In this section, we proposed a novel multiple microphone processing using multiple decoders, which is called multiple decoder processing. The procedure of multiple microphone processing using multiple decoders is shown in Fig. 6.1, in which all results obtained by different decoders are inputted to a so-called VM or MSLM decision method to obtain the final result.

66

6.2.1.1


Voting method

Because of the subtle differences in the features between input streams, different channels may lead to different results for a certain utterance. To achieve robust speech recognition for the multiple channels, a good decision method for the final result from the results obtained from these channels is important. The signal received by each channel is recognized independently, and the system votes for a word according to the recognition result. Then the word which obtained the maximum number of votes is selected as the final recognition result, which is called Voting method (VM). The Voting method is defined as: ˆ = arg max W

#channel

WR

I(Wi , WR ),

(6.1)

i=1

I(Wi , WR ) =

1 if (Wi =WR )

,

(6.2)

0 otherwise

where Wi is the recognition result of the i-th channel, and I(Wi , WR ) denotes an indicator. If there are more than two results that obtain the same number of votes, the result of the microphone which is nearest to the sound source is selected as the final result. In our proposed position-dependent CMN method, speaker position is estimated a priori, so it is possible to calculate the distance from the microphone to the speaker. 6.2.1.2

Maximum-summation-likelihood method

The likelihood of each microphone can be seen as a potential confidence score, so combining the likelihood of all channels should obtain a robust recognition result. In this chapter, the maximal summation likelihood is defined as: ˆ = arg max W WR

#channel

LWR (i),

(6.3)

i=1

where LWR (i) indicates the log likelihood of WR obtained from i-th channel. We call this the Maximum-summation-likelihood method (MSLM). In other words, it is a maximum production rule of probabilities.

6.2.2

Single decoder processing

The multiple microphone processing using multiple decoders may be more robust than a single channel. However, the computational complexity of multiple microphone processing using multiple decoders is K (the number of input channels) times that of a single input. To reduce the computational cost, instead of

6.2. MULTIPLE MICROPHONE PROCESSING

Input 1

Input 2

67

Output Probability 1

Decoder 1 result 1


Decoder 2

result 2

VM/MSLM

Input K

Output Probability K

Final result

Decoder K result K

Figure 6.1: Illustration of multiple microphone processing using multiple decoders (utterance level) Input 1

Input 2


Output Probability 2 Decoder

Input K

Final result

Output Probability K

Figure 6.2: Illustration of multiple microphone processing using single decoder (frame level)

obtaining multiple hypotheses or likelihoods at the utterance level using multiple decoders, the output probability of each input is calculated at frame level, and a single decoder using these output probabilities is used to perform speech recognition. We call this method single decoder processing, and Fig. 6.2 shows its processing procedure. In a multiple decoder method, a conventional Viterbi algorithm [94] is used in each decoder, and the probability α(t, j, k) of the most likely state sequence at time t which has generated the observation sequence Ok (1) · · · Ok (t) (until time t) of k-th input (1 ≤ k ≤ K) and ends in state j is defined by λmj bmj (Ok (t))} (6.4) α(t, j, k) = max {α(t − 1, i, k)aij 1≤i≤S

m

where aij = P (st = j|st−1 = i) is the transition probability from state i to state j, 1 ≤ i, j ≤ S, 2 ≤ t ≤ T ; bmj (Ok (t)) is the output probability of m-th Gaussian mixture (1 ≤ m ≤ M ) for an observation sequence Ok (t) at state j; and λmj is the

68


mixture weights. In the multiple decoder method shown as Fig. 6.1, the Viterbi algorithm is performed by each decoder independently, so K (the number of input streams) times computational complexity is required. Thus, both the calculation of output probability and the rest of the processing cost such as finding a best path (state sequence) etc. are K times that of a single input. In order to use a single decoder for multiple inputs shown as Fig. 6.2, we modify the Viterbi algorithm as follows: α(t, j) = max {α(t − 1, i)aij max 1≤i≤S

k

λmj bmj (Ok (t))}

(6.5)

m

In Equation (6.5), the maximum output probability of all K inputs at time t and state j is used. So only one best state sequence for all K inputs using the maximum output probability of all K inputs is obtained. This means that extra K −1 times for only the calculation of the output probability is required compared to that of a single input. Here, we investigate further reduction of the computational cost. We assume that the output probabilities of K features at time t from each Gaussian component are similar to each other. Hence, if we obtained the maximum output probability of the 1st input from the m-th ˆ component among those in state j, it is highly likely that the maximum output probability of k-th input will also be obtained from m-th ˆ component. Thus, we modify Equation (6.5) as follows:

α(t, j) =

max {α(t − 1, i)aij max bmj ˆ (Ok (t))},

1≤i≤S

k

m ˆ = arg max λmj bmj (O1 (t))). m

(6.6)

− 1 = K−1 In Equation (6.6), only extra M +K−1 M M times calculation of output probability is required compared to that of a single input. The methods defined by Equations (6.5) and (6.6) both involve multiple microphone processing using the single decoder shown in Fig. 6.2. To distinguish these two methods, the method given by Equation (6.5) is called the full-mixture single decoder method, while the method given by Equation (6.6) is called the single-mixture single decoder method.

6.2.3

Multiple microphone-array processing

Many microphone array-based speech recognition systems have successfully used delay-and-sum processing to improve recognition performance because of its spatial filtering ability and simplicity, so it remains the method of choice for many array-based speech recognition systems [22, 23, 76]. Beamforming can suppress reverberation for the speech source of interest. Beams with different properties would be formed by the array structure, sensor spacing and sensor

6.3. COMBINATION OF PDCMN/VT-PDCMN AND CONVENTIONAL CMN

69

quality [77]. As described in Sections 6.2.1 and 6.2.2, the multiple microphonearray processing using multiple decoders or a single decoder should obtain a more robust performance than a single channel or a single microphone array, because only microphone-array processing may yield an estimation error. We integrated a set of the delay-and-sum beamforming with multiple or single decoder processing. In this chapter, the 4 T-shaped microphones are set as shown in Fig. 3.1. Array 1 (microphones 1, 2, 3), array 2 (microphones 1, 2, 4), array 3 (microphones 1, 3, 4), array 4 (microphones 2, 3, 4) and array 5 (microphones 1, 2, 3, 4) are used as individual arrays, and thus we can obtain 5 channel input streams using delay-and-sum beamforming. These streams are used as inputs of the multiple or single decoder processing to obtain the final result. We call this method multiple microphone-array processing. These streams can also be compensated by the proposed position-dependent CMN etc. before they are inputted into multiple decoder processing or single decoder processing.

6.3 6.3.1

Combination of PDCMN/VT-PDCMN and Conventional CMN Fixed-weight combinational CMN

To compensate the channel distortion and speaker characteristics simultaneously, a short-term or variable-term spectrum based position-dependent cepstral mean is combined linearly with the conventional cepstral mean. The new compensation parameter ∆C for the combinational CMN is defined by: ∆C = λ(C¯ position − C¯ train ) + (1 − λ)(C¯tx − C¯ train ) = λC¯ position + (1 − λ)C¯tx − C¯ train ,

(6.7)

where λ denotes a weighting coefficient. In this section, C¯position and C¯train are estimated by averaging short-term cepstra for PDCMN given by Eq. (4.9) and variable-term cepstra for VT-PDCMN given by Eqs. (4.14) and (4.15). When a fixed λ is used for the entire test data, the method is known as fixed-weight combinational CMN.

6.3.2

Variable-weight combinational CMN

In Section 6.3.1, a fixed weighting coefficient λ is used to combine PDCMN/VTPDCMN with the conventional CMN. The effect of the channel distortion (that is, position-dependent cepstral mean) depends on speaker position and the confidence of the estimated speaker characteristics (that is, the conventional cepstral mean) depends on the length of the utterance. Therefore, the weighting coefficient λ should be adjusted according to the speaker position and the length of the utterance. A single input feature compensated by the combinational cepstral means

70


with different weighting coefficients generates multiple input features. Thus, the problem becomes how to obtain the optimal performance given the multiple input features. Given a set of variable weights {λk }, an automatic decision algorithm for the optimal weighting coefficient λ is required. We extend the method proposed in Section 6.2.2 and modify this algorithm to the so-called variable-weight combinational CMN. Indeed, the proposed variable-weight combinational CMN can automatically select the optimal weight coefficient at frame level from within the range of given weight coefficients. For multiple inputs, a conventional Viterbi algorithm [94] is used for each input stream, k. The probability α(t, j, k) of the most likely state sequence at time t which has generated the observation sequence Ok (1) · · · Ok (t) (until time t) of the k-th input (1 ≤ k ≤ K) and ends in state j is defined by: α(t, j, k) =

max {α(t − 1, i, k)aij bj (Ok (t))},

1≤i≤S

(6.8)

Ok (t) = C˜t − (λk C¯ position + (1 − λk )C¯tx − C¯ train ). where aij = P (st = j|st−1 = i) is the transition probability from state i to state j, 1 ≤ i, j ≤ S, 2 ≤ t ≤ T ; and bj (Ok (t)) is the output probability for an observation sequence Ok (t) at state j. λk is the k-th weighting coefficient. In order to use a single decoder for multiple inputs, we modify Eq. (6.8) as Eq. (6.5) in Section 6.2.2. In Eq. (6.5), the maximum output probability of all K inputs at time t and state j is used. So only one best state sequence for all K inputs using the maximum output probability of all K inputs is obtained. This means that an extra K −1 times for the calculation of only the output probability is required compared to that of a single input. Furthermore, the derivatives of the K input cepstrums (∆cepstrum) compensated by different combinational cepstral means have the same values. Thus, the calculation depending only on the derivatives can be shared by the input streams.

6.4 6.4.1

Experiments of Distant-talking Speech Recognition Experimental Setup

We performed the experiment in the room shown in Fig. 3.2 measuring 3.45 m × 3 m × 2.6 m without additive noise. The room was divided into the 12 (3 × 4) rectangular areas shown in Fig. 3.3, where the area size is 60 cm × 60 cm. We measured the transmission characteristics (that is, the mean cepstrums of utterances recorded a priori) from the center of each area. In our experiments, the room was set up as the seminar room shown in Fig. 3.2 with a whiteboard beside the left wall, one table and some chairs in the center of the room, one TV and some other tables etc.

6.4. EXPERIMENTS OF DISTANT-TALKING SPEECH RECOGNITION

71

In our method, the estimated speaker position should be used to determine the area (60 cm × 60 cm) in which the speaker should be. In Chapter 3, we also revealed that the speaker position could be estimated with estimation errors of 10–15 cm by the 4 T-shaped microphone system shown as Fig. 3.1. In the present study, therefore, we assumed that the position area was accurately estimated, and we purely evaluated only our proposed speech recognition methods. Tohoku University and Panasonic isolated spoken word database was used as training and test set. Twenty male speakers uttered 200 isolated words, each with a close-microphone. The average time of all utterances was about 0.6 second. For the utterances of each speaker, the first 100 words were used as test data and the rest together with other speakers’ utterances for estimation of cepstral mean C¯position in Equations (4.9) and (4.12). All the utterances were emitted from a loudspeaker located in the center of each area and recorded for test and estimation of C¯position to simulate the utterances spoken at various positions. The sampling frequency was 12 kHz. The frame length was 21.3 ms (256-point) for a short-term cepstrum, and 37.3 ms (448-point) for a long-term cepstrum. To compensate for the effect of long reverberation, it seems that longer analysis window is more effective. However, there exists a tradeoff between temporal resolution and frequency resolution. The longer the analysis window was, the worse the temporal resolution was. Furthermore, the speech segment (even the vowel etc.) should no longer be a stationary signal if the analysis window it too long, the recognition performance would became worse in that case. In this paper, the length of long-term analysis window was empirically determined. The result based on 448-point window obtained the best performance. By the way, the result based on 448-point window was almost the same as that based on 512-point window and was significantly better than that based on other length of windows. A frame shift of 8 ms (96-point) was used for both short-term and long-term cepstra. Then, 116 Japanese speaker-independent syllable-based HMMs (strictly speaking, mora-unit HMMs [96]) were trained using 27992 utterances read by 175 male speakers (JNAS corpus). Each continuous-density HMM had 5 states, 4 with pdfs of output probability. Each pdf consisted of 4 Gaussians with fullcovariance matrices. The feature space was comprised of 10 MFCCs. First- and second-order derivatives of the cepstrums plus first and second derivatives of the power component were also included. When using the variable-weight combinational CMN, the optimal weighting coefficient was not empirically determined for the entire test data or development data as in the fixed-weight combinational CMN, but was automatically selected at frame level from within the range of given weight coefficients. In this paper, the number of weight coefficient K was set as 3. For the single microphone, λ1 , λ2 and λ3 were set as 0.6, 0.7 and 0.8, respectively. For the microphone array, λ1 , λ2 and λ3 were set as 0.4, 0.5 and 0.6, respectively.


72

Table 6.1: Recognition results emitted by a loudspeaker (average of results obtained by 4 independent microphones: %).

6.4.2 6.4.2.1

area

w/o CMN

Conv. CMN

CM of area 5

PICMN

PDCMN

1 2 3 4 5 6 7 8 9 10 11 12 Ave.

86.3 95.4 94.3 87.4 92.1 90.9 89.0 91.4 92.3 84.9 86.9 85.9 89.7

92.8 95.7 94.6 92.9 93.8 93.2 92.4 91.4 93.1 90.0 90.9 89.8 92.5

94.2 97.4 96.8 93.1 96.3 95.2 94.3 93.8 95.9 90.5 91.7 90.9 94.2

95.0 97.7 97.1 94.6 96.0 96.1 94.8 94.1 96.4 91.8 93.2 93.3 94.9

95.7 97.4 96.8 95.6 96.3 95.9 95.7 94.7 96.0 93.5 94.1 93.3 95.4

Recognition experiment by single microphone Recognition experiment for speech emitted by a loudspeaker

We conducted the speech recognition experiment of isolated words emitted by a loudspeaker using a single microphone in a distant environment. The recognition results are shown in Table 6.1. The speech recognition performance for clean isolated words was 96.0%. The proposed method is referred to as PDCMN (Position-Dependent CMN). In Table 6.1, the average results obtained by the 4 independent microphones shown in Fig. 3.1 are indicated. In Table 6.1, PDCMN is compared with the baseline (recognition without CMN), the conventional CMN, “CM of area 5” and PICMN (Position-Independent CMN). Area 5 is in the center of all 12 areas, and “CM of area 5” means that a fixed Cepstral Mean (CM) in the central area was used to compensate for the input features of all 12 areas. PICMN means the method by which the averaged compensation parameters over 12 areas were used. Without CMN, the recognition rate was drastically degraded according to the distance between the sound source and the microphone. The conventional CMN could not obtain enough improvement because the average duration of all utterances was too short (about 0.6 second). By compensating the transmission characteristics using the compensation parameters measured a priori, both “CM of area 5”, PICMN and PDCMN effectively improved the performance of speech recognition from w/o CMN and the conventional CMN.


73

In a distant-talking environment, the reflection may be very strong and may be very different depending on the given areas, so the difference of transmission characteristics in each area should be very large. In other words, obstacles caused complex reflection patterns depending on the speaker positions. The proposed PDCMN could also achieve more effective improvement than “CM of area 5” and PICMN. The PDCMN achieved a relative error reduction rate of 55.3% from w/o CMN, 38.7% from the conventional CMN, 20.7% from “CM of area 5” and 9.8% from PICMN, respectively. The experimental result also shows that the greater the distance between the sound source and the microphone, the greater was the improvement. The differences of the performance between the PDCMN and PICMN / “CM of area 5” were significant, but not too large. When assuming larger area, the performance difference must be much larger. So, we assume the extended area described in Fig. 6.3 and then the area 12 of the original area corresponds to the center of the extended area. We used “CM of area 12” to compensate the utterances emitted from area 1 to simulate the extended area. The result degraded from 94.2% (“CM of area 5”) to 92.9%. This was much inferior to that of PDCMN (95.4%). These results indicated that the proposed method works much better in the larger area. This degradation means a larger variation of the transmission characteristics, and this variation must cause the degradation of the performance of PICMN.

6.4.3

Recognition Experiment of Speech Uttered by Humans

We also conducted experiments with real utterances spoken by humans using a single microphone (that is, microphone 1 in Fig. 3.1 in this case). The utterances were directly spoken by 5 male speakers instead of those emitted from a loudspeaker in the first experiment. The experimental results are shown in Table 6.2, in which “CMN by human utterances” means the result of CMN with the cepstral means of real utterances recorded along with the test set (i.e., the ideal case). “CMN by utterances from a loudspeaker” means the result of CMN with the cepstral means of utterances emitted by a loudspeaker. The “proposed method” is the result of the proposed CMN given by Equation (4.12) which compensated for the mismatch between human (real) and loudspeaker (simulator). In the cases of “CMN by human utterances” and “proposed method”, we estimated the compensation parameters for a certain speaker from the utterances by the other 4 persons. We also conducted recognition experiments without CMN and with the conventional CMN. For the “proposed method”, the results of Table 6.2 were worse than that of Table 6.1 because a few utterances (100 utterance × 4 persons = 400 utterances) were used to estimate the position-dependent cepstral mean† , and the recording environment, the acquistion equipment and the speaker †

In Table 6.1, 2000 utterances (100 utterances × 20 persons) were used to estimate the


74

Extended area Original area

1

2

3

4

5

6

7

8

9

10

11

12

Figure 6.3: Extended area

of the clean speech were different. Since the utterances were too short (about 0.6s) to estimate the accurate cepstral mean, the conventional CMN was not robust in this case. In Table 6.1, the utterances were emitted by a loudspeaker whose distortion is relatively large. Hence, the gain of compensating these transmission characteristics is greater than the loss caused by the inaccurate cepstral compensation parameters.


75

Table 6.2: Recognition results of human utterances (results obtained by microphone 1 shown in Fig. 3.1: %)

area

w/o CMN

Conv. CMN

CMN by human utterances

5 9 10 Ave.

95.8 93.4 84.8 91.3

94.6 90.6 83.8 89.7

96.6 94.4 89.8 93.6

CMN by utterances from a loudspeaker

proposed method

96.0 91.2 83.0 90.1

96.8 94.2 90.0 93.7

Table 6.3: Comparison of recognition accuracy of single microphone with multiple microphones using multiple decoders (%)

w/o CMN Conv. CMN PICMN PDCMN

single microphone

VM

MSLM

1

89.7 92.5 94.9 95.4

91.3 94.2 96.0 96.4

91.6 94.5 96.2 96.6

90.9 93.5 95.7 96.2

multiple microphones beamforming (array) 2 3 4 5 91.3 93.3 96.1 96.3

91.3 93.3 95.8 96.2

91.0 93.4 96.0 96.1

91.4 93.6 96.1 96.4

VM + beamforming

MSLM + beamforming

91.9 94.1 96.3 96.7

91.9 94.2 96.4 96.8

Table 6.4: Comparison of recognition accuracy of multiple microphone-array processing using single decoder with that using multiple decoders (%) multiple decoders (see Table 6.3) VM + MSLM + beamforming beamforming recognition rate

w/o CMN Conv. CMN PICMN PDCMN

computation ratio

single decoder full-mixture + single-mixture + beamforming beamforming

91.9 94.1 96.3 96.7

91.9 94.2 96.4 96.8

93.0 93.9 96.5 96.9

92.0 92.9 96.1 96.6

5

5

3.58

1.77

mean estimated by short utterances. The conventional CMN worked better than without CMN. On the contrary, in Table 6.2, the utterances were spoken by humans, so the transmission characteristics were much smaller than those in Table 6.1. Then the degradation caused by the inaccurately estimated cepstral mean became dominant, and the conventional CMN worked even worse than without CMN. The results show that the proposed method could approximate the CMN with the human cepstral mean and was better than the CMN with the loudspeaker cepstral mean.

76

6.4.4


Experimental Results for Multiple Microphone Speech Processing

The experiments in Section 6.4.3 showed that the proposed method given by Equation (4.12) could well compensate for the mismatch between voices from humans and the loudspeaker. For convenience sake, we used utterances emitted from a loudspeaker to evaluate the multiple microphone speech processing methods. The recognition results of a single microphone and multiple microphones are compared in Table 6.3. The multiple microphone processing methods described in Section 6.2.1 which use multiple decoders were conducted. Both Voting method (VM ) and Maximum-summation-likelihood method (MSLM ) are more robust than single microphone processing. The MSLM achieved a relative error reduction rate of 21.6% from single microphone processing. The VM and MSLM could achieve a similar result to the conventional delay-and-sum beamforming. By combining the MSLM with beamforming based on the position-dependent CMN, a 11.1% relative error reduction rate was achieved from beamforming based on the position-dependent CMN, and a 50% relative error reduction rate was achieved from beamforming with the conventional CMN (that is, a conventional method). The MSLM proved more robust than the VM in almost all cases because the summation of the likelihoods can be seen as the potential confidence of all channels. The proposed PDCMN achieved more efficient improvement than PICMN by using multiple microphones. In the case of MSLM combining with beamforming, PDCMN achieved a relative error reduction rate of 11.1% from PICMN. Both PDCMN and PICMN could improve speech recognition performance significantly more than w/o CMN and the conventional CMN. It is not necessary for PICMN to estimate the speaker postion. Therefore, PICMN may also be a good choice because it simplifies the system implementation. As described in Section 6.2.2, the computational cost of multiple microphone processing using multiple decoders given by Equation (6.8) is 5 (the number of microphone arrays) times that of a single channel. Experiments were also conducted on a full-mixture single decoder processing given by Equation (6.5) and single-mixture single decoder processing given by Equation (6.6). The computational costs of full-mixture single decoder processing and single-mixture single decoder processing are 3.58 times and 1.77 times that of a single channel, respectively. The recognition results of the multiple microphone-array processing using the multiple decoders and single decoder are shown in Table 6.4. Since the multiple microphone-array processing using the full-mixture single decoder selected a maximum likelihood of each input sequence at every frame, it achieved slightly more improvement than the multiple microphone-array processing using the multiple decoders. The multiple microphone-array processing using the single-mixture single decoder reduced computational cost about 50% more than


77

Table 6.5: The individual speech recognition results for short-term and longterm cepstra. Cepstral means were estimated from 100 isolated words for each speaker. (single microphone: %) Recognition method Acoustic model

short-term spectrum based CMN short-term cepstrum

Area 10 Area 11 Area 12 Ave.

95.0 95.1 95.2 95.1

long-term spectrum based CMN short-term long-term cepstrum cepstrum 94.3 94.8 93.6 94.2

94.1 94.4 94.0 94.2

Table 6.6: Speech recognition results for the combination of short-term and long-term spectrum based CMN. Cepstral means were estimated from 1 word, 10 words and 100 words for each speaker. (single microphone: %) Area 10 11 12 Ave.

1 word Conv. CMN proposed 90.4 88.6 90.9 88.5 89.9 88.7 90.4 88.6

10 words Conv. CMN proposed 94.4 94.8 94.2 94.9 94.8 94.7 94.5 94.8

100 words Conv. CMN proposed 95.0 95.7 95.1 96.1 95.2 95.6 95.1 95.8

that using the full-mixture single decoder. In theory, the improvement of computational complexity between the single-mixture single decoder processing and the multiple microphone processing using the multiple decoders is determined by the number of inputs K and the number of Guassian mixtures M , as decribed in Section 6.2.2. The larger the number of Gaussian mixtures was, the greater was the reduction of computational cost. In our experiments, the number of Gaussian mixtures was 4. Comparing the results in Tables 6.3 and 6.4, the delay-and-sum beamforming using the single-mixture single decoder based on the position-dependent CMN achieved a 3.0% improvement (46.9% relative error reduction rate) over the delay-and-sum beamforming based on the conventional CMN with 1.77 times the computational cost.

6.4.5

Preliminary experimental results based on the combination of short-term and long-term spectrum based CMN

We conducted the preliminary speech recognition experiment using a single microphone and combining short-term and long-term spectrum based CMN as proposed in Section 4.4.1. The utterances emitted by a loudspeaker located in areas 10, 11 and 12 as shown in Fig. 3.3 were used as the test data.

78


The individual results for the single microphone based on short-term cepstrum and long-term cepstrum are compared in Table 6.5. Cepstral means were estimated from 100 isolated words for each speaker, and then CMN was performed. The results based on long-term cepstrum were worse than those based on short-term cepstrum because numerous speech segments of test data were not static signals and could not be analyzed by the long-term window (≈ 37.3 ms). For static signals analyzed by the long-term window, both the short-term HMMs and long-term HMMs were used as acoustic models. Since a considerable number of the speech segments of training data was not static and could not be analyzed by the long-term window, parameters of the long-term HMMs could not be estimated accurately. Thus, the results based on long-term HMMs were slightly worse than those based on short-term HMMs. In the following part of this paper, the same short-term syllable-based HMMs were used as acoustic models for both the short-term spectrum based CMN and the long-term spectrum based CMN. The results of combining short-term and long-term spectrum based CMN are shown in Table 6.6. Cepstral means were estimated from 1 word, 10 words and 100 words for each speaker. 30% of speech segments with a smaller cepstral distance were identified as static speech segments, and this was empirically determined. For CMN with 1 word, since the average duration of the static speech signal of all utterances was too short (about 0.6 second × 30% = 0.18 second), accurate cepstral means could not be estimated for the static speech signal (longterm spectrum). Thus, the combination of short-term and long-term spectrum based CMN did not improve recognition performance for short utterances. For CMN with 10 words or 100 words, the proposed combination method effectively improved the recognition performance. The experimental results also show that the longer the speech data which is used to estimate the cepstral mean, the greater was the improvement. The proposed combination of the short-term and long-term spectrum based CMN using 100 words for cepstral mean estimation achieved a 14.3% relative error reduction rate over the conventional short-term spectrum based CMN.

6.4.6

Experimental results based on the combination of PDCMN/VTPDCMN and conventional CMN

The variable-term spectrum based CMN (that is, the combination of shortterm and long-term spectrum based CMN) improved the recognition rate when the length of the utterance to be recognized was long enough. However, this precludes real-time processing of speech recognition. Furthermore, the variableterm spectrum based CMN degraded the recognition rate when the length of utterance to be recognized was too short. In this section, we conducted the experiments based on a combination of environmentally robust real-time PDCMN/VTPDCMN and the conventional CMN. Both the single microphone and the T-shape


79

Table 6.7: Speech recognition results for the combination of PDCMN/VTPDCMN and conventional CMN (%)

W/O CMN Conv. CMN PDCMN VT-PDCMN PDCMN + Conv. CMN (fixed-weight) VT-PDCMN + Conv. CMN (fixed-weight) VT-PDCMN + Conv. CMN (variable-weight)

Single microphone 90.1 92.9 95.8

Microphone array 91.4 93.6 96.4

96.2

96.7

96.3

96.9

96.6

97.1

97.0

97.5

4 microphone array were used. Short-term cepstral means were estimated from one isolated word (about 0.6 seconds) for the conventional CMN. The average results of all 12 areas based on a combination of PDCMN/VTPDCMN and the conventional CMN for both the single microphone and the microphone array are summarized in Table 6.7. The detailed experimental results for every area are shown in Table 6.8 for the single microphone and Table 6.9 for the microphone array. By compensating the transmission characteristics using the compensation parameters measured a priori from sufficient utterances for each area, the short-term spectrum based PDCMN given by Eq. (4.9) effectively improved the speech recognition performance in all 12 areas for both the single microphone and the microphone array, compared to the conventional CMN. For the microphone array, the conventional CMN and the proposed PDCMN were applied after the delay-and-sum beamforming. The proposed method outperformed the conventional CMN (that is, a typical channel normalization method for dereverberation), microphone array processing (that is, a spatial filtering for dereverberation) and the combination method of conventional CMN and microphone array processing. Furthermore, CMN based dereververation methods are easy to be combined with many other dereverberation methods such as representations RelAtive SpecTrA (RASTA) filtering, low-pass AutoRegressive Moving Average (ARMA) filtering [98, 99, 100], etc. RASTA applies a band-pass filter to the energy in each frequency subband in order to smooth over noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel [98, 101]. The band-pass nature of the RASTA filter and mean

80


subtraction of CMN both result in a feature vector stream with mean of zero. In many cases, the performance based on CMN was similar to that based on RASTA under convolutional noise [102, 103]. In [102], RASTA filtering and CMN were examined as methods for normalization. Experiments showed that RASTA filtering results in slightly better performance on the unconstrained monophone task than CMN. In [103], the classical RASTA filtering resulted in decreased recognition performance when compared to CMN. Phase-corrected RASTA reached the same performance level as obtained for CMN for a medium and large vocabulary continuous speech recognition task. The phase-corrected RASTA is a technique that consists of classical RASTA filtering followed by a phase correction operation. In some cases, the combination of CMN and RASTA can give better results than either of the techniques alone [99, 101]. Therefore, our proposed method is effective than some typical dereververation techniques such as the conventional CMN and the delay-and-sum beamforming. Moreover, the proposed method is easy to be combined with many other dereverberation methods such as beamforming, RASTA filtering, ARMA filtering, etc., and a furthermore improvement should be obtained. Thus, in this paper, we did not compare our proposed method with other dereverberation methods such as RASTA filtering, ARMA filtering, etc. Since the effect of long reverberation on a static speech segment could be compensated by the long-term spectrum based CMN, the combination of shortterm spectrum based PDCMN and long-term spectrum based PDCMN (that is, Variable-Term spectrum based PDCMN (VT-PDCMN)) further improved the speech recognition performance. VT-PDCMN achieved a relative error reduction rate of 9.5% over PDCMN for the single microphone and 8.3% over PDCMN for the microphone array. 40% of speech segments for the single microphone and 30% of speech segments for the microphone array with smaller cepstral distances were identified as static speech segments, and this was empirically determined. The combination of short-term spectrum based PDCMN and conventional CMN with a fixed-weight compensated the channel distortion and speaker characteristics simultaneously, so an 11.9% relative error reduction rate was achieved over PDCMN for the single microphone and a 13.9% relative error reduction rate was achieved over PDCMN for the microphone array. When VT-PDCMN was combined with the conventional CMN using a fixed-weight, it achieved a relative error reduction rate of 19.0% over PDCMN for the single microphone and 19.4% over PDCMN for the microphone array. The best average performance was obtained with the weight coefficient λ = 0.7 for the single microphone and λ = 0.5 for the microphone array. Finally, the combination of VT-PDCMN and conventional CMN with variableweight achieved the best recognition performance of all the methods because the optimal weighting coefficients were selected at each frame in an utterance. In other words, when using the variable-weight combinational CMN, the optimal


81

Table 6.8: Speech recognition results for the combination of PDCMN/VTPDCMN and conventional CMN using a single microphone (%) Area

Conv. CMN

PDCMN

VTPDCMN

1 2 3 4 5 6 7 8 9 10 11 12 Ave.

93.1 96.1 94.9 93.6 94.4 93.7 92.4 91.1 93.8 90.4 90.9 89.9 92.9

96.6 97.8 96.6 95.9 96.6 96.0 95.8 95.1 96.9 94.4 94.0 93.6 95.8

97.0 97.7 97.2 96.2 97.1 96.4 95.9 95.2 96.9 94.6 95.3 94.6 96.2

PDCMN + Conv. CMN (fixed-weight) 96.9 97.9 97.1 96.2 97.2 97.0 96.4 95.2 97.4 94.4 94.9 94.7 96.3

VT-PDCMN + Conv. CMN (fixed-weight) 97.2 98.1 97.6 96.5 97.6 97.1 96.6 95.7 97.3 95.1 95.4 95.2 96.6

VT-PDCMN + Conv. CMN (variable-weight) 97.8 98.6 98.1 97.0 97.8 97.5 97.0 96.1 97.7 95.5 95.6 95.5 97.0

Table 6.9: Speech recognition results for the combination of PDCMN/VTPDCMN and conventional CMN using a microphone array (%) Area

Conv. CMN

PDCMN

VTPDCMN

1 2 3 4 5 6 7 8 9 10 11 12 Ave.

94.8 96.0 94.7 93.1 94.5 94.6 94.1 93.0 93.8 91.5 91.9 91.4 93.6

97.2 98.1 96.9 95.9 97.0 97.3 96.8 95.9 96.5 94.8 95.5 94.4 96.4

97.5 98.2 97.3 96.6 97.3 97.1 97.0 96.4 97.0 95.7 95.7 94.7 96.7

PDCMN + Conv. CMN (fixed-weight) 97.7 97.9 97.7 96.7 98.0 97.7 96.9 96.2 97.2 95.4 95.8 95.2 96.9

VT-PDCMN + Conv. CMN (fixed-weight) 97.8 98.3 98.0 97.3 97.8 97.9 97.0 96.8 97.1 95.6 95.9 95.4 97.1

VT-PDCMN + Conv. CMN (variable-weight) 98.2 98.6 98.3 97.6 98.4 98.2 97.5 96.9 97.7 96.1 96.7 95.8 97.5

weighting coefficient was not empirically determined for the entire test data or development data as in fixed-weight combinational CMN, but was automatically selected at frame level from within the range of given weight coefficients. For the single microphone, a 4.1% improvement (57.7% relative error reduction rate) over the conventional CMN, and a 1.2% (28.6% relative error reduction rate) over PDCMN were achieved. For the microphone array, a 3.9% improvement (60.9% relative error reduction rate) over the conventional CMN, and a 1.1% (30.6% relative error reduction rate) over PDCMN were achieved. The computational

82


Table 6.10: An example of Euclidean cepstrum distance between different speakers, different vowels and different positions speaker 1 and speaker 2 (vowel /a/, close-talk) speaker 1 and speaker 2 (vowel /i/, close-talk) vowel /a/ and vowel /i/ (close-talk, speaker 1) vowel /a/ and vowel /i/ (close-talk, speaker 2) area5 and area 10 (vowel /a/, speaker 1) area5 and area 10 (vowel /i/, speaker 1) area5 and area 10 (vowel /a/, speaker 2) area5 and area 10 (vowel /i/, speaker 2)

0.227 0.357 0.673 0.673 0.059 0.125 0.081 0.168

cost of the variable-weight combinational CMN was only 1.26 times that of the other methods even when 3 input streams were used.

6.4.7

Discussion – Analysis of Effect of Compensation Parameter Estimation for CMN on Speech/Speaker Recognition

There are very few studies on the analysis of speech recognition and speaker recognition performances in a distant-talking environment [97]. In [97], a relationship between word recognition rate and distance between microphone and talker was investigated and showed that the performance was degraded in accordance with the distance. In this section, we evaluate the speech recognition and speaker recognition by the position-dependent CMN. Speech recognition and speaker recognition are two different tasks. The required characteristics for acoustic features in speech recognition is to maximize the inter-phoneme variation while minimizing the intra-phoneme variation in the feature space, and for speaker variation instead of phoneme variation in speaker recognition. Thus, the effect of speech recognition and speaker recognition may be different by using PDCMN. We also investigate the effect of experimental environment, the length of utterance and the distance between the sound source and the microphone, etc. on speech/speaker recognition. Then we discuss the solutions for the degradation caused by different factors in the following section. 6.4.7.1

Analysis of the cepstrum distance

In this section, we investigate the effect of inter-vowel variation and interspeaker variation on the speech recognition and speaker recognition in a distant environment. Five Japanese vowels were uttered by four male speakers at positions 5 and 10 in our experimental environment shown in Figs. 3.3 and 3.2. We compared an example of Euclidean cepstrum distances between different speakers, different vow-


83

Table 6.11: Euclidean cepstrum distance Normalization

w/o CMN

CM of corresponding speaker

CM of corresponding position

inter-vowel inter-speaker inter-position

0.316 0.176 0.085

0.316 0.130 0.039

0.316 0.176 0.042

els and different positions in Table 6.10. We also compared the overall inter-vowel, inter-speaker and inter-position Euclidean cepstrum distances in Table 6.11. The inter-vowel distance is larger than inter-speaker and inter-position ones. Since the required characteristics for acoustic features in speech recognition is to maximize the inter-vowel variation in this case while those in speaker recognition is to maximize the inter-speaker variation in the feature space, speaker recognition may be more sensitive to the channel distortion than speech recognition. “CM of corresponding speaker” means that Cepstral Mean (CM) of a certain speaker at a certain position was used to compensate for the cepstrum of corresponding speaker. “CM of corresponding position” means that Cepstral Mean (CM) of a certain position estimated by all speakers was used to compensate for the cepstrum of corresponding position for all speakers. Speaker characteristics are sometimes regarded as a transmission characteristics. So “CM of corresponding speaker” is compensated for the channel distortion and the speaker characteristics simultaneously, and both inter-speaker and inter-position distances became smaller. That is to say, “CM of corresponding speaker” may be suitable for speech recognition if the accurate CM can be estimated, and may not be very suitable for speaker recognition. However, in a real recognition system, the accurate CM cannot be estimated especially when the recognition utterance is short. On the other hand, “CM of corresponding position” compensated for the channel distortion but not for the speaker characteristics, so it may be more effective for speaker recognition. Unrelated to the recognition utterance, “CM of corresponding position” can be estimated from various utterances from different speakers at a certain postion a priori. This is why we proposed a position-dependent CMN in Section 4.3. 6.4.7.2

Comparison of speech recognition performance with speaker recognition performance

We conducted the distant-talking speech/speaker recognition experiments of isolated words emitted by a loudspeaker using a single microphone in both simulated and real environments. For speech recognition, conventional shortterm spectrum based CMN, PDCMN, and their combination were performed, and the experimental setup was same as Section 6.4.1. For speaker recognition,

84


Table 6.12: Speech recognition error rate (%)

simulated environment real environment

w/o CMN

Conv. CMN

PICMN

PDCMN

11.80

7.02

4.94

4.76

9.91

7.22

4.87

4.24

Table 6.13: Speaker recognition error rate (%)

simulated environment real environment

w/o CMN

Conv. CMN

PICMN

PDCMN

7.94

13.01

4.10

2.86

16.02

13.22

6.57

3.81

the experimental setup was same as Section 5.3.1 except that instead of 114 syllable-based HMMs by LPCs in Section 5.3.1, 116 syllable-based HMMs by MFCCs were used as acoustic models. In the speaker recognition task, it is difficult to identify the correct speaker using only one utterance because the average time (about 0.6 second) for all utterances is too short. Therefore, the connective likelihood of 3 utterances (about 1.8 seconds) was used to identify the speaker. As described in Section 5.3.1, 100 isolated words were used for the test, so we had 33 test samples for each speaker, that is, a total 660 samples. In Tables 6.12 and 6.13, PDCMN is compared with w/o CMN, the conventional CMN, and PICMN (Position-Independent CMN). PICMN means the method in which the compensation parameter ∆C in Equation (4.9) was averaged over positions and the averaged parameter was used independently of the speaker position. Without CMN, both the speech recognition and speaker recognition performances were not good according to the distance between the sound source and the microphone. The conventional CMN could not obtain enough improvement in both speech recognition and speaker recognition because the average duration of all utterances was too short (about 0.6 second). Furthermore, as analyzed in Section 6.4.7.1, since the conventional CMN removed the speaker characteristic, the inter-speaker cepstrum distance became smaller than that of without CMN which made the speaker recognition difficult. Consequently, the speaker recognition result was much worse than that without CMN in simulated environment because the negative effect of the loss of speaker characteristics was relatively


85

larger than the positive effect of the compensation of channel distortion. By compensating the transmission characteristics using the compensation parameters measured a priori, both PICMN and PDCMN effectively improved the performance of speech/speaker recognition from w/o CMN and the conventional CMN. Since these methods compensated for the channel distortion by the distance effectively but not for speaker variation, PICMN/PDCMN were relatively more effective for speaker recognition than speech recognition. The proposed PDCMN could also achieve more effective improvement than PICMN because it compensated more precisely for the channel distortion depending on speaker position. As described in Section 6.4.7.1, speaker recognition was more sensitive to channel distortion than speech recognition. Thus, in a same difference of transmission characteristics (that is, the difference of cepstral mean between PDCMN and PICMN), PDCMN achieved more improvement for speaker recognition than that for speech recognition. 6.4.7.3

The effect of the experimental environment

In the real environment, the reflection may be very strong and may be very different depending on the given areas, so the difference of transmission characteristics in each area should be very large. In other words, obstacles caused complex reflection patterns (or reverberation) depending on the speaker positions. Thus, PDCMN improved both the speech recognition and speaker recognition performances remarkably. In the simulated environment (i.e., empty room), the reflection may be relatively weak, so the difference of transmission characteristics in each area may be also relatively small. Thus, PDCMN could not improve speech recognition performance sufficiently because the relatively small difference of cepstral mean between PDCMN and PICMN may not be considerable comparing to that of inter-syllable variation. Both PDCMN and PICMN could improve speech recognition performance more than w/o CMN and the conventional CMN. It is not necessary for PICMN to estimate the speaker postion. Therefore, PICMN may also be a good choice because it simplifies system implementation in this situation. On the other hand, speaker recognition was more sensitive to channel distortion than speech recognition. Therefore, PDCMN improved the speaker recognition performance remarkably even in the simulated environment for which difference of transmission characteristics was relatively small. The PDCMN using a single microphone in a real environment was evaluated in the following sections. 6.4.7.4

The effect of the length of utterance

The recognition of short utterances such as commands, and city names is very important in many applications. In the conventional CMN (cepstral mean



86

Conv. CMN (speech recognition) PDCMN (speech recognition) Conv. CMN (speaker recognition) PDCMN (speaker recognition)

14 13 12 11 10 9 8 7 6 5 4 3 1

2

3

5

7

10

20

50

100

Number of words for CMN Figure 6.4: Speech recognition and speaker recognition performances using the conventional CMN with different length of utterance

of recognition utterance), the accurate cepstral mean cannot be estimated when the utterance is short. PDCMN achieved good speech/speaker recognition performances for isolated word database. On the other hand, when long utterances are available, it is necessary to compare the performance of PDCMN with that of cepstral mean of long utterance. We investigated the effect of the length of utterance on speech/speaker recognition by varying the number of isolated words. In the case of the conventional CMN with long utterance, the compensation parameter ∆C in Eq. (4.6) was estimated by multiple words including recognition word, while an isolated word was used for speech recognition and the connective likelihood of three words was used for speaker recognition. The speech/speaker recognition results by CM of different number of words were compared with those by PDCMN in Fig. 6.4. Both speech recognition and speaker recognition improve in performance as the length of utterance (that is, number of words) increases. For speaker recognition, since the conventional CMN removed the speaker characteristic, even the result of conventional CMN with 100 words was much worse than that of PDCMN. For speech recognition, the conventional CMN normalized the channel distortion and speaker characteristic simultaneously while PDCMN could not normalize the speaker characteristic, so the conventional CMN was better than PDCMN if the length of utterance was long enough (that is, accurate cepstral means can be estimated). In our experiment, the conventional CMN achieved the same performance as that of PDCMN when the number of words was 3. It seemed


87

Table 6.14: Distances of each area (m) class no.

left side

straight direction

right side

1 2 3 4

0.78 (area 1) 1.25 (area 4) 1.80 (area 7) 2.57 (area 10)



that PDCMN was not very effective comparing to the conventional CMN when the long utterance was available. However, the conventional CMN involves a long delay that is likely unacceptable when the utterance is long. To make the speech recognition more robust with short delay, the normalization of speaker characteristic should be incorporate into PDCMN. In this thesis, robust distant speech recognition by combining PDCMN with the conventional CMN by a short utterance (that is, one word about 0.6 second) was proposed in Section 6.3. It achieved almost the same performance (recognition error rate: 3.40%) as that of the conventional CMN with 100 words (recognition error rate: 3.43%). 6.4.7.5

The effect of distance between the sound source and the microphone

We also investigated the effect of distance between the sound source and the microphone on the speaker recognition in the real environment. The distance of each area was shown in Table 6.14, which could be calculated from Fig. 3.3. The speaker recognition results corresponding to different distance were shown in Fig. 6.5. A sound signal is reflected by various obstacles and attenuated by the distance. Thus, in most cases, the greater the distance between the sound source and the microphone, the greater was the speaker recognition error rate. However, the results of the areas (area 7, area 8 and area 9) had a different tendency. A lot of reflections in some areas may form special transmission characteristics, so the degradation of signal is depending not only on the distance but also on the other factors. Even for the same distance, the signals from different directions may have different characteristics. The results of the areas on right side were better than those on left side with the same distance, since the areas on left side was relatively near the wall and the whiteboard (see Figs. 3.3 and 3.2) which caused more reflections. The speech recognition results corresponding to different distance were shown in Fig. 6.6. For speech recognition, the performance was consistently degraded when the distance between the sound source and the microphone was increase, which was different from the tendency of speaker recognition performance. The reason may be that the speaker recognition performance is more sensitive than the speech recognition performance under distant-talking environment.


88

30

areas on left side (Conv. CMN) areas on straight direction (Conv. CMN) areas on right side (Conv. CMN) areas on left side (PDCMN) areas on straight direction (PDCMN) areas on right side (PDCMN)


27 24 21 18 15 12 9 6 3 0

0.5

1.0

1.5 Distance (m)

2.0

2.5

Figure 6.5: Comparison of speaker recognition results for different distances between the sound source and the microphone

6.5

Summary

In this chapter, we proposed a robust distant speech recognition system based on the position-dependent CMN using multiple microphones. The proposed method improved the speech recognition performance more than not only the conventional CMN but also position-independent CMN. We also compensated for the mismatch between the cepstral means of utterances spoken by humans and those emitted from a loudspeaker. Multi-microphone speech processing technologies were used to obtain robust distant speech recognition. Furthermore, we combined delay-and-sum beamforming with single decoder processing. The proposed multiple microphone-array using the single decoder achieved a significant improvement over the single microphone array. Combining the multiple microphone-array using the single decoder with position-dependent CMN, a 3.0% improvement (46.9% relative error reduction rate) over the delay-and-sum beamforming with the conventional CMN was achieved in the real environment at 1.77 times the computational cost. We proposed a robust distant-talking speech recognition method by combining a short-term spectrum based CMN with a long-term spectrum based CMN. We assumed that a static speech segment affected by reverberation can be mod-

6.5. SUMMARY

89


14 areas on left side (Conv. CMN) areas on straight direction (Conv. CMN) areas on right side (Conv. CMN) areas on left side (PDCMN) areas on straight direction (PDCMN) areas on right side (PDCMN)

12 10 8 6 4 2

0.5

1

1.5

2

2.5

Distance (m) Figure 6.6: Comparison of speech recognition results for different distances between the sound source and the microphone

eled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. In Chapter 4, the concept of variable-term spectrum based CMN was extended to a robust speech recognition method based on Position-Dependent CMN (PDCMN) to compensate for channel distortion depending on speaker position. We call this method Variable-Term spectrum based PDCMN (VT-PDCMN). Since PDCMN/VT-PDCMN cannot normalize speaker variation, we further combined PDCMN/VT-PDCMN with the conventional CMN to compensate simultaneously for the channel distortion and speaker characteristics. We conducted the experiments of our proposed method using limited vocabulary (100 words) distant-talking isolated word recognition in a real environment. The combination of VT-PDCMN and conventional CMN achieved a relative error reduction rate of 60.9% over the conventional short-term spectrum based CMN and 30.6% over the short-term spectrum based PDCMN. Finally, we analyzed the influence of duration of compensation parameter estimation for CMN and the distance on speech/speaker recognition.

Chapter 7

Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Distant-talking Speech Recognition

7.1

Introduction

So far, the proposed position-dependent CMN based method requires the transmission characteristics of each position a priori. Thus, it is necessary to estimate the speaker position before speech processing. In this chapter, we aim to compensate for the channel distortion without the knowledge of the position. Compensating an input feature is the main way to reduce a mismatch between the pratical environment and the training environment. Cepstral Mean Normalization (CMN) has been used to reduce channel distortion as a simple and effective way of normalizing the feature space [4]. In order to be effective for CMN, the length of the channel impulse response needs to be shorter than the short-term spectral analysis window. However, the duration of the impulse response of reverberation usually has a much longer tail in a distant-talking environment. Therefore, the conventional CMN is not effective under these conditions. Several studies have focused on decreasing the above problem. Raut et al. [56, 57] used preceding states as units of preceding speech segments, and by estimating their contributions to the current state using a maximum likelihood function, they adapted the models accordingly. However, model adaptation from 91

CHAPTER 7. DEREVERBERATION BASED ON SPECTRAL SUBTRACTION BY MULTI-CHANNEL LMS ALGORITHM FOR DISTANT-TALKING SPEECH 92 RECOGNITION

a priori training data make it less practice to use. A blind deconvolution-based approach for the restoration of speech degraded by the acoustic environment was proposed in [58]. The proposed scheme processed the outputs of two microphones using cepstra operations and the theory of signal reconstruction from phase only. In [59], Avendano and Hermansky explored a speech dereverberation technique whose principle was the recovery of the envelope modulations of the original (anechoic) speech. They applied a data designed filterbank technique to the reverberant speech. A novel approach for multimicrophone speech dereverberation was proposed in [60]. The method was based on the construction of the null subspace of the data matrix in the presence of colored noise, using the generalized singular-value decomposition (GSVD) technique, or the generalized eigenvalue decomposition (GEVD) of the respective correlation matrices. A reverberation compensation method for speaker recognition using spectral subtraction in which the late reverberation was treated as additive noise was proposed in [87, 88]. However, the drawback of this approach is that the optimum parameters for spectrum subtraction are empirically estimated on a development dataset and the late reverberation cannot be subtracted well since the late reverberation is not modelled precisely. In [27, 28], a novel dereverberation method utilizing multi-step forward linear prediction was proposed. They estimated the linear prediction coefficients in a time domain and suppress amplitude of late reflections using spectral subtraction in a spectral domain. In this chapter, we propose a blind reverberation reduction method based on spectral subtraction by adaptive Multi-Channel Least Mean Square (MCLMS) algorithm for distant-talking speech recognition. Speech captured by distanttalking microphones is distorted by the reverberation. With long impulse response, the spectrum of the distorted speech is approximated by convolving the spectrum of clean speech with the spectrum of impulse response. We treat the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction can be easily applied to compensate for the late reverberation. By excluding the phase information from the dereverberation operation as in [26, 28], the dereveration reduction on a power spectrum domain provided a robustness to certain errors that conventional sensitive inverse filtering method could not achieve. The compensation parameter (that is, the spectrum of the impulse response) for spectral subtraction is required. In [108, 109, 110, 111], an adaptive MCLMS algorithm was proposed to blindly identify the channel impulse response in a time domain. In this paper, we extend this method to blindly estimate the spectrum of impulse response for the spectral subtraction in a frequency domain.

7.2. DEREVERBERATION BASED ON SPECTRAL SUBTRACTION

7.2

93

Dereverberation Based on Spectral Subtraction

When speech s[t] is corrupted by convolutional noise h[t] and additive noise n[t], the observed speech x[t] becomes x[t] = h[t] ⊗ s[t] + n[t].

(7.1)

In this paper, additive noise is ignored for simplification, so Eq. (7.1) becomes x[t] = h[t] ⊗ s[t]. To analyze the effect of impulse response, the impulse response h[t] can be separated into two parts hearly [t] and hlate [t] as [88]

hearly [t] =

h[t] 0

t