Improved Mandarin Keyword Spotting Using Confusion ... - CNRS

0 downloads 0 Views 561KB Size Report
knowledge of the statistic grammar of the utterances, keyword spotters have .... are pronounced similarly in Mandarin Chinese, which can be called “pronunciation .... bring confusion garbage model into a modern Mandarin. KWS system, which ...
2010 International Conference on Pattern Recognition

Improved Mandarin Keyword Spotting using Confusion Garbage Model

Shilei Zhang, Zhiwei Shuang, Qin Shi and Yong Qin IBM Research - China, Beijing, China {slzhang, shuangzw, shiqin, qinyong}@cn.ibm.com

Keyword spotting uses the search in phoneme lattice and outputs whether a keyword is present in a signal or not. In KWS technology, the confidence measure of recognized keyword candidate is of crucial importance. In some previous work [2,3,4], techniques for solving the pronunciation variation influenced by accent or speaking styles are studied in the keyword spotting field. Phone confusion network method, determined through either knowledge-based or data-driven approaches, is employed in KWS system to achieve high detection rate. However, the more variations for keywords there are in the spotting net, the more the confusability and false alarm rate will increase. However, it is more important to control and lower false alarm rate effectively in some tasks. In this paper, we will discuss the acoustic keyword spotting using a novel architecture, called confusion garbage model to decrease false alarm rate. The similar pronunciation words with predefined keywords are modeled by a confusion garbage model connected in parallel to the keyword models and conventional garbage models. For the proposed keyword spotting algorithm, the extraneous speech can be handled by conventional garbage models composed of all Chinese syllables and confusion garbage models composed of pronunciation similarity words. The rest of the paper is organized as follows. In Section II, we discuss the baseline keyword spotting paradigm; Section III describes proposed method in detail; Experimental results are presented in Section IV, followed by conclusions in Section V.

Abstract—This paper presents an improved acoustic keyword spotting (KWS) algorithm using a novel confusion garbage model in Mandarin conversational speech. Observing the KWS corpus, we found there are many words with similar pronunciation with predefined keywords, although they have different Chinese characters and different meanings, which easily result in high false alarm rate. In this paper, an improved acoustic KWS method with confusion garbage models was developed that absorbs similar pronunciation words confused with specific keywords for a given task. One obvious advantage of such method is that it provides a flexible framework to implement the selection procedure and reduce false alarm rate effectively for a specific task. The efficiency of the proposed architecture was evaluated under HMM-based confidence measures (CM) methods and demonstrated on a conversational telephone dataset. Keywords-keyword spotting; similar confusion garbage model; confidence measure

I.

pronunciation;

INTRODUCTION

Keyword spotting is a very important branch of speech recognition, which is the task of detecting the occurrences of predefined keywords in an unconstrained audio stream. KWS system has been widely used in various practical application areas such as command control, information retrieval, call center, messages classification and automatic operator queries systems. Without having any a-priori knowledge of the statistic grammar of the utterances, keyword spotters have the fast ability to detect the useful information embedded in natural conversational speech, which may be full of emotion sounds, background noises, etc. This makes keyword spotting different from continuous speech recognition. The users can speak in a natural manner and scenario, such as telephone speech, recordings of meetings, interviews and lectures. The existing work done in keyword spotting can be categorized under three major approaches [1]. The first approach is acoustic keyword spotting approach, which is often used in the online spotting application in an audio stream. In this approach, all words other than the keywords assumed to be garbage and are represented by garbage models. The second approach is Large Vocabulary Continuous Speech Recognition (LVCSR) approach. This approach requires complete decoding of speech signal and it outputs a completely decoded sentence. The third approach is a hybrid approach making use of phoneme lattice. 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.901

II.

BASELINE SYSTEM OVERVIEW

A. Acoustic Keyword Spotting In general, acoustic keyword spotting is parallel network, as shown in Fig. 1, where the keyword network is modeled by keyword phonetic strings and garbage network is composed of all 1291 single Chinese syllables with tone types formed by simple concatenation of phoneme models. The keyword HMM aims to detect the keyword while the garbage HMM is used to reject all non-keyword speech. Different from ASR recognizer, keyword spotters are designed only to spot the specific keywords, need not recognize all speech word. This baseline system is a speaker dependent, null-grammar, time synchronous Viterbi beam search decoder consisting of a parallel network of keywords and garbage words. It is infeasible for specific KWS task to 3688 3704 3700

acquire grammar knowledge about word sequence statistics from available training data, therefore null grammar, where any keyword or garbage word can follow any others with equal probability, was used in the KWS. In real application, we also can control the detection rate by adjusting the keyword entrance probability C kw and garbage entrance probability C gb .

Figure 1. Null-grammar keyword spotting network.

B. HMM-based Confidence Measures Confidence measure is of crucial importance, which indicates what degree of certainty the algorithm believes the keyword is spotted in the utterance. Posterior probability based on the standard maximum a posterior decision rule is a good CM candidate in speech technology since it is an absolute measure of confidence degree [5]. In this section, we will review the posterior probability methods based on HMM to estimate normalization term in the denominator, which serve as CM score in our following experiments. For each word W given a sequence of observation vectors X = x1 , x 2 " x N and their underlying sequence of HMM phone states Q = q1 , q 2 " q N , we compute the wordbased log posterior probability log P(W | X ) as follows: log P (W | X ) = log =

1 N

P ( X | W ) P (W ) P( X ) . N p( xi | q i ) P(q i ) log i =1 max p ( x i | q j ) P ( q j )



(1)

III.

Two words with different spelling and different meanings are pronounced similarly in Mandarin Chinese, which can be called “pronunciation similarity”. For instance, the Chinese word 技术 (technology) is pronounced “Ji4 Shu4” and 基础 (basis) is pronounced “Ji1 Chu1” in Mandarin, and their pronunciations are similar each other. When 技术 is defined in KWS task, while confusion word 基础 is embedded in conversational speech, 技 术 is easy to be mistakenly spotted. Error analysis shows that pronunciation similarity can cause the performance of KWS to deteriorate if it is not well accounted for. A novel approach to solving the pronunciation similarity problem is proposed, where pronunciation confusion words are selected and added to garbage network as alternative competing hypotheses in order to fit the acoustic data better. A. Phoneme Confusion Table Modeling units play a very important role in state-of-theart speech recognition systems. The design and selection of them will directly impact the performance of final speech recognition engine. A straightforward tonal phone set can be derived directly from the Pinyin system, which maps Mandarin phonetics using a set of initials and finals. By grouping the glides with the consonant initials into premes, each syllable can be decomposed into demisyllables [6]. Adding four auxiliary phonemes (undefined, word boundary, inter-word silence, and inter-utterance silence) brings the total number of tone-specific demisyllable phonemes in the set to 162. This tonal demisyllable system is used as our basic phonetic units for all systems described in this paper. Next we will evaluate the similarity between two phonemes. The basic idea in similarity computation is to buildup the models for these phonemes and then to apply a similarity computation technique to determine the similarity. Gaussian Models can be employed as underlying models of phonemes, trained on conversational telephone speech in this paper. Each phoneme is modeled as a single Gaussian model distribution: c ~ N ( μ , Σ) . The Mahalanobis distance measure [7] is considered appropriate for similarity measure with the aim to produce smaller values for more similar phones. The metric to determine the similarity between two phonemes i and j is shown as following:

j =1" M

d 2 (c i , c j ) = ( μ i − μ j ) T (

where p( xi | qi ) is the probability density of the observation xi using the model corresponding to the HMM qi phone state. The sum in the denominator is approximated by its maximum over j , which runs over the set of all HMM phone state models in decoding paths at frame i . P(q i ) represents the prior probability of the phone state qi , while all phone states are equally likely P(q i ) = P(q j ) .

PROPOSED ALGORITHM

Σi + Σ j 2

) −1 ( μ i − μ j ) .

(2)

Then we can generate a general phoneme similarity table. B. Confusion Garbage Word Generation 1) Word similarity computation: Chinese words consist of multi-syllable, which can further be broken down into smaller demisyllable phones as mentioned above. Similarity score of syllable pair, each of which is decomposed into two demisyllables, can be measured by

3701 3705 3689

those of corresponding demisellable phone pairs. For the different length of each word pair, an alignment process similar to recognition performance evaluation can be carried out for that to find the optimal syllable matching result based on syllable similarity scores. For deletion/insertion of a syllable in alignment result of word pair, we define similarity score as a penalty value, which can be computed as the average of all phone similarity scores. Then we compute word pair similarity as sum of all corresponding syllable similarity scores, normalized by the length of the longer word. 2) Confusion Word Selection: As our Mandarin knowledge source, we adopt the lookup vocabulary with 107.8K words as our pronunciation lexicon source, which cover almost all the common words. The length of one word in the lexicon varies from one Chinese character to fifteen. Although the regular confusion words for each keyword can be obtained from linguistic and acoustic information, other keywords in the given task need to be taken into account to select optimal confusion words. For a specific task with N keywords, we can select 5 top confusion word candidates per keyword from above vocabulary based on word similarity scores. Then the total 5 N words compose the group of the confusion word candidates, next we refine the final confusion words group based on unique similarity method, which is a measure used to evaluate how similar a word is to a specific keyword and dissimilar to other keywords in pronunciation for a task. The confusability increases proportionally to the similarity score between a word and certain keyword but is offset by the score between that and other keywords in the keyword list. In our method, a vector Ti = (Wi1 , Wi 2 , " , Wi 5 ) is used to denote the confusion words candidates related to keyword i , and the kth element Wik of vector Ti denotes the unique similarity value of the kth confusion word related to the ith keyword, that is the weighted score of similarity, which can be calculated as the following: Wik =

( M − 1) simscore ik



M j =1

simscore jk − simscore ik

.

(3)

where M is the number of all keywords, simscoreik denotes the similarity score between keyword i and confusion word k . In this study, based on the analysis and experiments of development set, we select N / 5 top confusion word candidates with the smallest unique similarity scores as the group of final confusion words for a given task. The confusion garbage models composed of confusion words will generate both improvements and deterioration in the KWS system, so the total number of confusion words is very important. In future, we need more research work to

find the optimal algorithm to efficiently decide the size of confusion words for the specific task.

C. KWS with Confusion Garbage Models The phonetic strings for the above confusion words are obtained from the lookup dictionary. The confusion words can be modeled by a confusion garbage model represented as a sequence of phonemes connected in parallel to the keyword models and conventional garbage models, thus the parallel network compose our proposed null-grammar keyword spotting network, as shown in Fig. 2. Then confusion garbage model can handle this extraneous speech, pronunciation of which is similar to keywords, by explicitly modeling the extraneous speech using confusion models.

Figure 2. Proposed null-grammar keyword spotting network.

IV.

EXPERIMENTS

A. Performance Measures The possible events in keyword spotting are hit, false alarm and false rejection. The performance is evaluated by presenting the false rejection rate (FRR) as function of the false alarm rate (FAR). This yields the Detect Error Tradeoff (DET) curve. The FAR and FRR are defined as: FAR =

# fa # fr × 100%; FRR = × 100% . (4) N # KW ⋅ HR ⋅ C

where #fa is false alarm number; #fr is false rejection number; #Kw is keywords number; N is total keywords occurrences number; HR is the time duration of test set in hours as the unit; C is a constant to adjust the scale between FRR and FAR, which is usually set to be 10. Equal Error Rate (EER) is used as evaluation metric, which is the point of same value for FAR and FRR.

B. Experimental Results 1) Data and system setup: We use the 700h real data extracted from conversational telephone for HMM training and phoneme similarity table generation. The experiments were carried out on China 2005 863 national evaluation project benchmark test [8]. All utterances are spoken in Chinese Mandarin and spontaneous speech with accent style

3702 3706 3690

garbage models. Therefore, finally, we got an improvement in EER result. In practical application, KWS are used for evaluating decision making applications which involve a CM threshold to determine whether the output should be true or false.

recorded through landline telephone. There is about 1 hour of test-data which defines 100 keywords (76 2-characters words; 24 3-characters words) with 398 occurrences. For the baseline system, the telephone input signal is coded using 13-dimensional PLP features with a 25ms window and 10ms frame-shift. In the front-end, LDA, MLLT (maximum likelihood linear transform) and featurespace MPE (fMPE) are applied. Here three-state, left-toright HMMs are used to represent 162 phones. The HMM states are context-dependent and clustered into equivalence classes by using decision trees. The distributions of 5K states are modeled by a pool of 150K Gaussian densities. TABLE I.

Confusion words

机器(Ji1 Qi4)

继续(Ji4 Xu4)

寝室(Qin3 Shi4)

其实(Qi2 Shi2)

数学(Shu4 Xue2)

识别(Shi2 Bie2)

牺牲(Xi1 Sheng1)

形成(Xing2 Cheng2)

翻译(Fan1 Yi4)

发育(Fa1 Yu4)

REFERENCES [1]

EER PERFORMANCE USING HMM-BASED CM System

CONCLUSIONS

Keyword spotting is an important technique to develop a user-friendly speech recognition system that can handle natural conversation speech. The work in this paper aims to bring confusion garbage model into a modern Mandarin KWS system, which shows better performance and robustness than the systems without it. Pronunciation similarity is a natural and unavoidable phenomenon in any language environment. This paper focuses on Mandarin acoustic keyword spotting, and the framework studied here is applicable to other languages and other KWS methods as well.

SAMPLES OF CONFUSION WORDS WITH PINYIN.

Keywords

TABLE II.

V.

[2]

EER Recall Precision

HMM-based CM baseline 34.6% 65.4%

43.1%

+ Confusion Garbage Model(CGM) 31.1% 68.9%

47.0%

[3]

[4]

[5] [6]

[7]

[8]

Figure 3. Performance comparison using HMM-based CM.

2) Results: Based on proposed method, we select 20 confusion words, part of that are listed in the Table I. Fig. 3 is the DET curves of the baseline system and proposed system based on the HMM-based CM method. As shown in the Table II, the EER with confusion garbage models can achieve 10.1% relative reductions compared with that with baseline models for HMM-based CM method. Based on error analysis observed in this experiment, proposed method with confusion words leads to performance deterioration of FRR and performance improvement of FAR compared with baseline system; however, the performances of deterioration are smaller than that of improvements by adding confusion

3703 3707 3691

S. Igor, S. Petr, M. Pavel, B. Lukas, et al., "Comparison of keyword spotting approaches for informal continuous speech," INTERSPEECH’05, pp.633-636, 2005. P. Jinto, A. Lovitt and H. Hermansky, “Exploiting Phoneme Similarities in Hybrid HMM-ANN Keyword Spotting,” INTERSPEECH’07, pp. 1817-1820, Antwerpen, Belgium, August 27-31, 2007. K. M. Knill and S. J. Young, “Speaker dependent keyword spotting for accessing stored speech” Technical Report F-INFENG/TR 193, Cambridge University, 1994. J. Shao and et al., “A fast fuzzy keyword spotting algorithm based on syllable confusion network,” INTERSPEECH’07, pp. 2405-2408, Belgium, Aug, 2007. H. Jiang, “Confidence Measures for Speech Recognition: A Survey,” Speech Communication, pp. 455-570, 2005. S. L. Zhang, et al., “Main vowel domain tone modeling with lexical and prosodic analysis for Mandarin ASR,” ICASSP’09, pp.45614564, 2009. H. Beigi and et al., “A Distance Measure between Collections of Distributions and Its Application to Speaker Recognition,” ICASSP, pp.753-756, USA, May 1998. The HTRDP Evaluation Group, “The 2005 HTRDP evaluation guidelines for automatic speech recognition,” 2005.

Suggest Documents