Recognition of Multiple Language Voice Navigation Queries in Traffic ...

3 downloads 18983 Views 282KB Size Report
system. At first, an automated call-center service tries to identify the POI (Point Of. Interest) in the incoming call based on ASR (Automatic Speech Recognition).
Recognition of Multiple Language Voice Navigation Queries in Traffic Situations Gell´ert S´ arosi1, Tam´ as Mozsolics12 , Bal´azs Tarj´an1 , Andr´ as Balog12 , P´eter Mihajlik12 , and Tibor Fegy´o13 1

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics {sarosi,tarjanb,mihajlik,fegyo}@tmit.bme.hu http://www.tmit.bme.hu 2 THINKTech Research Center Nonprofit LLC. {tmozsolics,abalog}@thinktech.hu http://www.thinktech.hu/ 3 Aitia International Inc. http://www.aitia.ai/

Abstract. This paper introduces our work and results related to a multiple language continuous speech recognition task. The aim was to design a system that introduces tolerable amount of recognition errors for point of interest words in voice navigational queries even in the presence of real-life traffic noise. Additional challenges were that no task-specific training databases were available for language and acoustic modeling. Instead, general purpose acoustic database were obtained and (probabilistic) context free grammars were constructed for the acoustic and language models, respectively. Public pronunciation lexicon was used for the English language, whereas rule- and exception dictionary based pronunciation modeling was applied for French, German, Italian, Spanish and Hungarian. For the last four languages the classical phoneme-based pronunciation modeling approach was compared to grapheme-based pronunciation modeling technique, as well. Noise robustness was addressed by applying various feature extraction methods. The results show that achieving high word recognition accuracy is feasible if cooperative speakers can be assumed. Keywords: point of interest, speech recognition, context free grammar, noise robustness, feature extraction, multiple languages, navigation system

1

Introduction

The main interest of our paper is in the design of a speech-based automated guiding service for car drivers and pedestrians. People can ask for help through the public telephone network to find a target destination. The system shown in Figure 1 supports multiple languages, the required one is selected by a keystroke.

2

Recognition of Voice Navigation Queries in Noise Human help in case of failed recognition

Request

ASR Content Service Provider

Response

TTS

Target GPS coordinates

Fig. 1: Overview of the navigation service system

Incoming calls are directed into a service center featuring a two-level processing system. At first, an automated call-center service tries to identify the POI (Point Of Interest) in the incoming call based on ASR (Automatic Speech Recognition) technology. The ASR system matches the incoming utterances to a previously loaded speech recognition network, and returns the most likely result. The network represents word-sequences expected in real-life navigational situations. If a customer notices that the ASR fails, then the call is rerouted to a human assistant to answer the request. In either case, the user’s navigation system gets back the GPS coordinates of the most probable POI as an answer. In this paper, we present the design and implementation issues of the ASR part of the system. At first we go through the related work in the next section. Then in Section 3 the characteristics of the training and test databases are described. Section 4 details the training process of the ASR system and the feature extraction methods that we used in our experiments. Our results are summarized in Section 5. In the last section we discuss our findings and draw some pertinent conclusions.

2

Tasks and Related Works

The integration of speech recognition into car navigation systems is increasingly popular. There are several commercially available solutions enabling voice control for navigation, searching, call management, note or e-mail dictation, etc. However, these applications require sophisticated cell phones or mobile operating systems, and operate merely on US English. Speech recognition services are typically server-based solutions, these are still device dependent applications. Our approach was to develop a server-based speech recognition system for a device independent service. Since ASR systems still require more resources that an everyday cell phone can provide, the optimal solution is to integrate the recognizer into a platform independent application like a call center, thus the navigation service can be reached by a simple telephone call. We developed an ASR system for navigational services which can operate on six languages - with manual language pre-selection. One of the leading companies in speech technology Google has recently published a study [1] about the language modeling approaches applied in their

Recognition of Voice Navigation Queries in Noise

3

”Search by voice” service. In this study they reported using a training database consists of 320 billion words of google text search queries. Processing such a huge training corpus needs special treatment. In order to reduce vocabulary and language model size a finite state text normalization technique and aggressive language model pruning were applied. With the resulting recognition network around 17% WER was achieved on 10k queries by using 2-pass decoding with lattice rescoring. In another paper by Google [2] a description of development of acoustic models used for ”Search by voice” can be found. The service was started with a mismatched acoustic model. As the users provided more and more training data, first manual transcriptions were made for supervised training and then as the traffic increased they changed to unsupervised training. Every release was reported to improve the overall accuracy of the system. Researchers of Microsoft presented methods for training text normalization and interpolation of transcripts of real calls and a listing database [3]. However, it was also emphasized that no data is better than the more real data. Hence our task is more challenging as we are lack of task-specific training databases. During dictionary building, we have specified the number of POI expressions around 8000 as a compromise between recognition accuracy and POI-expression variety. In [4] there are much more POI words in the language model than in our dictionary, however no surrounding text is allowed during recognition. This solution simplifies the system, but ignoring the contextual information can make the recognition more difficult. In [4] speech enhancement was combined with end-point detection and speech/nonspeech discrimination. The time-domain preprocessing stages - such as voice activity detection - may discard noisy but important speech segments, therefore we merely used feature extraction to process the speech-signals. The general experience [5]-[6] is that every method has different performance depending on the noise conditions and the level of the SNR. Therefore we re-evaluated several recently developed and several baseline feature extraction methods to examine which front-ends are more suitable for our real-life noisy recognition task.

3

Databases

For training purposes, we used various SpeechDat [7] type databases (see [8] and [7]) recorded typically through mobile telephone networks. These contain recordings from 500 to 5000 speakers from the required languages. Common features of all training and testing databases are the 8kHz sample rate, single channel and 8 bit A-law encoding. The identifiers and parameters for each language corpus are presented in Table 1. Recognition tests were performed on a database consisting of navigational questions or statements spoken by native speakers from different ages and from both genders. All of them were recorded through the mobile telephone network either from the street of from a moving vehicle. The length of test recordings was in the range of 1-6 seconds. Parameters of the test and training corpora are detailed in Table 1.

4

Recognition of Voice Navigation Queries in Noise

Table 1: The source identifiers (see [7]) and the most important features of the acoustical training databases; and parameters of the test recordings. English French German Hungarian Italian Spanish Training db: ELRA ID S0011 S0061 S0051 – S0116 S0101 length [hour] 17,8 57,9 62,1 28,9 93,7 56,5 64k 269k 219k 92k 251k 212k # of words # of chars 373k 1447k 1586k 630k 1568k 1247k Test data: # of records 58 40 26 291 71 28 # of speakers 9 5 3 27 9 3 genders [m/f] 5/4 1/4 3/0 20/7 5/4 2/1

The test database was recorded in presence of wide variety of background noises – callers were asked to walk on the street or travel in a vehicle during the recording. The test subjects had to read out or make up sentences consisting of a question or description of a POI – for example a theater, diner or a museum. Altogether 85% of the collected sentences followed the predefined sentence structures, while 15% of them were constructed by the test subject. There were such sentences that did not contain any POI’s.

4 4.1

Speech Recognition Models Language Models

Our approach was to apply LVCSR technology to extract POI’s from the spoken utterances. Continuous speech recognizers are usually trained on task-specific text corpora. However, collecting a large training database particularly for the investigated speech controlled guiding system could not fit into the project’s financial and time limits. In this section we present two language modeling techniques that can be used when a theme-specific training corpus is completely unavailable. The first model presented here is a rule-based grammar, where the expected sentences for the given situation have to be collected manually. Examples for search statements: ’Where is the nearest pizza/POI restaurant/POI?’ ’Is there a shoe/POI shop/POI near here?’ ’Find the Modern/POI Art/POI Gallery/POI.’ ’Take me to the airport/POI.’ where the whole sentences are recognized and the /POI tags help to extract the POI words from the output of the recognizer. Theoretically there are infinite sentence variations and it is impossible to collect all the potential search requests. However, separating the class of POI’s and the sentence structures results in a much general representation:

Recognition of Voice Navigation Queries in Noise

’Where is the nearest [poi]?’ ’Is there a [poi] near here?’ ’Find the [poi].’ ’Take me to the [poi].’

5

’pizza restaurant’ ’shoe shop’ ’Modern Art Gallery’ ’airport’

Replacing the [poi] tags in the predefined sentence structures with the actual POI’s in the right column, we get CFG (Context Free Grammar) model which has NSEN T EN CES ×NP OI sentences (16 in the example). The sentence structure variations can be efficiently described in the GRA-format1 and it is practical to divide the destinations into subcategories like restaurants, shopping, services, etc. As was mentioned above, building an efficient N-gram language model would require a large, task-specific training text corpus. Fortunately, there is a technique [10] that allows us utilizing the collected sentences to train a stochastic grammar. This method performs a three-way randomization on the original database, where the sentences are being varied in length, word-order and appearance probability. The resulted corpus is now suitable for class N-gram training. The full process is carried out with the Logios language compilation suite [11], which generates class N-gram language models in ARPA-format (hereafter will be referred to as PCFG N-grams: Probabilistic Context Free Grammars, according to [10]) from GRA-format CFGs.

Table 2: Details about the grammar models. English French German Hungarian Italian Spanish Average sentence structures 36,7k 4,1k 68,3k 18,8k 5,4k 17,6k ≈25k dictionary size 5,4k 5,9k 7,1k 15k 7,8k 5,4k ≈7,8k

CFG and PCFG 3-gram models are built on the target languages with the parameters shown in Table 2. There are 8000 POI’s used during model training, however the dictionary sizes are smaller than 8k for all languages except Hungarian (discussed later). There are matching words in the POI expressions like in the Museum of Applied Arts and the National Museum, which causes the dictionary size to decrease. The variance between the languages comes from the different number of contextual words. However, Hungarian is a highly agglutinative language, therefore all POI’s have three more alternatives with the ’-t’, ’-ba/-be’, ’-hoz/-hez/-h¨ oz’ suffixes (detailed in Section 4.6), which drastically increases dictionary size. 1

which was defined in the Phoenix Parser, see [9]

6

4.2

Recognition of Voice Navigation Queries in Noise

Pronunciation model

In the phoneme-based approach simple grapheme-to-phoneme rules are applied on each lexicon separately in order to obtain word-to-phoneme mappings. The following phonetic transcribers are used: LIA PHON[12] for French, the TXT2PHO[13] for German and Spanish, and our own transcriber for the rest of the languages (English, Italian, Hungarian). In the Hungarian and English pronunciation models, the automatically derived phonetic transcriptions are corrected by using word exception pronunciation dictionaries. For this purpose the BEEP[14] dictionary is applied in the case of the English experiments, whereas for Hungarian only the exceptionally pronounced POI’s have been collected on an exception list. The application of phoneme-based acoustic models require considerable amount of language specific knowledge like grapheme-to-phoneme rules, or manual phonemic transcriptions. Hence, grapheme-based models are also tested on those 4 languages (German, Hungarian, Italian and Spanish), where acoustic models are built directly on letters (or graphemes) instead of phonemes [15]. In this approach even foreign, traditional, and other morphs grapheme ”pronunciations” are obtained as their linear sequence of alphabetic letters, thus no alternative pronunciations are allowed. However, grapheme-based approach can be inaccurate for modeling the untypical pronunciation variations of grapheme sequences. Applying languagespecific grapheme-based exception dictionaries – similarly to the phoneme-based ones – can significantly improve recognition accuracy. For example: Deutsche Deutsche Auchan = Auchan = Toyota = Toyota = 4.3

= = o o t t

d d s s o e

o o a c j u

j y n h o o

c c ; a t t

s e ; (for Hungarian) h e ; (for English) (for Hungarian) n ; (for German) a ; (for Hungarian) a ; (for German)

Context dependency model

As Equation (1) shows, triphone context expansion is performed after the integration of higher level knowledge sources. Context dependency is modeled across word-boundaries, with respect to inter-word optional silences, as well. 4.4

Acoustic Models

Speaker independent decision-tree state clustered cross-word tied triphone models were trained using ML (Maximum Likelihood) estimation [16]. Three state left-to-right HMM’s were applied with GMM’s (Gaussian Mixture Models) associated to the states. The acoustic models were trained for each language from the related database according to Table 1. The number of states were in the range 800-5200 depending

Recognition of Voice Navigation Queries in Noise

7

on the actual language, and 10-15 Gaussians were used per state. All the feature types detailed in Section 4.7 were used, and blind channel equalization [17] was also applied.

Table 3: The HMM state numbers for grapheme-and phoneme-based acoustic models. Phon Graph

English French German Hungarian Italian Spanish 0,9k 4,8k 5,2k 0,8k 1,2k 3,6k – – 4,8k 1,8k 3,8k 3,8k

Context-dependent grapheme-based acoustic models called as ”trigraphones” were also trained similarly as phoneme-based triphone acoustic models in case of the four languages mentioned in Section 4.2. By default, the phonemic questions used in decision tree constructions were simply converted to graphemic questions as in [15]. The resulting HMM state numbers of the phoneme- and graphemebased models are shown in Table 3. 4.5

Off-line recognition network construction

The WFST (Weighted Finite State Transducer) [18] recognition network is computed on the HMM-level: ) ))) o G} |L {z phoneme-level model | {z } triphone-level model {z } HMM-level model

H o wpush(min(det( C o det(

|

(1)

where G (Grammar) denotes the word-level language model, L (Library) is the lexicon of words and their pronunciations, C is the context-dependency transducer and H (HMM-dictionary) is the lexicon of triphones and their HMM states. The ’o’ symbol denotes the composition operator that carries out crosslevel transformations between the models and the ’det’, ’min’ and ’wpush’ acronyms denote further optimization steps [18]. 4.6

Models for multiple languages

The sequence of WFST operations used for building the recognition networks (1) is language independent, therefore our main task was to construct the H, C, L and G transducers for each language. Structure of H, C and L transducers is well defined in [18] and [19]. Their construction is quite straightforward if the language-specific acoustic models and pronunciation rules are given (see Sections

8

Recognition of Voice Navigation Queries in Noise

4.4 and 4.2). Construction of language models (G) has been discussed in Section 4.1. However, there are some language-specific subproblems that still have to be handled. For instance, the target destinations appear as subjects or adverbs in place of the navigation related keywords in Hungarian. Hence, accusative and adverbial suffixes have to be removed from the end of POI’s, for example: ’moziba’ (to the cinema) or ’´aruh´ azhoz’ (to a store). These suffixes usually have a couple of alternatives (’ba/be’, ’hoz/hez/h¨oz’) according to the position of the back (’a’, ’´a’, ’o’, ’´o’, ’u’, ’´ u’) and front vowels (’e’, ’´e’, ’i’, ’´ı’, ’¨o’, ’˝o, ’¨ u’, ’˝ u’) in the actual word. The lexical form of POI’s can be extracted by using a simple, rulebased software, that can choose the right suffix alternative with 95% accuracy for our POI dictionary. 4.7

Feature Extraction

In order to automatically recognize speech in an environment filled with reallife noises, the choice of the front-end processing stage can be crucial. Multiple feature extraction methods have been developed for this purpose. However, the general experience is that if a technique performs well in certain noise conditions, it can be suboptimal in other noise or high SNR conditions. So, real-life noises always need the re-evaluation of acoustic feature extraction techniques. This section shows advanced and baseline methods, which are included in our comparative test. The Mel Frequency Cepstral Coefficients (MFCC) is a widely used feature extraction method implemented in multiple ways. We tested the variations included in the HTK (Hidden Markov Modell Toolkit)[16], the front-end of the SPHINX[20] speech recognition system, and our own version implemented in the VOXerver2 recognition software, which we also used in the recognition tests. The major difference between the three MFCC front-ends is in the procedure that reduces convolutive distortions caused by the transmission channel. The HTK and SPHINX systems use CMN (Cepstral Mean Normalization), while our implementation applies an adaptive technique, called BEQ (Blind Equalization)[17] based methods. The Perceptual Linear Prediction (PLP)[21] is also a quite popular feature extraction method, because it is considered as a more noise robust solution. Therefore we added the HTK implementation into our tests. The Perceptual Minimum Variance Distortionless Response (PMVDR)[22] is based on a procedure that estimates the transfer characteristic from a signal’s spectrum by computing an upper spectral envelope. A special transformation called frequency bending is applied on the FFT spectra instead of a filtering step. The MVDR spectrum comes from the LP coefficients, calculated in a similar way to the PLP method. This front-end also uses BEQ to reduce convolutive distortions. 2

Aitia International Inc.

Recognition of Voice Navigation Queries in Noise

9

The Power Normalized Cepstral Coefficients (PNCC)[23] is a recently introduced front-end technique, similar to the MFCC but the Mel-scale transformation is replaced by Gammatone filters[24] simulating the behavior of the cochlea. Furthermore, it includes a step called medium time power bias removal to increase robustness. The bias vector is calculated using the arithmetic to geometric mean ratio in a way, to estimate the speech quality reduction caused by noise. 4.8

Evaluation

One-pass decoding was performed by the frame synchronous WFST decoder called as VOXerver – developed in our laboratories. RTF (Real Time Factor) of the decoding process was adjusted to be close to equal (0.2-0.4 @ 2GHz, 2 core CPU) across every languages using standard pruning techniques. Standard WACC (Word Recognition Accuracy) was measured to evaluate the general performance of each ASR system, whereas the efficiency of POI retrieval was estimated by measuring word recognition accuracies for the POI-related words (WACC,P OI ).

5

Results and discussion

In this section, we discuss the results according to the various aspects of the speech recognition tests. First we compare the phoneme-based CFG and PCFG 3-gram models for the six languages. Then grapheme-based pronunciation and acoustic modeling are compared to the classical phoneme-based approach on the suitable four languages. Finally, we discuss the impact of the various feature extraction methods using CFG and 3-gram models, too. The average results of the tests were weighted with the relative number of the test recordings of each language. 5.1

CFG vs. PCFG 3-gram

In this test, we compared the phoneme-based CFG and PCFG 3-gram models for the six languages in the case of complete match – which means that there were no OOV (Out Of Vocabulary) words or out of grammar expressions in the test recordings. This test can also be interpreted as a comparison of the targeted languages. The results are shown in Table 4, where the highest average scores are emphasized. The complete match suggested much better results with the CFG model, however the more flexible PCFG 3-gram model performed nearly as well as the CFG. The applied feature extraction technique was the internal MFCC method of the VOXerver. The results of the different languages were similar except the English model which significantly underperformed the other five. It was probably caused by the relatively smaller training database we had for acoustic model training. An other possible explanation is that the mapping of training and test

10

Recognition of Voice Navigation Queries in Noise

Table 4: The word recognition accuracy results of the phoneme-based CFG and PCFG 3-gram models (with POI vocabulary size of 8K avg.) for the six languages, and the average accuracies (the highest word and POI accuracies are written in bold). WACC English French All POI All POI [%] CFG 50.4 35.1 74.2 80.0 PCFG 51.0 50.4 78.5 85.2

German All POI 72.4 89.1 74.9 82.6

Hungarian Italian Spanish All POI All POI All POI 70.9 66.9 77.4 81.8 86.3 84.9 68.9 64.5 66.1 83.3 81.4 80.3

Average All POI 70.7 68.5 68.2 68.9

words to their phonetic counterparts was obtained independently using different methods. We performed another test on the Hungarian CFG and PCFG 3-gram models. Initially all test sentences were included in – they were all expected by – the language models. In this test, some of the sentence structures were removed from the training data, therefore several test sentences were not included in the language models, these became unexpected sentences and caused the matching rate (expected test sentences / all test sentences) to drop. The flexibility of the CFG and PCFG models were evaluated by decreasing this matching rate in three steps. As Figure 2 (a) shows, the recognition accuracy decreased as the test data became more unexpected for the language models. The 3-gram showed increasing advantage over the CFG model as the matching rate decreased, because the PCFG has a more flexible structure to recognize unexpected word sequences. Differences between the CFG and PCFG models was only measured to be significant (signed-rank Wilcoxon test) if the matching rate was under 0.8. Summarizing the comparison of the CFG and PCFG models, the CFG gave slightly better average recognition result in case of the perfect match to the test database, however PCFG N-gram performed better with a smaller sentence structure database and it was more flexible with unexpected sentences. The Hungarian CFG model was also tested using POI dictionary sizes linearly varying from 1000 to 21000 in 20 steps, keeping the complete match condition for all models. Power law regression function (2) is calculated on the measured recognition accuracy values, see Figure 2 (b). WACC,P OI ≈ 87, 1385 · (NP OI )−0,0286 [%]

(2)

According to Equation 2, we can tell that the POI-related word recognition accuracy shows a decreasing tendency in the range of NP OI ∈ [1000, 21000]. During the WFST network construction, the memory needs grew almost linearly with the increase of the POI-number, but constructing a PCFG model needed significantly more memory as compared to the CFG model. 5.2

Grapheme-based models vs. phoneme-based models

Grapheme-based language models are tested for German, Hungarian, Italian and Spanish in comparison with the results of the previous section. The applied

Recognition of Voice Navigation Queries in Noise

72

77 72

CFG

67

PCFG

62 57 52 47

measured values power law regression

71 POI Word Accuracy [%]

POI Word Accuracy [%]

11

70 69 68 67 66

42 1

0,841

0,682

Matching Rate

(a)

0,54

65 1k

5k

9k

13k

17k

21k

NPOI

(b)

Fig. 2: a) The performance of the Hungarian CFG and PCFG 3-gram methods at different matching rates (ratio of the number of test sentence structures covered by the language models, and the number of all test sentences); b) The WACC,P OI accuracy parameter and its approximating power law regression function for the Hungarian CFG model in the NP OI range from 1000 to 21000. feature extraction was also the internal MFCC method of the VOXerver. Accordingly to the negative experiences of [15], English and French languages were excluded from this series of grapheme-based experiments. Table 5 contains the recognition results comparing the grapheme-based model to the phoneme-based ones, and Figure 3 shows the average accuracies of the two methods.

Table 5: The word recognition accuracy results of the grapheme- and phonemebased CFG models of the four languages, and the average accuracies (highest values are written in bold). WACC [%] Grapheme Phoneme

German All POI 63,8 82,6 72.4 89.1

Hungarian Italian Spanish All POI All POI All POI 69,2 58,4 76,6 82,6 79,8 75,8 70.9 66.9 77.4 81.8 86.3 84.9

Average All POI 70.8 65.2 73.1 72.0

Significant part of the POI’s contain names of multinational enterprises and international brands from all over the world. The ratio of foreign words in the test recordings was negligible in case of German, Italian and Spanish, therefore the application of the exception dictionaries did not affect the grapheme-based accuracies. However, our Hungarian test recordings had 15% foreign POI’s, for example Erste Bank, McDonald’s, Renault, etc. In this case, the grapheme-based exception dictionary (see Section 4.2) significantly improved in the Hungarian grapheme-based accuracies (from WACC,All = 66.4% and WACC,P OI = 51.1% to the scores in Table 5).

12

Recognition of Voice Navigation Queries in Noise

The Hungarian, German and Spanish grapheme-based models performed a bit worse compared to their phoneme-based counterparts. However, the Italian model performed better than its phoneme-based alternative. The weaker performance of the phoneme-based Italian model was probably caused by the manually collected and possibly incomplete phonetic transcription rules. According to the average result of Figure 3, the accuracy of the grapheme-based models come near to the phoneme-based ones. 75 73 71

Grapheme

69 67 Phoneme

65 63 61 Word accuracy [%]

POI accuracy [%]

Fig. 3: The average recognition accuracies of the phoneme- and grapheme-based CFG models for the four languages.

Not surprisingly, the POI names are harder to be recognized, because these typically contain more ”out of language” expressions than ordinary words. The grapheme-based exception dictionaries could help in every languages, not just for Hungarian. This could be investigated in the future by obtaining more test recordings that contain foreign POI names. 5.3

Comparison of feature extraction methods

In order to compare acoustic front-ends series of experiments were performed using the standard settings of the previous tests. Both the CFG and the PCFG 3-gram language models were included in the recognition tests. Table 6 shows word recognition accuracies (WACC ) of all words, grouped by the CFG and PCFG language types. The three highest average word recognition scores were emphasized for both language modeling type to indicate the best performing front-end techniques. In addition, Figure 4 displays the average word scorings of both language models. The different front-end methods generally performed better using CFG-based language models, similarly to the results in Section 5.1. In case of both model types, the same three feature extraction method gave better performance. According to Figure 4, the PNCC technique proved to be the best, but the MFCC and PLP front-ends of the HTK are only slightly behind. Surprisingly the noise robust methods could not outperform the standard MFCC techniques, although the difference can be highly varying from tasks and implementations. The POI

Recognition of Voice Navigation Queries in Noise

13

Table 6: Recognition accuracies of the CFG and the PCFG 3-gram models using various feature extraction methods. WACC,CF G [%] English French German Hungarian Italian Spanish Average mfcc-htk 59.8 69.2 71.2 73.3 79.7 78.1 72.5 mfcc-sphinx 48.3 72.8 68.1 70.1 75.8 77.6 68.9 mfcc-vox 50.4 74.2 72.4 70.9 77.4 86.3 70.7 plp-htk 59.5 70.6 68.7 71.7 79.7 83.6 71.9 70.3 pmvdr 49.7 68.5 68.7 72.1 78.7 76.0 pncc 50.8 75.3 65.0 73.5 77.4 83.6 71.9 WACC,P CF G [%] English French German Hungarian Italian Spanish Average mfcc-htk 55.0 76.7 70.6 70.8 68.4 74.9 69.4 mfcc-sphinx 42.7 75.6 64.7 68.7 66.6 74.3 66.1 68.2 mfcc-vox 51.0 78.5 74.9 68.9 66.1 81.4 plp-htk 57.8 75.3 71.2 69.9 69.2 81.4 69.6 pmvdr 45.0 75.6 71.2 69.6 68.9 73.8 68.2 70.4 pncc 50.6 78.1 72.4 71.4 72.6 83.1

Word Recognition Accuracy [%]

72 71 70 69 68 67 66 65

Fig. 4: Averaged word accuracies of the CFG and the PCFG language models.

14

Recognition of Voice Navigation Queries in Noise

recognition rates were also evaluated but are not discussed, because they are in a good correlation with the global word accuracies and the previous results. A more detailed analysis of noise robustness of the applied front-end techniques can be found in [25].

6

Conclusions

This paper introduced our work on designing a multiple language continuous speech recognition system for navigation service. The aim was to achieve good recognition accuracy of point of interest words in voice navigational queries even in the presence of real-life traffic noise. Serious challenge was that no task specific training databases were available for language modeling. Instead, we have applied and compared two language model construction methods: the CFG modeling and the PCFG N-grams. Both methods gave suitable solutions for this specific speech recognition task. As expected, the completely matched CFG model yielded the highest recognition accuracies. Increasing the number of POI expressions from 1000 to 21000 in the Hungarian CFG model, we have found that a power law approximation can be applied between the word error rate and the number of the POI’s in the examined range. The search space was much larger in the case of the PCFG N-gram model as compared to the CFG approach, which resulted in a minor recognition accuracy reduction in the fully matched tests. However, a significant advantage of the PCFG model was observed when the test contained more out of grammar and OOV elements. The classical phoneme-based pronunciation modeling approach was compared to a customized grapheme-based pronunciation modeling technique for the German, Hungarian, Italian and Spanish languages. The results showed that building an exception dictionary for foreign POI-related words can significantly improve the grapheme-based models, making them almost competitive to the phoneme-based ones. Noise robustness was addressed by applying various feature extraction methods. The results showed a great variation in recognition accuracies of the acoustic front-ends. The recognition scores of the PNCC proved to be the highest, but the MFCC and PLP front-ends of the HTK were only slightly behind. The results suggest that achieving high word recognition accuracy is possible if cooperative speakers can be assumed – if the users raise navigational questions in the usual way, so the ratio of OOV and out of grammar expressions is minimal. The different cost efficient language and acoustic modeling techniques and feature extraction methods worked well for the included six languages, although the English accuracy should be improved. Hopefully with the growth of the acoustical model training database the results of our English model will improve and approach the accuracies of the other five.

Recognition of Voice Navigation Queries in Noise

15

Acknowledgment. Our research was partially funded by: OM-00102/2007, ´ ´ OMFB-00736/2005, TAMOP-4.2.2-08/1/KMR-2008-0007, TAMOP-4.2.1/B-09/1/KMR-2010-0002, KMOP-1.1.1-07/1-2008-0034.

References 1. Chelba, C., Schalkwyk, J., Brants, T., Ha, V., Harb, B., Neveitt, W., Parada, C. and Xu, P.: Query Language Modeling for Voice Search. In: Proceedings of the 2010 IEEE Workshop on Spoken Language Technology 2. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M. and Strope, B.: Google Search by Voice: A Case Study. (2010) 3. Yu, Dong, Ju, Yun-Cheng, Wang, Ye-Yi, Zweig, G. and Acero, A.: Automated Directory Assistance System - from Theory to Practice. In: INTERSPEECH 2007, 2709-2712. 4. Lee, S. H., Chung, H., Park, J. G., Young, H.-Y., Lee, Y.: A Commercial Car Navigation System using Korean Large Vocabulary Automatic Speech Recognizer. In: APSIPA 2009 Annual Summit and Conference, pp. 286-289. (2009) 5. Kim, D.-S., Lee, S.-Y., Rhee M. Kil, R. M.: Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments. In: IEEE Transactions on Speech and Audio Processing, Vol. 7, No. 1, pp. 55-69. (1999) 6. Milner, B.: A comparison of front-end configurations for robust speech recognition. In: ICASSP-93. pp. 797-800. (1993) 7. European Language Resource Association http://catalog.elra.info/ 8. Hungarian Telephone Speech Database (Magyar Telefonos Besz´ed Adatb´ azis), http://alpha.tmit.bme.hu/speech/hdbMTBA.php 9. Center for Spoken Language Research of Colorado: Phoenix parser for spontaneous speech, http://cslr.colorado.edu/~ whw/phoenix/ 10. Harris, T. K.: Bi-grams Generated from Phoenix Grammars and Sparse Data for the Universal Speech Interface. In: Language and Statistics Class Project, CMU (May 2002) 11. CMU Language Compilation Suite for Dialog Systems, https://cmusphinx.svn. sourceforge.net/svnroot/cmusphinx/trunk/logios/ 12. A text phonetization system for the MBROLA system http://tcts.fpms.ac.be/ synthesis/mbrola/tts/French/liaphon.tar.gz 13. A German TTS-frontend for MBROLA system http://www.sk.uni-bonn.de/ forschung/phonetik/sprachsynthese/txt2pho 14. British English pronunciation dictionary http://mi.eng.cam.ac.uk/comp. speech/Section1/Lexical/beep.html 15. Kanthak, S., Ney, H.: Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition In: ICASSP-93. pp. 845-848. (1993) 16. Young, S., Ollason, D., Valtchev, V. and Woodland, P.: The HTK book. (for HTK version 3.4), March 2009, http://htk.eng.cam.ac.uk 17. Mauuary, L.: Blind Equalization in the Cepstral Domain for robust Telephone based Speech Recognition. In: Proc. of EUSPICO’98, Vol.1, pp. 59-363. (1998) 18. Mohri, M., Pereira, F. and Riley, M.: Weighted Finite-State Transducers in speech Recognition. In: Computer Speech and Language 16(1), pp. 69-88. (2002) 19. Szarvas, M.: Efficient Large Vocabulary Continuous Speech Recognition Using Weighted Finite-state Transducers – The Development of a Hungarian Dictation System. PhD Thesis, Department of Computer Science, Tokyo Institute of Technology, Tokyo. (March 2003)

16

Recognition of Voice Navigation Queries in Noise

20. CMU Speech Recognition Engine (SphinxTrain 1.0), http://www.speech.cs.cmu. edu/ 21. Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. In: Journal of the Acoustical Society of America, 87(4), pp. 1738-1752. (1990) 22. Yapanel, U. H., Hansen, J. H.L.: A New Perspective on Feature Extraction for Robust In-Vehicle Speech Recognition. In: EUROSPEECH 2003, pp. 1281-1284. (2003) 23. Kim, C., Stern, R. M.: Feature Extraction for Robust Speech Recognition using a Power-Law Nonlinearity and Power-Bias Subtraction. In: INTERSPEECH 2009, pp. 28-31. (2009) 24. Patterson, R. D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand, M. H.: Complex sounds and auditory images. In: Cazals, Y., Demany, L. and Horner, K. (eds.) Auditory and Perception pp. 429-446. Pergamon Press, Oxford (1992) 25. S´ arosi, G., Mozs´ ary, M., Mihajlik, P. and Fegy´ o, T.: Comparison of Feature Extraction Methods for Speech Recognition in Noise-Free and in Traffic Noise Environment. In: Proc. of the 6th Conference on Speech Technology and Human-Computer Dialogue. Romania, Brasov (2011)

Suggest Documents