Distributed Speech Recognition for Information ... - Semantic Scholar

0 downloads 0 Views 198KB Size Report
modeling for the Danish language and language modeling for a soccer test domain ... constrained by a grammar designed explicitly for the domain. Alternative ...
Distributed Speech Recognition for Information Retrieval on Mobile Devices Tom Brøndsted, Lars Bo Larsen, Børge Lindberg, Morten Rasmussen, Zheng-Hua Tan, Haitian Xu Speech and Multimedia Communication (SMC), Department of Communication Technology Aalborg University, Denmark

{tb,lbl,bli,mr,zt,hx}@kom.aau.dk

ABSTRACT This paper presents a distributed speech recognition (DSR) system for information retrieval on mobile devices. The overall prototype system applies the state-of-the-art DSR technique and knowledge-based Information Retrieval (IR) processing for spoken question answering. A configurable DSR system is implemented on the basis of the ETSI-DSR advanced front-end and the SPHINX IV recognizer. Acoustic modeling for the Danish language and language modeling for a soccer test domain are presented in detail. Though the prototype system can only answer queries and questions within the chosen domain, the system has the flexibility for being ported to other domains.

Categories and Subject Descriptors C.3 SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS Microprocessor/microcomputer applications, Real-time and embedded systems, Signal processing systems

General Terms Distributed Speech recognition, Intelligent Search Engines, Language models, Keywords are your own designated keywords.

Keywords Distributed Speech recognition, Intelligent Search Engines, Language models.

conventional search engines based on e.g. the vector-space model (i.e. it returns fewer and more relevant webdocuments), provided that the documents to be retrieved are written in a certain language (in our case Danish) and belong to a certain domain. For the test application the domain of soccer has been chosen [2]. Similarly, the DSR-system is of course languagedependent (as it employs acoustic models trained on Danish speech), and the acoustic search is currently constrained by a grammar designed explicitly for the domain. Alternative language modeling has been analyzed by comparing various text corpora. Other systems with a similar aim are e.g. [13], which is built from commercial components, and a system for retrieval of Chinese (Mandarin) broadcast news [14], [15]. The SmartWeb [16], [17] project is an effort to access semantic web information using spoken queries. This paper sets focus on the second component of the overall system, the DSR system. It is organized as follows. In Section 2, the system’s architecture is presented. Section 3 describes the speech recognizer. Conclusions are presented in Section 4.

2. SYSTEM ARCHITECTURE The system employs a fully distributed architecture that includes both the speech recognizer and the IR system. The overall architecture of the system is depicted in Fig. 1.

1. INTRODUCTION Distributed Speech Recognition (DSR) has a wide range of applications because of its advantages both in reducing the computational requirements and power consumption for devices at the client side and in facilitating the effortless update of the core part of the recognizer at the server side [1]. The current paper applies DSR in accessing IR services on remote servers from mobile devices such as Personal Digital Assistants (PDAs) and mobile phones. In collaboration with the Software Intelligence and Security Research Centre (SISRC), Esbjerg, Denmark, a prototype system [2] has been built employing two main components: 1) An IR-system using sophisticated IR techniques with a specialized question answering engine and 2) a DSR-system [3] implemented on the basis of the ETSI-DSR advanced front-end [4] and the SPHINX IV recognizer [5]. The IR-engine has been designed to perform better than

Figure 1: Architecture of the speech-enabled information retrieval system. The speech recognizer is implemented using the DSRscheme where the speech recognition processing is split

into the client-based DSR front-end feature extraction and the server-based DSR back-end recognition on a standard PC [1]. Speech features are transmitted from the DSR client to the DSR server. Subsequently, the output in the form of either the best result or an N-best list is sent from the remote server back to the client and from there to the IR server. In order to enable the configuration of the DSR back-end recognizer, commands are transmitted back and forth between the DSR client and the DSR server. The distributed speech recognizer is described in Section 3. The IR server is described in more detail in [2]. It receives the user’s questions about the chosen soccer test domain in text form and determines whether the question should be sent to the search engine or can be answered directly using domain-dependent Information Extraction (IE)-like techniques [6]. The server continuously updates the knowledge base by retrieving sports news via the WWW in the form of RSS articles from diverse news providers [7]. These documents are stored in the database for retrieval. Figure 2 shows the application user interface.

processed by the advanced front-end server-side module. First, on the detection of transmission errors, error concealment is conducted for feature reconstruction. Then, the error-corrected speech feature packages are decoded into a set of cepstral features and VAD information. Subsequently, the cepstral features are processed by the SPHINX IV speech recognizer. The recognizer presents its result (either the best or N-best results) at the utterance end, detected by the VAD information, and transmits back to the client. To increase system usability and flexibility, three typical recognition modes are represented, namely: “Isolated word recognition”, “Grammar based recognition” and “Large vocabulary recognition”. Each is defined by a set of prototype files at the server side. The setting may be different across a group of end-users.

Figure 3: The DSR architecture.

Figure 2: Screen shot of the soccer IR service. Input (top left field) can be entered either by voice or by stylus. Search results are presented as text and shown below.

3. DSR SYSTEM This section describes the DSR system, acoustic modeling and language modeling.

3.1 Distributed Architecture As illustrated in Fig. 3, the DSR system [3] is developed on the basis of the ETSI-DSR advanced front-end (AFE) [4], [8] and the SPHINX IV recognizer [5]. The advanced front-end client-side module extracts noise-robust Mel-Frequency Cepstral Coefficient (MFCC) features which together with Voice Activity Detection (VAD) information are encoded sequentially and packed into speech feature packages for network transmission. At the server side the received speech packages are

A “Command Processor” is implemented at both the client and server side to support the interchange of configuration commands. Potential commands include control commands to start or stop recognition, choice of recognition mode, commands providing feedback information from the server to a client (e.g. success or failure of any user request), etc. The DSR client has been evaluated on a H5550 IPAQ with a 400MHz XScale CPU and 128 MB memory. With speech data sampled at 8 kHz the client is able to conduct the speech processing 0.82 times real-time.

3.2 Acoustic modeling The acoustic models used in the system are created as part of an ongoing effort to create a Large Vocabulary Continuous Speech Recognition (LVCSR) system for Danish on the basis of the SPHINX IV speech recognition platform. The speech data set consists of 44 hours of recorded sentences selected from the SpeechDat(II) speech database [9], which contains telephone quality speech sampled at 8 kHz stored in the A-law format. The training of the models is performed by using SphinxTrain and the SPHINXIV front-end.

Table 1: The WER at each step of improvement. Methods Baseline system Training material Filler models Force alignment Increase bandwidth Gender dependent models

WER 37.6 % 37.4 % 35.2 % 34.7 % 33.0 % 31.9 %

When recognizing 9260 words (843 sentences) from a general domain using the female models, the WER is 30.5%. Similarly the WER when recognizing 7223 words (676 sentences) using the male models is 33.8%. These tests run at 0.7 times real-time on a P4 2.8 GHz computer.

3.3 Vocabulary and grammar The target domain for the recognition task is Danish soccer news. The current version of the system employs a rule-

grammar for modeling user questions. The rule grammar requires very fixed sentence structures and takes only a subset of the questions into account that can actually be handled by the IR/IE-component of the system. The vocabulary size is around 700 including a little more than 400 player and team names. As player and team names are handled as two syntactic classes (implying that all players and all teams are syntactically equivalent and have equal likelihoods), the perplexity of the grammar is high in spite of the small-sized vocabulary. Applying the rule grammar described above for recognition a sentence accuracy of 90% is achieved when testing on 260 sentences (approximately 14 minutes) recorded by one person using a Plantronics M2500 Bluetooth headset and using the Advanced DSR Front-end [8] for both training and test. It should however be noted that it is also possible to use the embedded microphone in the PDA as input device.

3.4 Alternative language modeling Alternative linguistic search-space constraints based either on a more elaborated rule grammar or on conventional statistical language modeling is under consideration. For this purpose, the documents making out the collection of the IR-search engine (called the DB in Fig. 1) has been analyzed to clarify the extent to which they have the characteristics of a distinct sublanguage (relatively closed vocabulary etc.). Though, there is no simple 1:1-relation between these documents and the input expected by the system from the user, certain issues (unigram, bigram, and trigram statistics) can be considered largely coherent since the documents have been collected with the expectation that they contain answers to the questions asked by the users. In IE, the term sublanguage [12] denotes a language restricted to a certain semantic (typically "technical") domain and characterized by absence of lexical ambiguity, a relatively closed vocabulary, and more. Traditionally the language encountered in weather reports is considered a distinct sublanguage. Fig. 4 depicts the number of unique words (unigrams) observed as a function of "running text", i.e. the total number of words, encountered for three text corpora: 1) sentences from the Danish Korpus2000 corpus, 2) the documents making out the collection of the IR/IE-component of the current system, and 3) 48 daily weather reports sampled February-March 2003. 14k Korpus2000 Sports news Weather reports

12k

Unique unigrams

The models trained are word-internal and between-word triphones based on 61 phones (including diphthongs). These triphones are modeled by three state left-to-right continuous density Hidden Markov Models (HMM) with 16 Gaussians per state. The triphones are trained by building binary decision trees used for clustering similar triphone states in order to reduce the number of parameters to be trained. The resulting total number of clusters (or states) is approx. 5000. This improves the training of the triphones with limited amount of training data. The Language Model (LM) used is a statistical trigram model, trained on the Korpus2000 corpus (20M words) [10] and the SpeechDat(II) text corpus (160k words) using GoodTuring discounting. One LM is first trained on each corpus and then the generated LMs are mixed to minimize the cross perplexity with a development text set [11]. Both Korpus2000 and SpeechDat(II) are very general corpora containing texts with various styles and topics. Korpus2000 is described in Section 3.4 in more detail. The above setting results in a baseline system where a word error rate (WER) of 37.6% is obtained when tested on 16483 words (1519 sentences) from a general domain. Towards improving the performance step by step, a number of experiments have been made [18]. 1) Add “word phrases”, which essentially are sentences, to the training material, resulting in 48 hours of speech for training. 2) Train models for filled pauses (speaker saying “umm”, etc.) and speaker noise (cough, etc.) and include training material containing intermediate noise (door slam, etc.), which was previously excluded, resulting in 62 hours of training material. These context-independent models are also modeled by three state left-to-right continuous density HMMs with 16 Gaussians per state. 3) Force-align transcriptions in order to select alternative pronunciation of words and to insert silence and speaker noise markers into the transcriptions used for training. 4) Increase the bandwidth when creating the cepstral coefficients from 300Hz-3400Hz to 150Hz-3600Hz. 5) Create gender dependent models using only 3000 clustered states. The WER for each step is shown in Table 1.

10k 8k 6k 4k 2k 0

0

10k

20k

30k 40k Unigrams

50k

60k

Figure 4: Number of unique words (unigrams) as a function of the total number of words in 1) Korpus2000, 2) collection of sport news, and 3) weather reports.

Korpus2000 consists of sentences in random order taken from various sources such as news papers, publishers, schools, and various organizations [10]. As seen from the Fig. 4, there is in Korpus2000 no flattening tendency in the number of unique words observed when processing the first 60000 words in this corpus (which corresponds the amount of words currently available for sports). On the contrary, the weather reports very quickly (after approximately 5000 words, not easily depicted in the figure) shows a flattening tendency - as expected for a distinct sublanguage. The wording in such reports is stereotype ("light to moderate", "moderate to fresh", "south to south-east", "east to northeast", etc.), and even a similar curve depicting the number of unique bigrams as a function of the total number of words shows a slightly flattening tendency after about 35000 words. Using Korpus2000 and the weather reports as references, we can classify the sports news as somewhere between a distinct sublanguage and a heterogeneous and general Danish language usage, unfortunately closer to the latter than to the former. This tendency gets even more significant when we compute bigram-statistics for the three corpora (Fig. 5) similar to the unigram statistics in Fig. 4. Korpus2000 Sports news Weather reports

45k Unique bigrams

40k 35k 30k 25k 20k 15k 10k 5k 0

0

10k

20k

30k 40k Bigrams

50k

60k

Figure 5: Bigram statistics similar to the unigram statistics in Fig. 4. We conclude that we have far from sufficient sport news data for training of a bigram or trigram LM for the domain, and tests indicate that the problem cannot be solved by employing even heavy smoothing techniques (e.g. GoodTuring) or adaptation of LMs trained on larger corpora (like Korpus 2000). In any case, we have to compensate for the absence of the syntactic patterns found in questions (yes/noand wh-questions) in the data. Furthermore, a large number of foreign names are encountered in this domain, which should be dealt with in some way, since new players are emerging all of the time. The problem with foreign names is that they are hard to transcribe automatically, and that people might not know how to utter them properly. An elaborated rule grammar manually or semiautomatically designed to accept sentences from the sports data including questions derived from these sentences has so far been ruled out, as it unquestionably will result in a too high perplexity search space and a too big network to be used with a conventional linear Viterbi search.

4. CONCLUSIONS In this paper we presented a mobile information access system with spoken query answering on the basis of the DSR

technology. The system adopts a distributed architecture in which the speech recognizer and the knowledge-based IR system are located in different servers. The prototype system is capable of answering spoken queries over a PDA by using a rule grammar. The domain analysis also shows that far more sport news data are needed for training a statistical language modeling.

5. ACKNOWLEDGEMENTS The project was supported by Center for TeleInFrastruktur (CTIF) in the project POSH under CTIF’s C3 program. Furthermore, the authors thank our colleagues Henrik Legind Larsen, Daniel Ortiz-Arroyo and Dan Saugstrup at the Software Intelligence and Security Research Center (SIS-RC) for providing the IR-server of the system.

6. REFERENCES [1] Tan, Z.-H., Dalsgaard, P. and Lindberg, B.: Automatic speech recognition over error-prone wireless networks, Speech Communication, 47(1–2), 220–242, 2005. [2] Brøndsted, T., Larsen, H.L., Larsen, L.B., Lindberg, B., Ortiz-Arroyo, D., Tan, Z.-H., Xu, H., “Mobile Information Access with Spoken Query Answering” in COST278 Final Workshop on ”Applied Spoken Language Interaction in Distributed Environments" ASIDE, Aalborg, Denmark, Nov 2005 [3] Xu, H., Tan, Z.-H., Dalsgaard, P., Mattethat, R. and Lindberg, B.: A configurable distributed speech recognition system, Biennial on DSP for in-Vehicle and Mobile Systems, Sesimbra, Portugal, Sep. 2005. [4] ETSI Standard ES 202 212. Distributed speech recognition; extended advanced front-end feature extraction algorithm; compression algorithm, back-end speech reconstruction algorithm, November 2003. [5] The CMU Sphinx Group Open Source Speech Recognition Engines. http://cmusphinx.sourceforge.net/html/cmusphinx.php [6] Larsen, H.L., and Yager, R.R.: The use of fuzzy relational thesauri for classificatory problem solving in information retrieval and expert systems. IEEE J. on System, Man, and Cybernetics 23(1):31–41, 1993. [7] Mathiassen, H., Nielsen N. N, and Pedersen, A.: Mining Tables from Domain Specific HTML Text, Information Retrieval Project Report, Aalborg University Esbjerg, 2005. [8] 3GPP TS 26.243: ANSI-C code for the Fixed-Point Distributed Speech Recognition Extended. Advanced Front-end, December, 2004. [9] Lindberg, B.: Speechdat, Danish FDB 4000 speaker database for the fixed telephone network, pp. 1–98, March 1999. [10] Korpus2000: www.korpus.dsl.dk visited April 2006 [11] Rasmussen, M.H. and Svendsen, M.T.: Large Vocabulary Continuous Speech Recognizer for Danish and Language Model Adaptation, Master Thesis, Aalborg University, 2005. [12] Harris, Z.: Mathematical Structures of Language. WileyInterscience, New York, 1968.. [13] Schofield and Zheng: “A speech interface for opendomain question-answering” Proceedings of 41st Annual Meeting of the Association for Computational Linguistics, July 2003, pp. 177-180.

[14] Chang, E., Seide, F., Meng, H.M, et al.: A system for spoken query information retrieval on mobile devices, IEEE Trans. Speech and Audio Proc., 10(8):531–541, 2002. [15] Chen, B., Chen, Y.-T., Chang, C.-H., et al.: Speech retrieval of Mandarin broadcast news via mobile devices, Interspeech 2005, Lisbon, Portugal, Sep. 2005. [16] Reithinger, N. and Sonntag, D.: An integration framework for a mobile multimodal dialogue system accessing the semantic web, Interspeech 2005, Lisbon, Portugal, Sep. 2005. [17] SmartWeb: Mobile Broadband Access to the Semantic Web. http://www.smartweb-project.org [18] Lindberg, B., Johansen, F.T., Warakagoda, et al., “A noise robust multilingual reference recogniser based on SpeechDat(II),” ICSLP2000, Oct. 2000.

Suggest Documents