An Open-Source Speech Recognizer for Brazilian ... - CiteSeerX

2 downloads 0 Views 71KB Size Report
Key words: Continuous speech recognition, Brazilian Portuguese re- sources, application .... of utterances to recognize, i.e., dictation or context-free grammar.
An Open-Source Speech Recognizer for Brazilian Portuguese with a Windows Programming Interface Patrick Silva, Pedro Batista, Nelson Neto, and Aldebaro Klautau Universidade Federal do Par´ a, Signal Processing Laboratory, Rua Augusto Correa. 1, 660750110 Bel´em, PA, Brazil {patrickalves,pedro,nelsonneto,aldebaro}@ufpa.br http://www.laps.ufpa.br

Abstract. This work is part of the effort to develop a speech recognition system for Brazilian Portuguese. The resources for the training and test stages of this system, such as corpora, pronunciation dictionary, language and acoustic models, are publicly available. Here, an application programming interface is proposed in order to facilitate using the opensource Julius speech decoder. Performance tests are presented, comparing the developed systems with a commercial software. Key words: Continuous speech recognition, Brazilian Portuguese resources, application programming interface.

1

Introduction

This contribution describes an on-going work concerning the development of a speech recognition system for Brazilian Portuguese (BP). The goal is to implement a large-vocabulary continuous speech recognition system (LVCSR), capable of operating in real-time using Julius [1]. The Julius speech recognition engine is a high-performance LVCSR decoder for speech-related applications. However, incorporating Julius to an application targeting the Windows operating system is not trivial. This work presents a simple application programming interface (API) to ease the task of developing applications based on speech recognition.

2

Developed Resources

In [2], the authors have described a grapheme-to-phoneme converter with stress determination for BP, based on a set of rules. The resulting phonetic dictionary has over 65 thousand words. It is well-known that the amount of training data can drastically influence the performance of an automatic speech recognition system. This motivates the inclusion of a brief description of our BP speech and text corpora [3]. The LapsStory is a corpus based on audiobooks with 6 speakers and more than 14 hours of audio. The files were manually segmented to create

2

Speech Recognition in Brazilian Portuguese

smaller audio files. The acoustic environment in audiobooks is very controlled, so the audio files have no audible noise and high signal to noise ratio. The Spoltech corpus was also used, but it required revision and correction of multiple files, as described in [4]. After revision the modified Spoltech was composed by 477 speakers, which corresponds to 4.3 hours of audio. The Spoltech’s acoustic environment was not controlled, in order to allow for background conditions that would occur in application environments. To complement Spoltech and LapsStory and get a benchmark reference corpus for testing BP systems, the LapsBenchmark is under construction. Currently, such corpus has data from 35 speakers with 20 sentences each, which corresponds to approximately 54 minutes of audio. The LapsBenchmark’s recordings were performed on computers using common desktop microphones and the acoustic environment was not controlled.

3

Baseline system

The front-end consists of the widely used 12 mel-frequency cepstral coefficients (MFCCs). The initial acoustic models for the 39 phones (38 monophones and a silence model) used 3-state left-to-right HMMs. After that, cross-word triphone models were built from the monophone models and a decision tree was designed for tying triphones with similar characteristics. After each step, the models were reestimated using the Baum-Welch algorithm via HTK tools [5]. A more detailed description of the HMMs training process can be seen in [4]. The SRILM tools [6] were used to build the n-gram language models. The number of sentences used to train the language models was 1.6 million with approximately 65 thousand distinct words from CETENFolha and our BP text corpora. A trigram language model with 25.8 million words and perplexity 169 was designed with Kneser-Ney smoothing. The amount of 10 thousand sentences, unseen during the training phase, were used to measure the perplexity. The HTK was used and two adaptation techniques are adopted [5]. The first is the maximum likelihood linear regression (MLLR), which computes a set of transformations that aim to reduce the mismatch between an initial model set and the adaptation data. The second technique uses the maximum a posteriori (MAP) approach. In the HTK implementation, the mean of each Gaussian is updated by MAP using the average of the prior distribution, the weights of the Gaussian and the data of the speaker.

4

Proposed application programming interface

While promoting the widespread development of applications based on speech recognition, the authors noted that it was not enough to make available resources such as language models. These resources are useful for speech scientists but most programmers demand an easy-to-use API. Hence, it was necessary to complement the documentation and code that is part of the Julius package. An API for Windows was developed in the C++ programming language with

Speech Recognition in Brazilian Portuguese

3

the Common Language Runtime specification, which enables communication between the languages supported by the .NET platform. The proposed API allows the real-time control of the Julius and the audio interface. Julius is responsible for producing the speech parameterization to perform the recognition and was chosen to be the decoder used by the proposed API because of its flexible license.

Fig. 1. API interaction model.

Since the API supports the component object model automation, most high level languages (e.g., C#, Visual Basic, and others), can be used to write applications. As shown in Fig. 1, the API enables applications to control aspects of the Julius decoder, from which the application can load the acoustic and language models to be used, start and stop recognition, receive events and recognition results with an associated confidence measure. In essence, a speech application must create, load, and activate a grammar, which essentially indicates what type of utterances to recognize, i.e., dictation or context-free grammar.

5

Experiments results

The acoustic model was initially trained using the LapsStory corpus and then it was adapted using the MLLR and MAP techniques with the Spoltech corpus [3]. Both techniques were used in supervised training (offline). A comparison was made with another decoder: HDecode (part of HTK) and the commercial software IBM ViaVoice [7]. The LapsBenchmark corpus was used to evaluate the systems. The used performance measures were the correct words rate (CWR) and real-time factor (xRT). The evaluation process was carried out in two stages: speaker independent and dependent models.The results are shown in Table 1. The IBM ViaVoice requires a session of speaker adaptation. Hence, for the first stage, the speaker adaptation process for ViaVoice was carried out using the voice of six speakers, 3 men and 3 women, which corresponds to 10 minutes of audio. We could not measure the xRT value from ViaVoice due to the adopted procedure for invoking the recognizer in batch. Note that HDecode and ViaVoice

4

Speech Recognition in Brazilian Portuguese Table 1. Systems comparison using speaker independent and dependent models. Decoder

Independent models Dependent models CWR(%) xRT CWR(%) xRT Julius 60.42 0.7 77,7 0.7 HDecode 70.63 0.9 84,6 0.8 IBM ViaVoice 70.71 82.7 -

had almost the same performance, while Julius had worse performance, but it was faster. Two speakers were used in the dependent process, which corresponds to 10 minutes of each voice. The MLLR and MAP adaptation techniques were again used for the models adopted in Julius and HDecode. In the case of IBM ViaVoice, the standard adaptation process (guided by the software) was adopted. HDecode showed satisfactory results when compared to ViaVoice. Although the Julius decoder presented the worst performance in both tests, if tuned more carefully, it can eventually outperform HDecode, as described in [8].

6

Final considerations

This paper presented a speech recognition system for BP, including the resources, a Julius-based engine and an API for Windows. The resources were made publicly available [9] and allow for reproducing results across different sites. Future work includes making Julius complaint to the Microsoft Speech API while using BP.

References 1. “http://julius.sourceforge.jp/en/,” Visited in May, 2009. 2. A. Siravenha, N. Neto, V. Macedo, and A. Klautau, “Uso de regras fonol´ ogicas com determina¸c˜ ao de vogal tˆ onica para convers˜ ao grafema-fone em portuguˆes brasileiro,” 7th International Information and Telecommunication Technologies Symposium, 2008. 3. P. Silva, N. Neto, and A. Klautau, “Novos recursos e utiliza¸c˜ ao de adapta¸c˜ ao de locutor no desenvolvimento de um sistema de reconhecimento de voz para o portuguˆes brasileiro,” In XXVII Simp´ osio Brasileiro de Telecomunica¸c˜ oes, 2009. 4. P. Silva, N. Neto, A. Klautau, A. Adami, and I. Trancoso, “Speech recognition for brazilian portuguese using the spoltech and OGI-22 corpora,” XXVI Simp´ osio Brasileiro de Telecomunica¸c˜ oes - SBrt 2008, 2008. 5. S. Young, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, 2006. 6. A. Stolcke, “SRILM an extensible language modeling toolkit,” Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, 2002. 7. “http://www.ibm.com/software/speech/,” Visited in September, 2009. 8. T. Rotovnik, M. S. Maucec, B. Horvat, and Z. Kacic, “A comparison of HTK, ISIP and julius in slovenian large vocabulary continuous speech recognition,” 7th International Conference on Spoken Language Processing (ICSLP), 2002. 9. “http://www.laps.ufpa.br/falabrasil,” Visited in October, 2009.

Suggest Documents