A Keyword Based Interactive Speech Recognition System for ... - IDt

A Keyword Based Interactive Speech Recognition System for Embedded Applications Master’s Thesis by Iván Francisco Castro Cerón Andrea Graciela Garc´ıa Badillo

June, 2011

School of Innovation, Design and Engineering M¨ alardalen University Väster˚ as, Sweden Supervisor: Patrik Björkman Examiner: Lars Asplund

Abstract Speech recognition has been an important area of research during the past decades. The usage of automatic speech recognition systems is rapidly increasing among different areas, such as mobile telephony, automotive, healthcare, robotics and more. However, despite the existence of many speech recognition systems, most of them use platform specific and non-publicly available software. Nevertheless, it is possible to develop speech recognition systems using already existing open source technology. The aim of this master’s thesis is to develop an interactive and speaker independent speech recognition system. The system shall be able to identify predetermined keywords from incoming live speech and in response, play audio files with related information. Moreover, the system shall be able to provide a response even if no keyword was identified. For this project, the system was implemented using PocketSphinx, a speech recognition library, part of the open source Sphinx technology by the Carnegie Mellon University. During the implementation of this project, the automation of different steps of the process, was a key factor for a successful completion. This automation consisted on the development of different tools for the creation of the language model and the dictionary, two important components of the system. Similarly, the audio files to be played after identifying a keyword, as well as the evaluation of the system’s performance, were fully automated. The tests run show encouraging results and demonstrate that the system is a feasible solution that could be implemented and tested in a real embedded application. Despite the good results, possible improvements can be implemented, such as the creation of a different phonetic dictionary to support different languages. Keywords: Automatic Speech Recognition, PocketSphinx, Embedded Systems

Acknowledgements This master’s thesis represent the completion of a journey we decided to start two years ago. Studying in Sweden has been quite an experience and for that, we would like to thank MDH for giving us the opportunity to study here. We would also like to express our gratitude to our Professor Lars Asplund for his guidance throughout the development of this project. Moreover, we would like to thank the Consejo Nacional de Ciencia y Tecnolog´ıa (CONACYT) at México for the financial support provided through the scholarships we were both granted. We would also like to thank our families for their immense support and encouragement during all this time we have been away. Finally, we are grateful with life for having had the great opportunity of being able to accomplish one more goal together.

Contents List of Figures

iv

List of Tables

vi

1 Introduction 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

2 Background 2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Main challenges and Current Research Areas . . . . . . . . . . . . . . . .

3 3 6

3 Automatic Speech Recognition 3.1 Speech Variability . . . . . . . . . . . . . . . 3.1.1 Gender and Age . . . . . . . . . . . . 3.1.2 Speaker’s Origin . . . . . . . . . . . . 3.1.3 Speaker’s style . . . . . . . . . . . . . 3.1.4 Rate of Speech . . . . . . . . . . . . . 3.1.5 Environment . . . . . . . . . . . . . . 3.2 ASR Components . . . . . . . . . . . . . . . . 3.2.1 Front End . . . . . . . . . . . . . . . . 3.2.2 Linguistic Models . . . . . . . . . . . . 3.2.3 Decoder . . . . . . . . . . . . . . . . . 3.3 Training for Hidden Markov Models . . . . . 3.3.1 Expectation-Maximization Algorithm 3.3.2 Forward-Backward Algorithm . . . . . 3.4 Performance . . . . . . . . . . . . . . . . . . . 3.4.1 Word Error Rate . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

8 8 8 9 9 9 9 10 10 12 15 23 23 23 25 25

4 Design and Implementation 4.1 The CMU Sphinx Technology 4.1.1 Sphinx . . . . . . . . . 4.1.2 PocketSphinx . . . . . 4.2 System Description . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

27 27 27 28 29

. . . .

. . . .

. . . .

i

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4.3 4.4

4.5

4.6

Keyword List . . . . . . . . . . . . . . . . . . . . . . Audio Files . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Creating the audio files . . . . . . . . . . . . 4.4.2 Playing the audio files . . . . . . . . . . . . . Custom Language Model and Dictionary . . . . . . . 4.5.1 Keyword-Based Language Model Generation 4.5.2 Custom Dictionary Generation . . . . . . . . 4.5.3 Support for other languages . . . . . . . . . . Decoding System . . . . . . . . . . . . . . . . . . . . 4.6.1 Inputs . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Outputs . . . . . . . . . . . . . . . . . . . . . 4.6.3 Algorithm . . . . . . . . . . . . . . . . . . . . 4.6.4 System Components . . . . . . . . . . . . . .

5 Tests and Results 5.1 Main ASR application . . . . . . . . . . . 5.1.1 Performance Evaluation and KDA 5.1.2 Test environment and setup . . . . 5.2 Auxiliary Java Tools . . . . . . . . . . . . 5.2.1 TextExtractor Tool . . . . . . . . . 5.2.2 DictGenerator Tool . . . . . . . . . 5.2.3 RespGenerator Tool . . . . . . . . 5.2.4 rEvaluator Tool . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . .

30 30 30 33 33 34 36 38 40 40 40 41 43

. . . . . . . .

44 44 44 46 49 49 50 51 51

6 Summary 52 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bibliography

55

A Guidelines A.1 Installation of the speech recognizer project . . . . . . A.2 Running the speech recognition application . . . . . . A.3 Generating a new language model . . . . . . . . . . . A.4 Generating a new dictionary . . . . . . . . . . . . . . . A.5 Generating a new set of predefined responses . . . . . A.6 Running the speech recognizer using the newly created A.7 Evaluating the performance and measuring the KDA . B Source Code B.1 Recognizer . . . B.2 TextExtractor . B.3 DictGenerator . B.4 respGenerator .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

ii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . files . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

57 57 58 59 61 61 63 64

. . . .

66 67 76 80 83

B.5 rEvaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

iii

List of Figures 2.1 2.2

DARPA Speech Recognition Benchmark Tests. . . . . . . . . . . . . . . . Milestones and Evolution of the Technology. . . . . . . . . . . . . . . . . .

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Signal processing to obtain the MFCC vectors . . . . . A typical triphone structure. . . . . . . . . . . . . . . Three state HMM . . . . . . . . . . . . . . . . . . . . Example of phone and word HMMs representation . . A detailed representation of a Multi-layer Perceptron. Tree Search Nerwork using the A* algorithm. . . . . . Sphinx training procedure . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

11 14 16 18 19 21 25

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Keyword File Example . . . . . . . Sentences for Audio Files Example Conversion of Sentences into Audio Statistical Language Model Toolkit TextExtractor and CMUCLMTK . Dictionary Generator Tool . . . . . SphinxTrain Tool . . . . . . . . . . Decoding Algorithm . . . . . . . . Component Diagram . . . . . . . .

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11

Typical Test Setup . . . . . . . . . . . KDA Test Setup . . . . . . . . . . . . Automated Test Process . . . . . . . . Test Sentences . . . . . . . . . . . . . rEvaluator Tool . . . . . . . . . . . . . TextExtractor Operation Mode 1 . . . TextExtractor Operation Mode 2 . . . DictGenerator Operation Modes 1 and Generated Dictionary Files . . . . . . RespGenerator: Execution Messages . rEvaluator: Execution Messages . . . .

iv

. . . . . . Files . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . .

4 5

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

30 31 32 34 35 37 39 42 43

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

45 45 45 46 47 49 50 50 51 51 51

A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9

ASR Project Folder Structure . . . . . . . Message . . . . . . . . . . . . . . . . . . . Decoding Example . . . . . . . . . . . . . Decoding Example . . . . . . . . . . . . . Keywords and URLs . . . . . . . . . . . . Example of a Set of Predefined Responses Responses Structure . . . . . . . . . . . . OGG Files . . . . . . . . . . . . . . . . . . Verification File . . . . . . . . . . . . . . .

v

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

57 59 59 59 60 62 62 63 64

List of Tables 3.1

Example of the 3 first words in an ASR dictionary . . . . . . . . . . . . . 15

4.1 4.2

Dictionary Words Not Found: Causes and Solutions . . . . . . . . . . . . 37 Phonetical Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 5.2 5.3 5.4

Keyword List . . . . . . KDA Report Format . . Overall Test Results . . Computer Specifications

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

46 47 48 48

A.1 KDA Report Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

vi

Chapter 1

Introduction Automatic speech recognition systems (ASR) have caught the attention of researchers since the middle of the 20th century. From the initial attempts to identify isolated words using a very limited vocabulary, to the latest advancements processing continuous speech composed by thousands of words, the ASR technology has grown progressively. The raise of robust speech recognition systems during the last two decades has triggered a number of potential applications for the technology. Existing human-machine interfaces, such as keyboards, can be enhanced or even replaced by speech recognizers that interpret voice commands to complete a task. This kind of applications is particularly important as speech is the main form of communication among humans, for this reason, it is much simpler and faster to complete a task by providing vocal instructions to a machine rather than typing commands using a keyboard [21]. The convergence of several research disciplines such as digital signal processing, machine learning and language modeling have allowed the ASR technology to mature and currently be used in commercial applications. The current speech recognition systems are capable of identifying and processing commands with many words. However, they are still unable to fully handle and understand a typical human conversation [9]. The performance and accuracy of existing systems allows them to be used in simple tasks like telephone-based applications (call-centers, user authentication, etc). Nevertheless, as the accuracy and the vocabulary size are increased, the computational resources needed to implement a typical ASR system grow as well. The amount of computational power required to execute a fully-functional speech recognizer system can be easily supplied by a general-purpose personal computer, but it might be difficult to execute the same system in a portable device [21]. Mobile applications represent a very important area where ASR systems can be used, for example in GPS navigation systems for cars, where the user can provide navigation commands by voice, or a speech-based song selector for a music player. However, a large amount of existing portable devices lack the processing power needed to execute 1

a high-end speech recognizer. For this reason, there is a large interest in designing and developing flexible ASR systems that can be run on both strong and resource-constrained devices [8]. Typically, the speech recognition systems designed for embedded devices use restrictive grammars and do not support large vocabularies. Additionally the accuracy achieved by those systems tends to be lower when compared to a high-end ASR. This compromise between processing power and accuracy is acceptable for simple applications, however, as the ASR technology becomes more popular, the expectations regarding accuracy and vocabulary increase [9]. One particular area of interest in the research community is to optimize the performance of existing speech recognizers by implementing faster, lighter and smarter algorithms. For example, PocketSphinx is an open source, continuous speech recognition system for handheld devices developed by the Carnegie Mellon University (CMU). As its name suggests, this system is a heavily simplified and optimized version of the existing tool Sphinx, developed by the same university. There are other existing speech recognizers developed for embedded devices, but most of these systems are not free and their source code is not publicly available. For this reason it is difficult to use them for experimentation and research purposes [8].

1.1

Outline

This master’s thesis report is organized as the following: Chapter 2 reviews part of the history of speech recognition, the main challenges faced during the development of ASR systems as well as some of the current areas of research. Chapter 3 presents some of the sources of variability within human speech and their effects on speech recognition. Also described are the main components of an ASR system, how are they commonly trained and evaluated. Chapter 4 describes in detail the design and implementation of the Keyword Based Interactive Speech Recognition System. This chapter also discusses the reasons for selecting PocketSphinx as the main decoder system for this project. Chapter 5 presents the tests and results obtained after evaluating the implemented Keyword Based Interactive Speech Recognition System. Chapter 6 presents the summary and conclusions of the paper and discusses possibilities of future work.

2

Chapter 2

Background This chapter provides an overview of the history of speech recognition and ASR systems, as well as some of the main challenges regarding ASR systems and the current areas of research.

2.1

History

The idea of building machines capable of communicating with humans using speech emerged during the last couple of decades of the 18th century. However, these initial trials were more focused on building machines able to speak, rather than listening and understanding human commands. For example, using the existing knowledge about the human vocal tract, Wolfgang von Kempelen constructed an “Acoustic-Mechanical Speech Machine” with the intention of replicating speech-like sounds [10]. During the first half of the 20th century, one of the main research fields related to language recognition was the spectral analysis of speech and its perception by a human listener. Some of the most influential documents were published by Bell Laboratories and in 1952 they built a system able to identify isolated digits for a single speaker. The system had 10 speaker-dependent patterns, one associated to each digit, which represented the first two vowel formants for each digit. Although this scheme was rather rudimentary, the accuracy levels achieved were quite remarkable, reaching around 95 to 97 % [11]. The final part of the 1960’s saw the introduction of feature extraction algorithms, such as the Fast Fourier Transform (FFT) developed by Cooley and Tukey, the Linear Predictive Coding (LPC) developed by Atal and Hanauer, and the Cepstral Processing of speech introduced by Oppenheim in 1968. Warping, also known as non-uniform time scale was another technique presented to handle differences in the speaking rate and segment length of the input signals. This was accomplished by shrinking and stretching the input

3

signals in order to match stored patterns. The introduction of Hidden Markov Models (HMMs) brought significant improvements to the existing technology. Based on the work published by Baum and others, James Baker applied the HMM framework to speech recognition in 1975 as part of his graduate work at CMU. In the same year, Jelinek, Bahl and Mercer applied HMMs to speech recognition while they were working for IBM. The main difference was the type of decoding used by the two systems; Baker’s system used Viterbi decoding while IBM’s system used stack decoding (also known as A* decoding) [10]. The use of statistical methods and HMMs heavily influenced the design of the next generation of speech recognizers. In fact, the use of HMMs started to spread since the 1980s and has become by far the most common framework used by speech recognizers. During the late 1980s a technology involving neural networks (ANN) was introduced into the speech recognition research, the main idea behind this technology was to identify phonemes or even complete words using multilayer perceptrons (MLPs). However, more modern research has tried to apply MLPs as a complementary tool for HMMs, in other words some modern ASRs use hybrid MLP/HMMs in order to improve their accuracy. The Defense Advanced Research Projects Agency (DARPA) in the US played a major role in the funding, creation and improvement of speech recognition systems during the last two decades of the 20th century. The DARPA funded and evaluated many systems by measuring their accuracy and, most importantly, their word error rate (WER), as illustrated by Figure 2.1.

Figure 2.1: DARPA Speech Recognition Benchmark Tests.

4

Many different tasks were created with different difficulty levels. For example, some tasks involved continuous speech recognition using structured grammar such as in military commands, some other tasks involved recognition of conversational speech using a very large vocabulary with more than 20 thousand words. The DARPA program also helped to create a number of speech databases used to train and evaluate ASR systems. Some of these databases are the Wall Street Journal, the Switchboard and the CALLHOME databases; all of them are publicly available. Driven by the DARPA initiatives, the ASR technology evolved and several research programs were created in different parts of the world. The speech recognition systems eventually become very sophisticated and capable of supporting large vocabularies with thousands of words and phonemes. However, the WER for conversational speech still can be considered high, with a value of near 40% [9]. During the first decade of the 21st century the research community has focused on the use of machine learning in order to not only recognize words, but to interpret and understand human speech. In this regard, text-to-speech (TTS) synthesis systems have become popular in order to develop machines able to speak to humans. Nonetheless, designing and building machines able to mimic a person seems to be a challenge for future work [10]. Figure 2.2 depicts important milestones of speech recognition.

Figure 2.2: Milestones and Evolution of the Technology.

5

2.2

Main challenges and Current Research Areas

During the past five decades, the ASR technology has faced many challenges. Some of these challenges have been solved; some others are still present even on the most sophisticated systems. One of the most important roadblocks of the technology is the high WER for large-vocabulary continuous speech recognition systems. Without knowing the word boundaries in advance, the whole process of recognizing words and sentences becomes much harder. Moreover, failing to correctly identify word boundaries of words will certainly produce incorrect word hypotheses and incorrect sentences. This means that sophisticated language models are needed in order to discard incorrect word hypotheses [21]. Conversational and natural speeches also contain co-articulatory effects; in other words, every sound is heavily influenced by the preceding and following sounds. In order to discriminate and correctly determine the phones associated to each sound, the ASR requires complex and detailed acoustic models. However the use of large language and acoustic models typically increases the processing power needed to run the system. During the 1990s even some common workstations did not have enough processing power to run a large vocabulary ASR system [21]. It is well known that increasing the accuracy of a speech recognition system increases its processing power and memory requirements. Similarly, these requirements can be relaxed by sacrificing some accuracy in certain applications. Nevertheless, it is much more difficult to improve accuracy while decreasing computational requirements. Speech variability in all of its possible categories represents one of the more difficult challenges for speech recognition technology. Speech variability can be traced to many different sources such as the environment, the speaker itself or the input equipment. For this reason, it is critical to create robust systems that are able to overcome speech variability regardless of its source [3]. Another important challenge for the technology is the improvement of the training process. For example, it is important to use training data that is similar to the type of speech used in a real application. It is also valuable to use training data that helps the system to discard incorrect patterns found during the process of recognizing speech. Finally, it is desired to use algorithms and training sets that can adapt in order to discriminate incorrect patterns. As it can be seen, there are many opportunity areas and ideas to improve the existing ASR technology. Nevertheless, we can organize these areas in more specific groups, for example: • It is important to improve the ease-of use of the existing systems, in other words, the user needs to be able to use the technology in order to find more applications.

6

• The ASR systems should be able to adapt and learn automatically. For instance, they should be able to learn new words and sounds. • The systems should be able to minimize and tolerate errors. This can be achieved by designing robust systems that can be used in real life applications. • Recognition of emotions could play a very important role in order to improve speech recognition as it is one of the main causes of variability [3].

7

Chapter 3

Automatic Speech Recognition This chapter discusses the main sources of speech variability and their effects on the accuracy of speech recognition. Additionally, it describes the major components of a typical ASR system, presents some of the algorithms used during the training phase and their common method of evaluation.

3.1

Speech Variability

One major challenge for speech recognition is speech variability. Due to human nature, a person is capable of emitting and producing a vast variety of sounds. Therefore, since each person has different vocal tract configurations, shapes and lengths, (articulatory variations), it is impossible for two persons to speak alike, not even a same person can reproduce a same waveform after repeating the same word. However, is not only the vocal tract configuration, but different factors that can create different effects on the resulting speech signal.

3.1.1

Gender and Age

Speaker’s gender is one of the main sources of speech variability. It makes a difference on the produced fundamental frequencies as men and woman have different vocal sizes. Similarly, age contributes to speech variability as children ASR becomes particularly difficult since their vocal tracts and folds are smaller compared to adults. This has a direct impact on the fundamental frequency as it becomes higher than adult frequencies. Also, according to [1], it has been shown that children under ten years old, increase the duration of the vowels resulting in variations of the formant locations and fundamental frequencies. In addition, children might also lack of a correct pronunciation and vocabulary or even have a spontaneous speech grammatically incorrect.

8

3.1.2

Speaker’s Origin

Variations exist when recognizing native and non-native speech. Speech recognition among native speakers does not represent a significant change in the acoustics and therefore there is not a big impact in the ASR system performance. However, this might not be the same when recognizing a foreign speaker. Factors such as the level of knowledge of the non-native speaker, vocabulary, speaker’s accent and pronunciation represent variations of the speech signal that could impact the system’s performance. Moreover, if the used speech models are only considering native speech data, the system behavior could not be correct.

3.1.3

Speaker’s style

Apart from the origin of the speaker, speech also varies depending on the speaking style. A speaker might reduce the pronunciation of some phonemes or syllables during a casual speech. On the other hand, portions of speech containing a complex syntax and semantic tend to be articulated more carefully by speakers. Moreover, a phrase can be emphasized and the pronunciation can vary due to the speaker’s mood. Thus, the context also determines how a speech signal is produced. Additionally, the speaker can introduce word repetitions or expressions that denote hesitation or uncertainty.

3.1.4

Rate of Speech

Another important source of variability is the rate of speech (ROS) since it increases the complexity, within the recognition process, of mapping the acoustic signal with the phonetic categories. Therefore, timing plays an important role as an ASR system can have a higher error rate due to higher speaking rate if the ROS is not properly taken into account. In contrast, a lower speaking rate can either affect or not the ASR system’s performance, depending on factors such as if the speaker extremely articulates or introduces pauses within syllables.

3.1.5

Environment

Other sources of speech variability reside at the transmission channel, for example, distortions to the speech signal can be introduced due to the microphone arrangement. Also, the background environment can produce noise that is introduced into the speech signal or even the room acoustics can modify the speech signal received by the ASR system. From physiological to environmental factors, different variations exist when speaking and between speakers, such as gender or age. Therefore, speech recognition deals with

9

properly overcoming all these type of variations and their effects during the recognition process.

3.2

ASR Components

An ASR system comprises several components and each of them should be carefully designed in order to have a robust and well implemented system. This section presents the theory behind each of these components in order to better understand the entire process of the design and development of an ASR system.

3.2.1

Front End

The front end is the first component of almost every ASR system as it is the one to see the speech signal as it comes into the system. This component is the one in charge of doing the signal processing to the received speech signal. At the moment that the input signal arrives to the front end, it had already passed through an acoustic environment, where the signal might have suffered from diverse effects, such as additive noise or room reverberation [15]. Thus, a proper signal processing needs to be made in order to enhance the signal suppressing possible sources of variation and extract its features to be used by the decoder. Feature Extraction The speech signal is usually translated into a spectral representation comprising acoustic features. This is done by compressing the signal into smaller set of spectral N features, where the size of N most likely depends on the duration of the signal. Therefore, in order to make a proper feature extraction, the signal needs to be properly sampled and treated prior the extraction. The feature extraction process should be able to extract the features that are critical to the recognition of the message within the speech signal. However, the speech waveform comprises several features, but, the most important feature dimension is the so called spectral envelope. This envelope contains the main features of the articulatory apparatus and is considered the core of speech analysis for speech recognition [9]. The features within the spectral envelope are obtained from a Fourier Transform, a Linear Predictive Coding (LPC) analysis or from a bank of bandpass filters. The most common used ASR features are the Mel-Frequency Cepstral Coefficients (MFCCs), however there also exist LPC coefficients, Line-Spectral Frequencies (LSFs) among many others. Despite the extensive number of feature sets, they all intend to capture the enough spectral information needed to recognize the spoken phonemes [16].

10

MFCC vectors MFCCs can be seen as a way to translate an analog signal into digital feature vectors of typically 39 numbers. However, this process requires of the execution of several steps in order to obtain these vectors. Figure 3.1 depicts the series of steps required to get the feature vectors from the input speech signal.

Figure 3.1: Signal processing to obtain the MFCC vectors Due to the nature of human speech, the speech signal is attenuated as the frequencies increase. Also, the speech signal is subject to a falloff of -6dB when passing through the vocal tract [16]. Therefore, the signal needs to be preemphasized, which means that a preemphasis filter, which is a high-pass filter, is applied to the signal. This increases the amplitude of the signal for the high frequencies while decreasing the components of lower frequencies. Then, having the original speech signal x, a new preemphasized sample at time n is given by the difference xn = xn − a(xn−1 )

(3.1)

where a is a factor that is typically set to a value near 0.9 [16]. Once the signal has been preemphasized, it is partitioned into smaller frames of sizes of about 20 to 30 milliseconds sampled every 10 milliseconds. The higher the sample rate, the better it is to model fast speech changes [16]. The frames are overlapped in order to avoid missing information that could have been in between the limits of each frame. This process is called windowing and it is used in order to minimize the effects of partitioning the signal into small windows. The most used window function in ASR is the Hamming window [13]. This function is described by 2nπ wn = α − (1 − α) ∗ cos (3.2) N −1 where w is a window of size N and α has a value of 0.54 [18]. Next step is to compute the power spectrum by means of the Discrete Fourier Transform (DFT) using the Fast Fourier Transform (FFT) algorithm to minimize the required computation. Then, using a mel-filter bank the power spectrum is mapped onto the 11

mel-scale in order to obtain a mel-weighted spectrum. The reason for using this scale is because it is a non-linear scale that approximates the non-uniform human auditory system [9]. A mel-filter bank consists on a number of overlapped triangular bandpass filters where the center frequencies are equidistant on the mel-scale. This scale is linear up to 1000 hertz and logarithmic thereafter [18]. Next, a logarithm compression is made to the data obtained from the mel-filter bank resulting into log-energy coefficients. This coefficients are then orthognonalized by a Discrete Cosine Transform (DCT) in order to compress the spectral data into a set of low order coefficients, also known as the mel-cepstrum or cepstral vector [18]. This is described by the following

Ci =

M X k=1

1 π Xk cos k − , 2 M

i = 1, 2, ..., M

(3.3)

where Ci is the ith MFCC and M is the number of cepstrum coefficients and Xk represents the log-energy coefficients of the kth mel-filter [2]. This vector is later normalized in order to account for the distortions that may have occurred due to the transmission channel. Additionally, it is common to get the first and second order differentials of the cepstrum sequence to obtain the delta-cepstrum and the delta-delta cepstrum. Furthermore, the delta-energy and delta-delta energy parameters, which are the first and second order differentials of the power spectrum are also added to the feature vector [9]. This is how, an ASR feature vector typically consists of 13 cepstral coefficients, 13 delta values and 13 delta-delta values, which gives a total of 39 features per vector [16].

3.2.2

Linguistic Models

Language Models and N-grams After processing the input sound waves in the front end, the ASR generates a series of symbols representing the possible phonemes in a piece of speech. In order to form words using those phonemes, the speech recognition system uses language modeling. This modeling is characterized by a set of rules regarding how each word is related to other words. For example, a group of words cannot be put together arbitrarily; they have to follow a set of grammatical and syntactical rules of a language. This part of speech processing is necessary in order to determine the meaning of a spoken message during the further stages of speech understanding [9]. Most ASR systems use a probabilistic framework in order to find out how words are related to each other. In a very general sense, the system tries to determine what word is the latest one received based on a group of words previously received. For instance, what would be the next word in the following sentence? 12

I would like to make a collect. . . Some of the possible words would be “call”, “phone-call” or “international” among others. In order to determine the most probable word in the sentence, the ASR needs to use probability density functions, P (W ). Where W represents a sequence of words w1 , w2 , . . . wn . The density function assigns a probability value to a word sequence depending on how likely it appears in a speech corpus [11]. Using the same example as before, the probability of the word “call” occurring after “I would like to make a collect”, would be given by: P (W ) = P (”I”, ”would”, ”like”, ”to”, ”make”, ”a”, ”collect”, ”call”)

(3.4)

If we substitute the words by mathematical expressions we have: P (W ) = P (w1 , w2 , w3 , . . . .wn − 1, wn )

(3.5)

P (W ) = P (w1 )P (w2 |w1 )P (w3 |w2 , w1 ). . . P (wn |wn − 1. . . w1 )

(3.6)

As it can be seen, the probability function requires using all of the words in a given sentence. This might represent a big problem when using large vocabularies, especially when dealing with very long sentences. A large speech corpus would be required in order to compute the probability of each word occurring after any combination of other words. In order to minimize the need of a large corpus, the N-gram model is used. This can be achieved by approximating the probability function of a given word by using a predefined number of previous words (N). For example, using a bigram model, the probability function is approximated using just the previous word. Similarly, the trigram model uses the previous two words and so on [11]. Using a bigram model it is easier to approximate the probability of occurrence for the word “call“, given the previous words: P (W ) = P (”call”|”collect”)P (”collect”|”a”)P (”a”|”make”). . . P (”would”|”I”) (3.7) The trigram model looks slightly more complicated as it uses the two words to compute each conditional probability: P (W ) = P (”call”|”collect”, ”a”)P (”collect”|”a”, ”make”). . . P (”like”|”would”, ”I”) (3.8) The N-gram model can be generalized using the following form [9]: P (W ) =

n Y

P (wk |wk − 1, wk − 2. . . .wk−N +1 )

(3.9)

k=1

In general, as N increases, the N −gram approximation becomes more effective, however, most ASR systems use a bigram, or trigram model. 13

Acoustic Model The main purpose of an acoustic model is to map sounds into phonemes and words. This can be achieved by using a large speech corpus and generating statistical representations of every phoneme in a language. The English language is composed by around 40 different phonemes [6]. However, due to the co-articulation effects, one phoneme is affected by the preceding and succeeding phonemes depending on the context. The most common approach to overcome this problem is by using triphones, in other words each phone is modeled along its preceding and succeeding phones [9]. For example, the phoneme representation for the word HOUSEBOAT is: [ hh aw s b ow t ] In this context, in order to generate a statistical representation for the phoneme [aw], the phonemes [hh] and [s] need to be included as well. This triphone strategy can be implemented using Hidden-Markov models, where the triphone is represented using three main states (one for each phoneme), plus one initial and one end state. An example of a triphone can be seen at Figure 3.2.

Figure 3.2: A typical triphone structure. The purpose of the Hidden Markov model is to estimate the probability that an acoustic feature vector corresponds to a certain triphone. The computation of this probability can be obtained by using what is called the evaluation problem of HMMs. There is more than one way to solve the evaluation problem; however the most used algorithms are the forward-backward algorithm, the Viterbi algorithm and the stack decoding algorithm [16]. Dictionary The ASR dictionary complements the language model and the acoustic model by mapping written words into phoneme sequences. It is important to notice that the size of the dictionary should correspond to the size of the vocabulary used by the ASR system. In other words, it should contain every single word and its phonetic representation used by the system, as it can be seen at Table 3.1.

14

Word aancor aardema aardvark

Phoneme representation AA N K AO R AA R D EH M AH AA R D V AA R K

Table 3.1: Example of the 3 first words in an ASR dictionary The representation of these phonemes must correspond to the representation used by the acoustic model in order to identify words and sentences correctly. Similarly, the words included in the dictionary must be the same as the words used by the language model. Furthermore, the dictionary should be as detailed as possible in order to improve the accuracy of the speech recognition system. For example, the dictionary used by the CMU-Sphinx system is composed by more than 100,000 words.

3.2.3

Decoder

Hidden Markov Models ASR systems typically make use of finite-state machines (FSM) to overcome all variations found within speech by means of stochastic modeling. Hidden Markov Models (HMM) are one of the most common FSM used within ASR and were first used for speech processing in the 1970s by Baker at CMU and Jelinek at IBM [16]. This type of models comprises a set of observations and hidden states with self or forward transitions and probabilistic transitions between them. Most HMMs have a left-to-right topology, which in case of ASR allows the modeling of the nature of sequential speech [15]. In a Markov model, the probability or likelihood of being in a given state depends only on the immediate prior state, leaving earlier states out of consideration. Unlike a Markov model, in where each state corresponds to an observable event, in an HMM the state is hidden, only the output is visible. Then, an HMM can be described as having states, observations and probabilities [20]: • States. N hidden interconnected states and state qt at time t, S = S1 , S2, ...SN

(3.10)

• Observations. Symbols. M observation symbols per state, V = V1 , V 2, ...VM

15

(3.11)

Sequences. T observations in the sequence, O = O1 , O2 , ...OT

(3.12)

• Probabilities. State transition probability distribution A = aij where, aij = P [q( t + 1) = Sj |qt = Si ],

1 ≤ i, j ≤ N

(3.13)

Observation symbol probability distribution B = bj (k) at state Sj , where, bj (k) = P [vk

at

t|qt = Sj ],

1 ≤ j ≤ N,

1≤k≤M

(3.14)

Initial state distribution π = πi where, πi = P (q1 = Si ],

1≤i≤N

(3.15)

Therefore, an HMM model can be denoted as: λ = (A, B, π)

(3.16)

An example of an HMM is depicted at Figure 3.3. The model has three states having self and forward transitions between them. State q1 has two observation symbols, state q2 has three observation symbols and state q3 has four observation symbols. Each symbol has its own observation probability distribution per state.

Figure 3.3: Three state HMM An HMM shall address three basic problems in order to be useful in real-world applications such as in an ASR system [20]. The first problem is selecting a proper method to determine the probability P (O|λ) that the observation sequence O = O1 , O2 , ...OT is 16

produced by the model λ = (A, B, π). The typical approach used to solve this problem is the forward-backward procedure, being the forward step the most important at this stage. This algorithm is presented at Section 3.3 The second problem is finding the best state sequence Q = q1 , q2, ...qt that produced the observation sequence O = O1 , O2 , ...OT . For this, it is necessary to use an optimality criterion and learn about the model’s structure. One possibility is to select the states qt that are individually most likely for each t. However, this approach is not efficient as it is only providing individual results, that is why the common approach to finding the best state sequence is to use the Viterbi algorithm. This algorithm is later presented at this section. Finally, the third problem deals with the adjustment of the model parameters λ = (A, B, π) in order to maximize the probability P (O|λ) to better determine the origin of the observation O = O1 , O2 , ...OT . This is considered to be an optimization problem where the output results to be the decision criterion. In addition, this is the point where training takes an important role in ASR as it allows adapting the model parameters to observed training data. Therefore, a common approach to solve this is by means of the maximum likelihood optimization procedure. This procedure uses separate training observation sequences Ov to obtain model parameters for each model λv Pv∗ = max P (Ov |λv ) λv

(3.17)

Typical usage of Hidden Markov Models In ASR, various models can be made based on HMM, being the most common models the ones for phonemes and words. An example of these type of models is depicted in Figure 3.4 where various phone HMM models are used to construct the phonetic representation of the word one and the concatenation of these models are used to construct word HMM models. Phoneme models are constrained by pronunciations from a dictionary, while word models are constrained by a grammar. Phone or phoneme models are usually made up using one or more state HMM. On the other hand, word models are made up of the concatenation of phoneme models, that at the same time help on the construction of sentence models [15].

17

Figure 3.4: Example of phone and word HMMs representation Acoustic Probabilities and Neural Networks / MLPs The front end phase of any speech recognition system is in charge of converting the input sound waves into feature vectors. Nonetheless, these vectors need to be converted into observation probabilities in order to decode the most probable sequence of phonemes and words in the input speech. Two of the most common methods to find the observation probabilities are: the Gaussian probability-density functions (PDFs) and most recently, the use of neural networks (also called multi-layer perceptrons, MLPs). The Gaussian observation-probability method converts an observation feature vector ot into a probability function bj (ot ) using a Gaussian curve with a mean value µ and a covariance matrix Σ. In the simpler version of this method, each state in the hidden Markov framework has one PDF. However, most ASR systems use multiple PDFs per state, for this reason, the overall probability function bj (ot ) is computed using Gaussian Mixtures. The forward-backward algorithm is commonly used to compute these probability functions and is also used to train the whole hidden Markov model framework [11]. Having the mean value and the covariance matrix, the probability function can be computed using the following equation: 0 −1 1 bj (ot ) = p e[(ot −µj ) Σj (ot −µj )] (2π)(Σj)

(3.18)

Artificial neural networks are one of the main alternatives to the Gaussian estimator in order to compute observation probabilities. The key characteristic of neural networks is that they can arbitrarily map inputs to outputs as long as they have enough hidden layers. For example, having the feature vectors as inputs (ot ) and the corresponding 18

phone labels as outputs, a neural network can be trained to get the observation probability functions bj (ot ) [10]. Typically, the ASR systems that use hidden Markov models and MLPs are classified as hybrid (HMM-MLP) speech recognition systems. The inputs to the neural network are various frames containing spectral features and the network has one output for each phone in the language. In this case, the most common method to train the neural network is the back-propagation algorithm [15]. Another advantage of using MLPs is that having a large set of phone labels and their corresponding set of observations, the back-propagation algorithm iteratively adjust the weights in the MLP until the errors are minimized. Figure 3.5 presents an example of a representation of an MLP of three layers.

Figure 3.5: A detailed representation of a Multi-layer Perceptron. The two methods to calculate the acoustic probabilities (Gaussian and MLPs) have roughly the same performance. However, MLPs take longer time to train and they use less parameters. For this reason, the neural networks method seems to be more suited to ASR applications where amount of processing power and memory are major concerns.

19

Viterbi Algorithm and Beam Search One of the most complex tasks faced by speech recognition systems is the identification of word boundaries. Due to co-articulation and fast speech rate, it is often difficult to identify the place where one word ends and the next one starts. This is called the segmentation problem of speech and it is typically solved using N-gram models and the Viterbi algorithm. Given a set of observed phones o = (o1 , o2 . . . ot ), the purpose of the decoding algorithm is to find the most probable sequence of states q∗ = (q1 , q2 . . . qn ), and subsequently, the most probable sequence of words in a speech [11]. The Viterbi search is implemented using a matrix where each cell contains the best path after the first t observations. Additionally, each cell contains a pointer to the last state i in the path. For example: viterbi[t, i] =

max

q1 ,q2 ,...,qt−1

P (q1 q2 ...qt−1 , qt = i, o1 , o2 ...ot )

(3.19)

In order to compute each cell in the matrix, the Viterbi algorithm uses what is called the dynamic programming invariant [9]. In other words, the algorithm assumes that the overall best path for the entire observation goes through state i, but sometimes that assumption might lead to incorrect results. For example, if the best path looks bad initially, the algorithm would discard it and select a different path with a better probability up to state i. However, the dynamic programming invariant is often used in order to simplify the decoding process using a recurrence rule in which each cell in the Viterbi matrix can be computed using information related to the previous cell [11]. viterbi[t, j] = max(viterbi[t − 1, i]aij )bj (ot ) i

(3.20)

As it can be seen, the cell [t, j] is computed using the previous cell [t − 1, i], one emission probability bj and one transition probability aij . It is important to emphasize that this is a simplified model for the Viterbi algorithm, for example a real speech recognizer that uses HMMs would receive feature acoustical vectors instead of phones. Furthermore, the likelihood probabilities bj (ot ) would be calculated using Gaussian probability functions or multi-layer perceptrons (MLPs). Additionally, the hidden Markov models are typically divided in triphones rather than single phones. This characteristic of HMMs provides a direct segmentation of the speech utterance [16]. The large amount of possible triphones and the use of a large vocabulary make the speech decoding a computationally expensive task. For this reason, it is necessary to implement a method to discard low probability paths and focus on the best ones. This process is called pruning and is usually implemented using a beam search algorithm. The main purpose of this algorithm is to speed up the execution of the search algorithm and the use of a lower amount of computational resources. However, the main drawback of the algorithm is the degradation of the decoding performance [11]. 20

A* decoding algorithm The A* decoding algorithm (also known as stack decoding) can be used to overcome the limitations of the Viterbi algorithm. The most important limitation is related to the use of the dynamic programming invariant, therefore it cannot be used with some language models, such as tri-grams [11]. The stack decoding algorithm solves the same problem as the Viterbi algorithm, which is finding the most likely word sequence W given a sequence of observations O, [17]. For this reason it is often used as the best alternative to substitute Viterbi’s method. In this case, the speech recognition problem can be seen as a tree network search problem in which the branches leaving each junction represent words. As the tree network is processed, more words are appended to the current string of words in order to form the most likely path [17]. An example of this tree is illustrated by Figure 3.6.

Figure 3.6: Tree Search Nerwork using the A* algorithm.

21

The stack decoder computes the path with the highest probability of occurrence from the start to the last leaf in the sequence. This is achieved by storing a priority stack (or queue) with a list of partial sentences with a score, based on their probability of occurrence. The basic operation of the A* algorithm can be simplified using the following steps [17]: 1. Initialize the stack 2. Pop the best (high scoring) candidate off the stack 3. If the end-of-sentence is reached, output the sentence and terminate 4. Perform acoustic and language model fast matches to obtain new candidates 5. For each word on the candidate list: (a) Perform acoustic and language-model detailed matches to compute new theory output likelihood. i. If the end-of-sentence is not reached, insert candidate into the stack ii. If the end-of sentence is reached, insert it into the stack with end-of sentence flag. 6. Go to step 2 The stack decoding algorithm is based on a criterion that computes the estimated score f ∗ (t) of a sequence of words up to time t. The score is computed by using known probabilities of previous words gi (t) and an heuristic function to predict the remaining words in the sentence. For example: f ∗ (t) = gi (t) + h∗ (t)

(3.21)

Alternatively, these functions can be expressed in terms of a partial path p, instead of time t. In this case f ∗ (p) is the score of the best complete path which starts at path p. Similarly gi (p) represents the score from the beginning of the utterance towards the partial path p. Lastly, h∗ (p) estimated the best extension from the partial path p towards the end of the sentence [11]. Finding an efficient and accurate heuristic function might represent a complex task. Fast matches are heuristic functions that are computationally cheap and are used to reduce the number of next possible word candidates. Nonetheless, these fast match functions must be checked by more accurate detailed match functions [17].

22

3.3

Training for Hidden Markov Models

One of the most challenging phases of developing an automatic speech recognizer system is finding a suitable method to train and evaluate its hidden Markov models. This section provides an overview on the Expectation-Maximization algorithm and presents the Forward-Backward algorithm that is used for the training of an ASR system.

3.3.1

Expectation-Maximization Algorithm

Typically, the methods used to complete the training task of an ASR are variations of the Expectation-Maximization algorithm (EM), presented by Dempster in 1977. The goal of this algorithm is to approximate the transition (aij ) and emission (bi (ot )) probabilities of the HMM using large sets of observation sequences O, [11]. The initial values of the transition and emission probabilities can be estimated; as the training for the HMM progresses, those probabilities are re-estimated until their values converge into a good model. The expectation-maximization algorithm uses the steepest gradient, also known as hill-climbing, method. This means that the convergence of the algorithm is determined by local optimality [16].

3.3.2

Forward-Backward Algorithm

Due to the large number of parameters involved in the training of HMMs, it is preferable to use simple methods. Some of these methods can feature adaptation in order to add robustness to the system by using noisy or contaminated training data. In fact, this has been a major research area in the field of speech recognition during the last couple of decades [14]. Two of the most popular algorithms used to train HMMs are the Viterbi algorithm, and the Forward-Backward (FB) algorithm, also known as the Baum-Welch algorithm. This last algorithm computes two main parameters in order to train HMMs, the maximum likelihood estimates and the posterior modes (transition and emission probabilities) using emission observations as training data. The approximation of HMM probabilities can be achieved combining two parameters named: the forward probability (alpha) and backward probability (beta) [11]. The forward probability is defined as the probability of being in state i after the first t observations (o1 , o2 , o3 . . . ot ). αt (i) = P (o1 , o2 ...ot , qt = i)

(3.22)

Similarly, the backward probability is defined as the probability of visualizing observations from time t + 1 to the end (T ) when the state is j at time t. 23

βi (ot ) = P (ot+1 , ot+2 ...oT |qt = j)

(3.23)

In both cases, the probabilities are calculated using an initial estimate and then the remaining values are approximated using an induction step. For example: αi (1) = a1j bj (o1 ) And then the other values are recursively calculated: "N −1 # X αj (t) = αi (t − 1)aij bj (ot )

(3.24)

(3.25)

i=2

As it can be seen, the forward probability in any given state can be computed using the product of the observation likelihood bj and the forward probabilities from time t − 1. This characteristic allows this algorithm to work efficiently without drastically increasing the number of computations as N grows [16]. A similar approach is used to calculate the backward probabilities using an iterative formula: N −1 X βi (t) = aij bj (ot + 1)βj (t + 1) (3.26) i=2

Once that the forward and backward probabilities have been calculated, both the alpha and beta factors are normalized and combined in order to approximate the new values for the emission and transition probabilities. Although the Baum-Welch algorithm has proven to be an efficient way to train HMMs, the training data needs to be sufficient to avoid having parameters with probabilities equal to zero. Furthermore, using the forward-backward algorithm helps to train parameters for an existing HMM, but the structure of the HMM needs to be generated manually. This could represent a major concern as finding a good method to generate the structure for an HMM could be a difficult task [16]. An example of a training procedure is depicted in Figure 3.7 in which the forward-backward algorithm is used by Sphinx.

24

Figure 3.7: Sphinx training procedure

3.4

Performance

The performance of an ASR system can be measured in terms of its recognition error probability, which is why specific metrics such as the word-error rate measurement are used to evaluate this type of systems. This section describes the word-error rate measurement, which is the most common measurement to evaluate the performance of an ASR system.

3.4.1

Word Error Rate

The word-error rate (WER) has become the standard measurement scheme to evaluate the performance of speech recognition systems [11]. This metric allows calculating the total number of incorrect words in a recognition task. Similar approaches use syllables or phonemes to calculate error rates; however the most used measurement units are words. The WER is calculated by measuring the number of inserted, deleted or substituted words of a correct transcript with respect to a hypothesized speech string [1]. W ER =

Insertions + Substitutions + Deletions 100 T otalW ords 25

(3.27)

As it can be seen, the number of word insertions is included in the mathematical expression, for this reason the WER can have values above 100 percent. Typically, a WER lower than 10 percent is acceptable on most ASR systems [10]. Furthermore, the WER is the most common metric used to benchmark and evaluate improvements to existing automatic speech recognition systems, for example, when introducing improved or new algorithms [21]. Although the WER is the most used metric to evaluate the performance of ASRs, it does not provide further insight of the factors that generate the recognition errors. Some other methods have been proposed in order to measure and classify the most common speech recognition errors. For example, the analysis of variance (ANOVA) method, which allows the quantification of multiple sources of errors acting in the variability of speech signals [1]. During the last decade, researchers have tried to predict speech recognition errors instead of just measuring them in order to evaluate the performance of a system. Furthermore, the predicted error rates can be used to carefully select speech data in order to train the ASR systems more effectively.

26

Chapter 4

Design and Implementation This chapter describes in detail the design and implementation of a Keyword Based Interactive Speech Recognition System using PocketSphinx. A brief description of the CMU Sphinx technology and the reasons for selecting PocketSphinx is provided in this chapter. Also introduced is the creation process of both the language model and the dictionary. Furthermore, several tools were created in order to ease the system’s development and they are introduced in this chapter. However, more information on these tools and their usage can be found at Appendix A.

4.1

The CMU Sphinx Technology

This section describes the Sphinx speech recognition system that is part of the CMU Sphinx technology. Also described is the PocketSphinx system which is the ASR decoder library selected for this project.

4.1.1

Sphinx

Sphinx is a continuous-speech and speaker-independent recognition system developed in 1988 at the Carnegie Mellon University (CMU) in order to try to overcome some of the greatest challenges within speech recognition: speaker independence, continuous speech and large vocabularies [7, 13]. This system is part of the CMU Sphinx, open source, technology that provides a set of speech recognizers and tools that allow the development of speech recognition systems. This technology has been used to ease the development of speech recognition systems as it provides a way to avoid developers the need to start from scratch.

27

The first Sphinx system has been improved over the years and three other versions have been developed: Sphinx2, 3 and 4. Sphinx2 is a semi-continuous HMM based system, while Sphinx3 is a continuous HMM based speech recognition system and both are written in the C programming language [19]. On the other hand, Sphinx4 is a system that provides a more flexible and modular framework written in the Java programming language [23]. Currently, only Sphinx3 and Sphinx4 are still under development. Overall, the Sphinx systems comprise typical ASR components such as in Sphinx4 that has a Front End, a Decoder and a Linguist as well as a set of algorithms that can be used and configured depending on the needs of the project. They provide part of the needed technology, that given an acoustic signal and having a properly created acoustic model, a language model and a dictionary, decode the spoken word sequence.

4.1.2

PocketSphinx

PocketSphinx is a large vocabulary, semi-continuous speech recognition library based on CMU’s Sphinx 2. PocketSphinx was implemented with the objective of creating a speech recognition system for resource-constrained devices, such as hand-held computers [8]. The entire system is written in C with the aim of having fast-response and light-weight applications. For this reason, PocketSphinx can be used in live applications, such as dictation. One of the main advantages of PocketSphinx over other ASR systems is that it has been ported and executed successfully on different types of processors, most notably the x86 family and several ARM processors [8]. Similarly, this ASR has been used on different operating systems such as Microsoft’s Windows CE, Apple’s iOS and Google’s Android [24]. Additionally, the source code of PocketSphinx has been published by the Carnegie Melon University under a BSD style license. The latest code can be retrieved from SourceForge1 . In order to execute PocketSphinx, typically three input files need to be specified: the language model, the acoustic model and the dictionary. For more information about these three components please refer to Section 3.2.2. By default, the PocketSphinx toolkit includes at least one language model (wsj0vp.5000), one acoustic model (hub4wsj sc 8k) and one dictionary (cmu07a.dic). Nonetheless, these files are intended to support very large vocabularies, for example, the dictionary2 includes around 125,000 words. Although there are not many scientific publications to describe how the PocketSphinx library works, the official web page3 contains documentation that describes how to download and compile the source code, as well as how to create example applications.

1

CMU Sphinx repository: http://sourceforge.net/projects/cmusphinx CMU dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict 3 PocketSphinx: http://www.pocketsphinx.org 2

28

Similarly, there is a discussion forum where new developers are encouraged to ask questions. Probably the main drawback about the web page is that some sections do not seem to be updated regularly and the organization is not intuitive. In this regard, it can be cumbersome to navigate through the web page as some documentation applies to both PocketSphinx and Sphinx 4, while some other parts apply just to Sphinx 4. At the end of the day, we chose to use PocketSphinx because it provides a development framework that can be adapted to fit our requirements. For example, PocketSphinx has been tested in several embedded devices running different operating systems. For this reason, we consider that is more feasible to adapt it instead of creating an ASR from scratch. In this case, we can focus on creating a keyword-based language model and a dictionary that maximizes the performance of our system. Similarly we chose to create an application in which we can measure the performance of the ASR and interact with a computer using speech; in the following section we will describe this application in more detail.

4.2

System Description

The development of this project was done considering that the system can be used by a robot that is located at a museum and that can interact with people. The robot shall be able to listen and talk to a person based on identified keywords that were pronounced by the speaker. However, the robot shall never remain quiet. This means that if no keyword is identified, it shall be able to play additional information about the museum. For this project, the robot was assumed to be located at the Swedish Tekniska Museet 4 . Therefore, the system needs to be able to identify previously defined keywords from a spoken sentence and it should be speaker independent. Moreover, the system needs to be an interactive speech recognition system, in the sense that it shall be able to react based on an identified keyword. In this case, an audio file with information related to the keyword is played in response. Otherwise, if the system does not identify any keyword from the decoded speech, the system shall play any other default audio file. Furthermore, the system needs to be portable, as it has as main goal to be used by an embedded application. Therefore, as explained at Section 4.1.2, PocketSphinx was selected from the CMU Sphinx family of decoders as it provides the required technology to develop this type of systems. 4

Tekniska Museet: http://www.tekniskamuseet.se

29

4.3

Keyword List

The keyword list comprises the words of relevance and that are selected to be identified by the system. This list mostly likely shall have the words that could be pronounced more often by a speaker. Therefore, they shall be selected according to the area of interest. For this project the keywords were selected according to the current exhibitions at the museum. For the definition of the keywords a keyword file was created. This file contains the list of all the words that shall be identified from the incoming speech signal. Each line of the file shall contain the keyword, followed by the number of audio files available for playing whenever the keyword is identified by the system. An extra line is also added at the end of this file stating the number of available audio files that are to be played whenever the system is not able to identify a keyword. The word DEFAULT preceded by a hash symbol is used for this purpose. An example of the format of this file can be seen at Figure 4.1. This example has Keywords 1 to n, each with five available audio files as well as ten default audio files.

Figure 4.1: Keyword File Example

4.4 4.4.1

Audio Files Creating the audio files

The information that is played whenever a keyword is identified by the system, comes from selected sentences converted from text to speech. Thus, to convert the sentences into speech, a Text to Speech (TTS) system is required. There are different TTS systems available such as eSpeak[4] or FreeTTS[5]. However, for this project we have selected Google’s TTS, which is accessed via Google’s Translate[22] service, as it is easy to use and more importantly, due to its audio quality.

30

Therefore, in order to ease the creation of the audio files containing the responses to be given by the system, the tool RespGenerator was developed and written in Java. Prior the usage of this tool, the desired sentences per keyword and the default ones shall be written within a text file. The file should first contain a line with the keyword preceded by a hash symbol and then the lines with its associated sentences. An example of the format of this file containing Keywords 1 to n and their sentences can be seen at Figure 4.2.

Figure 4.2: Sentences for Audio Files Example Once the file containing the sentences is ready, the tool reads them from the file, connects to Google’s TTS and retrieves the generated audio files, all this by using the GNU wget5 free software package. The generated audio files are placed under a folder with the name of the keyword that they belong to and they are numerically named from 1 to n, where n is the number of available audio files per keyword. Figure 4.3 illustrates the process of the conversion of the sentences into audio files.

5

GNU wget: http://www.gnu.org/software/wget/

31

Figure 4.3: Conversion of Sentences into Audio Files The RespGenerator tool has only one mode of operation, in which, the user must specify the path and name of the file containing the sentences to be used as answers. Additionally, the user must specify the desired path for the output audio files. Regarding the wget tool, it needs to be placed in the same folder as the RespGenerator tool in order generate the audio files correctly. For linux operating systems, the wget tool is installed by default, nonetheless, for Windows operating systems, the user needs to download the executable and place it in the appropriate folder. The following command provides an example about how to execute the RespGenerator tool: java -jar RespGenerator.jar ..\SentencesFile.txt ..\OutputPath\ As it can be seen, the first parameter corresponds to the path and name of the file containing the sentences and the second parameter is the path for the output audio files. In case that the user specifies less than two parameters, the tool will display a mesage indicating that there was an error. For example: Error: less than 2 arguments were specified On the other hand, when the two inputs are correctly specified, the tool will generate the output audio files and it will display a message indicating that the tool was executed correctly: kwords.txt file created succesfully!

32

Additionally, the RespGenerator tool creates a text file containing a list of keywords and the corresponding number of audio files created per keyword. This file is used by the ASR application in order to know how many possible responses are available per keyword as well as how many default responses can be used.

4.4.2

Playing the audio files

Initially, the ASR system was designed to play MP3 files, but it was later changed to play OGG files as it is a completely open and free format6 . Therefore, in order to play the audio files, it is necessary to use an external library. For that, the BASS audio library7 was used. The BASS audio library allows the streaming of different audio type files, included OGG. Moreover, the library is free for non-commercial use and it is available for different programming languages, including C. Furthermore, one of the main advantages of this library is that everything is contained within a small dll file of only 100 KB.

4.5

Custom Language Model and Dictionary

Although PocketSphinx can be used in applications supporting vocabularies composed by several thousands of words, the best performance in terms of accuracy and execution time can be obtained using small vocabularies [12]. After downloading and compiling the source code for PocketSphinx, we proceeded to run a test program using the default language model, acoustic model and dictionary. Nonetheless, the word accuracy for the application tended to be very low. In other words, most of the speech sentences used as inputs were not recognized correctly. After performing some research in the documentation for PocketSphinx, we found out that is recommended to use a reduced language model and a custom dictionary with support for a small vocabulary. The smaller the vocabulary, the fastest PocketSphinx will decode the input sentences as the search space of the algorithms used by PocketSphinx gets smaller. Similarly the accuracy of the ASR becomes higher when the vocabulary is small. For example, there is an example application included with PocketSphinx that uses a vocabulary containing only the numbers from 0 to 9. In this case, the overall accuracy is in the range of 90% to 98%. On the other hand, it is recommended to use one of the default acoustic models included with PocketSphinx. The main reason is that the default acoustic models have been created using huge amounts of acoustic data containing speech from several persons. In other words, the default acoustic models have been carefully tuned to be speaker 6 7

OGG: http://www.vorbis.com/ BASS Audio Library: http://www.un4seen.com/bass.html

33

independent. If for some reason the user creates a new acoustic model, or adapts an existing one, the acoustic training data needs to be carefully selected in order to avoid speaker dependence. The CMUSphinx toolkit provides methods to adapt an existing acoustic model or even create a new one from scratch. However, in order to create a new language model, it is required to have large quantities of speech data from different speakers and the corresponding transcript for each training sentence. For dictation applications it is recommended to have at least 50 hours of recordings of 200 speakers8 .

4.5.1

Keyword-Based Language Model Generation

The CMUSphinx toolkit provides a set of applications aimed to help the developer during the creation of new language models, this group of applications is called the Statistical Language Model Toolkit (CMUCLMTK). The purpose of this toolkit is to take a text corpus as input and generate a corresponding language model. The text corpus is a group of delimited sentences that is used to determine the structure of the language and how each word is related to the others. For more information related to language models please refer to Section 3.2.2 . Figure 4.4 illustrates the basic usage of the CMUCLMTK.

Figure 4.4: Statistical Language Model Toolkit Alternatively, there is an online version of the CMUCLMTK toolkit, called lmtool. In order to execute this tool, the user needs to upload a text corpus into a server using a web browser and the tool generates a language model. However, this tool is intended to be used with small text corpora containing a couple of hundreds of sentences at the most. This limitation became a problem for us, as most of the times our web browser timed-out before the tool was able to generate a language model. For this reason we decided to use the CMUCLMTK toolkit instead. Besides generating a language model from a text corpus, the CMUCLMTK toolkit generates several other useful files, such as, the .vocab file that lists all the words found in the vocabulary. Similarly, the .wfreq file lists all the words in the vocabulary and the number of times they appear in the text corpus. Finally, the toolkit generates the laguage model in two possible formats, the arpa format (.arpa), which is a text file or 8

CMUCLMTK toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

34

the CMU binary format (.DMP ). Both formats can be used by Pocketsphinx, however the binary format is prefered since the files are much more compact. There are many ways to generate the input text corpus for the CMUCLMTK toolkit. Probably the simplest way of generating the text corpus is to write a number of sentences by hand. Nevertheless, this method can become unfeasible very easily, especially when the language model generation requires a fairly high amount of input sentences. For this reason, it is preferred to use a different method in order to generate the text corpus automatically. In order to optimize the generation of the language models we created a tool called TextExtractor . This tool creates a text corpus based on web texts from the internet. More specifically, this tool downloads text from web pages based on a set of input Uniform Resource Locators (URLs). Additionally, the TextExtractor filters the raw text, expands abbreviations and acronyms, converts numbers to their word representation and removes special characters, among other things. Furthermore, as illustrated at Figure 4.5, the TextExtractor can receive a list of keywords in order to insert into the text corpus only the sentences containing keywords. This is particularly helpful to keep the language model compact even when the number of sentences in the selected web pages is large.

Figure 4.5: TextExtractor and CMUCLMTK In general, the TextExtractor tool has two modes of operation. The simplest way to use the tool is by providing two arguments, the list of URLs to be used to generate the text corpus and the name of the output file, for example: java -jar TextExtractor.jar ..\URL_List.txt ..\TCorpus.txt In this case, the result of executing the tool will be a text corpus called TCorpus.txt. This file will include a list of sentences downloaded from the list of URLs. Addtionally, the sentences will not contain punctuation symbols, such as commas and colons, and all of the numbers will be replaced by its word representation. 35

In the second mode of operation, the tool receives three arguments. The two arguments previously described and a third argument, which is a list of keywords to be used to filter the output text corpus. As it was previously explained, the text corpus will only include sentences that contain one or more keywords. For example: java -jar TextExtractor.jar ..\URL_List.txt ..\TCorpus.txt ..\Keyword_List.txt In both operation modes, the tool will display an error message if it finds a problem during its execution. Otherwise, a success message will be displayed when the tool has completed its execution succesfully. For example: Output Text file(s) created Using both the TextExtractor and the CMUCLMTK tools allows generating language models in a quick and easy way. These characteristics gave us the opportunity to evaluate the performance of several language models in order to choose the most appropriate one given the requirements of our application. Please refer to Chapter 5 in order to find out more about the evaluation of different language models. The TextExtractor tool was entirely written in Java in order to be used in different platforms, while the CMUCLMTK toolkit is written in C. The CMUCLMTK toolkit can be compiled for different platforms, most notably Windows, SunOS and Linux. For this reason, it is important to choose the correct version of the tool before downloading it from the repository. Currently, there is an online document9 describing the main characteristics of the toolkit. However, it is not very detailed and the procedure to use the tools is inconsistently described in several parts. The TextExtractor tool uses a public library called boilerpipe in order to download text from web pages. This library is publicly available and can be downloaded from the repository for Google code projects10 .

4.5.2

Custom Dictionary Generation

Although the default dictionary file (cmu07a.dic) integrated with PocketSphinx includes more than 120,000 words, it is not feasible to use it in a small vocabulary application. For this reason we generated a custom dictionary with only the subset of words used by our language model. The size of the custom dictionary is around 1/40th the size of the original dictionary. This allows the ASR application to handle the dictionary file much easier and perform the speech decoding faster.

9 10

CMUCLMTK toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html Boilerpipe: http://code.google.com/p/boilerpipe

36

The dictionary generator tool (DictGenerator) takes a custom language model and the cmu07a dictionary from PocketSphinx as inputs. Similarly, the tool generates a custom dictionary containing just the subset of words supported by the language model. This process can be seen at Figure 4.6.

Figure 4.6: Dictionary Generator Tool Besides the custom dictionary file, the DictGenerator tool generates a text file containing a list of words not found in the cmu07a dictionary. When one word is not found in the dictionary, it can be because of two main reasons as specified in Table 4.1. Cause The word is written incorrectly or is written in another language.

Solution The text corpus used to generate the language model needs to be edited to remove typos and words in other languages. The user needs to edit the custom dictionary to include the missing word manually.

The word is written correctly, but is not included in the dictionary.

Table 4.1: Dictionary Words Not Found: Causes and Solutions Most of the times, the words in the language model that are not found in the cmu07a dictionary contain typos. For this reason, there is no need to edit the custom dictionary very often. The DictGenerator has two main operation modes. In the first operation mode, it receives at least two arguments, one indicating the name and path of the input dictionary file (cmudict07.dic) and the second indicating the name and path of a file containing all the words in the custom vocabulary. For example: java -jar dictGenerator.jar ..\cmu07a.dic ..\TCorpus.vocab

37

In this operation mode, the tool will generate a custom dictionary using a default name (outputDict.dic) and a file containing words not found in the input dictionary (notfound.txt). By default, the two files will be created in the same folder as the DictGenerator tool. In the second operation mode, the tool receives a third parameter. This allows the user to specify a name and path for the custom dictionary: java -jar dictGenerator.jar ..\cmu07a.dic ..\TCorpus.vocab ..\TCorpus.dic The main difference with respect to mode operation one is that the output file will be called TCorpus.dic and it will be stored in the path specified by the user. The other output file will have the same name, but it will be stored in the same location as the custom output dictionary. For both operation modes, the tool will display a message when it finished its execution. For example: Custom Dictionary was created succesfully! The DictGenerator tool was written in Java in order to allow the user to execute it in different platforms. This is consistent with other tools developed during the development of this project.

4.5.3

Support for other languages

English is the default language used during the development of this project. However, there are language models, acoustic models and dictionaries available for other popular languages, such as, German, French and Mandarin. All this models and dictionaries are publicly available and can be downloaded from the PocketSphinx repository11 . In order to support some other languages, a new language model, a new acoustic model and a new dictionary need to be created. The language model can be created in the same way as it was described previously using the TextExtractor and CMUCLMTK tools. The only difference is that the web pages used as input need to be written in the desired language. The generation of a new phonetic dictionary might be much more complex as every word supported by the language model needs to be included in the dictionary. It is important to mention that every word needs to be translated into its corresponding ARPAbet representation in order to be used by PocketSphinx. For example, Table 4.2 illustrates the first five words in a phonetic dictionary for Spanish.

11

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/

38

Word ´ ABREGO ´ AFRICA ´ ALVAREZ ´ ALVARO ´ ANFORAS

ARPAbet representation ABREGO AFRIKA ALVARES ALVARO ANFORAS

Table 4.2: Phonetical Dictionary Building a detailed phonetic dictionary for a language can be a very time-consuming and challenging task as the developer needs to have a very good understanding of the language and its pronunciation. In fact, it is recommended to get help from a group of linguists in order to aid the developers during the generation of the dictionary. In contrast, a new acoustic model can be trained automatically, provided there are enough recordings (acoustic database) and the corresponding transcript for every single sentence is available. PocketSphinx includes a tool called SphinxTrain that is intended to be used to train new acoustic models. The tool receives four inputs that can be seen at Figure 4.7.

Figure 4.7: SphinxTrain Tool

39

The inputs include the following files: 1. A set of recordings in .wav format, one file per sentence per speaker. 2. A transcript file containing each sentence in the acoustic database, the same file can be used for all the speakers. 3. A phonetic dictionary for the desired language. 4. A filler dictionary which maps non-speech sounds in the recordings into corresponding speech-like sound units. Besides downloading and compiling the SphinxTrain tool, it is required to have perl and python installed in the same computer in order to execute SphinxTrain. For more information regarding the installation and the configuration of the SphinxTrain tool, please refer to the tutorial12 published by the Carnegie Melon University.

4.6

Decoding System

The ASR system was written entirely in C in order to be able to use the PocketSphinx Application Programming Interface (API). Microsoft Visual Studio 2008 was used as the Integrated Development Environment (IDE). This section describes the inputs, outputs as well as the main algorithm used by the ASR system to perform the decoding of the speech signal. In addition the interaction between the ASR decoder system and all the tools that were developed during its implementation is explained at the end of this section.

4.6.1

Inputs

The acoustic model, the language model and the dictionary are the three main inputs the ASR system requires in order to perform the decoding of the incoming speech signal. For this project, the default acoustic model included with PocketSphinx is the one used by the system. On the other hand, a new language model as well as a new dictionary were created from scratch following the steps specified at Sections 4.5.1 and 4.5.2.

4.6.2

Outputs

The ASR system outputs the decoded incoming speech signal by providing a hypothesis that best describes the speaker’s utterance. However, since it is a keyword-based system, it also outputs the identified keyword. In addition, a file containing the obtained results during the ASR system execution is created. The file contains information from the obtained hypothesis and keywords for every speech signal that was decoded. 12

SphinxTrain tool: http://cmusphinx.sourceforge.net/wiki/tutorialam

40

4.6.3

Algorithm

The following describes the decoding process of the ASR system which comprises four main states: Ready, Listening, Decoding and Scanning . The states represent the steps needed to transform the live speech signal into a word sequence and properly identify a keyword. A state shall only start after the completion of the previous state. After the final state has been reached, the system will return to the first state without the need of restarting the system. This process is described by the following and is illustrated at Figure 4.8. 1. Ready. At the beginning, the system initializes the audio input device in order to start listening to the speaker. The system will also load the acoustic model, the language model and the dictionary in order to later perform the decoding of the signal. At this state the available keywords and their number of available audio files are also loaded into the system. Once the startup has finished the system will play the message: “I’m listening” and it will indicate that is ready to start listening to the speaker. This message is only played during the system’s startup, the following times only the word Ready is printed on screen. 2. Listening. As soon as the speaker starts talking, the system will listen to the incoming speech and will perform the feature extraction until the speaker has finished. For this, the system will wait for a long silence to occur. This silence is determined by verifying if one second has elapsed since the last spoken utterance. Once a long silence has been determined, the system will stop listening and the decoding process will then take place. Otherwise, the system will continue listening until the end of the speaker’s utterance is determined. The system will indicate when it stops listening by printing Stopped listening on screen. 3. Decoding. At the decoding state the received speech signal is decoded using the information from the feature vectors and the linguist to outcome a result hypothesis. For this, the linguist uses the data from the acoustic model, the language model and the dictionary. • The acoustic model provides statistical information that is used to map the incoming sounds within the speech signal into phonemes and words. • The language model contains statistical information on how likely words can occur. For this project, a 3-gram language model format was used. • The dictionary provides the phonetic representation (pronunciation) of the words within the language model and it is used to map phoneme sequences into written words. Please refer to Section 3.2.2 for more information on the linguist.

41

Once the signal has been decoded, a hypothesis of what was spoken is given as an outcome. The hypothesis consists of a word sequence that is determined to be the best option to represent the speaker’s utterance. 4. Scanning. The provided hypothesis is then scanned in order to verify if one of the words is a keyword from the predefined list. If a keyword is found, a random selection of one of its associated audio files is performed and played. Otherwise, an audio file from the default ones is randomly selected and played. After the audio file has been played, the ASR system will be ready to listen to the speaker again. There is no need to restart the program to continue its execution. In order to terminate the program’s execution the speaker can either say the word OUT or manually exit the program. The system will play the message: “Bye” and the results file, including the hypothesis and keywords identified, will be created.

Figure 4.8: Decoding Algorithm

42

4.6.4

System Components

In order to better illustrate the interaction between the main system and all developed tools, a component diagram is depicted at Figure 4.9. As it can be seen in this diagram, the entire system comprises several files, four tools and the ASR decoder in order to properly operate. Some of the files are generated by the user and others are the result of the execution of some of the tools.

Figure 4.9: Component Diagram The ASR decoder, which is the main system, uses the language model, the custom dictionary and the acoustic responses to operate. This outputs are result of the execution of three tools: the TextExtractor, the DictGenerator and the RespGenerator. However, at the same time, this tools require that the user properly generates the following files: the keyword file, the URL list, the master dictionary and the file with the response sentences. On the other hand, the rEvaluator tool uses the output of the main system, the LogFile, as well as the verification file to properly execute. Finally, this tool generates the performance report that is used to evaluate the system’s performance.

43

Chapter 5

Tests and Results This chapter provides a detailed description of the testing procedure used to evaluate the performance and accuracy of the Keyword Based Interactive Speech Recognition System. Furthermore, it includes a set of results obtained using different language models and phonetic dictionaries. Additionally, this chapter includes a list of tests performed to verify the correct operation of a set of Java tools developed during this project.

5.1 5.1.1

Main ASR application Performance Evaluation and KDA

After developing a keyword-based ASR application using PocketSphinx, the system was evaluated in order to quantify its accuracy and error rate. As the main purpose of the system is to identify keywords in the input utterances, it is not practical to measure the word error rate (WER) as in other ASR applications. In this case, we defined a parameter called Keyword Decoding Accuracy (KDA), which represents the percentage of keywords correctly identified by the ASR, given a set of utterances containing keywords. This is represented by the following: KDA = (Number of keywords decoded correctly/Total number of keywords)100 (5.1) Given the KDA parameter, we focused on optimizing the language model and the phonetic dictionary in order to maximize the accuracy of the ASR application. Many ASR tools developed using PocketSphinx evaluate their accuracy by reading audio streams (files). In other words, the audio files are directly loaded into memory and decoded by the ASR tool. This approach, as depicted in Figure 5.1, can be somehow unrealistic as the audio decoded by the ASR is absent of ambient noise or distortions caused by the environment. 44

Figure 5.1: Typical Test Setup To evaluate the accuracy of our ASR application, the KDA was measured using live audio and the default input device (microphone) in our development workstation. This allowed testing the application in a realistic environment although the accuracy might be slightly lower. Figure 5.2 illustrates the mentioned configuration.

Figure 5.2: KDA Test Setup In order to assure repeatability in the process of measuring the KDA, the set of test sentences articulated by the user were recorded and played using the computer’s sound card. This also allowed automating the process of measuring the KDA for several language models without the need of having a person speaking to the microphone each time. Another advantage of this approach is that the ASR application does not need to be modified to alternate between the microphone and the line-in device. Figure 5.3 illustrates the automation of the evaluation process.

(a) 1st Step : Recording of Test Sentences

(b) 2nd Step : Evaluating the Performance

Figure 5.3: Automated Test Process

45

5.1.2

Test environment and setup

The first step in the testing process was selecting a list of keywords to generate a language model and the dictionary to be used by the ASR application. This was achieved by selecting 20 keywords from the Swedish Tekniska Museet (Museum of Science and Technology) web page. It is important to mention that some keywords are actually composed by two words in order to test the application using more complex phrases. Table 5.1 comprises the list of keywords. MECHANICAL WORKSHOP STEAM ENGINE MECHANICAL ALPHABET INDUSTRIAL REVOLUTION RADIO STATION INTERACTIVE INSPIRATION

POWER ADVENTURE ENERGY SPORTS EXHIBITION EXPERIMENTS TRAIN SET

NASA MINE RAILWAY WOMEN TELEPHONE TECHNOLOGY

Table 5.1: Keyword List The second step in the testing process was defining a set of test sentences to measure the KDA for the application. In this case we recorded 5 sentences per keyword, which resulted in a total of 100 test sentences. The same set of sentences was recorded by two persons in order to evaluate the application’s speaker independence. Each of the sentences was used later on as input to the ASR application in order to evaluate its accuracy. Figure 5.4 shows some of the test sentences used to measure the KDA.

Figure 5.4: Test Sentences

46

After generating the set of test sentences, the next phase consisted in generating a language model and a dictionary using a list of web pages (URLs). During the first set of tests we only used web pages from the Tekniska Museet in order to build a very simple language model. After evaluating the accuracy of the first set of tests, we proceeded to add more URLs into the list in order to generate more elaborated language models. In this case, several articles from Wikipedia were used as most articles contain large quantities of well-written sentences. For the sake of automating the measurement of the KDA we developed a tool called rEvaluator . This tool is written in Java and it receives a log file from the ASR application and the list of keywords included in the test sentences. The tool generates a comma separated file report in which the KDA metric can be identified for each keyword and for the overall set of test sentences. The comma-separated report can be easily imported into Excel in order to visualize and compare the KDA metrics. Figure 5.5 depicts the operation of this tool.

Figure 5.5: rEvaluator Tool Table 5.2 depicts an example of the output report using 5 keywords. As it can be seen, it lists the number of correctly decoded keywords, the number of sentences per keyword and the KDA. Keyword TECHNOLOGY TELEPHONE STEAM ENGINE RADIO STATION ENERGY OVERALL METRICS

Matches 5 4 3 5 5 22

Sentences 5 5 5 5 5 25

Table 5.2: KDA Report Format

47

KDA(%) 100 80 60 100 100 88

Using the rEvaluator tool, we gathered results using different language models and a different number of URLs. Table 5.3 shows the performance of the ASR application. Test

URLs

1 2 3 4 5

15 19* 19 23 35

Vocab Size (Words) 1174 2679 1650 2596 3562

KDA Speaker1 56.5 59 68 78 83

KDA Speaker2 59 64 70 82 87

Overall KDA (%) 57.75 61.5 69 80 85

Table 5.3: Overall Test Results *Note. For test number 2, the input text corpus was not filtered. In other words, it included some sentences that did not contain keywords. This is why the vocabulary size in test number 3 is smaller but the number of URLs remained the same. As it can be seen from the results table, the overall KDA increased as the number of URLs (and vocabulary size) increased. However, as the vocabulary size increased, the execution time of the ASR application increased as well. For this reason, we decided to keep the vocabulary size around 3,500 words in order to maintain the execution time low. During the final round of tests, the ASR application managed to get a keyword decoding accuracy of 85%. In other words, around 85 of the 100 keywords in the test sentences were decoded correctly while maintaining an average execution time of around 0.5 seconds per sentence. Regarding the maximum KDA achieved, it must be clarified that none of the persons participating in the tests speak English as their first language. As it was explained previously, this could affect the overall performance of the application adversely. Finally, the system was executed and tested using two workstations; Table 5.4 illustrates the main characteristics of the systems and their architecture. System Dell Studio 1555 Dell XPS 1640

Architecture x86 x86 64

OS Vista SP2 Vista SP2

Processor Intel Core 2 Duo T6500 Intel Core 2 Duo P8600

RAM 3.0 GB 4.0 GB

Table 5.4: Computer Specifications Alternatively, during the preliminary stages of development, the application was also compiled and run on Linux using OpenSUSE 11.4 and the same Dell XPS 1640 workstation with an external Hard Drive. However, during the latter stages we continued the development of the application using Windows Vista and Visual Studio 2008.

48

5.2

Auxiliary Java Tools

In order to verify the correct operation of tools such as the TextExtractor, RespGenerator and DictGenerator, a list of tests was designed. This verification phase involved testing all of the different operation modes for each tool as well as verifying their output files. The following sections describe the different tests performed for each tool.

5.2.1

TextExtractor Tool

For the TextExtractor tool, two operation modes were tested. For the first operation mode it was verified that the tool was able to generate a text corpus from a list of URLs. Furthermore it was verified that the list of sentences included in the corpus was not filtered based on a list of keywords. Figure 5.6 shows the list of messages displayed in the command line after executing the tool:

Figure 5.6: TextExtractor Operation Mode 1 Similarly, the second operation mode was tested. In this case, it was verified that the list of sentences included in the text corpus was filtered based on a list of given keywords. Figure 5.7 shows the list of messages displayed in the command line after executing the tool:

49

Figure 5.7: TextExtractor Operation Mode 2

5.2.2

DictGenerator Tool

The testing for the DictGenerator tool was more straightforward as there is just one difference between the operation modes. In the first mode, the user does not define a name for the generated dictionary, while in the second mode the user specifies a path and a name for the custom dictionary. In both cases the set of messages shown in the command line is the same. This can be visualized in Figure 5.8. Similarly, Figure 5.9 shows the custom dictionary generated using each operation mode. As it can be seen, operation mode 1 generates a dictionary with a default name (outputDict.dic).

Figure 5.8: DictGenerator Operation Modes 1 and 2

50

Figure 5.9: Generated Dictionary Files

5.2.3

RespGenerator Tool

A similar test was designed for the RespGenerator tool. In this case the Java application has only one operation mode. The user needs to provide the name of a file containing the sentences to be used as acoustic responses from the ASR systen. Additionally the user must specify the path where the audio files will be stored after executing the tool. This can be seen at Figure 5.10.

Figure 5.10: RespGenerator: Execution Messages

5.2.4

rEvaluator Tool

Finally, the rEvaluator tool was verified by testing that the KDA report was generated correctly given three input parameters. The first parameter corresponds to a log file generated by the ASR application. The second parameter is a text file listing the keywords corresponding to each decoded utterance. Finally, the third parameter is the name of the generated output report. Figure 5.11 shows the set of messages displayed in the command window when the tool was executed.

Figure 5.11: rEvaluator: Execution Messages

51

Chapter 6

Summary This section recapitulates the main lessons learned and conclusions gathered during the development of the Keyword Based Interactive Automatic Speech Recognition System. Addionally, it highlights the main areas of improvement and possible topics for future work.

6.1

Conclusions

The Automatic Speech Recognition technology has been in constant development during the last fifty years. The possibility of recognizing and understanding human speech has driven numerous research groups to develop powerful and complex systems able to use and handle extensive vocabularies. Furthermore, a number of commercial tools are now available to be used in automotive and telephone-based applications, among others. However, the day when a machine is able to understand continuous and spontaneous speech still seems to be far away in the future. The accuracy and performance of modern ASR systems is strictly related to the size of its vocabulary, the complexity of the algorithms used to decode the input speech and finally, the size of the speech corpus utilized to train the system. For these reasons, the most accurate ASR systems are also the ones that require more computational power in order to work correctly. One important research area is related to the development of fast and efficient algorithms that decrease the processing power needed to execute large-vocabulary ASRs. For example, the implementation of hybrid ASR systems that use neural networks and hidden Markov models (MLP-HMM). Another example is the utilization of A* algorithms (stack decoding) instead of the well-known Viterbi and beam-search algorithms to decode speech. Although the task of recognizing and understanding speech is linked to several different research areas such as language modeling and digital signal processing, the main building

52

blocks of an ASR system are now well defined. Thus, a research group can focus on improving one or more particular areas, such as the front end or the speech decoder. There is a small amount of open-source ASR systems used by the research community to implement and compare new solutions in the ASR domain against established benchmarks. Using available tools such as PocketSphinx, proved to be a good strategy as it simplified the process of creating an interactive ASR application. In this regard, even though PocketSphinx is a free open-source library, its documentation is still somehow incomplete compared to other Sphinx decoders, most notably Sphinx 4. Nonetheless, one of the main advantages of PocketSphinx is that it can be compiled for different platforms, such as Windows, Linux, iOS and Android. Furthermore, PocketSphinx was designed to handle live audio applications such as dictation. For this reason, it is the fastest decoder of the CMU Sphinx family. Nevertheless, it is also one of the least flexible compared to Sphinx 4, which is written in Java. Consequently, trying to change or improve the algorithms used by PocketSphinx can be a troublesome and complex task. In fact, in most applications, the library is used without any changes made by the developers. Although PocketSphinx can be used on large-vocabulary applications, the overall best performance in terms of execution time can be reached using small vocabularies. In general, as the vocabulary gets smaller, the execution time gets lower as well. During the development of our keyword-based ASR application, we tested different language models and phonetic dictionaries in order to maximize the overall accuracy. At the final round of tests, the ASR application managed to get a keyword decoding accuracy of 85% while maintaining an average execution time of around 0.5 seconds per sentence. Since the early stages of development of the ASR system, we aimed to develop tools to automate the process of generating the configuration files needed to run the recognizer. In the end we successfully created tools to generate the language model, the dictionary and the set of predefined responses to be played when a keyword is identified. In a similar fashion, the process of evaluating the accuracy (KDA) was automated. As a result of this automation, it is possible to change the set of keywords and generate the corresponding configuration files in no more than a couple of hours. Furthermore, it is possible to run the application and use the new files without re-compiling the project.

6.2

Future Work

Although we managed to successfully complete the development of the keyword based ASR system, there are some areas that can be improved in order to enhance its operation. For example, the application can be updated to support more languages. In this regard, the main focus would be the creation of a phonetic dictionary for the desired language. Unfortunately, this can become an extensive task, as it requires a deep understanding of the given language. 53

Regarding the ASR decoder, it would be desirable to experiment using other decoding strategies, such as hybrid hidden Markov models that incorporate multilayer perceptrons (MLP-HMM). Similarly, it is desirable to test more decoding algorithms, such as stack (A*) decoding. However, the lack of flexibility of PocketSphinx might represent a problem when trying to change or improve the current algorithms. Even though the overall accuracy of the speech recognizer application is fairly high, it still can be improved. For this reason, more testing is needed in order to identify areas of improvement. For example, it would be beneficial to quantify the effect of speech variability, such as the speaker’s accent, on the overall accuracy of the system. Finally, it would be recommended to cross-compile the ASR project in order to be run on an embedded system. During the development of the project we were able to execute the speech recognition application using Windows and Linux, however it would be interesting to evaluate the application’s performance using other operating systems and different hardware.

54

Bibliography [1] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris, et al. Automatic speech recognition and speech variability: A review. Speech Communication, 49(10-11):763–786, 2007. [2] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE Transactions on, 28(4):357–366, 1980. [3] H.L. Doe. Evaluating the effects of automatic speech recognition word accuracy. PhD thesis, Virginia Polytechnic Institute and State University, 1998. [4] eSpeak: Speech Synthesizer. http://espeak.sourceforge.net, May 2011. [5] FreeTTS. http://freetts.sourceforge.net, May 2011. [6] J.P. Haton. Speech analysis for automatic speech recongnition: A review. In Speech Technology and Human-Computer Dialogue, 2009. SpeD’09. Proceedings of the 5-th Conference on, pages 1–5. IEEE. [7] X. Huang, F. Alleva, M.Y. Hwang, and R. Rosenfeld. An overview of the sphinx-ii speech recognition system. In Proceedings of the workshop on Human Language Technology, pages 81–86. Association for Computational Linguistics, 1993. [8] D. Huggins-Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and A.I. Rudnicky. PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 1, pages I–I. IEEE, 2006. [9] B.H. Juang and S. Furui. Automatic recognition and understanding of spoken language-a first step toward natural human-machine communication. Proceedings of the IEEE, 88(8):1142–1165, 2000. [10] BH Juang and L.R. Rabiner. Automatic speech recognition–A brief history of the technology development. Encyclopedia of Language and Linguistics, Elsevier, 2005.

55

[11] D. Jurafsky, J.H. Martin, A. Kehler, K. Vander Linden, and N. Ward. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, volume 163. MIT Press, 2000. [12] A. Kumar, A. Tewari, S. Horrigan, M. Kam, F. Metze, and J. Canny. Rethinking speech recognition on mobile devices. [13] K.F. Lee, H.W. Hon, and R. Reddy. An overview of the SPHINX speech recognition system. Acoustics, Speech and Signal Processing, IEEE Transactions on, 38(1):35–45, 1990. [14] M. Matassoni, M. Omologo, D. Giuliani, and P. Svaizer. Hidden Markov model training with contaminated speech material for distant-talking speech recognition. Computer Speech & Language, 16(2):205–223, 2002. [15] N. Morgan and H. Bourlard. Continuous speech recognition. Signal Processing Magazine, IEEE, 12(3):24–42, 1995. [16] D. O’Shaughnessy. Interacting with computers by voice: automatic speech recognition and synthesis. Proceedings of the IEEE, 91(9):1272–1305, 2003. [17] D.B. Paul. An efficient A* stack decoder algorithm for continuous speech recognition with a stochastic language model. In icassp, pages 25–28. IEEE, 1992. [18] S. Phadke, R. Limaye, S. Verma, and K. Subramanian. On design and implementation of an embedded automatic speech recognition system. In VLSI Design, 2004. Proceedings. 17th International Conference on, pages 127–132. IEEE, 2004. [19] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler, et al. The 1996 hub-4 sphinx-3 system. In Proc. DARPA Speech recognition workshop, pages 85–89. Citeseer, 1997. [20] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [21] M.K. Ravishankar. Citeseer, 2005.

Efficient Algorithms for Speech Recognition.

PhD thesis,

[22] Google Translate. http://translate.google.com, May 2011. [23] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems, Inc. Mountain View, CA, USA, page 18, 2004. [24] CMUSphinx Wiki. http://cmusphinx.sourceforge.net/wiki, May 2011.

56

Appendix A

Guidelines The following sections describe useful instructions about how to use the set of tools developed during the speech recognition project. Similarly they describe how to install the Pocketsphinx project and how to generate files to run the main application. It is strongly recommended to follow the instructions section by section in order to take advantage of the given examples and the main program.

A.1

Installation of the speech recognizer project

In order to install Pocketsphinx and the ASR recognizer application, the user must perform the following steps: 1. Extract the contents of the compressed file containing the complete ASR project into a new folder, for instance: C:\ASRProject\. Inside the ASRProject folder, seven new folders will be created as seen at Figure A.1.

Figure A.1: ASR Project Folder Structure 57

2. Go to the folder sphinxbase inside the ASRProject folder and then open the sphinxbase.sln file using Visual Studio. 3. Once that the sphinxbase project is open, go to the build menu and select Build Solution. After building the project, Visual Studio should show the message build succeeded. 4. Go to the folder \sphinxbase\bin\debug and copy the file sphinxbase.dll into the folder \pocketsphinx\bin\debug. 5. Now go to the folder pocketsphinx inside the ASRProject folder and open the pocketsphinx.sln file using Visual Studio. 6. Once that the pocketsphinx project is open, go to the build menu and select Build Solution. After building the project, Visual Studio should show the message build succeeded. 7. Now the recognizer application is ready to be run using a language model and a dictionary.

A.2

Running the speech recognition application

Once that Pocketsphinx has been installed correctly, the user can execute the recognizer by specifying an acoustic model, a language model and a dictionary. The following set of steps must be performed in order to execute the application: 1. Open a command window, go to \ASRProject\pocketsphinx\bin\debug and type the following command: recognizer -hmm ../../model/hmm/en_US/hub4wsj_sc_8k/ -lm ../../model/lm/en/tekniska/TextCorpusFiltered.lm.DMP -dict ../../model/lm/en/tekniska/TextCorpusFiltered.dic Where: • hub4wsj sc 8k is the folder containing the acoustic model to be used by the application. • TextCorpusFiltered.lm.DMP is the language model. • and TextCorpusFiltered.dic is the phonetic dictionary. 2. The command window will display a set of diagnostic messages and the computer will play the sound message: “I’m listening”. Additionally, the command window will display “READY. . . ” as in Figure A.2.

58

Figure A.2: Message 3. Now, the user should speak any phrase containing any of the predefined keywords supported by the program. For example: “What is TECHNOLOGY?” 4. The computer should decode the input utterance and identify the specified keyword TECHNOLOGY. Then it will play one of the predefined responses for that given keyword. For example: “Technology is the usage and knowledge of tools and systems in order to solve a problem”. At the same time, as seen at Figure A.3, some diagnostic messages will be displayed.

Figure A.3: Decoding Example 5. After the computer completes playing the response message, it will show “READY. . . ” in the command window again. Then the user is supposed to say another phrase. The whole process is repeated until the user commands the computer to close the application. This is achieved by saying the command“out”. 6. Once that the command is identified, the computer will play an exit message (“bye”) and the application will close. This is illustrated at Figure A.4.

Figure A.4: Decoding Example

A.3

Generating a new language model

In order to generate a new language model the user should create two text files: • One file should contain the list of keywords to be used by the recognizer application. • One file should contain a list of URLs where the keywords are used extensively. For example, the word NASA can be found many times in the URL: 59

http://en.wikipedia.org/wiki/NASA Currently there is no limit in the number of URLs supported, for this reason the user can specify more than one URL per keyword. An example of these files is seen at Figure A.5, where each keyword has an associated URL.

Figure A.5: Keywords and URLs For simplification purposes, the files Keyword List.txt and URL List.txt can be edited in order to specify keywords and URLs. These two files are located in the root folder for the project. The next step is performed using the TextExtractor tool and the CMUCLMTK tool to generate a new language model. This can be achieved using the following procedure: 1. Open a command window and go to the \ASRProject\Tools folder. 2. Generate a text corpus using the TextExtractor tool by executing the following command: java -jar TextExtractor.jar ..\URL_List.txt ..\GeneratedFiles\LanguageModels\TCorpus.txt

..\Keyword_List.txt

Where: • The first and third arguments specify the input files. • The second argument specifies the name of the text corpus. 3. After executing the tool, a file called TCorpusFiltered.txt will be created in the LanguageModels folder. 4. Now execute the CMUCLMTK tool by executing the following command: GenerateLM.bat ..\GeneratedFiles\LanguageModels\TCorpusFiltered.txt ..\GeneratedFiles\LanguageModels\TCorpus.ccs Where: • The first argument corresponds to the text corpus recently created.

60

• The second parameter is the filler dictionary, which specifies the meaning of the and in the text corpus. Note. The GenerateLM.bat file needs to be edited to specify the path for the CMUCLMTK and sphinxbase folders. 5. After executing the CMUCLMTK tool, the TCorpusFiltered.lm.DMP file will be created in the LanguageModels folder. This file contains a new language model that uses the set of defined keywords.

A.4

Generating a new dictionary

The process of generating a new dictionary is very straightforward as only one command needs to be executed. Assuming that the user has one command window open, the following command has to be executed in the \ASRProject\Tools folder: java -jar dictGenerator.jar ..\pocketsphinx\model\lm\en_US\cmu07a.dic ..\GeneratedFiles\LanguageModels\TCorpusFiltered.vocab ..\GeneratedFiles\LanguageModels\TCorpusFiltered.dic Where: • The first argument is the name of the CMU phonetic dictionary. • The second argument is the name of the file containing the vocabulary used in the text corpus. • The third argument is the name of the output dictionary. It is important to clarify that the cmu07a.dic is a file included in the Pocketsphinx package, while the TCorpusFiltered.vocab is a file created when the CMUCLMTK tool generates a language model.

A.5

Generating a new set of predefined responses

A new set of predefined responses (audio files) can be created using the RespGenerator tool when the set of keywords is changed. The tool receives a text file containing a variable number of predefined responses for each keyword. An example of this file is presented at Figure A.6. As it can be seen, each keyword is preceded by a hash (#) character and succeeded by a number of sentences. It is important to mention that a list of default sentences is included although DEFAULT is not a keyword. These sentences are used as responses in case the speech recognizer application is not able to decode any keyword from the given input utterance.

61

Figure A.6: Example of a Set of Predefined Responses For simplification purposes, the file Keyword Responses.txt can be edited in order to specify the set of possible responses for every keyword. This file is located in the root folder for the project. Then, the RespGenerator tool can be used to create the new sound files using the following procedure: 1. Open a command window and go to the \ASRProject\Tools folder. 2. Execute the RespGenerator tool using the following command: java -jar RespGenerator.jar ..\Keyword_Responses.txt ..\GeneratedFiles\AcousticResponses\ Where: • The first argument is the file containing the possible responses • The second argument is the destination folder for the output sound files 3. When the tool has finished generating the output files, the AcousticResponses folder should contain a list of folders whose names corresponds to each keyword, as in Figure A.7. Inside each folder, there should be a number of mp3 files containing the responses.

Figure A.7: Responses Structure

62

4. Using a file converter tool such as Audacity, convert all the generated mp3 files into ogg format. Make sure the new files have the same names as the mp3 files.

Figure A.8: OGG Files

A.6

Running the speech recognizer using the newly created files

After generating a new language model and a new dictionary, the files created in the LanguageModels folder need to be copied to the Pocketsphinx folder. Similarly, the acoustic responses need to be incorporated into Pocketsphinx in order to be used by the recognizer application. This can be done using the command line. Assuming that the current folder is \ASRProject\Tools, the user should execute the following commands: copy ..\GeneratedFiles\LanguageModels\TCorpusFiltered.dic ..\pocketsphinx\model\lm\en\tekniska\ copy ..\GeneratedFiles\LanguageModels\TCorpusFiltered.lm.DMP ..\pocketsphinx\model\lm\en\tekniska\ copy ..\GeneratedFiles\AcousticResponses\kwords.txt ..\pocketsphinx\bin\debug\ xcopy ..\GeneratedFiles\AcousticResponses\* ..\pocketsphinx\bin\debug\Responses /s /i After the files have been placed in the corresponding Pocketsphinx folder, the recognizer application can be run using the newly created files. First the user should go to the \ASRProject\pocketsphinx\bin\debug folder and then execute the recognizer: recognizer -hmm ../../model/hmm/en_US/hub4wsj_sc_8k/ -lm ../../model/lm/en/tekniska/TCorpusFiltered.lm.DMP -dict ../../model/lm/en/tekniska/TCorpusFiltered.dic

63

A.7

Evaluating the performance and measuring the KDA

Each time that the speech recognizer application has been run and closed, a log file is created. This file contains each decoded sentence as well as the keyword identified for each utterance. Using the rEvaluator tool this file can be used to measure the accuracy (KDA) for the complete set of utterances. First the user needs to create a verification file containing the list of keywords used in each utterance that the recognizer tool had to decode. For example, let’s suppose that just 4 utterances were used the last time the recognizer tool was run. The verification file should look like Figure A.9.

Figure A.9: Verification File Note. This file should be placed in the ResultsEvaluation folder Similarly, assuming that the speech recognizer tool generated a log file called HypLogFile May25.txt the last time it was run. In order to generate the results file the user needs to follow the next sequence of steps.

1. Copy the log file from the pocketsphinx folder to the ResultsEvaluation folder: copy Logfiles\HypLogFile_May25.txt ..\..\..\GeneratedFiles\ResultsEvaluation\ 2. Execute the rEvaluator tool using the following command: java -jar rEvaluator.jar ..\GeneratedFiles\ResultsEvaluation\HypLogFile_May25_004516.txt ..\GeneratedFiles\ResultsEvaluation\key_utterances.txt ..\GeneratedFiles\ResultsEvaluation\Results.csv Where: • The first argument is the log file created the last time the recognizer was run. • The second argument is the verification file containing the list of keywords used for each input utterance. • The third argument is the name of the comma-separated report file.

64

After running the rEvaluator tool, the newly generated report file should contain measurements regarding the accuracy of the speech recognizer. Keyword TECHNOLOGY TELEPHONE WOMEN MECHANICAL ALPHABET OVERALL METRICS

Matches 1 1 1 0 3

Sentences 1 1 1 1 4

Table A.1: KDA Report Format

65

KDA(%) 100 100 100 0 75

Appendix B

Source Code This following pages of this appendix include the source code for the main program and all the tools created during the development of this project.

• Recognizer - Main Application • TextExtractor - Tool • DictGenerator - Tool • RespGenerator - Tool • rEvaluator - Tool

66

B.1

Recognizer

/* File: recognizer.c Description: Adapted from pocketsphinx_continuous/continuous.c in order to develop an interactive ASR system. The ASR system reacts based on identified "keywords" incoming speech signal. An *.OGG audio file with information related to the keyword identified. Additional information is played whenever a keyword is not identified. Author(s): Ivan Castro, Andrea Garcia Email(s): [email protected], [email protected] */ #include #include #include #include #include #include #include #include

"pocketsphinx.h" "err.h" "ad.h" "cont_ad.h" "bass.h"

#if !defined(_WIN32_WCE) #include #include #endif #if defined(WIN32) && !defined(GNUWINCE) #include #else #include #include #endif #define NUM_KWORDS 20 //Number of available keywords static const arg_t cont_args_def[] = { POCKETSPHINX_OPTIONS, /* Argument file. */ { "-argfile", ARG_STRING, NULL, "Argument file giving extra arguments." }, { "-adcdev", ARG_STRING, NULL, "Name of audio device to use for input." }, CMDLN_EMPTY_OPTION }; static ad_rec_t *ad; static ps_decoder_t *ps; //Keyword structure typedef struct keyword{ char *kWordName; //keyword int numSounds; //available audio files per keyword }Keyword; char *kWordList[NUM_KWORDS]; //List of existing keywords int soundList[NUM_KWORDS]; //Number of sound recordings per keyword char *unknownkWord; //Name of the dir with recordings to play in case of an unknown keyword int numUnknownSound; //Number of recordings to play in case of an unknown keyword const char *keyFile = "kwords.txt"; //Name of the file containing the keywords const char *soundDir = "Responses/"; //Directory containing the recordings

67

const char *startSoundFile = "Responses/EXTRA/EXTRA1.ogg"; //Recording to be played at startup const char *endSoundFile = "Responses/EXTRA/EXTRA2.ogg"; //Recording to be played at the end FILE *hypLog; //File used to log information about the obtained hypothesis const char *logFile; //String containing the log filename // Methods Keyword getKeyword(char const *hyp); static void loadKeywords(); void playOGG(char const *oggFile); char const *selectOGG(char const *keyword, int numOGG); void terminate(int sig); static void sighandler(int signo); static void sleep_msec(int32 ms); static void utterance_loop(); int main(int argc, char *argv[]);

/* Play an OGG audio file */ void playOGG(char const *oggFile){ DWORD chan,act,time; QWORD pos; // check the correct BASS was loaded if (HIWORD(BASS_GetVersion())!=BASSVERSION) { printf("An incorrect version of BASS was loaded\n"); return; } // setup output - default device if (!BASS_Init(-1,44100,0,0,NULL)){ printf("Error(%d): Can’t initialize device\n",BASS_ErrorGetCode()); BASS_Free(); exit(0); } if(!(chan=BASS_StreamCreateFile(FALSE, oggFile, 0, 0, BASS_SAMPLE_FLOAT))){ printf("Error(%d): Can’t play the file %s\n",BASS_ErrorGetCode(),oggFile); BASS_Free(); return; } // play the audio file BASS_ChannelPlay(chan,FALSE); while (act=BASS_ChannelIsActive(chan)){ pos=BASS_ChannelGetPosition(chan,BASS_POS_BYTE); time=BASS_ChannelBytes2Seconds(chan,pos); sleep_msec(50); } BASS_ChannelSlideAttribute(chan,BASS_ATTRIB_FREQ,1000,500); sleep_msec(300); BASS_ChannelSlideAttribute(chan,BASS_ATTRIB_VOL,-1,200); while (BASS_ChannelIsSliding(chan,0)) sleep_msec(1); BASS_Free(); } /* Select an audio file based on the specified keyword */ char const *selectOGG(char const *keyword, int numOGG){ int selFile, len, rNum; char fileNum[3]; char *oggFile;

68

static

int

first_time = 1;

if( first_time ){ first_time = 0; srand( (unsigned int)time( NULL ) ); } rNum = rand() % (numOGG + 1); selFile = max(rNum,1); itoa(selFile,fileNum,10); len = strlen(soundDir)+strlen(keyword)*2+strlen(fileNum)+6; oggFile = (char *)malloc(len * sizeof(char)); strcpy(oggFile,soundDir); strcat(oggFile,keyword); strcat(oggFile,"/"); strcat(oggFile,keyword); strcat(oggFile,fileNum); strcat(oggFile,".ogg"); oggFile[len-1]=’\0’; return oggFile; } /* Load predefined keywords into an array */ static void loadKeywords(){ FILE *key; char line[50]; char number[2]; int len, wordLen, i, j, k; int buffer = 50; int numKeywords = 0; key = fopen (keyFile, "r"); if (key == NULL) { fprintf(stderr, "Error when opening the keyword file!\n"); exit(1); } while( fgets(line, buffer, key) != NULL){ //Remove ’\n’ from the read line len = strlen(line); if( line[len-1] == ’\n’ ){ line[len-1] = ’\0’; } if(line[0] != ’#’){ wordLen = strcspn(line,","); //Add the word into the keyword list kWordList[numKeywords] = (char *)malloc((wordLen+1) * sizeof(char)); strncpy(kWordList[numKeywords],line,wordLen); kWordList[numKeywords][wordLen]=’\0’; //Number of recordings of the current keyword i = wordLen+1; j = 0; while (line[i] != NULL){ number[j] = line[i]; i++; j++;

69

} soundList[numKeywords] = atoi(number); numKeywords++; } else{ wordLen = strcspn(line,",")-1; //Don’t count the # //Add the word into the keyword list unknownkWord = (char *)malloc((wordLen+1) * sizeof(char)); i = 1; j = 0; k = 0; while (line[i] != NULL){ if (iread_ts; /* Decode utterance until end (marked by a "long" silence, >1sec) */ for (;;) { /* Read non-silence audio data, if any, from continuous listening module */ if ((k = cont_ad_read(cont, adbuf, 4096)) < 0) E_FATAL("cont_ad_read failed\n"); if (k == 0) { /* * No speech data available; check current timestamp with most recent * speech to see if more than 1 sec elapsed. If so, end of utterance. */ if ((cont->read_ts - ts) > DEFAULT_SAMPLES_PER_SEC) break; } else {

72

/* New speech data received; note current timestamp */ ts = cont->read_ts; } /* * Decode whatever data was read above. */ rem = ps_process_raw(ps, adbuf, k, FALSE, FALSE); /* If no work to be done, sleep a bit */ if ((rem == 0) && (k == 0)) sleep_msec(20); } /* * Utterance ended; flush any accumulated, unprocessed A/D data and stop * listening until current utterance completely decoded */ ad_stop_rec(ad); while (ad_read(ad, adbuf, 4096) >= 0); cont_ad_reset(cont); printf("Stopped listening, please wait...\n"); fflush(stdout); /* Finish decoding, obtain and print result */ ps_end_utt(ps); clock_start=clock(); hyp = ps_get_hyp(ps, &score, &uttid); elapsed=(double)(clock()- clock_start)/CLOCKS_PER_SEC; printf("\n************************************\n"); printf("Elapsed time: %f seconds\n",elapsed); printf("%s: %s (%d)\n", uttid, hyp, score); /* Exit if the first word spoken was OUT */ if (hyp){ sscanf(hyp, "%s", word); if (strcmp(word, "OUT") == 0){ printf("************************************\n"); playOGG(endSoundFile); break; } } /***************************************************************************/ // Obtain the keyword from the current hypothesis kword = getKeyword(hyp); if (strcmp(kword.kWordName,"UNKNOWN") != 0){ printf("Keyword: %s\n", kword.kWordName); //Select and play information related to the keyword oggFile = selectOGG(kword.kWordName,kword.numSounds); printf("Selected file: %s\n",oggFile); printf("************************************\n"); playOGG(oggFile); } else{ //Select and play additional information printf("No keyword identified!\n"); oggFile = selectOGG(unknownkWord,numUnknownSound); printf("Playing additional information...\n"); printf("Selected file: %s\n",oggFile);

73

printf("************************************\n"); playOGG(oggFile); } // Print the result data into the logfile fprintf(hypLog,"%s: %s (%d) %f %s\n", uttid, hyp, score, elapsed, kword); /***************************************************************************/ fflush(stdout); /* Resume A/D recording for next utterance */ if (ad_start_rec(ad) < 0) E_FATAL("ad_start_rec failed\n"); } fclose (hypLog); cont_ad_close(cont); printf("\nExiting...\n"); printf("Results were written to: %s\n", hypLogFile); } static jmp_buf jbuf; static void sighandler(int signo){ longjmp(jbuf, 1); } int main(int argc, char *argv[]){ cmd_ln_t *config; char const *cfg; /* Make sure we exit cleanly (needed for profiling among other things) */ /* Signals seem to be broken in arm-wince-pe. */ #if !defined(GNUWINCE) && !defined(_WIN32_WCE) signal(SIGINT, &sighandler); #endif if (argc == 2) { config = cmd_ln_parse_file_r(NULL, cont_args_def, argv[1], TRUE); } else { config = cmd_ln_parse_r(NULL, cont_args_def, argc, argv, FALSE); } /* Handle argument file as -argfile. */ if (config && (cfg = cmd_ln_str_r(config, "-argfile")) != NULL) { config = cmd_ln_parse_file_r(config, cont_args_def, cfg, FALSE); } if (config == NULL) return 1; ps = ps_init(config); if (ps == NULL) return 1; if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"), (int)cmd_ln_float32_r(config, "-samprate"))) == NULL) E_FATAL("ad_open_dev failed\n"); if (setjmp(jbuf) == 0) { utterance_loop(); }

74

ps_free(ps); ad_close(ad); return 0; }

75

B.2

TextExtractor

/* File: tExtract.java Description: This program generates a filtered text corpus based on a set of URLs. The output format is compatible with the CMUCLMTK tool. Author(s): Ivan Castro, Andrea Garcia Email(s): [email protected], [email protected] */

import import import import import import public

java.io.*; java.net.URL; de.l3s.boilerpipe.extractors.ArticleSentencesExtractor; java.util.ArrayList; java.util.regex.Pattern; java.util.regex.Matcher; class tExtract { public static void main(String[] args) throws Exception { String String String String String

rawText = ""; tmpText = ""; inputFile = ""; outputFile = ""; keywordsFile = "";

// Process command line arguments if (args.length == 0){ // Display an error, at least one argument should be input System.err.println ("Error: No input arguments were specified"); System.exit(1); } else if(args.length == 1){ // The user provides the path and name of the input file inputFile = args[0]; inputFile = inputFile.replace("\\", "/"); // Create the output file in the same folder as the input file outputFile = inputFile.substring(0, inputFile.lastIndexOf("/")+1) + "outputCorpus.txt"; } else if(args.length == 2){ // The user provides the path and name of the input and output files, // but no keywords file inputFile = args[0]; inputFile = inputFile.replace("\\", "/"); outputFile = args[1]; outputFile = outputFile.replace("\\", "/"); } else if(args.length > 2){ // The user provides the path and name of the input, output and keywords // files inputFile = args[0]; inputFile = inputFile.replace("\\", "/"); outputFile = args[1]; outputFile = outputFile.replace("\\", "/"); keywordsFile = args[2]; keywordsFile = keywordsFile.replace("\\", "/"); } try{

76

File urlListFile = new File(inputFile); BufferedReader urlBuf = new BufferedReader(new FileReader(urlListFile)); String line = null; ArrayList urlList = new ArrayList(); // Read all of the lines in the file while ((line = urlBuf.readLine()) != null) { // Discard commented lines in the input file if(!line.startsWith("//", 0)) urlList.add(line); } URL[] url = new URL[urlList.size()]; for (int ind = 0; ind < urlList.size(); ind++) { System.out.println("Processing url number ".concat(Integer.toString(ind+1))); url[ind] = new URL(urlList.get(ind)); //Append the last web page into the text string rawText = rawText + ArticleSentencesExtractor.INSTANCE.getText(url[ind]) + "\n"; } // Remove new line characters //tmpText = rawText.replace("\r" , " ").replace("\n" , " "); // Remove possessive form (’s) tmpText = rawText.replace("’s" , ""); // Remove various grammatical symbols, like comma, :, ;, ", ’,etc... tmpText = tmpText.replaceAll("[,|\"|/|:|;|’|(|)|!|?]", " "); // Remove acronyms like i.e and e.g tmpText = tmpText.replace("i.e.", " ").replace("e.g.", " "); // Replace &, % and @ tmpText = tmpText.replace("%", " percent "); tmpText = tmpText.replace("&", " and "); tmpText = tmpText.replace("@", " at "); // Replace currency symbols tmpText = tmpText.replace(Character.toString((char)163), " pounds "); tmpText = tmpText.replace(Character.toString((char)128), " euro "); tmpText = tmpText.replace(Character.toString((char)36), " dollars "); // Replace stressed vowels tmpText = tmpText.replace("´ a", tmpText = tmpText.replace("´ e", tmpText = tmpText.replace("´ ı", tmpText = tmpText.replace("´ o", tmpText = tmpText.replace("´ u",

"a"); "e"); "i"); "o"); "u");

//letter //letter //letter //letter //letter

a e i o u

with with with with with

acute acute acute acute acute

// Replace left and right quotation marks (single and double tmpText = tmpText.replace(Character.toString((char)145), " "); tmpText = tmpText.replace(Character.toString((char)146), " "); tmpText = tmpText.replace(Character.toString((char)147), " "); tmpText = tmpText.replace(Character.toString((char)148), " "); // Replace known acronyms like US tmpText = tmpText.replace(" US ", tmpText = tmpText.replace("U.S.", tmpText = tmpText.replace(" UK ",

77

and UK " United States "); " United States "); " United Kingdom ");

tmpText tmpText tmpText tmpText

= = = =

tmpText.replace("U.K.", tmpText.replace("A.M.", tmpText.replace(" AM ", tmpText.replace("a.m.",

" " " "

United before before before

Kingdom "); noon "); noon "); noon ");

tmpText = tmpText.replace("P.M.", " after noon "); tmpText = tmpText.replace(" PM ", " after noon "); tmpText = tmpText.replace("p.m.", " after noon ");

// Remove square brackets and their contents tmpText = tmpText.replaceAll("\\[.*?\\]", ""); // Replace numbers with it’s text representation String numregex = "(\\b\\d+)|(\\.\\d+)"; String match = ""; String numStr = ""; Pattern p = Pattern.compile(numregex); Matcher m = p.matcher(tmpText); int numCnt = 0; while (m.find()){ numCnt++; match = tmpText.substring(m.start(),m.end()); if (match.contains(".")){ numStr = "point "; match = match.replace(".", ""); } numStr = numStr + NumTranslate.translate(Integer.parseInt(match)) ; tmpText = tmpText.replaceFirst(match, numStr); numStr = ""; m = p.matcher(tmpText); } // Replace ... tmpText = tmpText.replace("...", ""); // Replace some decimal points with spaces tmpText = tmpText.replaceAll("\\b\\.\\b", " "); // Remove some decimal points with end of line characters tmpText = tmpText.replaceAll("[\\s]*\\.[\\s]*", "\n"); // Remove tmpText = tmpText = tmpText = tmpText = tmpText =

dashes tmpText.replace(Character.toString((char)45), " "); // hypen tmpText.replace(Character.toString((char)150), " "); // En dash tmpText.replace(Character.toString((char)151), " "); // Em dash tmpText.replace(Character.toString((char)8211), " "); // En dash tmpText.replace(Character.toString((char)8212), " "); // Em dash

// remove weird spaces tmpText = tmpText.replace(Character.toString((char)160), " "); // Replace Roman Numbers tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace(" tmpText = tmpText.replace("

78

II ", "second"); III ", "third"); IV ", "fourth"); V ", "fifth"); VI ", "sixth"); VII ", "seventh"); VIII ", "eigth"); IX ", "ninth"); X ", "tenth");

// space

// Finally, remove anything that is not a word tmpText = tmpText.replaceAll("[â-zA-Z\\s\\n]", " "); // Remove extra spaces between words tmpText = tmpText.replaceAll("\\b\\s{2,}\\b", " "); // Convert all the words to upper case tmpText = tmpText.toUpperCase(); // Optional. Add the and characters to each sentence tmpText = " " + tmpText; // First Line tmpText = tmpText.replaceAll("\\n", " \n "); tmpText = tmpText.substring(0, tmpText.length() -5); try{ // Optional file containing raw text extracted from the web pages System.out.println("Generating Raw Text File"); String rawOutputFile = outputFile.substring(0, outputFile.lastIndexOf("/")+1) + "rawOutputCorpus.txt"; Writer out1 = new OutputStreamWriter(new FileOutputStream(rawOutputFile)); out1.write(rawText); out1.close(); // Not-Filtered Corpus file System.out.println("Generating Corpus Text File"); Writer out = new OutputStreamWriter(new FileOutputStream(outputFile)); out.write(tmpText); out.close(); // Filtered Corpus File if (keywordsFile.length() > 0){ System.out.println("Generating Filtered Text File"); String filteredOutputFile = outputFile.substring(0, outputFile.lastIndexOf(".")) + "Filtered.txt"; filtCorpus.filterKeywords(outputFile, keywordsFile, filteredOutputFile); } // Context file (.ccs) used by the language model generator System.out.println("Generating Context Text File"); String OutputContextFile = outputFile.substring(0, outputFile.lastIndexOf(".")+1) + "ccs"; Writer contextFile = new OutputStreamWriter(new FileOutputStream(OutputContextFile)); contextFile.write("\n"); contextFile.close(); System.out.println("Output Text file(s) created"); }catch(Exception e) { System.err.println ("Error while writing the output file"); System.exit(1); } }catch(Exception e) { System.err.println ("Error while opening the input file"); System.exit(1); } } }

79

B.3

DictGenerator

/* File: dGenerator.java Description: This program generates a custom dictionary based on the cmu07.dic dictionary and a given vocabulary. The purpose of this tool is to generate a dictionary including just the words contained in a given language model. Author(s): Ivan Castro, Andrea Garcia Email(s): [email protected], [email protected] */ import import import import import

java.io.*; java.util.regex.Pattern; java.util.regex.Matcher; java.util.ArrayList; java.util.Collections;

public class dGenerator { public static void main(String[] args) throws Exception { String String String String

masterDictFile inputVocabFile outputDictFile notFoundFile =

= ""; = ""; = ""; "";

// Process command line arguments if (args.length < 2){ // Display an error, at least one argument should be input System.err.println ("Error: Less than two arguments were specified"); System.exit(1); } else if(args.length == 2){ // The user provides the path and name of the dictionary file and the // input vocabulary masterDictFile = args[0]; masterDictFile = masterDictFile.replace("\\", "/"); inputVocabFile = args[1]; inputVocabFile = inputVocabFile.replace("\\", "/"); // Create the output file in the same folder as the input vocabulary outputDictFile = inputVocabFile.substring(0, inputVocabFile.lastIndexOf("/")+1) + "outputDict.dic"; notFoundFile = inputVocabFile.substring(0, inputVocabFile.lastIndexOf("/")+1) + "notfound.txt"; } else if(args.length >= 3){ // The user provides the path and name of the input and output files masterDictFile = args[0]; masterDictFile = masterDictFile.replace("\\", "/"); inputVocabFile = args[1]; inputVocabFile = inputVocabFile.replace("\\", "/"); // Create the output file in the same folder as the input vocabulary outputDictFile = args[2]; outputDictFile = outputDictFile.replace("\\", "/"); notFoundFile = outputDictFile.substring(0,

80

inputVocabFile.lastIndexOf("/")+1) + "notfound.txt"; } try{ File inputDict = new File(masterDictFile); File vocab = new File(inputVocabFile); String line = null; int wordCounter = 0; int matchCounter = 0; BufferedReader readerVocab = new BufferedReader(new FileReader(vocab)); BufferedReader readerDict = new BufferedReader(new FileReader(inputDict)); StringBuffer contentsDict = new StringBuffer(); ArrayList rowList = new ArrayList(); ArrayList notFoundList = new ArrayList(); ArrayList filteredList = new ArrayList(); int endOfLine = 0; Pattern p; Matcher m; // get a matcher object // Read all of the lines in the dictionary System.out.println("Reading Master Dictionary"); while ((line = readerDict.readLine()) != null) { contentsDict.append(line).append("\n"); } // Read all of the lines in the vocabulary System.out.println("Building Custom Dictionary"); while ((line = readerVocab.readLine()) != null) { wordCounter++; matchCounter = 0; if (wordCounter % 100 == 0){ System.out.println(Integer.toString(wordCounter) + " words have been processed"); } // For each word in the vocabulary, look for its pronunciation in the // master dictionary line = "\n" + line.toLowerCase(); // add a new line character p = Pattern.compile(line + "(\\((.*?)\\))*" + "\t"); m = p.matcher(contentsDict); // get a matcher object while(m.find()) { matchCounter++; endOfLine = contentsDict.indexOf("\n", m.start() +1); rowList.add(contentsDict.substring(m.start()+1, endOfLine+1).toUpperCase()); } if (matchCounter == 0){ // Current word was not found in the dictionary notFoundList.add(line); } } System.out.println(Integer.toString(wordCounter) + " words found in the vocabulary file"); // Eliminate repetitions: for(int i=0;i contentsRes.size()) { System.err.println ("More Keywords than utterances were found!"); System.exit(1); } else if (keywords.size() < contentsRes.size()) { System.err.println ("Less Keywords than utterances were found!"); System.exit(1); } // Count the number of non-repeated keywords for(int i=0;i

A Keyword Based Interactive Speech Recognition System for ... - IDt

A Keyword Based Interactive Speech Recognition System for ... - IDt

Suggest Documents

A speech-recognition-based system for converting hebrew text to ...

a Speech Recognition based Audio Indexing System for the Web

A Speech Recognition System for Embedded

A FIRST SPEECH RECOGNITION SYSTEM FOR MANDARIN ...

Speech Recognition System Based On Phonemes

TRIPHONE BASED CONTINUOUS SPEECH RECOGNITION SYSTEM ...

Experimental speech recognition system based on ...

Hybrid System for Robust Recognition of Noisy Speech Based on ...

Speech Based Voice Recognition System for Natural Language

A Speech Recognition System Based on Hybrid ...

Implementation of Speech Recognition Based Robotic System A

A Grapheme Based Speech Recognition System ... - Semantic Scholar

A Real-World Speech Recognition System Based on ... - CiteSeerX

Evaluating the Performance of a Speech Recognition based System

automatic speech recognition system for turkish ...

Hybrid System for Speech Recognition - Christian Dugast

Automatic Speech Recognition System for Isolated

Automatic Speech Recognition System for Isolated ...

Speech Recognition System for Windows Commands

A CONFIGURABLE DISTRIBUTED SPEECH RECOGNITION SYSTEM

continuous speech recognition system: a review

Speech Recognition in Human-Computer Interactive Control

a configurable distributed speech recognition system - AAU

continuous speech recognition system: a review