Speaker Independent Speech Recognition Implementation with

0 downloads 0 Views 291KB Size Report
speaker independent i.e. recognition of voice irrespective of any random speaker. Keywords- Speech Recognition, Language Models,. Windows 7, hidden ...
2013 International Symposium on Computational and Business Intelligence

Speaker Independent Speech Recognition Implementation with Adaptive Language Models

Anukriti

Sushant Tiwari

Tanmay Chatterjee

Mahua Bhattacharya

Information Technology Information Technology Information Technology Information Technology ABV-IIITM ABV-IIITM ABV-IIITM ABV-IIITM Gwalior, India Gwalior, India Gwalior, India Gwalior.India [email protected] [email protected] [email protected] [email protected]

Abstract: Speech recognition system is the method of implementing protocols and codes to enable a machine to take sound waves in the form of input, interpret it and formulate the desired output. In this paper, a method has been proposed to build up a speech recognition system which could interpret in any language (tested with Hindi and Bengali language) and implement it on Windows 7 platform which could be speaker independent i.e. recognition of voice irrespective of any random speaker. Keywords- Speech Recognition, Language Windows 7, hidden markov model, phones

II BACKGROUND AND PREVIOUS WORK A technical definition of Acoustic Speech Recognition (ASR) is given by Jurafsky[1], where he defined ASR as the building of system for mapping acoustic signals to a string of words. He continues by defining automatic speech understanding (ASU) as extending the goal to producing some sort of understanding of the sentence. The area of ASR is about 75 years old. In 1939 AT&T introduced the first electronic speech synthesizer named Voder. There have been many gradual developments and advancements in ASR till Lenny Baum invented Hidden Markov Model (HMM) approach to speech recognition in early 1970’s. HMM is a complex mathematical pattern-matching strategy which is eventually adopted by all top most speech recognition companies like IBM, AT&T, Dragon System and others. Now we have various advancements in this field like multi-model, cross-lingual/multi-lingual ASR which uses various probabilistic and statistical techniques such as HMM, neural network, SVM and others [2]. Most ASR systems are speakerindependent or speaker dependent but the most effective ASR is speaker-adaptive ASR [3]. Various speaker-adaptation techniques [4]-[7] has been developed which integrate speaker adaptation into speaker-independent continuous speech recognition systems. Weak point common in these techniques is non-distinguished variation in the source of speaker speech spectra and requirement of several minutes of adaptation speech to achieve 25% reduction in word error rate. An effort to counter these weak point is a new adaptation technique [8] which uses factorinduced spectral variations in the source of speech spectra, this substantially reduces word error rate

Models,

I INTRODUCTION In the past few years a number of systems have come up which support large vocabulary multilingual speech recognition with high word recognition accuracy. Although a couple of the systems have concentrated on either isolated word input, or have been trained to individual speakers, most of the current large vocabulary recognition systems have the goal to perform speech recognition with a fluent input string (continuous speech) by any taker (speaker independent systems). With the growing number of smart devices, there is a huge growth in the area of speech recognition.The ability of a device to get stimulated by sound waves, interpret it and operate certain action accordingly is termed as speech recognition and the devices are called audio narrated devices. Since the past few years, many IT companies have come up with different voice recognition features in their products like Siri by Apple Inc.; Aisha by Micromax; S-voice by Samsung including Windows 7 and dragon system. 978-0-7695-5066-4/13 $26.00 © 2013 IEEE DOI 10.1109/ISCBI.2013.9

7

size of the dictionary and due to usage of multiple language models SPHINX has a comparatively larger working database which results in addressing the locality of the caches. Thus the overall efficiency of the engine reduces.

when from each speaker only a few seconds of adaptation speech is taken. Another effort is the development of modular connectionist neural network [9] which structure in the regular spectral variation can be modeled by the appropriate network module. Channel degradation and background noise extremely constrain the performance of ASR. Various approaches are there to add robustness to speech recognition [10].

The above mentioned algorithm of speech processing has specific stages and due to continuous use of HMM, the major hinder in processing time is due to the searching phase. Cache storage and even branch predictions do not merge efficiently with the searching phase.

III EXISTING SPEECH RECOGNITION ALGORITHM

V OPTIMIZATION THROUGH MULTIEXECUTION

The continuous input speech signals when spoken in the form of words have an average gap of 10-20 milliseconds. The main aim of any speech recognition is to break a particular speech wave into its respective phones and use speech signal processing techniques to make discrete smaller utterances from the phones, passing the feature vectors obtained from the speech processing through a Hidden Markov Model (HMM) based search and map them with the dictionary. Before driving through HMM phase the feature vectors are converted into a quantum which consumes the most of the execution time. The input audio is being compared with the phones present in the engine to take as input the most probabilistic string.

Most of the time in speech, processing is required for feature vector quantization and searching the phones. Hence the approach of multiple execution of speech recognition could be one of the methods to improve its efficiency. The concept of multiple executions involves the segmentation of the signal processing mechanism to isolate the searching phase from the rest of the stages and hence introducing the concept of parallelization using threads to a certain extent. Each stage using its own buffer while running the threads and hence has its own time complexity thereby removing the bottleneck of the searching phase. The stages in which the speech signal processing is done are: the speech processing, distance calculation, vector quantization and searching.

Dynamic programming algorithm is being followed in HMM phase to map the utterances with the phones in the dictionary. In the midst of audio input, all the possible cases of different phones are being considered to map them with. Initially the prefixes are considered for comparing with the input and gradually as more utterances are taken, the list of prefixes get filtered to get the most appropriate one. Because of a large number of utterances in a word or sentence, indefinite number of phones can be considered for comparison for making better accuracy. Multiple codebooks while quantizing the audio input is another method for minimizing the error.

IV LIMITATIONS OF THE ALGORITHM Due to frequent usage of HMM model for searching the phones the SPHINX engine has larger memory footprints than other engines along with more probability of misinterpretations. Depending on the

Figure 1: Model for multiple execution

8

VI IMPLEMENTATION AND RESULTS

REFERENCES

The multi-threaded execution can be made possible using the SPHINX speech engine which merges the HMM algorithm along with parallelization for better efficiency. The JAVA based application is coded in eclipse using jdk 1.7. The basic steps being followed to implement the above mentioned algorithm arex

x

x

x

Developing the dictionary involving the basic words mapped with their corresponding phones. These phones are developed by concatenating the phones of each specific alphabet in a sequence. Developing the grammar on the basis of the syntactic knowledge involved in English, Hindi and Bengali language. Since the semantic knowledge varies with languages, the grammar structure is designed accordingly. The speech engine being used is sphinx 4.1.0 which works on JAVA. Hence the JAVA code is being developed in order to catch the words through microphone, disintegrate the words and use HMM algorithm to match with the corresponding phones. Being a multi lingual speech recognition tool, it has a user interface to choose the language which needs to get detected by the tool.

[1]

A.A.M. Abushariah, T.S. Gunawan, O.O. Khalifa and M.A.M Abushariah, “English digits speech recognition system based on Hidden Markov Models,” in Proc. ICCCE (Kuala Lampur, Malaysia), May 2010, pp. 1-5.

[2]

Garfinkel (1998). Retrieved on 15th December 2012, http://www.dragon-medicaltranscription.com/historyspeechrecognition.html

[3]

N. Mrvaljevic, Ying Sun, “Comparison between speaker dependent mode and speaker independent mode for voice recognition,” in Bioengineering Conference (IEEE 35th Annual Northeast), April 2009, pp. 1-2.

[4]

S.-C. Yin, R.Rose and P. Kenny, “Adaptive Score Normailzation for progressive model adaption in Text Independent Speaker Verification,” in Proc. ICASSP, April 2008, pp. 4857-4860.

[5]

W. Rozzi and R. Stem, “Speaker adaptation in continuous speech recognition via estimation of correlated mean vectors,’’ in Proc. ICASSP (Toronto, Canada), May 1991, pp, 865-868.

[6]

Wen-Lin Zhan, Wei-Qiang Zhang, Bi-Cheng Li, Dan Qu and M.T. Johnson, “Bayesian Speaker Adaption based on a New Hierarchical Probablistic Model,” in audio, Speech and Language Processing, Sept. 2012, vol.20, no.7, pp.2002-2015.

[7]

0. Schmidbauer and J. Tebelskis, “An LVQ based reference model for speaker adaptive speech recognition,” in Proc. ICASSP (San Francisco, CA), Mar. 1992, pp. 1441-1444.

[8]

Y. Zhao, “A new speaker adaptation technique using very short calibration speech,” in Proc. ICASSP II (Minneapolis, MN), Apr. 1993, pp. 562-565.

[9]

R. L. Watrous, “Source decomposition of acoustic variability in a modular connectionist network,” in Proc. ICASSP (Toronto, Canada), May, 1991, pp. 129131.

[10] V. Mitra, H. Franco, M. Graciarena and A. Mandal, “Normalized amplitude modulation features for large vocabulary noise-robust speech recognition,” in Proc. ICASSP (Menlo Park, CA, USA), Mar,2012, pp.41174120.

VII CONCLUSION

[11] S. Franzini, J. Ben-Arie, "Speech recognition by indexing and sequencing," in International Conference of Soft Computing and Pattern Recognition (SoCPaR), Dec 2010, pp.93-98.

The hidden markov model rules have been implemented to build a a multi-lingual acoustic speech recognition system with flexible grammar and dictionary models. This tool can be used for studying the behavior of a system while detecting same words with different phones. In the recent future, by increasing the dictionary and grammar database, this tool can be implemented to the android platform.

[12] L. Baugh, J. Renau, and J. Tuck, Parallelization” in Proc. ICASSP 2002.

9

“Sphinx

Sample grammar structure

Sample Dictionary model and their phones

(a)

(b) Output in Bengali Language

Output in Hindi Language

(a)

(b)

Figure 2: Snapshots of the application (Sample Dictionary; Sample Grammar structure; Outputs)

10

Suggest Documents