USER IDENTIFICATION BASED ON LITHUANIAN DIGITS RECOGNITION Rytis Maskeliunas1, Algimantas Rudzionis1, Kastytis Ratkevicius1, Vytautas Rudzionis2 1
Kaunas University of Technology, Speech Research Laboratory, Studentu str. 65, LT-51369, Kaunas, Lithuania, e-mail:
[email protected],
[email protected],
[email protected] 2 Kaunas Humanities Faculty of Vilnius University, Dept.of Informatics, Muitines str. 8, LT-44280, Kaunas, Lithuania, e-mail:
[email protected]
Abstract. Paper deals with the Lithuanian digits and the sequence of Lithuanian digits recognition by recognition engines of other languages. Preparing of telephony applications on Microsoft Office Communications Server 2007 Speech Server is examined. Results of Lithuanian digits recognition by English and Spanish recognition engines are presented. Keywords: speech recognition, speech server, recognition accuracy.
1
Introduction
Advantages of speech based interfaces for information or telecommunication services are well known. A speech input interface based on speech recognition technology has already been applied to many applications. Particular benefits could be achieved by disabled people. Automatic Speech Recognition (ASR) is potentially of enormous benefit to people with severe physical disabilities. The tremendous richness of human speech communication gives the user many degrees of freedom for control and input. The speed of speech recognition also gives it a potential advantage over other input methods commonly employed by physically disabled people [1]. Recent years were marked with the appearance of several commercial toolkits for the development of applications using speech recognition and other speech processing technologies. Such toolkits as Microsoft Speech Server [2], IBM WebSphere Voice Server [2] and several other provide possibilities to implement economically viable speech processing using applications. Typically such toolkits have phonemic models database and application programming interface to manipulate phonemic models in order to achieve desired functionality of application. The main value and know-how of speech application development tools are in their phonemic databases. The development of phonemic databases requires many time and a lots of human and financial resources. That’s why Microsoft and IBM provide phonemic databases for “big” languages (languages that have large numbers of native speakers or are widely used as a means for international cooperation): such languages have bigger market potential. In example, latest version of Microsoft Speech Server possesses speech recognition models only for English, Spanish, French and German. Each human language is unique in the sense of their phonetic content and syntactic rules. That’s why characteristic property of speech technologies is inability to use recognition tools developed for other language directly: at least some sort of adaptation should be carried on. Hence the users of “smaller” languages in principle have two options: - to develop phonemic database for their spoken language from scratch; - to try adapt phonemic models trained on foreign language. Both approaches are used in practice but here we want to concentrate more on the second one. Such activities were carried on by various researchers and some of them we need to mention. Zgank and colleagues [3] applied CD-HMM trained triphone models for several European languages to recognition of Slovenian spoken language. Later under MASPER research initiative they extended their efforts for several other languages (Hungarian, Slovak, Danish, etc.) [4]. Weng and colleagues [5] used English and Swedish version of ATIS speech corpora to train bilingual acoustic models sharing speech recognition models trying to obtain better overall speech recognition accuracy by increase in available data for training. But all those investigations were carried on using different speech corpora and laboratory tools to organize speech recognition. Such approach makes it more complicated attempts to implement such systems into the practical applications. This paper presents our attempts to investigate possibilities to use several foreign language oriented speech recognition engines for the recognition of Lithuanian spoken digits. Experiments included the search for the ways to find optimal transformation of phonetic transcriptions to enable implementation of foreign language recognizer for Lithuanian speech recognition. Experimental study shows that Spanish recognition engine possess better capabilities for adaptation to recognize Lithuanian speech than English recognition engine and has potential to achieve acceptable for some applications recognition accuracy. - 256 -
2
Speech for telephony – Office Communications Server Speech Server
Part of the Microsoft Office Communications Server (MOCS) package, responsible for speech interface control is Microsoft Speech Server (MSS’2007) [6]. Speech Server is an interactive voice response (IVR) platform that integrates with Visual Studio 2005 or Visual Studio 2008. Speech Server provides tools for developing applications that run over a telephone, or telephony applications. For example, telephony applications let you check your bank balance via a telephone or get an automated call from your doctor’s office reminding you of your next appointment. Speech Server is to a telephony application what a web server such as Internet Information Services (IIS) is to a web application. Speech Server applications can have the following capabilities: • speech recognition allows users to respond to application prompts; • Touch-Tone capabilities, called dual-tone multi-frequency (DTMF), let users respond to application prompts via the telephone keypad; • text-to-speech (TTS) capabilities allow applications to read and speak written text to users. MSS’2007 supports a powerful Microsoft .NET Framework-based application programming interface (Voice Response and Windows Workflow) for low-level access to core Speech Server functionalities (latest version of MSS also supports VoiceXML and SALT (Speech Application Language Tags)) [7]. By utilizing Windows Workflow software developer can write managed code and actually see the entire call flow. Microsoft's .NET Framework is a runtime environment and class library that dramatically simplifies the development and deployment of modern, component-based applications. In Visual Studio, developer may build Voice Response Workflow application framework in the Dialog Workflow Designer (Figure 1).
Figure 1. The view of Dialog workflow designer window
Developer must write managed code to interact with the designer objects in the code-beside, similarly to the code-behind in a typical ASP.NET application. Latest MSS version (2007), supports simplified tool for building grammars – Conversational Grammar Builder (CGB) (Figure 2). CGB is most appropriate for developing grammars, where the user's answers are single keywords. Conversational Grammar Builder can also be used to develop grammars that use natural language understanding, or Conversational Grammar grammars.
Figure 2. The view of Conversational Grammar Builder window
More advanced Speech Grammar Editor (SGE) is provided for developing grammars where the user's answers are more complex than a single phrase, for example, a grammar used to recognize the PIN code in Lithuanian digits (Figure 3). Another application for Speech Grammar Editor is speech recognition scenarios involving mixed-initiative user responses, where the user can provide information that the system has not yet asked for.
- 257 -
Figure 3. The view of Speech grammar editor window.
Also it is possible to customize recognized pronunciations in speech applications running on Speech Server. Lexicons are supported by Speech Grammar Editor and Conversational Grammar Builder. Below is the multiple pronunciation example of the word „nulis“ written in phonemes [8] and prosodic symbols (“.” means syllable break, S1 - primary stress) used by the speech recognition engine (nulis means zero in Lithuanian) (Figure 4). nulis N UH L IH S N UH . L IH S N S1 UH L IH S
Figure 4. The part of lexicon file with multiple transcriptions of digit “0”
3
Investigation of Lithuanian digits recognition
Ten Lithuanian digits from 0 to 9 were chosen to investigate recognition accuracy. Speech recognition engines Microsoft English Recognizer v5.1, Microsoft English (U.S.) v6.1, Microsoft English (U.S. Telephony) v7.0 Server, Microsoft English (U.S. Wideband) v7.0 Server were used in first experiment. First recognizer is distributed with Speech SDK, second recognizer – with Microsoft Office’2003 and last two – with Speech Application SDK v.1.1. 3.1
Comparison of English speech recognizers Each digit was spoken 100 times by single woman speaker and recognition accuracy was measured. Experiment was conducted in noisy computer room environment, using headset microphone. Initial transcriptions of Lithuanian digits were as follows: „nulis, vienas, du, trys, keturi, penki, sheshi, septyni, ashtuoni, devyni“. Words in which Lithuanian letters are used were transcribed as English ones, i.e., „šeši“ (six) as „sheshi“, „aštuoni“ (eight) as „ashtuoni“. Results are displayed in Table 1. Table 1. Accuracy of Lithuanian digits recognition by English recognizers
Recognizer Microsoft English Recognizer v5.1 Microsoft English (U.S.) v6.1 Recognizer Microsoft English (U.S. Telephony) v7.0 Server Microsoft English (U.S. Wideband) v7.0 Server
Accuracy, % 27,6 84,4 78,0 50,3
In order to improve the accuracy of Lithuanian digits recognition by English recognition engines, it is possible to use English transcriptions of Lithuanian words. This way English speech recognition engine interprets spoken word as English one and sometimes the improvement is quite noticeable. English transcriptions of Lithuanian digits were chosen using Microsoft English speech synthesizer: each Lithuanian digit was synthesized and the most similar to Lithuanian pronunciation English transcription of digit was selected. In order to reduce the amount of experiment we selected only four digits „trys, keturi, aštuoni, devyni“ and replaced their transcriptions with „trees, kehtoori, ashtuohni, deveehni”. Such transcriptions were used in our earlier experiments with Microsoft Speech Server‘2004, when recognition accuracy of digits „ashtuoni“ and „deveehni“ was improved to 100%, but other two digits „trees, kehtoori „ were not recognized overall [9]. Two best recognizers Microsoft English (U.S.) v6.1 and Microsoft English (U.S. Telephony) v7.0 Server were used in our next experiments. Averaged results are shown in Table 2.
- 258 -
Table 2. Recognition accuracy dependence on transcriptions for two recognizers
Recognizer v6.1 Telephony v7.0
trys, keturi, ashtuoni, devyni 71,25 55,75
Accuracy, % trees, kehtoori, ashtuohni, deveehni 81,5 63,75
Significant improvement of recognition accuracy was achieved for digit „trees“, but the recognition accuracy of digits „keturi“ and „kehtoori“ was the same for both transcriptions. 3.2
Investigation of recognition accuracy dependence on recognizer settings All previous experiments were performed using default speech profile with such settings: background adaptation – turn off, low pronunciation sensitivity, high accuracy, slow recognition response time. Pronunciation sensitivity controls whether the system rejects your command if it is not certain what you said (high sensitivity) or always act upon your command even if it unsure of what you said (low sensitivity). The ratio “Accuracy vs. recognition response time” controls whether the system recognizes the text with high accuracy but slower response time or lower accuracy with faster response time. Background adaptation automatically adapts to the speaker during the recognition process. Next experiments were performed using digits 3, 4, 8, 9 with transcriptions „trees, kehtoori, ashtuohni, deveehni”. Recognition accuracy of these digits was measured when the background adaptation was turned on and turned off using Microsoft English (U.S.) v6.1 recognizer. The accuracy of four digits recognition when background adaptation is turned on was equal 65% and the accuracy of four digits recognition when background adaptation is turned off – 81,5%. Investigation of recognition accuracy dependence on pronunciation sensitivity was done without background adaptation using three pronunciation sensitivity levels: low, middle and high. The accuracy of four digits recognition was equal to 11,25% when pronunciation sensitivity was fixed to “high”, 77,5% - when pronunciation sensitivity was fixed to “middle” and 81,5% - when pronunciation sensitivity was fixed to “low”. Investigation of recognition accuracy dependence on ratio “accuracy vs. recognition response time” was done without background adaptation using low pronunciation sensitivity and three above mentioned ratio levels: low/fast, middle/middle and high/slow. The accuracy of four digits recognition was equal 81,5% when ratio was fixed to “high/slow”, 89,0% - when ratio was fixed to “middle/middle” and 90,25% - when ratio was fixed to “low/fast”. New profile of speaker was added and trained prior to the next experiment which was done without background adaptation using low pronunciation sensitivity and “low/fast” ratio of accuracy and recognition response time. The accuracy of four digits recognition using default speech profile was equal 90,25% and the accuracy of four digits recognition using own speaker profile was – 95,75%. Similar experiments were done using Microsoft English (U.S. Telephony) v7.0 Server recognizer. In order to shorten the long-lasting experiments the audio recordings of Lithuanian digits with 11 kHz sampling rate and 16 bit resolution were used for testing of Microsoft English (U.S. Telephony) v7.0 Server recognizer. Unfortunately, the experiments indicated that audio recordings are not suitable for testing of Microsoft English (U.S.) v6.1 recognizer on tested computer due to the end-pointing errors or aborting of testing program. Results of investigation of recognition accuracy dependence on recognizer settings for Microsoft English (U.S. Telephony) v7.0 Server recognizer proved that the best recognizer settings are: low pronunciation sensitivity and low/fast ratio “accuracy vs. recognition response time”. The background adaptation and speech profile does not influence on recognition accuracy of Microsoft English (U.S. Telephony) v7.0 Server recognizer when audio recordings are used for testing. 3.3
Investigation of recognition accuracy dependence on the number and type of transcriptions As was mentioned in chapter 2, it is possible to customize recognized pronunciations in speech applications running on Speech Server’2007. Similar mechanism is enable in SAPI-based applications using XML grammar component PRON and ARPAbet alphabet [10]. The SAPI-based list of Lithuanian digits transcriptions was prepared and used in next experiments:
vienas
du
trys
keturi
penki
sesi
septyni
astuoni
- 259 -
devyni
nulis
Server-based list of Lithuanian digits transcriptions was prepared by eliminating the gaps and stress marks from SAPI-based list. Such server-based list could be used in both MSS’2004 and MSS’2007 servers. Lithuanian diphthongs can be represented in some ways using ARPAbet alphabet, for example, diphthong /ai/ can be described by ten transcriptions: [ah ih], [ah iy], [ah y], [aa ih], [aa iy], [aa y], [ax ih], [ax iy], [ax y], [ay]. In the same way two lists of Lithuanian digits transcriptions (SAPI-based and server-based) were prepared consisting of many transcriptions for each digit (from 2 transcriptions for digit “2” to 10 transcriptions for digit “1”, the total number of transcriptions was equal 25). Averaged results are shown in Table 3 and Figure 5. Table 3. Recognition accuracy dependence on the number and type of transcriptions for Microsoft English (U.S.) v6.1 Recognizer
Transcription One SAPI-based transcription for digit One server-based transcription for digit Many SAPI-based transcriptions for digit Many server-based transcriptions for digit
Accuracy, % 94,2 99,0 95,8 99,8
Errors, % “0” - 49% , “5” - 9% “0” – 8%, “5” – 2% “0” – 41%, “5” – 1% “0” – 1%, “5” – 1%
100 90 80 70 One SAPI-based
60
One server-based
50
Many SAPI-based
40
Many sever-based
30 20 10
D u Tr ys K et ur P i en ki Še Se ši pt y Aš ni tu o D ni ev yn i
N ul Vi is en as
0
Figure 5. Recognition accuracy dependence on the number and type of transcriptions for Microsoft English (U.S. Telephony) v7.0 Server Recognizer
Very high averaged accuracy 99,8% of Lithuanian digits recognition by Microsoft English (U.S.) v6.1 recognizer was achieved using many server-based transcriptions for each digit, trained speaker profile and previously estimated recognizer settings. 3.4
Investigation of recognition accuracy dependence on the length of digit sequence User identification could be implemented by recognizing the identification code of the user (PIN code), consisting of 11 digits. Accuracy of such sequence of digits should be equal to the accuracy of the single digit recognition raised to the power of the length of the digits sequence [11], for example, if the accuracy of the single digit recognition is equal 95%, the accuracy of 11 digits sequence recognition should be equal 57%. Two applications were implemented on Microsoft Speech Server’2007: one - for the measuring of the accuracy of the single digit recognition and second - for the measuring of the accuracy of the digit sequence recognition. One server-based transcription for each digit was used for the measuring of the accuracy of digits by English recognition engine and the lexicon with multiple pronunciations for each digit was used for the measuring of the accuracy of digits by Spanish recognition engine. Four speakers took part in the experiment, each digit was spoken 100 times through the mobile telephone during the measuring of the accuracy of the single digit recognition and 9 sequences of 11 digits were spoken through the mobile telephone during the measuring of the accuracy of the digit sequence recognition.
- 260 -
Table 4. Recognition accuracy dependence on the length of digit sequence
Speaker
RM, man VR, man KR, man MR, woman
Spanish engine Accuracy of the single Accuracy of the digit digit recognition, % sequence recognition, % 98,8 97.3 93,5 59,0 84,5 82,2 71,0 55,6
English engine Accuracy of the single digit recognition, % 64.2 52,1 61,0 41,0
Results of recognition accuracy measuring achieved by speaker VR approximately corresponds to the above mentioned recognition accuracy dependence on the length of digit sequence: accuracy of the single digit recognition – 93,5%, accuracy of the digit sequence recognition – 59,0%. Low recognition accuracy achieved by English recognition engine indicates that used transcriptions are not suitable for Lithuanian digits recognition. Universal Phone Set (UPS) [7] should be used instead of ARPAbet alphabet for Lithuanian digits transcription.
4
Conclusions
Very high averaged accuracy 99,8% of Lithuanian digits recognition by Microsoft English (U.S.) v6.1 recognizer was achieved using many server-based transcriptions for each digit, trained speaker profile and experimentally chosen recognizer settings, i.e., carefully and properly selected settings of recognizer and transcriptions of digits warrants good recognition results. Analyzing the results of last experiments we can conclude, that developer, who created the new digit recognition models for the Lithuanian language, achieved the best recognition accuracy (Table 4, speaker RM), meaning that the system was trained and adapted to his voice, during the development phase. Future models will be trained and adapted to much wider range of voices, improving the recognition accuracy for other people.
References [1] [2] [3] [4]
[5] [6] [7] [8] [9] [10] [11]
Hawley M.S., Green P., Enderby P., Cunningham S., Moore R.K. Speech Technology for e-Inclusion of People with Physical Disabilities and Disordered Speech. Proc. Interspeech, Lisbon, 2005, pp.445-448. Xiaole Song. Comparing Microsoft Speech Server 2004 and IBM WebSphere Voice Server V4.2. Retrieved December 19, 2008, from http://www.developer.com/voice/article.php/3381851. Zgank A., Kacic Z., Horvat B. Comparison of Acoustic Adaptation Methods in Multilingual Speech Recognition Environment. TSD 2003: 245-250. Zgank A, Kacic Z., Vicsi K., Szaszak G., Diehl F., Juhar J., Lihan S. Crosslingual Transfer of Source Acoustic Models to Two Different Target Languages. Proc. Of COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, University of East Anglia, Norwich, UK, August 30-31, 2004. Weng F., Bratt H., Neumeyer L., Stolcke A. A Study of Multilingual Speech Recognition. Proceedings of Eurospeech’97, pp. 209-212. Dunn M. Pro Microsoft Speech Server 2007: Developing Speech Enabled Applications with .NET. ISBN10: 159059-902-0. ISBN13: 978-1-59059-902-0. 275 pp. Published Jun 2007. Microsoft .NET Speech Technologies. Retrieved December 19, 2008, from http://www.microsoft.com/speech. Phoneme Table for English (United States). Retrieved December 19, 2008, from http://msdn.microsoft.com/enus/library/bb813894.aspx. Rudžionis A., Maskeliūnas R., Ratkevičius K., Rudžionis V. Investigation of voice servers application for Lithuanian language // Electronics and Electrical Engineering. ISSN 1392-1215. 2007, nr. 6(78). p. 43-46. Jurafsky D., Martin J. H. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle River, New Jersey 07458, 2000. Rahmel H. Strategies for Optimizing Alphanumeric Recognition Accuracy. Retrieved December 19, 2008, from http://www.microsoft.com/speech/community/newsletter/articles/124004art/index.html.
- 261 -