APPLICATION OF SPEECH TECHNOLOGY IN THE MULTILINGUAL ...

APPLICATION OF SPEECH TECHNOLOGY IN THE MULTILINGUAL SPEEDATA PROJECT U. Ackermann1, F. Brugnara2, M. Federico2, H. Niemann1 1

2 Bavarian Center for Knowledge Istituto per la Ricerca Based Systems (FORWISS) Scienti ca e Tecnologica (IRST) Am Weichselgarten 7 Via Sommarive 91058 Erlangen, Germany) 38050 Povo (Trento), Italy Phone: +49-9131-691-194 Phone: +39-461-314552 Fax: +49-9131-691-185 Fax: +39-461-302040

email:[email protected]

Abstract

In this paper we present a new application for speech technology. In the SpeeData project a user of the system will speak in any of the provided languages, which presently are the two languages German and Italian. The system will analyse the utterance and generate a data base entry. At the same time, relevant information will be translated into the other language. This paper presents an overview of multilingual speech recognition when using a single recognition module.

1 Introduction Recent developments [1, 5, 12] have shown that speech recognition technology has reached a point where it may be used to facilitate work. Entering data into a data base shall be facilitated by means of speech recognition. Speech technology can be useful in situations where it replaces another input mode and provides a higher eciency according to needed time or error rate. Until now, speech recognition systems are restricted to understanding only one language. Many applications can be found where parallel understanding of several languages is to the user's advantage, e. g. if users are multilingual or from dierent origins. In order to provide speech technology to a range of unexperienced users, a high comfort must be guaranteed. In the Autonomous Region of Trentino{Alto Adige/Sudtirol (RATAA), there are two ocial languages: Italian and German. In this region, land register entries are made since 1897. The land register data are kept in so-called master books and are written in Italian, German or in both languages. These land register books will be stored electronically. At the oces, land register experts will extract data from the master books. According to RATAA a human eort of 285 man years is calculated for entering data in the traditional way. When entering data via speech, it is estimated that the man power for data entry is reduced by 10 to 20 %. This paper describes the application of multilingual speech recognition in the land register oces of the Autonomous Region of Trentino{Alto Adige/Sudtirol, see Figure 1. Languages to be recognized are Italian and German. The base of the recognition system consists of a recognition system for each of the languages [1, 11]. The paper is organized as follows: In Section 2 the user requirements will be presented. Multilinguality aspects will be treated in Section 3, starting with the dierences of the modeled languages, Italian and German, proceeding to the consequences that may lead to dierent realisations of speech recognition systems. Section 4 presents the architecture of the complete data entry system, Section 5 describes the speech recognizer employed for multilingual speech recognition. Section 6 shows the current work, i. e. the de nition of evaluation criteria. The nal section shows the work to be done in the next future. 1

Electronically stored master books

Historic master books

Land register expert dictates master book entries via speech, mouse, keyboard

Figure 1: Sketch of the application of the SpeeData project

2 Application Domain In the SpeeData project, speech recognition technology is a medium for text processing and data entry tasks. The domain, for which this speech recognition system is designed is the acquisition of land register data from the master book. It was introduced to the region in 1897. All important information kept in the master book shall be put into a data base. One constraint is made for the entry: only relevant information from the historic master book is to be taken. Relevance in this context is de ned by laws and represents information that has some impact on today's master book entries. For this work, experts are needed to decide on the importance of the given information. Today's situation in the land register oces of the region Trentino{Alto Adige/Sudtirol consists of two people working on data entry at the same time: the expert extracts information and dictates it to a secretary who types it into the corresponding data base elds. Then, the data base records are printed and corrected by the expert. The content of a data base record shows a variety of formats: 1. Numbers: they appear mostly when a date or an amount of money is mentioned. They may be ordinal, cardinal or fractal numbers. 2. Fixed texts: these texts are limited in their variety. The name of a right or law may be cited, thus, only between 10 and 100 dierent words are possible at a time. 3. Free texts: this type appears in description of lands or houses. It may also describe some rights among owners and neighbors like The real estate of X can be crossed by the cattle of Y. Most of the lexicon for this application, however, consists of proper names, which include town names, street names and names of owners and inhabitants. Thus, about 70 % of the lexicon consists of names. To enter data to a data base, the information must be structured. A master book entry called land group consists of 4 sheets according to items like rights, obligations etc. Within the sheets, information can be structured into categories like name, birthdate, type of right. Some of the information must be entered in both languages, these elds may be xed or free text. 2

When using speech as data entry medium, it is useful for the expert, who now enters the data himself, to utter a keyword and then the corresponding information. The user requirements towards the data entry system are as follows: 1. Processing Speed: the users require a system with a fast response time. For the speech recognition module, a real time factor between 1.3 and 1.5 shall be realized. 2. Reliability: A speaker-adaptive system with a high accuracy will be provided. The system will adapt to dierent dialects and accents. A word accuracy of 94% is envisaged. For eventual errors, a correction possibility will be provided. 3. Types of data entry: Data can be entered by speech, mouse (by clicking a word from a set of alternatives) and keyboard. The users prefer entering data by the rst two techniques. 4. Language: the speaker shows in some way the language in which he/she will enter data, for example by using a keyword in a particular language. 5. Flexibility of data entry: Since the information is not always structured in the master book in the same way, the land register expert has the possibility of entering data in dierent orders. To avoid ambiguities, he can enter a keyword to de ne in which eld the information is stored. 6. Feedback: it will be provided on the screen. All available information concerning the current land group will be shown. 7. Guidance: the system always shows which elds may be lled next and which elds are obligatory to be lled. 8. Comfort: A high comfort must be provided, since most of the users are unexperienced with text processing and are concentrated on the master books during work. 9. Translation: Some elds are obligatory to be lled in both languages. For those xed-text elds automatic translation will be provided. Users working with the system will work in a quiet room, where no disturbances will arise by telephone or people coming in. Still, turning pages etc. will cause some noise while speaking.

3 Multilinguality When working on a multilingual speech recognition system, a good deal of attention must be paid to the languages to be recognized by the system. In this application, a recognition system for Italian and German is built, thus, the properties of both languages are of importance. One characteristic of German is its high variety of exions and cases [3]: number and gender agreement is required for nouns, adjectives, verbs and articles. Sometimes two correct forms of one exion exist [2], like the genitive of male or neutral words like Hof (=farm), which can be either Hofs or Hofes. Also, the order in a sentence can be chosen quite freely compared to other languages, whereas the verb has a xed position. On the other hand, the verb can be split within one sentence into two parts. Homophones result when one word has dierent roles like substantive or verb. In the German language, the role of words is distinguished by using capital and small letters and enlarge the lexicon this way. Another property of German is the generation and use of compound words. In contrary to other languages, speci cations or re nement of de nitions are made by creating new words. This happens by making a new word 3

out of concatenating two nouns or adjectives, instead of generating phrases with prepositions like of in English and di in Italian. All these eects lead to a large lexicon and especially to a high out-of-vocabulary rate, see Table 1. Newspaper corpus German English Italian # Words in corpus 31 M 37M 25.7M # distinct words 650 K 165 K 200K OOV rate (lexicon 20 K) 10 % 2.5 % 3.7% Table 1: Lexical coverage of German, English, and Italian corpora. [8] Another aspect of this application is the high occurence of proper names which must be treated dierently than other words. Proper names showed to dier from other word types in a series of properties [2]. For example stressing changes in general from the stem syllable to the rst one ('Erlangen (German town) and er'langen (=to reach)). Town names like Vezzan are also in uenced by neighboring countries and languages. Depending on the concrete task, words and structures from dierent epochs must be known as well as regional dierences like dialects. For the German language, it can be said, that pronunciation and lexicon depends strongly on the region. Southtyrolean German uses the words Speis, Promenadenweg, Fusteig, Janner, whereas in standard German the words Ezimmer, Gehweg, Gehsteig, Januar are mostly used (=dining room, footpath, sidewalk, january). The most striking observation that can be made is that in a multilingual application words of both languages may occur within one sentence. In this application, as already mentioned in Section 2, the user chooses the language for the next entry and speaks in that language. But there are always words, mostly proper names, that occur in the other language. For example, when information is entered, the owner of a property may have a German name and a company mentioned may have an Italian name. The land register experts have either Italian or German as their mother language and may thus have an accent whenever they enter data in a non-native language. Therefore, the recognition system must not only cope with dialectal variations, but also with a certain amount of accent by the speaker. To build a speech recognition system for both languages that accepts speech from the other language, the properties of both languages must be modeled properly. In this approach, the challenges arising with the characteristics of both languages will be modeled in a common module. The speakers have a dierent knowledge of the other language. Thus, utterances in the foreign language have a large bandwidth of accents, possibly in dierent dialects. Dierent pronunciation by dierent speakers will be modeled by adaptation.

4 System Architecture In this section, the overall system architecture is shown. This comprises the modules and their interconnections. The tasks and properties of the speech recognition module will be described here only regarding its relation to the other modules, the next section will describe the speech recognizer in detail. In the remainder of this section, the modules that can be seen in Figure 2, namely the central manager, the user interface, the data base interface and the speech recognizer will be described. The central manager is the core of the architecture. Its tasks are to execute the user's commands, to control and pass information to the other modules, and to maintain the status and 4

Screen

Keyboard

User Interface

Central Manager

Mouse

Microphone

Data Base Interface

Speech Recognizer

Acoustic Models

Language Models

Figure 2: System architecture for the SpeeData data entry module. the context information of the interaction. During the data-entry, the manager receives speech inputs from the user interface and forwards them to the speech recognizer together with the speci cation of the active LM. When the recognizer has returned a sequence of keywords/ eld assignments, the manager updates the status and the context information. It also tells the user interface module to update the data form on the screen. Moreover, the manager forwards requests to the data base interface when data have to be stored or retrieved from the database. The data base module is an interface to an external data base management system. The formats that the data-entry system and the data base management system use to store entered data need not be the same. The data-entry system only stores a view of the data in the current record, and sends it as a whole to the data base manager for insertion in the data base after user con rmation. During a data-entry session, queries can be generated by the central manager for speci c information in the data base. The role of the data base interface is thus to translate queries or update requests in the data base language. In the Land Register application, the data base system will be Oracle, since this is the system of choice in the existing keyboard-based system. However, having the data base management interface isolated to one module, a good portability towards alternative sytem is provided. The user interface is the most important unit for the user of the system. Therefore, it must be precisely designed according to the user requirements. The user interface manages four devices: screen, microphone, mouse, and keyboard. The last three devices are used for data entry. Using mouse and keyboard, elds can be selected, and data can entered either by typing or by selecting them from a menu (if possible). Using speech, the user can perform the same actions. Moreover, in many forms the user will be allowed to ll in more elds with a single utterance and without specifying the single elds. The three input modalities can complement each other: for example, the mouse can be used to select a certain eld, the information may be given via speech. This way, dierent modalities complement each other usefully. The screen ful lls several tasks: feedback, context, guidance. First of all, it provides feedback for the user about the entered data. This comprises both the possibility of error correction and knowledge about which data have been entered during the current session. Furthermore, 5

the screen shows stored information about the current land register entry as a result of previous entries or in relation to another stored land register entry. With the screen, a medium for guidance is provided. The user nds at any time a set of proposals which information he might enter next. Additionally, before nishing an entry, he will see if all necessary information is given or if something is missing. The task of the speech recognition module is to process speech events. This module will be treated in detail in the next section. Information between the speech recognition module and the central manager ows in two directions. A stream of speech events reaches the speech recognizer, there the information is transformed into a word sequence. Additionally, it is decided, in which eld(s) the data is stored. If there is a keyword, it is transformed into information concerning the eld. If no keyword is entered, the system has to choose by the content of the speech itself and by context-speci c rules.

5 The Multilingual Speech Recognizer The technology which serves as a basis at the institutes involved in the project will be described shortly in this section. In the remainder of this section, concepts will be presented about the design of the multilingual speech recognition system. Both IRST and FORWISS have developed a speech recognition system [9, 11]. Since 1988, IRST has been engaged in research and development in several elds of speech technology: machine dictation, speech understanding, and speaker recognition. IRST has developed state-of-the art speech recognition technology. An example is the AReS system, a 10,000-word vocabulary dictation system for radiological reporting, that has recently been put on the market. The recognizer is speaker adaptive, works with continuous speech, employs a bigram language model, and provides real-time response with a word accuracy close to 94%. Research is currently carried out on the IRST recognition engine in order to cope with vocabularies of tens of thousands of words, to improve robustness of the system with respect to environmental noise, dierent input devices (e.g. microphones or telephone hand-set), \bad" performing speakers, spontaneous speech phenomena. Research is also being performed in the eld of speech understanding to develop information query systems (e.g. ight information), phone services (project for automated \collect call" services), and spoken language translation (project C-STAR II). Finally, research activity is carried out in the eld of microphone arrays and computer vision, that will provide the basis for more sophisticated and robust human-machine interactions { e.g. speech processing without hand-held or body-worn microphones, lip reading, tele-conferencing, etc. FORWISS technology is obtained from cooperation with the Chair of Pattern Recognition of the University Erlangen. The recognition system that serves as a base in the SpeeData project, is used in the InterCity Train Timetable Information System EVAR Erkennen{Verstehen{ Antworten{Ruckfragen, (recognize{understand{answer{inquire) [4]. The Chair of Pattern Recognition also participates in the Verbmobil project [13]. The recognizer is used here for recognition of spontaneous speech. The recognition system is also used in the SQEL project for languages from Eastern Europe [7]. Results on train timetable inquiries uttered in spontaneous speech via public telephone lead to a recognition rate of 79% [6]. The projects presented in the last paragraphs show a variety of dierences regarding the speci cations they are made for. Among others, the domains are dierent. Furthermore, the size of the lexicon depends on the task, the need for a semantic analysis is not the same. In the prototype, a speech recognizer derived from the IRST dictation system will be in6

tegrated. The core of the system is a time-synchronous, beam-search Viterbi decoder where language models are represented by means of nite state networks. Up to now, this system has already been successfully employed in dictation systems related to dierent domains. However, two peculiar features of the data-entry task, namely multilinguality and language model switching, will require the system to be adapted.

5.1 Multilingual Acoustic Modeling

The rst issue is the main topic of the IRST-FORWISS collaboration, aiming to provide a multilingual speech recognizer system, where the dierent languages are integrated as smoothly and transparently as possible. Language dierences come into play both at the acoustic and the linguistic level. For a phonetic point of view, the rst step of a uni ed approach is to adopt the SAMPA phonetic alphabet, which covers the lexicon of both languages with a unique set of phonetic units. The translation of SAMPA units in actual recognition units (HMMs) is however not straightforward. It is known, for example, that the impact of context-dependent units on word accuracy is language dependent, and in particular it diers for the two considered languages. Previous work at IRST showed that satisfactory performance in Italian dictation can be achieved with context-independent units, and that the performance improvement gained by the introduction of context-dependent units is not big enough to compensate the increased resource requirements. On the other hand, the FORWISS experience shows that the same is not true for the German language, where context-dependent units are almost necessary to achieve good performance. The choice of an optimal set of context-dependent units, with respect to a given task and training set, has in fact been an important research topic at the Chair of Pattern Recognition at the University Erlangen, which proposed the polyphones technology. It is thus conceivable that, though keeping the common paradigm of the SAMPA base units, dierent levels of detail will be chosen when designing the acoustic models for the two languages, trying anyway to avoid a too big impact on the overall system complexity.

5.2 Language Model Switching

As for language model switching, this is a feature needed because utterances corresponding to dierent data- elds will have dierent linguistic constraints, i.e. dierent language models. This is in contrast with a dictation system, when the LM is usually xed. To cope with this problem, the system will have to be able to eciently switch the active LM according to the data-entry system state. This could be done by simply loading at system startup a set of prede ned LMs, one for every possible interaction state, and then selecting one among these alternatives every time the speech recognizer is invoked. After a preliminary analysis, however, this approach has been abandoned, because it would require the compilation of many dierent LMs which could be only slightly dierent from each other. Instead, it has been decided that there will be a core set of \primitive" LMs, through combination of which the actual LMs to apply at dierent states will be built on-the- y by a speci c module. This approach, though requiring more work for the adaptation of the recognition system, will allow much greater

exibility and resources economy.

5.3 User Adaptation

Another feature of a data-entry task is that the typical user is not occasional. There will be people that will use the system for a long period, so it will be not necessary to rely on speaker-independent units only. While speaker-independent units should provide a satisfactory baseline for a new user, a speaker-adaptation module will be available, allowing to better focus 7

the acoustic modeling on the particular speaker. There is already an established technology for speaker adaptation at IRST that should be easily exploitable for this task as well. The user is required to utter some phonetically balanced sentences, which are unrelated to the task, and this material is used to modify the parameters of the acoustic models. An adaptation session requires about fteen minutes of user engagement. For bilingual speakers, utterances of both languages will be exploited to adapt the set of acoustic units. Adaptation will concern language models as well. In this case the adaptation will take place with respect to dierent sites. There will be two kinds of LM adaptation. First, since the LMs will be class-based, site adaptation could consist in lling a general class (e.g. proper name) with site dependent data (e.g. the set of the most frequent proper names of a speci c district). Secondly, LM probabilities could be adjusted as well, when enough sample data become available at the target site. Similarly to acoustic adaptation, there are several LM adaptation methods already developed at IRST that could be applied.

6 Evaluation and Measurements Evaluation and measurement will be a key activity of the SpeeData project as it will guide the development of the nal demonstrator. Four general objectives of evaluation are foreseen for the SpeeData project:

diagnostic, that will look for design errors and bugs in the software architecture; performance, that will assess performance of the employed speech technology; prototype adequacy, that will evaluate the demonstrator with respect to the user; application adequacy, that will evaluate the demonstrator with respect to the market.

In the following, only performance and prototype adequacy evaluations will be brie y introduced.

6.1 Performance Evaluation

SpeeData will require re nements of existing speech recognition technology owned by the partners. Performance tests will be carried out to monitor both internal improvements and the state-of-the-art in the eld. For this reason, domain independent test suites will also be considered. Evaluations will focus on all the single components of the speech recognizer: acoustic models, language models, search algorithm, language adaptation algorithm, speaker adaptation algorithm, etc. Two main quality characteristics will be addressed: accuracy and eciency. In fact, the ultimate goal of each component will be to improve accuracy of the recognizer without trading o eciency, i.e. consumption of computational resources (time and space).

6.2 Prototype Adequacy

The ultimate goal of adequacy is to evaluate the demonstrator with respect to the targetuser requirements. A list of quality characteristics has been identi ed that can express such adequacy for a generic system.

8

6.2.1

Utility

The aim of the system is to allow data-entry of Land Register books by speech. The demonstrator must allow a single user to carry out this job and to store the same information as with a traditional keyboard based system. Functionality of the system will be rst analysed through empirical methods and nally, i.e. during the assessment on site phase, through objective measurements. In particular, the task-coverage rate will be measured by computing the percentage of data-entry tasks that can be managed with the system. 6.2.2

Usability

Usability refers to all those aspects of the demonstrator that involve human-computer interaction. Usability must be considered carefully, as the most important features of the system lie in its speech recognition based user interface. Usability attributes for user interfaces can be found in [10]. For the SpeeData project learnability, eciency, memorability, error rate, and satisfaction are the attributes that are analysed. The analysis can happen by means of heuristic evaluation, thinking aloud tests, performance measures, and questionnaires.

7 Acknowledgements This work is supported by the European Union under project reference number LE 1999, dep. DG XIII. Partners in the SpeeData project are Informatica Trentina (Italy), IRST (Istituto per la Ricerca Scienti ca e Tecnologica, Italy), FORWISS (Bavarian Research Center for Knowledge Based Systems, Germany), Regione Autonoma di Trentino Alto Adige (Italy) and Bundesministerium fur Justiz (Austria).

8 Outlook This contribution showed a new application for speech recognition technology. In the SpeeData project, speech is used for data entry. A user has three media to input data to a data base: keyboard, mouse and speech. With speech as medium, the user may use his hands for other things to do while entering data. In the region, where this application will be used, people have two dierent mother languages. Data entry will be possible in both languages. Furthermore, there is a variety of pronunciations, since several dialects are present and users will be speaking in another language than their native language. In the next time, acoustic data will be collected to get material about the dialects in the region Trentino { Alto Adige/Sudtirol. In the next step of the project, one single sytem will be realized that recognizes both languages. In a following step, dialects and accents will be modeled in the system. A big part of this project will deal with the portability of recognition systems. As a consequence, there will be another application where it will be shown how the know-how of this recognition system can be transferred to other domains and languages.

References [1] B. Angelini, G. Antoniol, F. Brugnara, M. Cettolo, M. Federico, R. Fiutem, and G. Lazzari. Radiological Reporting By Speech Recognition: The A.Re.S System. In Int. Conf. on Spoken Language Processing, volume 3, pages 1267{1270, Yokohama, September 1994. 9

[2] K. Belhoula. Rule-based Grapheme{to{phoneme Conversion of Names. In Proc. European Conf. on Speech Communication and Technology, volume 2, pages 881{884, Berlin, September 1993. [3] H. Bumann. Lexikon der Sprachwissenschaft. Alfred Kroner Verlag, Stuttgart, 2nd edition, 1990. [4] W. Eckert, T. Kuhn, H. Niemann, S. Rieck, A. Scheuer, and E.G. Schukat-Talamazzini. A spoken dialogue system for German intercity train timetable inquiries. In Proc. European Conf. on Speech Communication and Technology, volume 2, pages 1871{1874, Berlin, September 1993. [5] R. El-Khoury. Evaluating a Speech Interface System for an ICU. Master's thesis, McGill University, Montreal, 1994. [6] F. Gallwitz, E.G. Schukat-Talamazzini, and H. Niemann. Integrating Large Context Language Models into a Real Time Word Recognizer. In 3rd Slovenian-German and 2nd SDRV Workshop, Ljubljana, April 1996. (to appear). [7] I. Ipsic, F. Mihelic, N. Pavesic, and E. Noth. Slovenian Word Recognition. In 3rd SlovenianGerman and 2nd SDRV Workshop, Ljubljana, April 1996. (to appear). [8] L. Lamel and R. De Mori. Speech recognition of European languages. In Proceedings of IEEE Automatic Speech Recognition Workshop, Snowbird, 1995. [9] G. Lazzari. Automatic speech recognition and understanding at IRST. In H. Niemann, R. de Mori, and G. Hanrieder, editors, Progress and Prospects of Speech Research and Technology: Proc. of the CRIM/FORWISS Workshop (Munchen, Sept. 1994), pages 149{ 157, Sankt Augustin, 1994. in x. [10] J. Nielsen. Usability Engineering. Academic Press, Boston, 1993. [11] E.G. Schukat-Talamazzini and H. Niemann. ISADORA | A Speech Modelling Network Based on Hidden Markov Models. Computer Speech & Language, 1993. [12] S. Shiman, A. Wu, A. Poon, C. Lane, B. Middleton, R. Miller, F. Masarie, G. Cooper, E. Shortlie, and L. Fagan. Building a Speech Interface to a Medical Diagnostic System. IEEE Expert, 2:41{50, 1991. [13] W. Wahlster. Verbmobil | Translation of Face{To{Face Dialogs. In Proc. European Conf. on Speech Communication and Technology, volume \Opening and Plenary Sessions", pages 29{38, Berlin, September 1993.

10