Speech Technology for Universal Access in Interactive ... - CiteSeerX

0 downloads 0 Views 53KB Size Report
This paper will firstly report on the state of art of speech technology for older people. Secondly, it will ..... to realize the reading and talking exercise. The first ...
Speech Technology for Universal Access in Interactive Systems? Régis Privat1, Nadine Vigouroux1, Caroline Bousquet1, Philippe Truillet1,2 & Bernard Oriola1 1

IRIT, UMR CNRS 118, Route de Narbonne F-31062 Toulouse Cedex - France {privat,vigourou,bousquet,oriola}@irit.fr

2

CENA DGAC - Division PII 7, Av. Edouard-Belin BP 4005 31055 Toulouse Cedex - France [email protected]

Abstract The problem discussed in this paper concern how investigate research on spoken interaction can give a solution for the universal access for all. This paper will firstly report on the state of art of speech technology for older people. Secondly, it will describe the case study of the selection of a French dictation system for a fundamental evaluation by elderly people. This session must be considered as a case study. We will place our approach facing to NIST (http://www.nist.gov/speech/) and EAGLES [Gibbon 1997] evaluation contexts. Finally, we will give preliminary results mainly on the effect of the people age.

1. Introduction Speech is natural, efficient and flexible for the human-human communication. Is the use of Speech Recognition (SR) technology still a science fiction, or is it an opportunity for a universal access for ubiquitous human-computer interaction? Researches in speech technologies have been underway for decades and a great deal of advances has been made in reducing the word error rate and in allowing the natural way of spontaneous speech. On the other hand, recent studies demonstrate that machine performance is still quite far from human performance across a wide variety of effects: size vocabulary, noise environment, bandwidth, habits of speech technology use, age, social and cultural characteristics of the speaker, speech disorders, etc. [Lippmann 1997] reported that the machine performance was 43% of word error rate vs. 4% for human performance for switchboard tasks. Speech researches are in progress to avoid specific adaptation to application domains: [Padmanabhan 2000] proposes to develop a generic speech recognition system that can deal with linguistic as well as acoustic problems from different domains/tasks. Maturity of speech technologies1 represents great opportunities in the way people work with other or/and with a piece of information through Interactive Voice Systems (IVS). The providers of continuous speech technologies are trying to propose speaker independent and large vocabulary SR systems [WWW 2001]. This is one of the challenge of the IVS as Web-Galaxy [Lau 1997] and the ARISE project [Baggia 1999] for a general public use. For users with speech or language disabilities or normal elderly people, the challenge could be not focusing on the technology but on the needs of the individual user in regards to the technology accuracy. From the many challenges to overcome in developing human-like performance, we believe that the spontaneous speech, for different age section, is one of the most encompassing and serious. As reported by various studies of human-human speech, the main problem is that there are variations in spontaneous speech (faltering, false starts, ungrammatical sentences, emotional speech such as laughter, slower, etc.) which are attenuated in read speech. However some identical phenomena are observed in correlation with the people age (http://www.pitt.edu/~hennon/speech.htm). The objective of this work is to start investigating the spoken interaction as a solution for the concept of the universal access for all [Emiliani 2000]. Speech technologies can be assessed by means of the EAGLES/ISO methodology which consists in software quality characteristics (accuracy2, recoverability, usability3 and changeability) [Canelli 2000] but also in quality in use (effectiveness, satisfaction).

1 There are a lot of speech recognition products for Dictation (IBM, Dragon Systems, Lernout & Hauspie, Philips, etc.). There are also companies such as Nuance, SpeechWorks, AT&T Bells Labs, Philips, IBM, etc. developing IVS packages over the telephone or Internet. 2 ISO definition of accuracy : “the capability of the software product to provide the right or agreed results or effects”. 3 ISO definition of usability: “the capability of the software product to be understood, learned, used and attractive to the user, when used under specified conditions”.

2. Use of voice dictation systems and elderly people Few works have studied the use of the speech technology for handicapped and/or elderly people. [Kambeyanda 1996] have reported about problems associated with the use of isolated speech recognition products: difficulties to maintain constant pitch, volume and inflection while dictating word by word. The elderly people have needs for IVS. The first study was conduced by [Wilpon 1996] which reported that the error rates of the recognition increases for the speakers over 70 years old. [Yato 1999] described an experience conducted on 469 Japanese elderly people (age range 59 to 85). The IBM Via Voice system was running on 600 place name utterances. It also seems that the average error increases with the age of the person. Another interesting experiment was conduced by [Anderson 1999a,b]. He studied the effect of “user training”, that is to say the ability of a speaker to improve the accuracy rate of the system by modifying his way of speaking. The results showed small changes, but some users discovered on there own solutions —speaking slowly or loudly— to decrease their word error rate. But, as they said, the study did not take into account the speaker fatigue which could also interfere with the accuracy. Generally, Speech Recognition Dictation Systems (SRDS) are used to input text by “standard4” speakers. The IRIT’s research goals for the use of speech technologies by “all” are: a) to identify the characteristics of SRDS that are needed to be considered for universal use by “all”, mainly during the training phase (lexicon personalization, adaptation of the language model, decreasing of the training duration) and for different elocution mode to avoid problems caused by the improper use of SRDS; b) to study if accuracy problems are independent of either the technology or the user; c) to understand/identify the influence of age related physical, social changes on spoken communication functioning; d) finally, to propose a methodology of SRDS configuration to reduce both the training time and the cognitive loading charge —fatigue, stress— for elderly people, and to specify intelligent assistant in IVS according the user’s needs. 3. The case study 3.1 The objectives The challenge is: can the SRDS be used by the elderly persons to input text or/and to interact through an information service? The standard evaluation methodology would have been to conduct a series of experimentations focused on the evaluation of the main SR systems5 for dictation by two speaker’s classes in age: standard and elderly persons (elderly means more than 60 years old). As recommended by NIST (National Institute of Standard Technology, http://www.nist.gov/speech/) and EAGLES [Gibbon 1997] the ASR evaluation needs to observe important speaker number and huge corpora size to have reliable results. The aim of this paper is not to evaluate the accuracy of the SRDS (the commercial tests already give these results) but to try to delimitate the effect of some specific use conditions on the accuracy. To avoid long time accuracy evaluation processing for several SRDS, we decided to conduce a case study to determine which SRDS for French language gives the better results for our purpose. This study is performed with a limited number of speakers. It consists on measuring/evaluating the facilities of adaptation (acoustic model, language model, lexicon), the compatibility with the Speech Application Programming Interface (SAPI version 4, http://www.microsoft.speech) for their integration in an interactive application, the accuracy robustness in regards of the speaker-dependent model but also the extra-use in non normal elocution (spontaneous way, elderly persons, speech disorders…). 3.2 Method 3.2.1 Definition of study factors The three main commercial SRDS4 for French language were selected for this case study. Two factors were retained: the training effect and the elocution mode. These SRDS are speaker-dependent: it is why it is necessary before the recognition phase to process to the adaptation of the speaker acoustic model. Two states of training models were defined: 1) the implicit acoustic model given by the SRDS just smoothed by the signal noise ratio (SNR) process applying in the use context (environment and speaker) named reference model; 2) the acoustic speaker-dependent model, named SDM.

4 5

Standards means male or female speakers of 25-60 years old. L&H Voice Xpress Professional V4; IBM ViaVoice Pro, Millenium Edition, release 7; Dragon Naturally Speaking Prefered V4.

To measure the effect of the elocution mode on the accuracy, three different modes of elocution were defined: text reading with (EM1) and without punctuation markers (EM2) and spontaneous speech mode (EM3) as naturally as possible. These three modes were both suggested by the results of [Kambeyanda 1996], who reported that the SRDS accuracy are dependent of the elocution, and by the IRIT’s hypothesis: it is reasonable to use SRDS as a spoken interaction mode to interact with a computer application. So each speaker has to realize the training sequence face to a computer (the texts were presented on the screen of the computer) for the three modes. For this case study, only two computer science students done these try series but various tests were extensively conducted from different elocution mode (three elocution modes) and speaker-dependent acoustic model (three models). We decided to test all the speaker-dependent acoustic models for each SRDS with the two users in order to determine the real effects of the training phase: firstly to evaluate the benefits or not of the training phase on the accuracy; secondly to evaluate the deviation of SRDS accuracy trained for a speaker and tested with another speaker (the third column of each SRDS sub-table in Table 1). This value could be interpreted as an indicator on the system ability to recognize different voices without SNR smoothing and speaker training. This is due to the way we want to use the SRDS, as vocal input for IVS has to be independent to the speaker. For each SRDS, 18 tests were run. Each test took about fifty minutes for the training phase (adjustment, elocution phases and the automatic training smoothing by the SRDS). For the purpose of this paper, the three systems will remain undifferentiated. 3.2.2 Apparatus Tests were run on a portable Compaq Armada M-700, PII366 with 128 Mo of RAM, with Microsoft Windows95. The training and test voice corpora were recorded at a sampling rate of 22KHz through an Andrea Technology ANC-600 headphone on a Digital Audio Tape Sony. They were transmitted to the SRDS by a cable directly plugged to the microphone entry of the sound card. As the speech variability must not interfere with the results, the test corpora were recorded in order to compare the recognition for the same input. The test corpora is a phonetic calibrated text [Cadilhac 1997] of 196 words. 3.2.3 Accuracy Each SRDS’s output is compared to the orthographic reference transcription by means of an alignment tool. The accuracy metric is based on the ratio between the number of words truly/wrongly transcribed and the number of words in the text. The results are given in correct word-recognized rate. 3.3 Results and discussion SRDS 1 Speaker

SRDS 2 Other

Reference model

SDM

SDM

Speaker

SRDS 3 Other

Reference model

SDM

SDM

Speaker

Other

Reference model

SDM

SDM

Spk 1

EM1 EM2 EM3

76,28% 78,32% 68,11%

3,83% 2,04% 4,85%

-10,71% -14,80% -14,29%

84,44% 79,59% 74,49%

-1,53% -0,77% -3,06%

-5,61% -3,83% -7,40%

82,14% 81,38% 79,34%

-2,30% -4,59% -10,20%

-14,03% -14,29% -22,96%

Spk 2

EM1 EM2 EM3

61,48% 61,99% 42,86%

3,57% 2,30% 8,93%

14,03% 14,29% 20,41%

76,02% 71,68% 68,62%

5,61% -3,06% -4,85%

-9,18% -19,90% -19,64%

63,78% 58,93% 52,55%

12,76% 15,82% 8,67%

9,44% 12,24% 11,73%

Table 1: Some of the case study results

The results (Table 1) are given in terms of word recognized rate for the reference model. The gap of recognition rate due to the training phase is then given for the two speakers (SDM for the speaker performing the training phase). Two interesting points can be noticed: the first is the quite comparable results for the two first elocution modes, especially with the SRDS 1 (average increase of 0.13% for the elocution mode without the punctuations) and the SRDS 3 (average decrease of 2.25%); the second is that the results obtained seem to confirm that the SR rate is dependent of the speaking mode (an average decrease between 10 and 13% from EM1 to EM3 can be noticed), and this with all the SRDS. 4. Towards an analytic methodology of SRDS evaluation Our purpose is to build an analytic methodology for the evaluation of SRDS in order to determine whether they can be integrated in interactive systems or not, as a solution for enabling speech input. The case study confirms the needs on the SRDS selected:

a)

to investigate the use of speech technologies by “all” mainly in how it is very important to have tools to adapt/personalize the linguistic knowledge (lexicon personalization with the consideration of phonological variants of pronunciations, adaptation of the language model); b) to study the effect of people age on the elocution mode, speech rate, sound energy and phonological transcription; c) to analyze if accuracy problems are independent of either the technology or the user. 4.1 Methodology Three factors were retained for this study: the effect age, the elocution mode and the training model length of the SRDS chosen. For the two first factors, the same principle described in 3.2.1 will be applied. Concerning our efforts to reduce the training time length, we will plan to build a set of acoustic models for each speaker by changing the training text: we want to evaluate the re-use of the short texts required by the SRDS for smoothing the implicit model.This manner aims to reduce the effort, the stress, the fatigue caused by the long training time. As for the case study, the three elocution modes will be used (EM1, EM2, EM3). 4.1.1 Participants The study deals with two speaker’s classes: students in computer science (20-30 years old) and elderly persons (elderly means more than 60 years old). For each population, fifteen speakers have been selected. 4.1.2 Apparatus We decided to use a more powerful workstation, to avoid any material limitation for the SRDS. The computer is a PIII933 with 396Mo of Ram and a Sound Blaster Live Player sound card. The operating system is Microsoft Windows2000 Professional. All the corpora (training, tests) are recorded on a DAT with a sampling rate of 48KHz using an Andrea Technology ANC-600 headphone. 4.1.3 Procedure and study factors Our aim is to study the age, the training time reduction and the elocution mode effects on the recognition results in terms of accuracy (word-recognized rate), but also to identify some extra-linguistic phenomena which could occur during spoken human-computer dialogs. The same principles are applied to compute the accuracy than in 3.2.3. A satisfaction questionnaire will be added to estimate the quality in use and the cognitive processes. 4.1.4 Preliminary results and discussion Global results show that successful uses of SRDS require some considerations: the training phase consisted in two preliminary tests (SNR adjustment, benchmarks of audio devices) and the reading of the training text: this two phases are primordial. In this paper, we decided to focus our discussion on the speech rate analyze. This analysis will be done partially on four of thirty speakers. The speech rate is computed from the numbers of the words in the texts divided by the duration of the recording of each corpus. Results analysis of the word rate for all elocution modes: As shown in Table 2, the standard deviation varies between 0.304 and 0.512. This last value can be explained by the speaker behavior. This deviation is due to the strategy taken by the speaker 3 during the EM3: this speaker has a word rate very different according to the elocution mode EM1 (+ 67 words per minute in a spontaneous way). Results analysis for reading modes: For all speakers, the standard deviation varies between 0.31 and 0.34. For all speakers, the average is 2.57 words per second (around 154 words per minute). This rate is very close to the rate announced by [Price 1999] who reported that a user can dictate 160 words per minute with the Dragon System. For the speaker 1 and 2 (20-30 years old), the word rate per second increase (+ 22 words per minute). For the speaker 3 and 4 (more than 60 years old), the word rate decrease (-22 words per minute). These preliminary results seems to attest the effect of the age on accuracy. They need to be confirm on the whole set of speakers. Training phase Test text (EM1) Test text (EM2) Test Text (EM3)

Speaker 1 2.86 2.62 2.93 2.84

Speaker 2 3.05 2.72 3.27 2.18

Speaker 3 2.16 1.85 2.36 2.97

Speaker 4 2.23 2.34 2.39 2.23

Average Standard deviation

2.846 0.304

2.986 0.390

2.181 0.512

2.245 0.302

Average - EM3 Standard deviation - EM3

2.846 0.340

3.041 0.310

2.154 0.328

2.245 0.332

Table 2: Speech rate given in words per seconds.

These first results point out the more important deviation observed for the spontaneous way mode (EM3). It means that the guidelines to pronounce the text in the EM3 condition needs to be more refined for the next stage. At the present time, two ways are explored: 1) one way is based on explanations and examples; 2) another way is to define dialogue scripts to try to obtain more natural corpora. Correlation analyses are being conduced on the SRDS accuracy and the word rate per minute. These first results will provide knowledge about the behavior of speakers according to their age and face to the speech context of interaction (reading or spontaneous speaking). 5. Conclusion The preliminary results of both the case study for selecting a SDRS for French language and the evaluation of the system chosen for standard (versus elderly people) show that it is necessary to carry out of the experimental protocol for standard and elderly speakers. It seems that the age of people have an effect on the speech rate. A number of opened questions have not yet been addressed, including the following: a. User’s characteristics need to be taking into account such as general cognitive level (fatigue, stress), speech disorders (difficulties to pronounce syllables, unknown words), syntactic structures, self adaptation ability to realize the reading and talking exercise. The first diagnostic analyses seem to show that the word error rate depends on the quality of the speaker’s voice, such as the fluency, the prosodic stress and the respect or not of punctuation for a reading task. b. It is necessary to design the lexicon and language model for both the faltered pronunciation and the extralinguistic words. Important effort of experiment and analyses need to be pursuit to improve the universal access by means of speech interaction for elderly persons. The final goal is to give guidelines to implement/run these technologies for elderly persons. Acknowledgments We wish to thank Région Midi-Pyrénées (France) for supporting this work. 6. References [Anderson 1999a] Anderson S., Liberman N., Gillick L., Foster S., Hama S., The effects of speaker training on ASR accuracy, in Proceedings of EUROSPEECH’99, Budapest, Hungary, CD-ROM. [Anderson 1999b] Anderson S., Liberman N., Bernstein E., Foster S., Cate E., Levin B., Recognition of Elderly Speech and Speech Driven Document Retrieval, IEEE International Conference on Acoustics, Speech and Signal Processing, Phoenix, AZ, March 1999. [Baggia 1999] Baggia P., kellner A,. Pérennou G, Popovici C, Sturm J,. Wessel F, “Language Modelling and Spoken Dialogue Systems - the ARISE experience”, in Eurospeech' 99, Budapest, Hongrie, 5-9 septembre 1999, Vol. 4, pp.1767-1770. [Cadilhac 1997] Cadilhac Cl., Des structures textuelles à leur traitement : compréhension et mémorisation d’un récit par déments de type Alzheimer et sujets normaux âgés, Thèse d’Université, Toulouse II, Décembre 1997 [Canelli 2000] Canelli M., Grasso D., King M., Methods and Metric for the Evaluation of Dictation Systems : A Case Study, in Second International Conference on Language Resources and Evaluation, Proceedings Volume III, 31 M1y-2 June 2000, pp. 1325-1331. [Emiliani 2000] Emiliani P.L.; Stephanidis C., From Adaptations to User Interfaces for All, 6 TH ERCIM Wporkshop, User Interfaces for All, Florence, Italy, 25-26 october 2000, pp.313-323. [Gibbon 1997] Gibbon D., Moore R., Winski R., Handbook of Standards and Resources for Spoken Language Systems, Editors D. Gibbon, R. Moore, R. Winski, Berlin, New-York 1997. [Kambeyanda 1996] Kambeyanda D., Cronk S., Singera L., Potential Problems associated with use of speech recognition products, RESNA' 96 Proceedings, pp. 119-122. [Lau 1997] Lau R., Flammia G., Pao C., Zue V., “WebGALAXY: Integrating Spoken Language and Hypertext Navigation” Appears in: Proceedings of Eurospeech ' 97, Rhodes, Greece, pp.883-886, September, 1997. From the World Wide Web: http://www.sls.lcs.mit.edu/raylau/publications.html#WebGalEuro [Lippmann 1997] Lippmann R., Speech recognition, by Machines and Humans, Speech Communication, Vol. 22, N° 1, 1997. [Padmanabhan 2000] Padmanabhan M., Picheny M., “Towards super-human speech recognition”, in Automatic Speech Recognition, Challenges for the New Millenium, September 18-20 September, 2000, Paris, France, pp. 189-94. [Yato 1999] Yato F., Inoue N., Hashimoto K, A study of Speech recognition for the Elderly, in Proceedings of EUROSPEECH’99, Budapest, Hungary, CD-ROM. [Wilpon 1996] Wilpon J.G., Jacobsen C.N., “A study of speech Recognition for Children and the Elderly”, Proc. of ICASSP96, pp.349-352. [WWW 2001] http://www-4.ibm.com/software/speech/ www.dragonsys.com/ http://www.lhsl.com/ http://www.speech.philips.com/ud/get/Pages/psp_home.htm http://www.nuance.com/

Suggest Documents