of unseen words, without names or acronyms, the G2P transcribed 72% of the .... with a frequency of occurrence of at least N (where N was 100, 500 or 1000),.
Speech Recognition Issues for Dutch Spoken Document Retrieval Roeland Ordelman, Arjan van Hessen, Franciska de Jong University of Twente, Department of Computer Science, The Netherlands {ordelman, hessen, fdejong}@cs.utwente.nl
Abstract. In this paper, ongoing work on the development of the speech recognition modules of MMIR environment for Dutch is described. The work on the generation of acoustic models and language models along with their current performance is presented. Some characteristics of the Dutch language and of the target video archives that require special treatment are discussed.
1
Introduction
Using speech recognition transcripts of spoken audio to enhance the accessibility of multi-media streams via text indexing and/or retrieval techniques has proven to be very useful. Automatic speech recognition is applied in various Multimedia Information Retrieval (MMIR) systems (like for example [1]). Transcribing speech offers the opportunity not only to make audio content accessible via standard full-text retrieval and other advanced retrieval tools, but via the time-code of the audio part, also video fragments can be indexed on the basis of content features. In several ways the topic is part of the international research agenda: there has been a TREC-task for Spoken Document Retrieval (SDR) [9], there will be a video retrieval task at this year’s TREC, and also in the Topic Tracking and Detection evaluation event [16] speech transcripts are a primary source. There is large number of issues still to be solved in this domain of which the following two inspired the work described here. For an overview of other research themes, cf. [11]. 1. One of the biggest challenges in building such a system is undoubtedly the development and implementation of a large vocabulary, speaker independent, continuous speech recogniser (LVCSR) for the specific language. This functionality is crucial for the support of disclosing e.g. video archives with programs on a broad domain, with a non-fixed set of speakers, such as news shows. 2. Most existing MMIR systems or prototypes that use speech recognition are focussing on the English language and many of the IR paradigms can readily be applied to speech data from almost any language. However for non-English speech, often tailored recognition techniques are needed in order to let the generated transcripts be accessible or otherwise useful within a retrieval environment.
This paper describes the development of an MMIR environment for Dutch video archives, taken up in a series of related collaborative projects in which among others both the University of Twente and the research organisation TNO participate. The focus will be on the applied approach to speech recognition and on the requirements following from the envisaged applications for the projects DRUID (Document Retrieval Using Intelligent Disclosure) [12] and ECHO (European CHronicles Online). Whereas the DRUID project concentrates on contemporary data from Dutch broadcasts, the ECHO project aims mainly at the disclosure of historical national video archives. At the outset of DRUID in 1998 some experience with SDR was available from the OLIVE project in which the speech recognition for English, French and German was provided by LIMSI [12, 13]. However, no speech recognition system for Dutch was available suitable for SDR research. We were given the opportunity to use the ABBOT speech recognition system [14] originally developed for English at the Universities of Cambridge and Sheffield. One of the major goals within DRUID is to port the ABBOT system to Dutch by developing both language specific speech models and language models for Dutch. This implied firstly, that sufficient amounts of speech and text data along with an extensive pronunciation lexicon had to be collected. Secondly, language specific characteristics had to be considered thoroughly for determining system parameters, like the phone set to be used and the main vocabulary features. Finally, the system had to be tailored to the envisaged video retrieval task. Within ECHO the focus is on the additional requirements following from the historical nature of the target archives. In the next sessions we will give an overview of the work done on acoustic and language modelling for a Dutch speech recognition system for SDR, along with some preliminary evaluation statistics. We will specifically go into some language characteristics of Dutch that are important in the language modelling process.
2
Acoustic Modelling
The speech recognition system ABBOT is a hybrid connectionist/HMM system [7]. The acoustic modelling is done with a recurrent neural net which estimates the posterior probability of each phone given the acoustic data. An indication of the performance of the acoustic modelling is obtained by unlinking the neural net from the system and looking at its phone classification performance that is typically expressed in phone error rate. 2.1
Training Data and Performance
The training data consists of 50 hours acoustic training data with textual transcripts and a phonetic dictionary. The former is a combination of some 35 hours of mainly read speech that was publicly available (Groningen Corpus and SpeechStyles Corpus) and an additional corpus of about 15 hours of read speech from
newspapers, which was added in view of the target data for DRUID: news programs. On test data consisting of read speech only, the baseline performance was a 32% phone error rate. On broadcast news test data we achieved a phone error rate of 55% which is not surprising given the discrepancy between training and test data: broadcast news material contains both read and spontaneous speech, in studio but also in noisy environments, with broad-band as well as narrowband recordings. Clearly, to achieve better results on broadcast news data, the models have to be adapted to the acoustic conditions and speech types in this domain. For this purpose, the manual transcription of a substantial collection of broadcast news data has been started last year at TNO. Recently, also a first release of the Dutch national speech resource collection project ’Corpus Gesproken Nederlands’1 (CGN) became available. This corpus contains a variety of speech types in different contexts and a reasonable amount of data is used for the improvement of the acoustic models for the broadcast news domain. 2.2
Acoustic Modelling Issues
Historical audio data Within the ECHO project, the speech recogniser is called in for the transcription of audio data, derived from historical video archives. Quality of this audio ranges from very low (before 1950) to medium (until 1965) and reasonable (until 1975). Therefore, we expected our speech recognition performance to be considerably lower for this kind of data compared to contemporary data. Indeed, a preliminary test run showed a Word Error Rate of 68%. Since the collection became available only recently, we have not been able to do a thorough study, but a first glance at the data shows that we could improve acoustic modelling by adapting our models to one particular speaker (Philip Bloemendaal). This speaker is famous in the Netherlands because his voice is a characteristic element in an collection covering three decades of news items (called ’Polygoon Journaals’, shown in cinemas) and this voice is present in a substantial part of the ECHO collection. Phonetic Dictionary An indispensable tool in the acoustic model training process is a reliable phonetic dictionary to convert the words in the audio transcripts to their phonetic representations. The Dutch dictionary publisher Van Dale Lexicography provided us with a phonetic dictionary of 223K words. These very detailed (and manually checked) phonetic transcriptions that were generated using a phone set of about 200 different phones, served as a starting point for all our lexicon development steps. First, a grapheme-to-phoneme (G2P) converter was trained using a machine learning algorithm [6]. Given a word list of unseen words, without names or acronyms, the G2P transcribed 72% of the words to the correct transcriptions in the Van Dale phone set. From the 28% of the words that were wrongly transcribed, roughly 10% consisted of foreign words and words with spelling errors. Since the Van Dale phone set is much too detailed and therefore not suitable for speech recognition purposes, conversion 1
http://www.elis.rug.ac.be/cgn/
tables were built to map the Van Dale phone set to the DRUID phone set, which is SAMPA with a few modifications. Furthermore, the phonetic dictionary was augmented by adding derivations of existing words and compounds for which we could predict the phonetic transcription with maximum certainty. Finally, the words were added (from transcription tasks and language model vocabularies) that were processed by the G2P routine and were manually checked. The Van Dale format dictionary now contains 230K transcriptions. The automatically generated compound dictionary in DRUID format contains an additional 213K entries.
3
Language Modelling
For the language modelling of the speech recogniser we collected 152M words from various sources. Dutch newspaper data (146M words) was provided by the ’Persdatabank’, an organisation that administers the exploitation rights of four major Dutch newspapers. In [2] LM perplexity on broadcast news test data is reduced considerably by adding transcripts of broadcast news shows (BNA & BNC corpus) to the LM training data. Since similar corpora are not available for Dutch, we started recording teletext subtitles from broadcast news and ’current affairs’ shows in 1998. On top of that the Dutch National Broadcast Foundation (NOS) provides the auto cues of broadcast news shows. Although the teletext material, and in a lesser degree the auto cues material, do not match as good as manual transcripts, they are a welcome addition to our data set. All data was first converted to XML and stored in a database to allow content selection (foreign affairs, politics, business, sports, etc.). A pre-processing module was build on top of the database to enable the conversion of the raw newspaper text to a version more suitable for language modelling purposes. Basically, the module reduces the amount of spelling variants. It removes punctuation (or writes certain punctuation to a special symbol), expands numbers and abbreviations, and does case processing based on the uppercase/lowercase statistics of the complete corpus. Finally, the module tries to correct frequent spelling errors based on a spelling suggestion list that was provided by Van Dale Lexicography. A baseline backed-off trigram language model was created using an initial version of the pre-processing module without spelling checking and with only a small portion of the available data. A 40K vocabulary was used that included those words from the top 100K word frequency list of which also a manually checked phonetic transcription was available. The model was trained using version 2 of the CMU-Cambridge Statistical Language Model Toolkit [8] using Witten-Bell discounting. With this language model and our acoustic model based on read speech, a 34% word error rate (WER) was achieved on read speech and 58% WER in the broadcast news domain. 3.1
Language Modelling Issues
Recognition performance is expected to improve when all available data is used. However, improvements on some typical language modelling issues in the broad-
cast news domain is necessary for a further decrease of the word error rate. Training on text data from domains that match the broadcast news domain, such as text transcripts of broadcast news, seems to be of vital importance [2]. Although manually generated broadcast news transcripts are being collected at this moment (for acoustic modelling purposes), we expect that the size of this collection will not be sufficient to achieve language models that are significantly better. Therefore, for the time being we rely on the teletext and auto cues data, which at least fairly matches the broadcast news domain. Additional performance improvement is expected from the special handling of certain linguistic phenomena as discussed in the next sessions. Compounds In automatic speech recognition, the goal of lexicon optimisation is to construct a lexicon with exactly those words that are most likely to appear in the test data. Lexical coverage of a lexicon should be as high as possible to minimise out-of-vocabulary (OOV) words, which are an important source of error of a speech recognition system. Experiments as in [15, 10] show, that every OOV word, results in between 1.2 and 2.2 word recognition errors. Therefor, a thorough examination of lexical coverage in Dutch is essential to optimise performance of the speech recogniser. In [5] lexical variety and lexical coverage is compared across languages with the ratio #words in the language #distinct words in the language which provides an indication of how difficult it is to obtain a high lexical coverage given a certain language. In general, the more distinct words there are in a language, the harder it is to achieve a high lexical coverage. When the ratios of two languages are compared, the language with the highest ratio has less difficulty in obtaining an optimal lexical coverage than the other language. In Table 1 the statistics found in [5] are given and those for Dutch are added (coverage based on the normalised training text). It shows that Dutch is comparable with German although lexical coverage of German is even poorer than lexical coverage of Dutch. The reason is that German has case declension for articles, adjectives and nouns, which dramatically increases the amount of distinct words, while Dutch has not. The major reason for the poor lexical coverage of German and Dutch compared to the other languages is word compounding [3, 4]: words can (almost) freely be joined together to form new words. Because of compounding in German and in Dutch, a larger lexicon is needed for these languages to achieve the same lexical coverage as for English. To investigate whether lexical coverage for Dutch could be improved by de-compounding compound words into their separate constituents, a de-compounding procedure was created. Since we do not have tools for a thorough morphological analysis, we used a partial de-compounding procedure: every word is checked upon a ’dictionary’ list of 217K frequent compound words that was provided by Van Dale Lexicography. Every compound is translated into two separate constituents. After a first run, all words are checked again in a second run, to split compound
Table 1. Comparison of languages in terms of number of distinct words, lexical coverage and OOV rates for different lexicon sizes. PDB stands for Persdatabank, FR for Frankfurter Rundschau. Language Corpus Total nr. words #distinct words ratio 5K coverage 20K coverage 65K coverage
English WSJ 37,2M 165K 225 90,6% 97,5% 99,6%
Italian Sole 24 25,7M 200K 128 88,3% 96,3% 99,0%
French Le Monde 37,7M 280K 135 85,2% 94,7% 98,3%
Dutch PDB 22M 320K 69 84,6% 93% 97,5%
German FR 36M 650K 55 82,9% 90,0% 95,1%
20K OOV rate 65K OOV rate
2,5% 0.4%
3,7% 1.0%
5,3% 1.7%
7% 2,5%
10,0% 4,9%
words that remained because they originally consisted of more then two constituents. In Table 2 the amount of words and distinct words, ratio and lexical coverage of the top 20K, 40K and 60K vocabularies based on the complete newspaper data set before and after applying the de-compounding procedure are shown. As expected, de-compounding improves lexical coverage of all vocabularies significantly. Note however that de-compounding typically produces more shorter words. Since shorter words tend to be recognised with more difficulty than longer words because of a larger acoustic confusion and de-compounding exactly shortens the longer compound words, the possible improvement of recognition performance by decreasing the amount of OOV’s could be neutralised to some extend by a growing acoustic confusion. Table 2. Number of words, distinct words, ratio and lexical coverage of 20K, 40K and 60K vocabularies based on the original data set and after applying the de-compounding procedure #words #distinct words ratio 20K 40K 60K Original 146.564.949 933.297 157.04 92.90 95.74 96.99 De-compounded 149.121.805 739.304 201.71 93.92 96.59 97.69
Proper Names and Acronyms Proper names and acronyms deserve special attention in speech recognition development for Spoken Document Retrieval. They are important, information-carrying words but, especially in the broadcast news domain, also often out-of-vocabulary and therefore a major source of error. In general proper names and acronyms are selected in the vocabulary like any other word according to their frequencies in the development data. Following this procedure, almost 28% of our 65K lexicon consists of proper names and acronyms. We did a few experiments to see how well frequency statistics can model the occurrence of proper names and acronyms. Given a 65K lexicon based on the Persdatabank2000 data set (22M words), we removed different amounts of
proper names and acronyms according to a decision criterion and replaced them by words from the overall word frequency list, thus creating new 65K lexicons. To measure lexical coverage and OOV rates, we took the training data itself and, since we do not have accurate transcriptions of broadcast news, a test set of 35000 words of teletext subtitling information from January 2001 broadcast news, as a rough estimate of the actual transcriptions of broadcast news shows. In Table 3, lexical coverage and OOV rates of these lexicons are listed. It shows that selecting proper names and acronyms like any other word according to their frequencies in the development data, works very well with a lexical coverage of 97,5% . The next step was removing a part of the proper names and acronyms and replacing them by normal words from the word frequency list. Nor selecting only 15% instead of 27,9% proper names and acronyms, nor selecting only those with a frequency of occurrence of at least N (where N was 100, 500 or 1000), nor removing all proper names and acronyms, improved performance in lexical coverage. Table 3. Lexical coverage and OOV rates of 65K lexicons created with different amounts of proper names from different sources. PDB stands for the Persdatabank2000 subset, N means frequency of occurrence in the training data. training data amount of proper names & acron lexical coverage OOV PDB 27,9% 97,5% 2,5% PDB 0% 91,4% 8,7% PDB 15% 97,2% 2,8% PDB 3,8% (N > 100) 96% 4,1% PDB 0,7% (N > 500) 94,2% 5,9% PDB 0,3% (N > 1000) 93,3% 6,7%
Historical data For the processing of the historical data aimed at in the ECHO project, specific language models have to be created. The old-fashioned manner of speaking with bombastic language and typical grammatical constructions put specific demands on the language model. Also, the frequent occurrence of names and normal words that are rarely used anymore requires special measures. Domain or time specific text data is needed to create both a lexicon and a language model that adequately fit the data. However, there is only a very small amount of text data available that is related to the test data in the collection. If there is any, it is available on paper only. A few attempts to convert this paper data into computer-readable text using OCR (Fine-Reader 5.0) failed: the pages are generally copies of carbon copies so the characters are too much blurred for OCR to be successful.
4
Summary
In summary, we have described the advances in the development of a speech recognition system to be used in the MMIR environment for Dutch video archives which started in 1998. The work carried out for the acoustic modelling was presented and some language characteristics of Dutch along with the related work
on language modelling were addressed. Additional development tasks have been identified to improve the current performance level. Especially the historical data of the ECHO project turn out to impose a lot of additional requirements. A very first SDR demonstrator that uses the current speech recognition configuration, can be viewed at via http://dis.tpd.tno.nl/druid/public/demos.html.
References 1. D. Abberley, S. Renals, D. Ellis, and T. Robinson. The THISL SDR system at TREC-8. In Eighth Text Retrieval Conference, pages 699–706, Washington, 2000. 2. G. Adda, M. Jardino, and J. Gauvain. Language Modelling for Broadcast News Transcription. In Eurospeech’99, pages 1759–1762, Budapest, 1999. 3. M. Adda-Decker, G. Adda, and L. Lamel. Investigating text normalization and pronunciation variants for German broadcast transcription. In ICSLP’2000, pages 266–269, Beijing, 2000. 4. M. Adda-Decker, G. Adda, L. Lamel, and J.L. Gauvain. Developments in Large Vocabulary, Continuous Speech Recognition of German. In IEEE-ICASSP, pages 266–269, Atlanta, 1996. 5. M. Adda-Decker and Lamel L. The Use of Lexica in Automatic Speech Recognition. In F. van Eynde and D. Gibbon, editors, Lexicon Development for Speech and Language Processing. Kluwer Academic, 2000. 6. A. P. J. van den Bosch. Learning to pronounce written words, A study in inductive language learning. Master’s thesis, University of Maastricht, The Netherlands, 1997. 7. H. Bourland and N. Morgan. Connectionist Speech Recognition-A Hybrid Approach. Kluwer Academic, 1994. 8. P. Clarkson and R. Rosenfeld. Statistical language modelling using the CMUCambridge toolkit. In Eurospeech-97, pages 2707–2710, 1997. 9. J.S. Garofolo, C.G.P. Auzanne, and E.M Voorhees. The TREC SDR Track: A Success Story. In Eighth Text Retrieval Conference, pages 107–129, Washington, 2000. 10. J.L. Gauvain, L. Lamel, and M. Adda-Decker. Developments in Continuous Speech Dictation using the ARPA WSJ Task. In IEEE-ICASSP, pages 65–68, Detroit, 1995. 11. D. Hiemstra, F. de Jong, and K. Netter. Twente Workshop on Language Technology,TWLT 14: ”Language Technology in Multimedia Information Retrieval, 1998. 12. F. de Jong, J. Gauvain, D. Hiemstra, and K. Netter. Language-Based Multimedia Information Retrieval. In 6th RIAO Conference, Paris, 2000. 13. F. de Jong, J.L. Gauvain, J. den Hartog, , and K. Netter. Olive: Speech based video retrieval. In CBMI’99, Toulouse, 1999. 14. T. Robinson, M. Hochberg, and S. Renals. The use of recurrent networks in continuous speech recognition. In C. H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic Speech and Speaker Recognition - Advanced Topics, pages 233–258. Kluwer Academic Publishers, 1996. 15. Ronald RosenFeld. Optimizing Lexical and N-gram Coverage Via Judicious Use of Linguistic Data. In Eurospeech-95, pages 1763–1766, 1995. 16. C. Wayne. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. In Language Resources and Evaluation Conference (LREC), pages 1487–1494, 2000.