spoken document collection consisting of short German- language radio documentaries. First, the vector space model is applied to a known item retrieval task ...
Using Syllable-based Indexing Features and Language Models to improve German Spoken Document Retrieval Martha Larson, Stefan Eickeler Fraunhofer Institute for Media Communication (IMK) Sankt Augustin, Germany {larson,eickeler}@imk.fraunhofer.de
Abstract Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used to generate syllable-based indexing features. Recognition is carried out with a 5k syllable language model and a 10k mixed-unit language model whose vocabulary consists of a mixture of words and syllables. Both language models make retrieval performance possible that is comparable to that attained when a large vocabulary wordbased language model is used. Experiments are performed on a spoken document collection consisting of short Germanlanguage radio documentaries. First, the vector space model is applied to a known item retrieval task and a similar-document search. Then, the known item retrieval task is further explored with a Levenshtein-distance-based fuzzy word match.
1. Introduction Research and development in spoken document retrieval is driven by the increasing demand for speech technology that makes spoken audio as comfortably searchable and browsable as text. Comprehensive access to spoken document collections requires the automatic generation of indexing features that are both semantically discriminating and reliable. The potential of syllables as indexing features for spoken document retrieval is intuitively evident for a syllable-based language like Chinese and the discriminating capabilities of syllables have been established for Cantonese [1], and Mandarin [2]. Although German is not a syllable-based language, it does make extensive use of inflectional paradigms and of compounding. Semantically motivated word classes tend to be large in German, containing many distinct word forms, but these word forms will share many syllables in common. Breaking a German word into syllables effects an implicit stemming and compound decomposition. In previous work we have demonstrated that syllables are useful indexing features in German and, on a binary classification task, are better indexing features than words [3]. In this paper we investigate the performance of syllable-based indexing features on two more challenging information retrieval tasks. First, we experiment with a 'known item' retrieval task, for which there is exactly one relevant document in the collection corresponding to each query. Second, we perform tests on a 'similar document search' task, for which each document is used itself as a query and is associated with a group of topically related documents in the
collection. We also address the question of whether syllable features are best generated by decomposing the output of a word-based speech recognizer or if they can be generated directly using a syllable-based language model. Although syllables are useful features for indexing, they are a mixed blessing when used as the base units for a speech recognition language model. Syllable models can generate sub-orthographic combinations that were not part of the training data. This ability helps to handle out-of-vocabulary words in the audio input, but it can also increase the size of the hypothesis space, leading to recognizer error. In general, it is not possible to attain the same recognition performance with a syllable-based language model as with a word-based model of the same order. Therefore we propose a mixed-unit language model that is motivated by the observation that in running German text about 45% of the word tokens are already monosyllables. Raising this proportion slightly should introduce the benefits of syllable-based language models, without incurring their disadvantages. As a further measure, designed to introduce syllable recognition error robustness into the system, we investigate the performance of a fuzzy word match on our known item task. Like approaches which introduce approximate match indexing terms into spoken documents [4], the fuzzy word match we propose here aims to exploit a priori knowledge about the nature of recognizer error to identify spoken documents relevant to the query. Our fuzzy word match rates each document by summing the scores of the best syllable sequence match for each query term and uses the document rating to generate a top-10 hit list which contains the relevant document. In section 2 we describe the Kalenderblatt collection of radio documentaries on which the experiments are performed. In section 3 we introduce the speech recognition system and the word and syllable language models used in the experiments. We present the mixed-unit language model: a model trained on a combination of syllables and words. In Section 4 we describe the information retrieval models used to perform the experiments: the vector space model and the fuzzy word match. In Section 5 we report spoken document retrieval results on the known item task and on the similar document search. Section 6 summarizes the conclusions.
2. The Kalenderblatt Collection of Radio Documentaries The spoken document collection used for our experiments contains 263 short radio documentary programs from the Internet series Kalenderblatt (http://www.kalenderblatt.de) of the German radio broadcaster Deutsche Welle. Each program is about 5 minutes long and contains approximately 650
running words and is produced both as a radio broadcast and as well as a text for Internet publication. The spoken audio in the Kalenderblatt collection is heterogeneous. Music, sound effects and original sound footage are used to enliven the topics. Foreign-language interviews are overlaid with the voice of a German-language interpreter. The series treats topics of cultural, historical, political and social interest. This wide range of topics means the spoken audio has highly varied vocabulary, and out-ofvocabulary (OOV) words are thus a significant issue for this domain. To provide an impression of the vocabulary richness of the Kalenderblatt collection, we calculated Herdan's constant [5], the ratio of the log of the number of types in a text to the log of the number of tokens, for several German corpora.
German Parliament Proceedings German dpa newswire Text from German newspapers Kalenderblatt Documentaries
Herdan's constant 0.75 0.74 0.77 0.83
Table 1: Comparison of German language corpora Preparation of the Kalenderblatt collection is described in detail in [6] and is sketched only briefly here. The original audio documents were mono Real format, (31.1 Kb/s, 22.05KHz) and were re-sampled to 16kHz. The original texts were normalized and annotated with topics. The top level categories of the International Press Telecommunications Council (http://www.iptc.org) subject reference system, a total of 17 topic categories, were used to annotate the collection. Because the Kalenderblatt series aims to draw connections between ideas, eras and events, the documentaries do not fall neatly into a single topic class. For this reason, each of the radio programs was assigned up to three topic categories.
3. Extracting features from spoken audio
3.2. Decoding The HMM-based speech recognition system we use for these experiments is built with the ISIP public domain speech recognition toolkit [7]. Cross-word triphones were trained on a set of 83 documentaries (ca. 7 hours) from the portion of the Kalenderblatt database not used for the retrieval experiments. We used the Bayesian Information Criterion [8] to segment the audio at the junctures between speakers (narrator and interviewees). These speaker segments were further divided into sub-speaker segments of maximum length 20 seconds by setting boundaries at detected points of silence. To estimate the relative retrieval performance of our language models, we prepared a small test set consisting of audio of known and consistent quality. The test set contained 13.4 minutes of music-free narrator speech drawn from the Kalenderblatt documentaries and thus represents an upper limit for recognition performance on the Kalenderblatt collection as a whole. In Table 2 we report the results of our recognition tests in terms of syllables, the largest unit common to all three language models. Language model 91k word 2-gram 5k syllable 2-gram 5k syllable 3-gram 10k mixed-unit 2-gram 10k mixed-unit 3-gram
Recognition Accuracy 75 % syllable/ 68 % word 69 % syllable 75 % syllable 75% syllable 77 % syllable
Table 2: Recognition rates for word, syllable and mixed-unit (syllables + words) language models From the tests we concluded that the syllable recognition rate of a 91k word 2-gram language model can also be attained with a 5k syllable 3-gram model. The mixed-unit 3gram model achieves a small improvement on this rate. These three models were chosen for use in the retrieval experiments. A 91k word 3-gram model was also trained, but excluded from consideration due to decoding time of over 100x real time.
3.1. Language models To train our language models we chose a text corpus consisting of 64 million words from the German dpa newswire. Our baseline word-based language model was trained on the 91k most frequently occurring word types in the text corpus. To train the syllable language model, the transcription module of the Bonn Open Source Synthesis System (BOSSII) [9] was used to decompose the text corpus into syllables. Our baseline syllable model contained the most frequent 5k syllables in the resulting syllable corpus. Previous work showed that for syllable language models, recognitionrate improvement levels off at a syllable vocabulary of 5k syllables [10]. The mixed-unit language model was trained on a normalization of the training corpus containing both words and syllables. This normalization is generated by maintaining frequently-occurring words in the corpus as words, and decomposing infrequently occurring words into syllables. The procedure was developed in the work described in [10]. In the original training text 46% of the words tokens consisted of a single syllable. In the mixed-unit normalization 52% of the tokens were single syllables.
4. Information Retrieval Models 4.1. Vector Space Model The vector space model [10] was used for the experiments on the known item retrieval task and on the similar document search. Each component of the vector is a weight representing an indexing term. We used the standard combination of term frequency and inverse document frequency tf*idf as term weights and the standard cosine distance as the distance metric. The vector space model returns a relevance ranking of all the documents in the collection for each query according to their distance from the query. For each query the entire collection is ranked with respect to relevance with the vector space model. For the known item task, the results are presented in terms of Average Inverse Rank (AIR).
AIR =
1 N
N
1
∑ rank i =1
(1) i
For the similar document search task the returned ranking list reflects relevance to the topic of the query document. Topical relevance was assessed by whether at least one of the topic
classes assigned to the query document matched one of the topic classes in the document that the system returned. The performance of the system is represented by a plot of the average precision at 11 standard recall levels [10].
language model 91k word 2-gram
indexing terms word 1-grams
AIR 0.60
word 2-grams
0.27
syllable 1-grams
0.60
4.2. Fuzzy Word Match
syllable 2-grams
Our fuzzy word match is a variant of the familiar Boolean model [10] using the AND-operator, which returns a document as a hit when that document contains all terms in the query. The fuzzy word match approach identifies the syllable sequence in each document that is the best match with each word in the query. The word match score is the Levenshtein distance, weighted with syllable match scores, between syllable sequences in the query word and in the document. The syllable match score is the Levenshtein distance between syllables, weighted with basic phonological information. Substitutions between homorganic phonemes and phonemes of similar sonority carry reduced penalties. The relevance score of a document is calculated by multiplying the scores for the best word matches of all query terms. The total document score thus reflects both query term occurrence and recognition quality. Since the quality of the spoken realization of query terms in the documents does not affect document relevance in our known item task, ranking the documents with respect to document score is not directly meaningful. For this reason we report the results of the fuzzy word match in terms of the percentage of the queries for which the relevant document occurred in the top ten hits returned. The Boolean retrieval model, which also does not return a ranked list, provides an interesting exact term match to which we compare our fuzzy word match.
syllable 3-grams
0.67 0.51
syllable 1-grams
0.59
syllable 2-grams
0.65 0.46
Experiments were first performed on the text transcripts of the documentaries in order to establish a baseline, and then on the output of the recognizer for each of the language models. For each task a variety of overlapping indexing features was tested. 5.1. Known Item Retrieval with Vector Space Model Queries for the known item task are the titles of the 263 documentaries in the Kalenderblatt collection, which have been removed from the text and the audio. AIR vector space model text words
1-gram 2-gram
0.87 0.47
text syllables
1-gram 2-gram 3-gram
0.85 0.90 0.76
Table 3: Baseline text retrieval performance on the known item task The results of the baseline experiments on text are reported in Table 3. It is clear that syllable 2-grams are the best text indexing features. The results of the experiments on recognizer output are reported in Table 4. Syllable-level indexing features, are built by first decomposing recognized words and then recombining the resulting syllables.
syllable 3-grams 10k mixed-unit 3-gram (words + syllables)
syllable 1-grams
0.60
syllable 2-grams
0.66 0.45
syllable 3-grams
Table 4: Retrieval performance on spoken audio for the known item task Syllable 2-grams demonstrate themselves to be the most effective indexing features for spoken audio, just as they were for text. The retrieval rate on the output of the large vocabulary word recognizer is slightly better than the retrieval rate on the recognizer output generated with the language models including syllables. This difference suggests that although the word-based language model causes more syllable errors than the mixed-unit language model (see Table 2), the word-based model produces syllable bigrams that are better discriminating for the known item task. These results support the conclusion that a 5k syllable model can be used in the place of a 91k word model, without drastic reduction in retrieval performance. 5.2. Similar Document Search with Vector Space Model In the similar document search, the system is presented with each document in the collection in turn and required to rank all other documents with respect to topic similarity. Baseline performance of the similar document search on text is depicted in Figure 1. The syllable 2-grams outperform words as indexing features.
Average Precision
5. Experiments and Results
5k syllable 3-gram
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
syllable 2-grams word 1-grams syllable 1-grams 0 1 2 3 4 5 6 7 8 9 10 Recall Level
Figure 1:Baseline text retrieval performance for similar document search Similar document search experiments were carried out for spoken documents using syllable 2-gram indexing features generated from word-, syllable- and mixed-unit recognizer output. The performance was nearly identical for all three
Average Precision
outputs, with the mixed-unit slightly ahead of the other two. For reasons of readability only the mixed-unit case is depicted in Figure 2, which summarizes the performance of the similar document search task on spoken audio documents.
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
syllable 2-grams word 1-grams syllable 1-grams 0 1 2 3 4 5 6 7 8 9 10 Recall Level
Figure 2: Retrieval performance on spoken audio for similar document search Once again we see that syllable 2-grams as indexing features deliver the best retrieval performance. The performance of syllable 2-gram indexing features on spoken audio is comparable to that of word indexing features on text. 5.3. Known Item Retrieval with Fuzzy Word Match The fuzzy word match is performed on the known item task. Again there are 263 test queries for this task. To have a point of reference with which to compare the results of the fuzzy word match, we performed an exact Boolean match between query and documents using the AND-operator. Table 5 shows baseline results on text, reported in terms of the percent of queries for which the relevant document was among those returned in the exact match hit list. Recall, but not precision, is better on text that has been decomposed into syllables.
text words text syllables
Exact match: % queries successful 36% 56%
Average length of exact match list 0,6 documents 3,6 documents
Table 5: Text baseline Boolean match The fuzzy word match returns a document rating and not a relevant/non-relevant decision, and for this reason we restrict our assessment of the output to a hit list of reasonable length. Table 6 summarizes the results of fuzzy word match in terms of the portion of the queries for which the relevant document was returned among the top-10 hits.
speech words speech syllables speech mixedunits
Exact match on syllables 32% 26% 28%
Fuzzy match: % matched in top-10 79% 79% 82%
Table 6: Fuzzy word match compared with percent success for syllable Boolean match for spoken audio The output of the mixed-unit recognizer is best for the fuzzy match, mirroring its better syllable accuracy (Table 2).
The fuzzy word match is clearly superior to the exact match and achieves a good retrieval rate on recognizer output.
6. Conclusion The results of this paper demonstrate that on rich-vocabulary domains such as German radio documentaries, syllable bigrams are better indexing features for spoken document retrieval than words. Moreover, retrieval performance with syllable bigram indexing features is nearly independent of whether the underlying language model is syllable-, word- or syllable-and-word-based. Both word- and syllable-based transcripts produced by the speech recognizer can be searched far more effectively with a fuzzy word match than with a standard Boolean match.
7. Acknowledgements This work was made possible by the German Ministry of Education and Research (BMBF), who funds the project PiAVIda, within which this research was carried out. We would like to thank the Deutsche Welle for granting permission to use the audio and text material.
8. References [1] Meng, H.M., Lo, W.K., Li, Y.C. and Ching, P.C., "Multiscale Audio Indexing for Chinese Spoken Document Retrieval," Proceedings of ICSLP, 2000. [2] Chen, B., Wang, H. and Lee, L., "Discriminating Capabilities of Syllable-based Features and Approaches of utilizing them for Voice Retrieval of Speech Information in Mandarin Chinese," IEEE Transactions on Speech and Audio Processing, Vol. 10, Nr. 5, 2002. [3] Larson, M., Eickeler, S., Paaß, G., Leopold, E. and Kindermann, J., "Exploring sub-word Features and Linear Support Vector Machines for German Spoken Document Classification," Proceedings of ICSLP 2002. [4] Ng, K., "Towards robust methods for spoken document retrieval," Proceedings of ICSLP, 1998. [5] Herdan, G., Quantitative Linguistics, Butterworths, 1964. [6] Eickeler, S., Larson, M., Rüter, W. and Köhler, J., "Creation of an Annotated German Broadcast Speech Database for Spoken Document Retrieval," LREC, 2002. [7] Ganapathiraju, A., Deshmukh, N., Zhao, J., Zhang, X., Wu, Y., Hamaker, J. and Picone, J., "The ISIP Public Domain Decoder for Large Vocabulary Conversational Speech Recognition," http://www.isip.msstate.edu, 1999. [8] Tritschler, A. and Gopinath, R. "Improved Speaker Segmentation and Segments Clustering using the Bayesian Information Criterion," Proceedings Eurospeech, 1999. [9] Stöber, K., Wagner, P., Helbit, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., and Mangold, H., "Speech Synthesis by Multilevel Selection and Concatenation of Units from Large Speech Corpora," In: W. Wahlster, ed., Verb-mobil. Springer, 2000. [10] Larson, M., Eickeler, S., Biatov, K., and Koehler, J., "Mixed-unit language models for German language automatic speech recognition," Proc. of 13. Konferenz Elektronische Sprachsignalverarbeitung (ESSV) 2002. [11] Baeza-Yates, R. and Ribeiro-Netto, B., Modern Information Retrieval. Addison-Wesley, 1999.