Paulo Sirum Ng, Ivandro Sanches. Genius Institute of Technology, Manaus, Brazil. Computer Science Department, Federal University of Amazonas, Manaus, ...
SPECOM’2004: 9th Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004
ISCA Archive
http://www.isca-speech.org/archive
The Influence of Audio Compression on Speech Recognition Systems Paulo Sirum Ng, Ivandro Sanches Genius Institute of Technology, Manaus, Brazil Computer Science Department, Federal University of Amazonas, Manaus, Brazil {psirum,isanches}@genius.org.br
Abstract Large amount of disk space is needed to store the increasing volume of speech data that is becoming available for most languages either by data logging in the application side or by speech data acquisition for large databases. One way to minimize the need for disk space is to compress the speech data by using modern perceptual audio coding techniques such as MPEG Layer-3 (as known as MP3) or Dolby AC-3. In this article the performance of a speech recognizer for Brazilian Portuguese using acoustic models trained with audio data coded with an MP3 codec at 4 different bitrates: 16 kbps, 24 kbps, 32 kbps and 64 kbps is evaluated, in order to assess the influence of this audio compression technique applied to speech data. The word accuracy rates are compared with those obtained when training acoustic models with the original PCM recordings at 16 kHz and 16 bits per sample. Our experiments indicate that for 32 kbps or higher audio bit rate the word accuracy rates show very little degradation in performance in certain contexts, which means up to 8 times less disk space needed.
1. Introduction Most of the statistical speech recognition approaches use speech data material to build statistical models that usually represent or model the acoustics and/or the phonetics of a specific language. The amount of speech data material needed depends on the constraints given the application such as the vocabulary size, the speaker (in)dependency, the speaking style, the grammatical constraints and the operating conditions. But it is a common sense in the speech recognition community that Automatic Speech Recognition (ASR) systems can become more robust and accurate if more speech data, especially field data, is available for the acoustic/phonetic modeling. Companies and/or institutes can purchase or build speech databases or acquire them by data logging in the application field. A direct consequence of that is the need for large amount of disk space in order to store all this speech data. One way to overcome this issue would be to compress the data in a lossless way. But it can only achieve low compression rates. Another way is to use high-quality perceptual audio compression techniques such as MP3. Although
the reconstructed signal is not exactly identical to the original audio data, they achieve much higher compression rates. Furthermore, a large amount of speech data is available in the Internet in MP3 format. For example, several news web sites let their material available for download. In this context, automatic transcription of broadcast news and audio indexation are some of the applications using speech recognition technology [1][2]. This paper investigates the influence of high-quality audio compression techniques applied to training speech data on speech recognition systems performance by evaluating acoustic models trained with compressed speech data on non-compressed testing speech data. Audio compression is briefly described as well as MP3 in Section 2. The experiment that was carried out is described in Section 3 and its results presented in Section 4. Finally, these results are discussed in Section 5 as well as some possible future experiments.
2. Audio compression Audio compression is one of the key technologies for many different applications like Internet audio streaming, Digital Audio Broadcasting, Digital Television and portable audio device. One of its purposes is to reduce storage space requirements for audio data [3] by producing a minimal digital representation of it which is perceptually very similar to the original audio data when decoded. For that, many high quality low-bitrate audio coding schemes such as MP3, MPEG-4 AAC, Vorbis OGG and Dolby AC-3 were developed. They exploit the characteristics of the human perception of sound to achieve high compression rates. More specifically, their efficiency is achieved by exploiting signal redundancies and irrelevancies in the frequency domain based on a model of the human auditory system [3]. This kind of coding schemes is known as perceptual audio coding because it uses psychoacoustics to reach the targeted compression efficiency. A basic perceptual audio coding system consists of a filter bank block, a perceptual model block and a quantization/coding block. Basically, the audio data is decomposed into spectral components by an analysis filter bank. The perceptual model, following rules from psychoacoustics, gives an
estimate of the just noticeable noise-level for each subband of the filter bank. Then, in the quantizantion/coding stage, those spectral components are quantized and coded with the aim of keeping the quantization noise below the level given by the perceptual model and to meet the bitrate requirements [4][5]. MPEG Layer-3 was the audio compression scheme chosen for this study as it is an open standard (the specification is available for a fee to everyone) [5], it is a widely-used and known format and due to the availability of several hardware and software encoders and decoders. Besides a large amount of Brazilian Portuguese speech data is available in the Internet. This audio compression scheme works on many different sampling frequencies and a wide range of bitrates. More specifically, in this work the MPEG2 Layer3 is applied on the original audio data, whose sampling frequency is 16 kHz. More technical details about MP3 compression can be found in [4][5].
3. Experiment description The speech recognizer system used in this experiment is based on the Hidden Markov Model Toolkit (HTK) version 3.2 distributed by Cambridge University Engineering Department [6]. The acoustic models were generated with its training module and the evaluation was effected with its recognition module. The LAME MP3 tool (available at http://lame.sourceforge.net/) was used to compress the original audio data. One of the reasons for choosing this program is that it is an educational tool for learning about MP3 encoding [7].
Table 1: Coding schemes tested and their bitrates and bandwidths
PCM-Original
Bit Rate (kbps) 256
8
Orig
MP3 MP3
64 32
8 8
MP3-64 MP3-32
MP3 MP3
24 16
6.2 4.4
MP3-24 MP3-16
Coding Scheme
Bandwidth (kHz)
Abbr.
The influence of the MP3 compression at different bit-rates in the spectral domain can be seen in Figure 1, which shows the spectrograms of an utterance (a 8-digit sequence:“9 5 2 0 4 7 3 6”) in the different coding schemes. The first spectrogram corresponds to the original PCM recording. Then, the following spectrograms correspond to MP3-64, MP3-32, MP3-24 and MP3-16 coding schemes, in this order. The impact of bandwidth limitation at MP3-24 and MP3-16 is noticeable in the spectrograms.
3.1.Training speech data The speech database used for training acoustic models consists of 114 adult speaker sessions totaling about 55 hours of Brazilian Portuguese single-channel close-talk PCM recordings at 16 kHz and 16 bits per sample. The utterances consist mostly of prompted phonetically balanced sentences, lists of commands, numbers and names. The lexical pronunciations were initially generated using a grapheme-to-phoneme engine developed at Genius Institute of Technology and, then, manually verified. In total there were approximately 28,000 words in lexicon data. This audio data was then compressed using the LAME MP3 encoder at 4 different bit-rates: 16 kbps, 24 kbps, 32 kbps and 64 kbps. For the first two bit-rates, the MP3 encoder limited the bandwidth to 4.4 kHz and 6.2 kHz, as shown in the third column of Table 1, in order to prevent ringing and twinkling effects [7]. For each of those bit-rates, and for the original coding scheme, an abbreviation code is given as shown in the fourth column of Table 1. Those abbreviation codes are used throughout the article.
Figure 1. Spectrograms of an utterance at different coding schemes
Also, the influence of the MP3 compression at different bit-rates in storage space requirements can be seen in Table 2. The second column shows the disk space needed to store the training speech data for all the different coding schemes. The third column of the same table shows the disk space ratio between the original PCM coding scheme and the MP3 coding scheme.
7.60 9.75
MP3-16
0.41
14.02
3.2.Feature extraction The feature vectors consisted of 13 Perceptual Linear Prediction Coefficients and their delta and acceleration coefficients. As the HTK front-end cannot handle MP3 audio files directly, each of them was decoded into a temporary file and had its feature vectors extracted.
3.3.Acoustic models Speaker independent acoustic models were trained on the training speech data described in Section 3.1. The phones were modeled with the following features: • • • • • •
3 emitting states per model left-to-right HMMs without skips context-dependent phones state tying based on decision tree clustering mixture of 12 Gaussians per state diagonal covariance matrices
A training procedure was established and five sets of acoustic models were produced. The first set of acoustic models were trained on the original PCM recordings and it was considered as baseline acoustic models for comparison purposes. The other four acoustic model sets were produced based on each of MP3 compression scheme listed in Table 1.
3.4.Test set As the purpose of this experiment is to investigate the influence of the audio compression on ASR systems, the test set is based on task grammars in extended Backus-Naur Form (EBNF). The test set consists of six different tasks divided in four different types of test. They are: Command&Control, Connected Digits, Spelling and Names. The speech recognizer performance was measured with three different Command&Control tasks, a freelength digit sequence task, a free-length letter sequence task and a name-list task. They are listed in Table 3. Its third column gives the number of recordings in the test and the fourth column, the total number of words in the recordings.
Table 3: Tasks used to evaluate the speech recognizer performance and their total number of utterances and words Type
Tasks
Command & Control
CC1 CC2 CC3
Connected Digits Spelling
CDIG SPEL
Names
NAM
# utterances 902 2496
# words 2384 7128
514 240
890 1245
120 344
837 793
All the audio files in the test set are single-channel close-talk PCM recordings at 16 kHz and 16 bits per sample.
4. Experiment results Table 4 gives word accuracy rates for all tasks listed in Table 3 for each acoustic model set described in Section 3.3. Table 4: Word accuracy (%) of the speech recognizer using different acoustic model sets Acc(%) with acoustic models trained at Tasks
Orig
0.76 0.59
1 3.71
MP3-64
MP3-32 MP3-24
Ratio
MP3-32
Disk Space (GBytes) 5.75 1.55
MP3-24
Coding Scheme Orig MP3-64
CC1, CC2 and CC3 tasks represent different command sets. The first one contains commands for room automation, the second one, for car automation and CC3, for controlling consumer electronic devices. The first two, CC1 and CC2, were recorded at different environment conditions compared to those in training speech data. Test sets CC3, CDIG, SPEL and NAM were recorded at similar environmental conditions of the training speech data.
MP3-16
Table 2: Training speech data storage space requirements for the different coding schemes
CC1
95.3
95.5
95.6
95.3
95.8
CC2 CC3
80.7 93.4
82.2 96.6
82.9 97.8
83.9 98.3
84.5 98.8
CDIG SPEL
93.9 59.7
97.3 73.7
98.7 76.5
98.5 77.5
98.7 80.5
NAM
95.3
97.9
98.7
99.2
99.4
The performance of the baseline acoustic models is given in the last column of Table 4. The word accuracy rates are 95.8%, 84.5%, 98.8%, 98.7%, 80.5% and 99.4% for CC1, CC2, CC3, CDIG, SPEL and NAM tasks, respectively.
The second and third columns of Table 4 show the word accuracy rates of the speech recognizer using the acoustic model set trained on MP3-16 encoded speech data and the acoustic model set trained on MP3-24 encoded speech data, respectively. For both MP3-16 and MP3-24, there is a significant degradation of the speech recognizer performance for almost all tasks. This is understood when it is observed that, for those two bitrates, the MP3 compression further limits the bandwidth, leading to loss of useful spectral information (see Figure 1). For MP3 compression at 64 kbps, no significant differences in word accuracy rates were observed compared to the baseline results. And by listening to some MP3-64 compressed audio files and by examining the spectrogram (see Figure 1), no differences could be noticed when compared to the original ones. For MP3 compression at 32 kbps, differently from MP3-64, some spectral differences are noticeable in Figure 1. The differences in word accuracy rates were observed but they were limited. The difference in word accuracy rate for the SPEL task was 4% absolute and 20% relative, but it should be taken into account that the SPEL task is the most difficult one among all tasks. In the overall, in most cases, the results are in correspondence with the perceptual audio quality of the MP3 compression at different bit-rates. In other words, it was observed that MP3 coding at lower bit-rates has more influence on the speech recognizer performance. One possible reason for this is that the MP3 compression of the training speech data leads to a mismatch between training and testing conditions, making the speech recognition task more difficult. The lower the bit-rate is, the higher the mismatch. Besides, the reduction of the bitrate would imply a loss of possibly useful information.
5. Conclusion and future work The experiment results obtained indicate that MP3 compression at 32 kbps or higher bit-rates of the training speech data show very little influence on speech recognizer performance, acceptable on most speech applications. This suggests that the training speech data could be compressed with MP3 coding scheme and stored. And that the acoustic models for the speech recognizer could be built with this compressed audio data with very little degradation on speech recognizer performance when running on uncompressed audio data leading to a storage space saving of up to approximately 8 times. An interesting question arises when one argues if the performance of the 32 kbps models could be even superior to the baseline if 8 times more data were used to train the 32 kbps models than the amount used to train the baseline models. Future work would consist in evaluating the performance of the same speech recognizer on tasks using language models and testing other high-quality low bit-rate coding schemes such as MPEG-2 Advanced
Audio Coding, that can provide higher compression rates while mantaining the same perceptual audio quality compared to MP3.
6. Acknowledgement This research was supported by the Fund for Telecommunications Technology Development (FUNTTEL), project 01.02.0066.00, ref.0421/02, from the Brazilian Government. The authors wish to thank Ana R. Bittencourt from Genius Institute of Technology for verifying the lexical pronunciantions.
7. References [1] Barras, C., Lamel, L., and Gauvain, J-L., “Automatic Transcription of Compressed Broadcast Audio”, Proceedings of the ICASSP, Vol. I, Salt Lake City, United States, 2001. [2] Besacier, L., et al. “The Effect of Speech and Audio Compression on Speech Recognition Perfomance”, IEEE Multimedia Signal Processing Workshop, Cannes, France, 2001. [3] Franhoufer Institut Integrierte Schaltungen, “FraunhoferIIS - Audio & Multimedia - Basics about MPEG Perceptual Audio Coding”, Available at:http://www.iis.fraunhofer.de/amm/techinf/basics. html, Last accessed: 30/July/2004. [4] Franhoufer Institut Integrierte Schaltungen, “FraunhoferIIS - Audio & Multimedia – MPEG Audio Layer3”, Available at: http://www.iis.fraunhofer.de/amm/techinf/layer3/ind ex.html, Last accessed: 30/July/2004. [5] Brandenburg, K., “MP3 and AAC explained”. Proceedings of the AES 17th International Conference on High Quality Audio Coding, Florence, Italy, 1999. [6] Young, S., Woodland P., et. al, “The HTK Book for HTKv3.2”, Cambridge University Press, Cambridge, UK, 2002. [7] “LAME Ain’t an MP3 encoder”, http://lame.sourceforge.net/, Last accessed: 30/July/2004.