Audio Coding for Mobile Multimedia Communications M. B. Sandler, P. E. Kudumakis and A. J. Magrath Signals Circuits and Systems Research Group Department of Electronic and Electrical Engineering King's College London, Strand, London WC2R 2LS, UK E-mail:
[email protected]
1 Introduction Audio is perhaps the forgotten element of multimedia. If not true now, it was almost certainly true until recently and it is still the case that it is seen as the minor partner. This belies its real importance as the front-line means of communication between humans. Of course, speech has always been an important part of telecommunications, but considerations of quality have been secondary. This paper presents some early results from two projects at King's College which examine high quality, low bit rate coding of audio, both music and speech. One approach uses Wavelets, which oer advantages over polyphase lter banks in MPEG at low bit rates. The other approach uses the long established Linear Predictive Coding technique, but in a modi ed guise, with orders signi cantly higher than 10. It also plans to use pattern recognition techniques in time-frequency space for added compression, and a novel implementation style for recursive lters which enables high orders to be used in re-synthesis. But audio is about much more than coding. Obviously it includes the capture and replay of audio signals (microphone and loudspeaker technology, ampli cation, ADC and DAC technology), and modi cations to spectral distribution ( ltering) but it also now includes aspects such as spatial reality and sound eld synthesis. Thus we can hope to see high quality replay of audio from a mobile multimedia station to make the most of the coding eort. We can also expect enhanced teleconferencing facilities incorporating spatial separation in the listener's environment of the various conferencees. Also, of course, mobile multimedia does not necessarily involve mobile communications, though it will always involve data compression. Perhaps the mobile oce of the near future will not only seek to keep the business-person in touch with business while travelling, but also to amuse him or her with replay of CDs, games and even internet radio. One could envisage audio on demand services, similar to video on demand, and falling somewhere between digital radio and stored audio (CD, minidisk, etc.). This paper commences with an outline of the newer of the two projects, using high order LPC and then discusses the use of Wavelets within an MPEG coding framework.
2 High Quality Speech Coding The aim of speech coding is to transmit and store speech in order to conserve bandwidth (or bitrate) while achieving the highest possible voice quality. Applications include mobile and network communications, audio/video teleconferencing, secure voice transmission, digital storage systems, voice mail and voice response systems.
c IEE Colloquium on "The Future of Mobile Multimedia Communications", Digest No:1996/248, pp. 11.1-11.9, Savoy Place, London WC2R 0BL, UK, 6 Dec. 1996. 1996
1
The majority of speech coding systems use variants on linear predictive coding (LPC) to derive parameters of a vocal tract lter and excitation source, which model the human vocal system [1]. These parameters are transmitted and used to resynthesise the speech at the receiver. The vocal tract lter is generally an all-pole (autoregressive) lter, which models the resonances, or formants, of the vocal tract. The formant frequencies vary with time as the shape of the vocal tract changes, therefore the lter is required to be adaptive. Most of the current standards use a speech bandwidth of 4 kHz, and the formants are represented by a lter of order 10 or 12.
2.1 Higher Order Speech Models
For many applications, such as teleconferencing, there is an increasing demand for higher quality, natural speech and this translates to coding the speech waveform with greater accuracy and extending the bandwidth to over 8 kHz. A crucial factor in the accuracy of the speech model is the order of the vocal tract lter. Figures 1 and 2 show the time-varying frequency responses of the vocal tract lters derived from 10th and 20th order LPC analysis of a short section of speech, using the autoregressive model [2] and a 4 kHz bandwidth. The parameters of the analysis are shown in the gures. Note that the higher order model extracts a greater number of formants from the speech waveform and distinguishes between adjacent formants with nearby frequencies, which in the low order model are represented as (inacccurate) single formants. For 8 kHz bandwidth, it is estimated that in excess of 40th order ltering will be required to accurately represent the formants. High order lters have other advantages. Firstly the immunity of the coding algorithm to background noise (e.g. trac noise) is increased [3, 4]. High order lters are also used in low-delay CELP coders [5] to extract the fundamental (pitch) periodicity of the excitation source by LPC analysis, so that no separate pitch coding is required. The generality of the high-order analysis makes the technique amenable to non-speech signals, allowing the coding of music and data (e.g. modem) signals [5]. Another application of higher order LPC analysis is in the synthesis of musical instruments, for example drum and string sounds [6]. Before high order speech models can be sucessfully applied, however, there are two problem areas which the project aims to address.
2.2 Implementation of High Order Filters
The implementation of resonant high order recursive lters with xed point arithmetic has signi cant problems, namely internal over ow and sensitivity to coecient and wordlength quantization. A number of low sensitivity lter structures have been developed, which are suitable for high order implementations, including parallel, cascade and lattice form. Particularly good properties are obtained using a parallel interconnections of lower order sub lters (PIS) [7]. Future work will investigate the suitability of these structures for the implementation of the vocal tract lter. Since the lter is adaptive, it is necessary to regularly update the lter coecients. An important area of research is therefore the development of ecient algorithms to obtain coecients suitable for a particular lter structure from the original direct-form LPC coecients.
2.3 Coding of Filter Parameters
The second problem of using a high order model is that a greater number of parameters need to be coded for each adaption step. A solution to this problem which is currently being investigated uses feature extraction techniques from image processing to identify patterns in the time-frequency representation of the formants (i.e. curves in gures 1 and 2). By coding several consecutive adaption steps as feature parameterisations, a signi cant reduction in data rate can be achieved.
3 High Quality Music Coding Compression of digital audio signals marks the second phase of the revolutionary changes to the audio world which has been brought by digital signal processing. About ten years ago, when the CD had just been introduced, the rst proposals for digital data reduction systems were greeted with suspicion and disbelief. Almost nobody thought that digital compression was necessary, given the vast storage capacity of CD. Everybody agreed that it would be impossible to ful ll the audio quality requirements of the \golden ears" while removing 75% or more of the digital data. However, since 1988 collaborative work by an international committee of high- delity audio compression experts within the Moving Picture Experts Group (MPEG) has resulted in the rst international standard for the digital compression of high- delity audio, the so-called MPEG-1 audio (coding of mono and stereo signals at sampling rates of 32, 44:1 and 48kHz ). The International Standards Organization and the International Electrotechnical Commission (ISO/IEC) adopted this standard at the end of 1992 [8]. Thus, these days the situation of digital compression for high- delity audio is completely dierent. This is due to the widespread interest in transmitting, storing and retrieving CD{quality sound, high quality audio coding is an area attracting considerable interest within the DSP community. Current applications for high quality audio coding include Digital Audio Broadcasting (radio and TV), portable audio units (Digital Compact Cassette and Minidisc), and also transmission of coded sound over high speed networks. Forthcoming applications where audio compression is involved include interactive mobile multimedia communication, videophone, mobile audio-visual communication, multimedia electronic mail, electronic newspapers, interactive multimedia databases, and multimedia videotex. Thus, audio compression methods are included into almost every digital audio system. In fact, in many cases digital audio is used at bitrates which are too low even by the standards of the inventors of audio coding systems. Most of this change is due to the work of the companies involved in MPEG audio. That is because wide acceptance of the MPEG audio standard will permit manufacturers to produce and sell, at reasonable cost, large numbers of MPEG audio codecs and multimedia related products. Due to this high interest these days for high-quality low bit-rate audio coding the MPEG committe also developed at the end of 1994, the MPEG-2 audio standard (backwards compatible coding of 5.1 multichannel audio and low bit-rate coding of mono and stereo signals at half sampling rates in respect to MPEG-1 audio) [9]. Moreover, backwards compatibility (BC) means that an MPEG-1 audio decoder can decode two channels of the MPEG-2 stream and an MPEG-2 audio decoder can decode an MPEG-1 audio stream as if it were an MPEG-1 audio decoder. However, the MPEG committe today focuses on next generation audio coding standard the socalled MPEG-4. MPEG-4 will be the coding system for the multimedia applications of the future. It is designed to facilitate the growing interaction and overlap between the up-to-now separate worlds of computing, electronic mass media (TV and Radio) and telecommunications. International standard status of MPEG-4 is expected in November 1998 [10].
3.1 MPEG: Layer I,II and III
The MPEG committe chose to recommend three compression methods and named them audio Layer I, II and III. This provides increasing quality/compression ratios with increasing complexity and demands on processing power. Thus a wide range of trade-os between codec complexity and compressed audio quality is oered by the three Layers. The reason for recommending three Layers was partly that the testers felt that none of the coders was 100% transparent to all material and partly that the best coder (Layer III) was so compute intensive that it would seriously impact the acceptance of the standard.
3.2 MPEG: Non-Backwards Compatible (NBC) Audio Coding
BC built in MPEG-2 audio is an important service feature for many applications, such as television broadcasting. This compatibility, however, entails a degree of quality penalty that other applications need not pay. Work in this area of MPEG audio has produced a non-backwards compatible (NBC) extension to MPEG-2. International standard status of this NBC extension will be reached in April 1997. The NBC standard (part 7 of MPEG-2) is bringing down to 64 kbps virtual transparency of single channel music which MPEG-1 audio had set at 128 kbps. It is expected that interesting performance will be obtained even at lower bitrates than 64 kbps. Since MPEG-4 will not introduce new tools for coding of audio signals at 64 kbps and above, the NBC extension of MPEG-2 is therefore already providing part of the MPEG-4 audio standard. More work, however, needs to be done in the bitrate range much lower than 64 kbps. This is an area where there is a need for a generic technology serving such dierent applications as satellite and cellular communications, mobile multimedia communications, Internet, etc. Thus in this paper we present an alternative NBC approach to low bitrate audio coding based on the wavelet packet algorithm.
3.3 Wavelet Packet Codec: An alternative approach to low bitrate audio coding
A wavelet packet (WP) or in other words, a 5-stage 32-band uniform frequency subdivision subband coding scheme, based on a tree-structure lterbank, has been designed and implemented [11]. The frame lenght was set equal to 1024 samples. However, ecient signal compression results when subband signals are quantized with subband-speci c bit allocation, based on input power spectrum (fast Fourier transform{FFT) and the model of auditory perception. Thus, this lterbank is combined with dynamic bit allocation (DBA), based on psychoacoustic model-1 as adapted for use with Moving Pictures Expert Group (MPEG) layer-2 [8]. In our experiments we have not included the tonality in order to keep the overall complexity of the codec low. This also ensures that the segmental signal to noise ratio (SSNR) is more valid since masking phenomena have not been utilised. However, the tonality would improve the compression further if included. Finally, while the dynamic bit allocation strategy exploits some of the human hearing characteristics, further reduction in the bit rate requires removing the statistical redundancies of the signal. That is the ideal case for entropy noiseless Human coding. This has been embedded in the WP coder as in MPEG layer-3 [8]. The advantages of the WP coder are: The increasing quality/compression ratios with increasing complexity achieved by the three layers of MPEG they may be achieved with the WP codec while switching to dierent wavelet lters. It is obviously the tremendous cost savings in software and hardware by using one insteed of three layers while keeping their features. The high coding gain of the WP algorithm ensures that the WP codec outperforms the MPEG in the bitrate range of interest namely much lower than 64 kbps. This can be observed from Figs. 4, 5 and 6 for three dierent music signals [12].
3.4 Scalability and MPEG-4 functionalities
Perhaps the most signi cant feature which has not yet been fully addressed and embedded in audio coding systems, although well known in video coding technology, is the concept of scalability. Scalability is the property of a coded signal that part of the coded bitstream can be decoded in isolation. The fact that a subset of the coded bits is sucient for generating a meaningful audio or video signal is important in at least two contexts. A rst is where both cheap{simple and expensive{complex decoders are envisaged as being receivers of the signal: it is as if both AM and FM radio quality are available in
the same transmitted signal dependant on an appropriate decoder. A second is when the transmission channel cannot guarantee the full necessary bandwidth to handle the complete bitstream, for example: internet radio. There are several types of scalability, in terms of bandwidth; number of channels; and the most important, the so-called SNR scalability [13]. Scalable systems currently require a higher bit rate to achieve the same quality as a single stage perceptual audio codec. Since scalability oers many advantages, a small performance penalty may be acceptable [14]. However, we have seen in [15] that using the wavelet packet approach this performance penalty becomes performance advantage, particularly at low bit-rates. For example we have compared MPEG and wavelet-based two-stage scalable coders and found that: if a 64 kbps stream allocates 32 kbps to each of two stages, the overall SSNR is 22:10 dB for wavelets and 16:04 dB for MPEG. Even allocating all 64 kbps to MPEG (i.e. single stage standard MPEG) only achieves 22:32 dB SSNR. At 32 kbps SSNR of 18:72 dB is achieved with wavelets, performance which is superior to a scalable MPEG coder with double the number of bits [15]. Progress on scalability can be expected if a true integration of speech and sound coding can be achieved. In a scalable codec a reasonable speci cation requires the inner layer to provide 3:5 kHz of audio bandwidth at bit-rates ranging from 3 to 16 kbps. An additional scalability layer may result resulting in an intermediate audio bandwidth of 7 to 11 kHz at 16 to 40 kbps. Finally an additional high quality layer operating at bit-rates in excess of 100 kbps per channel would be useful for studio applications [14]. Such a range of performance was one of the justi cations for examining in [16] the performance of various wavelet lters families at both low (F s = 8 kHz ) and high (F s = 48 kHz ) sampling frequencies. This performance evaluation has also shown the limits of a bandwidth scalable wavelet packet based codec. Transparency (CD sound quality) can be obtained at 24 kbps with F s = 8 kHz , and at 64 kbps with F s = 48 kHz . All perceptual scalable audio codecs today are based on a 1024 band MDCT lterbank, as used for MPEG-2 NBC coding [14]. This also justi es our decision at early stages of this research for the wavelet packet transform length used, instead of the 384 samples used by Layer I and 1152 samples by Layer II of MPEG-1 and MPEG-2 BC. Since MPEG-4 audio will contain provisions to transmit synthetic speech and audio at very low bit-rates, scalability should also be considered in such systems. For very low rates in speech coding synthetic reproduction of speech is used. So far nothing has been presented for the transmission of music. MIDI operates at 32 kbps which is a rate where true audio codecs can already operate quite well. Therefore synthetic audio must be kept in mind for later designs [14]. However, we have shown that almost transparent (AM sound quality) coding can be achieved with wavelet packet based audio compression systems even at bit rates as low as 32 kbps (F s = 48 kHz ), and in comparison to MPEG-audio standards, wavelet packet based systems result in better sound quality than Layer I, and are competitive with Layer II [17]. Some ways to generate synthetic audio using the wavelet packet model have also been presented in [18] e.g. reconstructing the signal from its scalefactors or random number based algorithmic composition. A simple software version of this interactive random number wavelet packet based compositional algorithm able to work in several dierent audio formats can be found in our world wide web site [19]. We conclude this discussion of scalability with some thoughts motivated from our detailed study of 4-tap wavelet lters and their regularity (see Fig. 3) [20] [21]. There are three types of 4-tap wavelet lters in terms of scalable SSNR performance and complexity. This is clari ed in Table 1. We may need to consider that the SSNR values are dependant on the bit-rate and signal characteristics, while in terms of complexity we could use for the two scalability stages 2+2, 2+4 or 4+4 tap lters corresponding to various SSNR performances depending on the selection of the angle which determines the wavelet lter coecients. Of course, we could extend this technique using longer than 4-tap wavelet lters with a larger number of lter selections [22]. Current research on scalability is focused on encoding of mono signals only. Scalable multi-channel,
stereo and other MPEG-4 functionalities, for example support for pitch/time-scale change and editability/mixing on the compression bitstream, are areas which require further research. Wavelets are particularly suited to scalable coding because they inherently embody bandwidth scalability. At low bit rates the results of this paper have already demonstrated that wavelets compete strongly with current techniques ([11], [15], [17], [20], [21] and [22]). Wavelets are also very useful in multimedia applications where pitch and/or time changes are necessary (e.g. alongside slowed-down or speeded-up video). Also of interest is the plan that MPEG-4 will include synthetic music as one of its functionalities: here again wavelets are ideally suited as [18] and [23] demonstrate. Therefore what is still missing is to demonstrate the maturity of wavelet strategies and their readiness for exploitation in digital broadcasting, multimedia, digital hi- audio, speech coding/telephony and in particular to cover speech, `AM', `FM' and `CD' quality music under one uni ed coding scheme.
4 Conclusion This paper seeks to highlight the growing importance of research into audio to the fast-moving eld of Mobile Multimedia. Much of the paper has dealt with two speci c projects aimed at high quality coding of audio. It is already demonstrably true that Wavelets oer signi cant advantages over DCT, MDCT and polyphase lter-banks in MPEG and that the quality at low bit rates is good. The work seeks to adopt scalable coding to audio. It is expected, but not yet proven that the other technique, using high order LPC and high order recursive lters, will oer at least similar quality-compression performance, and may even exceed the performance of wavelets as the resonant model on which it relies is physically motivated. Later stages of the work will seek to combine the two approaches into a scalable coder, whose dierent levels use dierent coding strategies.
Acknowledgement The authors would like to acknowledge the support of EPSRC grant nos GR/L 15272 and GR/L 21914.
References [1] J.D. Markel and Jr A.H. Gray. Linear Prediction of Speech. Springer-Verlag, New York, 1976. [2] D. O'Shaugnessy. Speech Communication. Human and Machine. Addison-Wesley Series in Electrical Engineering: Digital Signal Processing, 1987. [3] A. A. Wrench and C.F.N Cowan. A new approach to noise-robust LPC. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 305{307, 1987. [4] Y.J. Liu. A robust 400-bps speech coder against background noise. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 601{604, 1991. [5] J-H Chen, R.V. Cox, Y-C Lin, N. Jayant, and M.J. Melchner. A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE Journal on Selected Areas in Communications, 10(5):601{604, June 1992. [6] M.B. Sandler. Analysis and synthesis of atonal percussion using high order lineaar predictive coding. Applied Acoustics, pages 247{264, 1990. [7] M. Price. Hybrid Structures for High Order Recursive Filters. PhD thesis, King's College, University of London, March 1996. [8] ISO/IEC 11172-3, \Information technology { Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s { (Part 3: Audio)", August 1993.
[9] ISO/IEC 13818-3, \Information technology { Generic coding of moving pictures and associated audio { (Part 3: Audio)", November 1994. [10] Leonardo Chiariglione, \MPEG and multimedia communications", http://www.cselt.stet.it/ufv/leonardo/paper/isce96.htm
[11] Panos Kudumakis and Mark Sandler, \On the Performance of Wavelets for Low Bit Rate Coding of Audio Signals", in Proc. of the IEEE ICASSP'95, Vol. 5, pp. 3087-3090, Detroit (MI), U.S.A., May 8-12, 1995. [12] Eveleigh and Wilkinson, \GUD Toons (Good Tunes)", in Kunta Kinti, The Future Music CD Vol. 4, demo no. 6, in Future Music Magazine No. 14, Dec. 1993. [13] Karlheinz Brandenburg and Bernard Grill, \First Ideas on Scalable Audio Coding", 97th AES Convention, San Francisco, Nov. 10-13, 1994. [14] Bernhard Grill and Karlheinz Brandenburg, \A Two- or Three-Stage Bit Rate Scalable Audio Coding System", 99th AES Convention, New York, Oct. 6-9, 1995. [15] Panos Kudumakis and Mark Sandler, \Wavelet Packet Based Scalable Audio Coding", in Proc. IEEE ISCAS'96, Vol. 2, pp. 41-44, Atlanta, May 1996. [16] Panos Kudumakis \Synthesis and Coding of Audio Signals using Wavelet Transforms for Multimedia Applications", PhD Thesis, King's College University of London, 1997. [17] Panos Kudumakis and Mark Sandler, \Wavelets, Regularity, Complexity and MPEG-Audio", 99th AES Convention, New York, Oct. 6-9, 1995. Preprint 4048, J. Audio Eng. Soc. (Abstracts), Vol. 43, No 12, p. 1072, Dec. 1995. [18] Panos Kudumakis and Mark Sandler, \Synthesis/Coding of Audio Signals based on Inverse Wavelet Packet Algorithm", in Proc. of the UK Symposium on Applications of Time-Frequency and TimeScale Methods (TFTS'95), pp. 92-99, Univ. of Warwick, Coventry, U.K., Aug. 30-31, 1995. [19] Panos Kudumakis: World Wide Web http://www.eee.kcl.ac.uk/member/pgrads/p kudumakis.html [20] Panos Kudumakis and Mark Sandler, \On The Compression Obtainable With 4{Tap Wavelets", IEEE Signal Processing Letters, Vol. 3, No 8, pp. 231-233, August 1996. [21] Panos Kudumakis, Tryphon Lambrou, Alfred Linney and Mark Sandler, \On the prediction of 4-tap wavelets coding gain", TFTS'97, Univ. of Warwick, Coventry, UK, Aug. 27-29, 1997. [22] Panos Kudumakis and Mark Sandler, \On the usage of short wavelets for scalable audio coding", SPIE'97, San Diego, CA, USA, 27 July - 1 Aug., 1997. [23] Panos Kudumakis and Mark Sandler, \Synthesis of Audio Signals Using the Wavelet Transform", IEE Colloquium on \Audio DSP - Circuits and Systems", 16 Nov. 1993. Ref. No: 1993/219.
Time Frame
Order=10, Window=Hamming, Frame length=160, Overlap=140
0
500
1000
1500 2000 2500 Frequency (Hz)
3000
3500
4000
Figure 1: Variation in Frequency Response of 10th order Vocal Tract Filter Model.
Time Frame
Order=20, Window=Hamming, Frame length=160, Overlap=140
0
500
1000
1500 2000 2500 Frequency (Hz)
3000
3500
Figure 2: Variation in Frequency Response of 20th order Vocal Tract Filter Model.
4000
4-tap wavelets 32-Unif DBA Huffman (Fs= 48kHz) ROCK SSNR (dB)
4-tap wavelets 32-Unif DBA Huffman (Fs= 48kHz) c3 x 10-3 850.00
min
800.00 750.00 700.00 650.00 600.00 550.00 500.00 450.00
Layer-1 285 315 45
38.00
900.00
165
Original 768 kbps Comp 32 kbps ROCK Comp 192 kbps ROCK Comp 32 kbps POP Comp 192 kbps POP Comp 32 kbps CLASSIC Comp 192 kbps CLASSIC Regularity
0
36.00 34.00 32.00 30.00 28.00 26.00 24.00
max
400.00 350.00 300.00
22.00 20.00
250.00
18.00
200.00
16.00
150.00 14.00
100.00 50.00
12.00
285
-0.00 -50.00 -100.00
0
10.00 BIT-RATE (kbps) 50.00
100.00
150.00
-150.00 c0 x 10-3 0.00
200.00
400.00
600.00
800.00
Figure 4: Bit-rate vs SSNR for the ROCK
Figure 3: The performance of 4{tap wavelet music signal.
lters, in terms of normalised SSNR and in comparison to their regularity, for low (32 kbps) and high (192 kbps) bit-rates and for three dierent music signals: ROCK (8sec), POP (6sec), CLASSIC (28sec).
4-tap wavelets 32-Unif DBA Huffman (Fs= 48kHz) POP SSNR (dB) Layer-1 285 315 45
25.00 24.00 23.00 22.00 21.00 20.00 19.00 18.00 17.00 16.00 15.00 14.00 13.00 12.00 11.00 10.00
BIT-RATE (kbps) 50.00
100.00
150.00
Figure 5: Bit-rate vs SSNR for the POP muTAPS SSN R (dB ) Reg. sic signal. 45 2 9 0 (135,225,315) 2 16 0 (165,285) 4 18 0.55 Table 1: Complexity and SSNR scalabil (0 )
4-tap wavelets 32-Unif DBA Huffman (Fs= 48kHz) CLASSIC
SSNR (dB)
Layer-1 285 315 45
50.00 48.00 46.00
ity using wavelet lters with N=2 vanishing moments. The SSNR values are given approximately and at 32 kbps with Fs = 48 kHz for the ROCK (8sec) music signal.
44.00 42.00 40.00 38.00 36.00 34.00 32.00 30.00 28.00 26.00 24.00 22.00 20.00 18.00 16.00 14.00 12.00 10.00
BIT-RATE (kbps) 50.00
100.00
150.00
Figure 6: Bit-rate vs SSNR for the CLASSIC
music signal.