SYNTHESIS AND CODING OF AUDIO SIGNALS USING WAVELET TRANSFORMS FOR MULTIMEDIA APPLICATIONS: AN OVERVIEW Panos Kudumakis
Mark Sandler
Signals Circuits and Systems Research Group Department of Electronic and Electrical Engineering King's College London, Strand, London WC2R 2LS, UK E-mail:
[email protected] 1. OVERVIEW Compared to xed lterbanks, adaptive lterbanks may give better sound quality for low bitrate audio coding but we have to pay for it; in computational load and long delay imposed by the criteria for the adaptation of the lterbank to input signal. As a compromise to adaptive lterbanks we have focused in this paper on xed lterbanks. Following a comparison of the octave wavelet transform and uniform wavelet packet algorithms [1], we chose the latter for the timefrequency mapping, since it has the shortest delay compared to octave and critical band based decompositions, when short wavelet lters are used, and to polyphase lterbank of MPEG-audio standard [2], [3]. Using the combination of 32-band uniform wavelet packet decomposition with dynamic bit allocation (DBA) and Human coding we evaluated the performance of various wavelet lters families at both high and low compression ratios. Although, the range of segmental signal to noise ratio (SSNR) observed at low compression ratios is smaller than at high compression ratios the relative performance of various wavelet lter families remain unchanged. This can be seen in Fig. 1 and Fig. 2 for the signal in [4]. The orthogonal FIR \minimum phase wavelet lters family" (DAUB-A) and \less asymmetric family" (DAUB-B) have in general been shown to possess relatively superior SSNR performance compared to the orthogonal IIR (SYMMA, SYMM-B and SYMM-C) wavelet lters and orthogonal (COIF-A) and biorthogonal (COIF-B and COIF-C) coi ets. We can see some minor deviations from this general rule e.g. one of the IIR families (SYMM-A) and Johnston's QMFs (JNQMF) are marginally superior in terms of SSNR for medium lter lengths (12- up to 16-taps). Fig. 1 and Fig. 2 clarify these results. Using the music signal in [4] and a 20-tap \minimum phase wavelet", the combination of 32band uniform wavelet packet decomposition with
ABSTRACT
This paper examines the use of wavelet lters and wavelet-based algorithms for synthesis and coding of audio signals. In subband coding of non-stationary signals, such as sharp attacks, it is useful for the lterbank to use a logarithmic frequency subdivision scheme that approaches the critical bands of the human auditory system. However, for coding of stationary signals use of a full-depth-tree decomposition maximizes coding gain, even if decomposition does not mimic human auditory lters. The wavelet transform (WT) or octave band decomposition and the wavelet packet (WP) or uniform frequency subdivision scheme t well with these requirements. The performance of some dierent wavelet lter families is investigated in terms of vanishing moments, regularity and complexity, including for comparison a well known family of quadrature mirror lters (QMF). For the assessment of the coding gain of these wavelets, both wavelet transform and packet subband coding schemes have been evaluated, using both constant and dynamic bit allocation, with and without entropy-noiseless Human coding. Experimental work presented includes psychoacoustic modeling, bit allocation, quantisation and entropy coding, thus the results are quite realistic. The performance of the wavelet-based codec schemes are compared to MPEG international industry standards in terms of compression versus quality performance, with particular attention paid to compression{speed trade-o. The latter is achieved by a detailed examination of the shortest wavelet lters, namely those with 4-taps. In addition an algorithm is proposed for the synthesis of musical instrument sounds and extended to algorithmic composition by using the inverse wavelet packet algorithm governed by MIDI-like parameters. DSP UK'96 Tech. Conf., London, UK, 3-4 Dec. 1996.
1
DBA and Human coding resulted in perceptually transparent quality at 24 kbps (Fs = 8 kHz , 3 bits=sample) for speech applications with SSNR 22:87 dB and at 64 kbps (Fs = 48 kHz , 1:33 bits=sample) for music related applications with SSNR 25:48 dB . The performance of 4{tap wavelet lters for low bit rate audio coding was described and explained in terms of their regularity [5]. It is clear that regularity plays an important role in the performance of the 4{tap wavelet lters since the coding gain is signi cantly depended on it as can be seen in Fig. 3. However, the improvement in coding gain reduces when longer (and therefore more regular) wavelets are used, especially when the complexity tradeo is taken into account [2]. In addition, the performance evaluation of 4{tap wavelet lters suggested that almost perceptually transparent coding of monophonic signals can be achieved even at bit rates as low as 32 kbps when using the 4{tap Daubechies wavelets with the proposed codec, which results in better sound quality than the MPEG-audio standard [2]. In particular, the performance of the proposed codec is shown to be superior to MPEG-audio Layer I and competitive with Layer II at low bitrates. An improvement of up to 8 dB SSNR is achieved for a variety of music signals at an output bit rate of 32 kbps over MPEG-audio Layer I, and an improvement of 4 dB over Layer II, using \minimum phase wavelet lters". However, most of the improvement, is due to the eciency of the entropy noiseless Human coding as the bit rate decreases, combined with the high coding gain of the wavelet packet algorithm, since both approaches use comparable psychoacoustic models [2]. Finally, we have compared wavelet lters (DAUB-A and DAUB-B) versus conventional lters (JN-QMF and S&B [6]). The results indicate that no \best" or \worse" lter exists for low bitrate subband audio coding despite the fact that the lter families considered possess dierent properties. However, statistically we can say that the \less asymmetric family" (DAUB-B) has some advantage. All these lter families are recommended for subband coding of audio signals. Their main dierence lies in complexity [3]. 2. SCALABILITY AND MPEG-4 FUNCTIONALITIES Perhaps the most signi cant feature which has not yet been fully addressed and embedded in audio coding systems, although well known in video coding technology, is the concept of scalability. Scalability is the property of a coded signal that part of the coded bitstream can be decoded in isolation. The fact that a subset of the coded bits is sucient
for generating a meaningful audio or video signal is important in at least two contexts. A rst is where both cheap{simple and expensive{complex decoders are envisaged as being receivers of the signal: it is as if both AM and FM radio quality are available in the same transmitted signal dependant on an appropriate decoder. A second is when the transmission channel cannot guarantee the full necessary bandwidth to handle the complete bitstream, for example: internet radio. There are several types of scalability, in terms of bandwidth; number of channels; and the most important, the so-called SNR scalability [7]. Scalable systems currently require a higher bit rate to achieve the same quality as a single stage perceptual audio codec. Since scalability oers many advantages, a small performance penalty may be acceptable [8]. However, we have seen in [9] that using the wavelet packet approach this performance penalty becomes performance advantage, particularly at low bit-rates. For example we have compared MPEG and wavelet-based two-stage scalable coders and found that: if a 64 kbps stream allocates 32 kbps to each of two stages, the overall SSNR is 22:10 dB for wavelets and 16:04 dB for MPEG. Even allocating all 64 kbps to MPEG (i.e. single stage standard MPEG) only achieves 22:32 dB SSNR. At 32 kbps SSNR of 18:72 dB is achieved with wavelets, performance which is superior to a scalable MPEG coder with double the number of bits [9]. Progress on scalability can be expected if a true integration of speech and sound coding can be achieved. In a scalable codec a reasonable speci cation requires the inner layer to provide 3:5 kHz of audio bandwidth at bit-rates ranging from 3 to 16 kbps. An additional scalability layer may result resulting in an intermediate audio bandwidth of 7 to 11 kHz at 16 to 40 kbps. Finally an additional high quality layer operating at bit-rates in excess of 100 kbps per channel would be useful for studio applications [8]. Such a range of performance was one of the justi cations for examining in [3] the performance of various wavelet lters families at both low (Fs = 8 kHz ) and high (Fs = 48 kHz ) sampling frequencies. This performance evaluation has also shown the limits of a bandwidth scalable wavelet packet based codec. Transparency (CD sound quality) can be obtained at 24 kbps with Fs = 8 kHz, and at 64 kbps with Fs = 48 kHz. All perceptual scalable audio codecs today are based on a 1024 band MDCT lterbank, as used for MPEG-2 NBC coding [8]. This also justi es our decision at early stages of this research for the wavelet packet transform length used, instead of 2
for example support for pitch/time-scale change and editability/mixing on the compression bitstream, are areas which require further research. Wavelets are particularly suited to scalable coding because they inherently embody bandwidth scalability. At low bit rates the results of this paper have already demonstrated that wavelets compete strongly with current techniques ([1], [2], [5] and [9]). Wavelets are also very useful in multimedia applications where pitch and/or time changes are necessary (e.g. alongside slowed-down or speeded-up video). Also of interest is the plan that MPEG-4 will include synthetic music as one of its functionalities: here again wavelets are ideally suited as [10] and [12] demonstrate. Therefore what is still missing is to demonstrate the maturity of wavelet strategies and their readiness for exploitation in digital broadcasting, multimedia, digital hi- audio, speech coding/telephony and in particular to cover speech, `AM', `FM' and `CD' quality music under one uni ed coding scheme. 3. CONCLUSIONS In this paper the performance of a variety of dierent wavelet lter families was investigated in terms of vanishing moments, regularity and complexity, and compared with a well known family of quadrature mirror lters (QMF). For the assessment of wavelet coding gain, both wavelet transform and packet subband coding schemes were evaluated, using both constant and dynamic bit allocation, with and without entropy-noiseless Human coding. Experimental work has included psychoacoustic modelling, bit allocation, quantisation schemes and entropy coding [1]. In common with work on wavelets for video compression, we have found that regularity plays little role in compression performance, as long as it suf ciently exceeds unity. Below unity, lower values of regularity imply lower coding gain [2]. The performance of wavelet based codec schemes has been compared to MPEG international coding industry standards in terms of compression versus quality performance, with particular attention paid to the compression-speed trade-o, since the todays buzzword is multimedia. The latter has been achieved by a detailed examination of the shortest wavelet lters, namely those with 4-taps [5]. It has also been shown that the wavelet-based codec is superior to MPEG Layer I and competitive with Layer II in terms of objective and subjective quality, whilst having less complexity/implementation cost and shorter processing delay [2]. Most recently attention has turned to the topic of scalable coding using wavelets. The approach
(0 ) TAPS SSNR (dB ) Reg. 45 2 9 0 (135,225,315) 2 16 0 (165,285) 4 18 0.55 Table 1. Complexity and SSNR scalability using wavelet lters with N=2 vanishing moments. The SSNR values are given approximately and at 32 kbps with Fs = 48 kHz.
the 384 samples used by Layer I and 1152 samples by Layer II of MPEG-1 and MPEG-2 BC. Since MPEG-4 audio will contain provisions to transmit synthetic speech and audio at very low bit-rates, scalability should also be considered in such systems. For very low rates in speech coding synthetic reproduction of speech is used. So far nothing has been presented for the transmission of music. MIDI operates at 32 kbps which is a rate where true audio codecs can already operate quite well. Therefore synthetic audio must be kept in mind for later designs [8]. However, we have shown that almost transparent (AM sound quality) coding can be achieved with wavelet packet based audio compression systems even at bit rates as low as 32 kbps (Fs = 48 kHz ), and in comparison to MPEG-audio standards, wavelet packet based systems result in better sound quality than Layer I, and are competitive with Layer II. Some ways to generate synthetic audio using the wavelet packet model have also been presented in [10] e.g. reconstructing the signal from its scalefactors or random number based algorithmic composition. A simple software version of this interactive random number wavelet packet based compositional algorithm able to work in several dierent audio formats can be found in our world wide web site [11]. We conclude this discussion of scalability with some thoughts motivated from our detailed study of 4-tap wavelet lters and their regularity (see Fig. 3). There are three types of 4-tap wavelet lters in terms of scalable SSNR performance and complexity. This is clari ed in Table 1. We may need to consider that the SSNR values are dependant on the bit-rate and signal characteristics, while in terms of complexity we could use for the two scalability stages 2+2, 2+4 or 4+4 tap lters corresponding to various SSNR performances depending on the selection of the angle which determines the wavelet lter coecients. Of course we could extend this technique using longer than 4-tap wavelet lters with an in nite number of lter selections. Current research on scalability is focused on encoding of mono signals only. Scalable multichannel, stereo and other MPEG-4 functionalities, 3
DBA 32-Uniform + Huffman
(Fs= 8 kHz, 3 bits/sample).
SSNR (dB) DAUB-A DAUB-B COIF-A COIF-B SYMM-A SYMM-B SYMM-C JN-QMF
23.00 22.50 22.00 21.50
4-tap wavelets 32-Unif DBA Huffman (Fs= 48kHz) c3 x 10-3 900.00 850.00
min
800.00 750.00 700.00
21.00
650.00 20.50
600.00 550.00
20.00
500.00
19.50
1650
450.00 19.00
400.00
18.50
350.00
max
300.00
18.00
Original 768 kbps Comp 32 kbps ROCK Comp 192 kbps ROCK Comp 32 kbps POP Comp 192 kbps POP Comp 32 kbps CLASSIC Comp 192 kbps CLASSIC Regularity
250.00 17.50
200.00
17.00
150.00 100.00
16.50
50.00
16.00 5.00
10.00
15.00
-50.00
20.00
-100.00
Figure 1. On the performance of dierent wavelet fam-
-150.00
ilies, in terms of SSNR, at 24 kbps (Fs = 8 kHz). DBA 32-Uniform + Huffman
c0 x 10-3 0.00
DAUB-A DAUB-B COIF-A COIF-B SYMM-A SYMM-B SYMM-C JN-QMF
20.00 19.00 18.00 17.00
[2]
16.00 15.00 14.00
[3]
13.00 12.00 11.00
[4]
10.00 9.00 8.00
[5]
Number of Coeff. 5.00
10.00
15.00
200.00
400.00
600.00
800.00
Figure 3. The performance of 4{tap wavelet lters, in terms of normalised SSNR and in comparison to their regularity, for various bit-rates and for three dierent music signals: ROCK, POP, CLASSIC.
(Fs= 48 kHz, 0.6667 bits/sample).
SSNR (dB) 21.00
2850
-0.00
Number of Coeff.
20.00
Figure 2. On the performance of dierent wavelet fam-
[6]
is outlined in [9] indicating that good performance can be obtained using wavelets. Whereas the coding gain of scalable coders based on more conventional transforms, e.g. polyphase lterbanks, is inferior to a single layer codec, wavelet-based systems can exceed this performance. Similar results to our own ([1], [2], [5] and [9]) have been independently published [13], [14], and also indicating that wavelets are competitive with MPEG-audio standards. It is thus an appropriate time to initiate a detailed study of the use of wavelets in terms of the forthcoming MPEG-4 standard, being the next important stage in the development of world-wide standards for the representation of audio signals. ACKNOWLEDGMENT
[7]
ilies, in terms of SSNR, at 32 kbps (Fs = 48 kHz).
[8] [9] [10] [11] [12] [13]
The authors wish to thank the AES Educational Foundation (J. Audio Eng. Soc., Vol. 44, No. 5, pp. 426, 1996 May) and EPSRC (Grant GR/L15272) for supporting this work.
[14]
REFERENCES
[1] Panos Kudumakis and Mark Sandler, \On the Performance of 4
Wavelets for Low Bit Rate Coding of Audio Signals", in Proc. of the IEEE ICASSP'95, Vol. 5, pp. 3087-3090, Detroit (MI), U.S.A., May 8-12, 1995. Panos Kudumakis and Mark Sandler, \Wavelets, Regularity, Complexity and MPEG-Audio", 99th AES Convention, New York, Oct. 6-9, 1995. Preprint 4048, J. Audio Eng. Soc. (Abstracts), Vol. 43, No 12, p. 1072, Dec. 1995. Panos Kudumakis \Synthesis and Coding of Audio Signals using Wavelet Transforms for Multimedia Applications", PhD Thesis, King's College University of London, submitted April 1996. Eveleigh and Wilkinson, \GUD Toons (Good Tunes)", in Kunta Kinti, The Future Music CD Vol. 4, demo no. 6, in Future Music Magazine No. 14, Dec. 1993. Panos Kudumakis and Mark Sandler, \On The Compression Obtainable With 4{Tap Wavelets", IEEE Signal Processing Letters, Vol. 3, No 8, pp. 231-233, August 1996. M.J.T. Smith, and T.P. Barnwell, \Exact reconstruction techniques for tree-structured subband coders", IEEE Trans. on Acoustics, Speech and Signal Processing, 34, pp. 434-441, 1986. Karlheinz Brandenburg and Bernard Grill, \First Ideas on Scalable Audio Coding", 97th AES Convention, San Francisco, Nov. 10-13, 1994. Bernhard Grill and Karlheinz Brandenburg, \A Two- or ThreeStage Bit Rate Scalable Audio Coding System", 99th AES Convention, New York, Oct. 6-9, 1995. Panos Kudumakis and Mark Sandler, \Wavelet Packet Based Scalable Audio Coding", in Proc. IEEE ISCAS'96, Vol. 2, pp. 41-44, Atlanta, May 1996. Panos Kudumakis and Mark Sandler, \Synthesis/Coding of Audio Signals based on Inverse Wavelet Packet Algorithm", in Proc. of the UK Symposium on Applications of Time-Frequency and Time-Scale Methods (TFTS'95), pp. 92-99, Univ. of Warwick, Coventry, U.K., Aug. 30-31, 1995. World Wide Web Panos Kudumakis: \http://www.eee.kcl.ac.uk/member/pgrads/p kudumakis.html" Panos Kudumakis and Mark Sandler, \Synthesis of Audio Signals Using the Wavelet Transform", IEE Colloquium on \Audio DSP - Circuits and Systems", 16 Nov. 1993. Ref. No: 1993/219. Mark Black and Mehmet Zeytinoglu, \Computationally ecient wavelet packet coding of wide-band stereo audio signals", in Proc. ICASSP'95, Vol. 5, pp. 3075-3078, Detroit (MI), U.S.A., May 8-12, 1995. P. Philippe, F. Moreau de saint-Martin, L. Mainard, \On the choice of wavelet lters for audio compression", in Proc. ICASSP'95, Vol. 2, pp. 1045-1048, Detroit (MI), U.S.A., May 8-12, 1995.