99TH AES CONVENTION, NEW YORK, USA, OCT. 6-9, 1995.
Wavelets, Regularity, Complexity and MPEG-Audio
1
Panos E. Kudumakis and Mark B. Sandler Signals Circuits and Systems Research Group, Department of Electronic and Electrical Engineering, King's College London, Strand, London WC2R 2LS, U.K. tel: ++44 (0)171{873 2041 ++44 (0)171{836 4781 fax: e-mail:
[email protected] Abstract
The performance of wavelets in terms of vanishing moments, regularity and complexity is investigated for low bit-rate coding of audio signals. To assess the coding gain of wavelets a codec model has been designed and implemented, based on a wavelet packet algorithm and an auditory perception model combined with entropy noiseless coding. The wavelet packet based coding approach is compared to the MPEG-audio international standard in terms of objective and subjective measurements and is shown to be superior to MPEG-audio Layer I and competitive with Layer II. The trade-os between the two approaches are discussed.
1. Introduction
Due to the widespread interest in transmitting, storing and retrieving CD{quality sound, high quality audio coding is an area attracting considerable interest within the DSP community. Current applications for high quality audio coding include Digital Audio Broadcasting (radio and TV), portable audio units (Digital Compact Cassette and Minidisc), and also transmission of coded sound over high speed networks. Forthcoming applications where audio compression is involved include interactive mobile multimedia communication, videophone, mobile audio-visual communication, multimedia electronic mail, electronic newspapers, interactive multimedia databases, and multimedia videotex. One of the most important areas of investigation is the improvement of quality at bit rates lower than 64kbps per channel, relative to the popular MPEG-audio 1 standard [1]. Block diagrams of the MPEG-audio 1 encoder and decoder are given in Figs. 1 and 2. When the audio application allows a limited audio bandwidth, a cost-eective solution is oered by the recent MPEG-audio 2 draft international standard involving a reduced sampling frequency [2]. However, the interest for solutions preserving the full bandwidth
99TH AES CONVENTION, NEW YORK, USA, OCT. 6-9, 1995.
2
of the encoded audio signal still remains high [3], [4]. This paper describes the wavelet packet WP{based codec, and discusses its similarities to and dierences from the MPEG-audio standard. This is followed by a review of wavelet lters [5] applied in the WP{based codec. Finally the performance of the two schemes is discussed in terms of quality, bit-rate, complexity and delay.
2. Filterbanks, Complexity and Delay.
WP decomposition is a generalized orthogonal wavelet analysis [6]. This technique is also known as subband coding based on a tree-structured perfect-reconstruction lterbank [7]. WP decomposition is represented by a binary tree in which one has the freedom to stop or continue the decomposition at any node, and several choices for an irregular tree are thus possible. Some possible decomposition tree structures are shown in Fig. 3. Subband decomposition in uences coder performance, computational load and delay. WP analysis based on critical bands of the human auditory system [8] has been used extensively [9]-[10] for coding of audio signals, suggesting near transparency at 64kbps per channel. In [9] an 8-stage, 28-band WP decomposition is used without exactly achieving the resolution required by the critical bands at center frequencies above 10kHz. The
exibility of WPs allows easy modi cation to the tree-structure limited only by the uncertainty principal. So additional stages may be introduced to achieve a better match to the critical bands, as in [10], where a 9-stage, 29-band WP decomposition is used. However, for a given lter length this will increase the total delay. The coder delay is a function of the number of decomposition stages, and the order of FIR lters. In particular, the overall coder delay is given by: = N (2S ? 1) samples
(1)
where N is the lter order, and S is the number of WP decomposition stages. It is well known that in coding of non-stationary signals { such as transients with sharp attacks { it is useful for the lterbank to use frequency subdivision schemes that approach the critical bands of the human auditory mechanism (with high temporal resolution). However, for coding of stationary signals it is better to use a full-depth tree decomposition (with high frequency resolution) to maximize the coding gain, even if the decomposition does not mimic the human auditory system [11]-[12]. Thus in [6] a simple cost{counting function was used as an entropy criterion for the selection of the \best basis". This method adaptively matches the decomposition tree to a given signal. Similarly, a time varying lterbank was used in [12]. The result is that in its highest frequency resolution mode this lterbank is approximately uniform with 512 bands, while in its highest temporal resolution mode it is close to a critical band system. The selection of the best matching of the lterbank resolution to a given signal is based on a perceptual entropy criterion. Although the results using adaptive lterbanks [6], [12], are promising compared against non-adaptive (octave, critical bands and uniform) methods, the high computational load imposed by the \best basis" selection algorithm and the relatively increased side information, forces a tradeo. As a compromise to this, the authors are currently
99TH AES CONVENTION, NEW YORK, USA, OCT. 6-9, 1995.
3
TABLE I
Comparison of the Filterbanks in Terms of Complexity. L Stands for the Transform Length while N for the Filter Length. Filterbank Structure
Complexity Mult/tions (C)
log2 (L) Wavelet Packet (WP) with Critical Bands 2NL < C < NLlog2 (L) log2 (L) < S < L=2 Wavelet Packet (WP) with Uniform Bands NLlog2 (L) L=2 Polyphase Filterbank (PF) / Uniform Bands 64L L Wavelet Transform (WT) / Octave Bands
2NL
Max. No of Subbands (S)
Subbands used for L=1024 samples 10-log 28 or 29-CB 32-uniform 32-uniform
investigating a computationally{ecient method which switches between octave and uniform lterbank structures driven by the input signal. However, in the MPEG-audio 1 standard [1], the lterbank decomposes the audio signal into 32 equal bandwidth subbands. This is eciently achieved by a polyphase lterbank. Accordingly in this paper, for the sake of comparison, the uniform WP approach is utilized and the number of scalefactors is maintained at 32, as in [1].
3. Compression Model
For the assessment of the coding gain of wavelet lters a codec model has been designed and implemented, based upon the combination of a wavelet packet algorithm, a model of auditory perception and entropy Human noiseless coding. A wavelet packet or in other words, a 5-stage, 32-band uniform frequency subdivision subband coder scheme, based on a tree-structured lterbank, was used as the basis for a performance comparison of wavelet lters, see also [13]. The transform frame length was set to L = 1024 samples. It should be noted that the less complex lterbank structure is oered by the octave or logarithmic decomposition, where the complexity is independent of the number of stages (octaves). The 5-stage uniform lterbank structure used here results in a lterbank with similar complexity but shorter delay than critical{band based solutions. Table I gives a comparison of the lterbanks in terms of computational load. It should also be noted for the sake of comparison that the polyphase lterbank employed in the MPEG model results in 16-tap FIR lters [9]. However, more ecient implementations have already appeared in the literature, i.e. [14]. Ecient signal compression results when subband signals are quantized with subbandspeci c bit allocation, based on input power spectrum (FFT) and a model of auditory perception. Thus, this lterbank is combined with Dynamic Bit Allocation (DBA), based on Psychoacoustic Model-1 as adapted for use with MPEG Layer-2, [1]. Quantisation is the same for all subbands based on block companding as in MPEG Layer-1. Finally, while the Dynamic Bit Allocation strategy exploits some of the human hearing characteristics, further reduction in the bit rate requires removing the statistical redundancies of the signal. That is the ideal case for entropy noiseless Human coding. This has been embedded in our codec as in MPEG Layer-3, [1].
99TH AES CONVENTION, NEW YORK, USA, OCT. 6-9, 1995.
4
4. Wavelets, Vanishing Moments and Regularity.
The convergence and dierentiability of the continuous wavelet function is related to the number of zeros of the discrete wavelet lter at ! = . The property known as regularity is de ned by Daubechies in [5] and provides a measure of smoothness for wavelet and scaling functions. However, it is not clear whether regular wavelet lters are more suited to coding schemes [15], [16]. The regularity-order necessary for good coding performance of WT schemes remains a topic of investigation. While regular basis functions are desired in image compression and in numerical applications, their in uence is not clear for coding of audio signals. Thus the approach taken here is to investigate the performance of wavelets in terms of vanishing moments, regularity and complexity for low bit rate coding of audio signals. The most popular and frequently used orthogonal wavelets are the original Daubechies wavelets [5]. They are a family of orthogonal wavelets indexed by N 2 N, where N is the number of vanishing wavelet moments. They are supported on an interval of length 2N ? 1 or in other words their number of coecients is equal to 2N . A disadvantage is that, except for the Haar wavelet (N = 1), they cannot be symmetric or antisymmetric. Also, the number of vanishing moments of the wavelet is connected to the regularity or smoothness of the wavelet and vice versa. Their regularity increases linearly with N and is approximately equal to 0:2075N for large N . This construction does not lead to a unique solution if N and the support length are xed. In fact, this family corresponds to choosing the extremal phase. These compactly supported wavelets, with extremal phase and highest number of vanishing moments compatible with their support width, are the most asymmetric. They are also known as \minimum phase wavelet lters". These wavelet lters for 4 and 20 coecients are shown in Figs. 4 and 5, respectively. This family has been chosen after an extensive evaluation of the performance of some dierent wavelet families, including Johnston-QMFs, less asymmetric wavelets, coi ets, and some biorthogonal wavelet lters [13]. These \minimum phase wavelet lters" have in general been shown to possess relatively superior performance among the others in terms of Segmental Signal to Noise Ratio (SSNR), and also in subjective informal listening tests. =2N ?1 satisfy certain constraints, [5]: The Wavelet Filter Coecients fck gkk=0
N vanishing moments
2N X?1
k=0
(?1)k kmck = 0
(2)
for m = 0; 1; : : : : : : ; N ? 1, and N 1
Orthogonality
2N X?1
k=0
ck ck+2m = 20;m
for m = 0; 1; : : : : : : ; N ? 1, and N 1
(3)
99TH AES CONVENTION, NEW YORK, USA, OCT. 6-9, 1995.
Normalisation
2X N ?1
k=0
5
p
ck = 2
(4)
Under these constraints, the values ck are the coecients of a low pass lter and bk are the coecients of a high pass lter, related by:
bk = (?1)k c2N ?1?k
(5)
Figure 3 indicates the application of these Low (L) and High (H) pass lters in the tree-structured subband coding decomposition schemes. If we expand Eqs. (2), (3) and (4) for N = 2 vanishing moments, for the 4{tap wavelet lters, then it is possible to derive the relationship:
c0 ? p1 2 2
!2
+ c3 ? p1 2 2
p
!2
p
=
1 2
2
(6)
This is the equation of a circle, centre (1=2 2; 1=2 2), with radius R = 1=2. Then c1 and c2 are de ned from Eqs. (2), (3) and (4) as:
c1 = p12 ? c3 and c2 = p12 ? c0
(7)
Thus the coecients fc0; c3 g are independent parameters and therefore identify each wavelet on the fc0; c3g{plane, as in Figure 6. We can also derive the wavelet lter coecients in polar co-ordinates as:
c0 = c3 =
p1 2 2
9 + R cos > =
; 1 + R sin > p 2 2
2 [0; 2]
(8)
However, since the radius R is constant, each 4{tap wavelet lter fc0(); c1(); c2(); c3()g is now identi ed by only one parameter, the angle . In fact, every point on this circle determines a dierent multiresolution analysis and hence a 4{tap wavelet orthonormal basis for L2 (