1352
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
A Fine Granular Scalable to Lossless Audio Coder Rongshan Yu, Member, IEEE, Susanto Rahardja, Lin Xiao, and Chi Chung Ko, Senior Member, IEEE
Abstract—This paper presents Advanced Audio Zip (AAZ), a fine grained scalable to lossless (SLS) audio coder that has recently been adopted as the reference model for MPEG-4 audio SLS work. AAZ integrates the functionalities of high-compression perceptual audio coding, fine granular scalable audio coding, and lossless audio coding in a single framework, and simultaneously provides backward compatibility to MPEG-4 Advanced Audio Coding (AAC). AAZ provides the fine granular bit-rate scalability from lossy to lossless coding, and such a scalability is achieved in a perceptually meaningful way, i.e., better perceptual quality at higher bit-rates. Despite its abundant functionalities, AAZ only introduces negligible overhead in terms of lossless compression performance compared with a nonscalable, lossless only audio coder. As a result, AAZ provides a universal yet efficient solution for digital audio applications such as audio archiving, network audio streaming, portable audio playing, and music downloading which were previously catered for by several different audio coding technologies, and eliminates the need for any transcoding system to facilitate sharing of digital audio contents across these application domains. Index Terms—Audio compression, lossless audio coding, perceptual audio coding, scalable audio coding.
I. INTRODUCTION
D
URING the last two decades, digital technology has superseded its analog counterpart for audio applications due to its unprecedented advantages in quality and flexibility. However, the large data rates of digital audio, e.g., CD-quality (44.1 kHz sampled, 16 bit quantization) digital audio signal has a data rate of 705.6 kbps per channel or 1.41 Mbps for a stereo pair, could be a heavy burden to digital audio applications with limited bandwidth/storage capacities. In response to this need, considerable research has been devoted to the development of audio compression technology, which has led to fruitful results. Nowadays audio coders can deliver CD-quality stereo audio at bit-rates from 96–192 kbps, or 14–7 times compression, and the quality degradation introduced in the compressed audio are perceptually negligible in most cases. Most of these high-compression audio coders make use of the so-called perceptual coding technique [1], which try to optimize the perceptual quality of the reconstructed audio at a given data rate by exploiting the masking property of the human auditory system. In principle, a perceptual audio coder tries to manipulate the distortions, or noises introduced during the coding process so that these Manuscript received May 7, 2004; revised June 29, 2005. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ravi P. Ramchandran. R. Yu, S. Rahardja, and L. Xiao are with the Institute for Infocomm Research, Singapore 119613 (e-mail: rsyu@ i2r.a-star.edu.sg;
[email protected];
[email protected]). C. C. Ko is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 119620 (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSA.2005.860841
noises are masked by the decoded audio during playback and become inaudible. Since the masking property is most naturally represented as a time/frequency phenomenon, most perceptual audio coders adopt the transform coding approach where the coding is performed on the transformed coefficients obtained by performing filterbank or lapped orthogonal transform on time-domain audio waveforms [1]. Clearly, the lossy compression nature of the perceptual audio coding technology described above makes it inappropriate for high quality applications where every bit in the original audio waveform is expected to be preserved, for example, audio archiving systems. In such cases, lossless audio coding technology remains as the only solution. Unlike the perceptual audio coding approach, most lossless audio coding systems adopt a time-domain predictive coding approach [2] due to its simplicity, and they can only achieve very limited compression ratio because of the lossless coding constraint. According to [2], lossless audio coders could only achieve 1.5–3.0 times compression for most audio signals, which is very limited compared with that of perceptual audio coders. Recently, with advances in computers, broad-band networking and storage technologies, the capacities of more and more digital audio applications are quickly approaching those for delivery of high sampling-rate, high resolution (e.g., 96 kHz, 24 bit/sample) digital audio at lossless quality. However, there are still many applications that require high-compression digital audio due to bandwidth/storage constraints. Clearly, a scalable to lossless (SLS) audio coding technology that supports both lossy and lossless audio compression is in high demand for implementing the idea of a Universal Multimedia Access (UMA) paradigm [3] where the multimedia contents are delivered to various devices with different capabilities through a heterogeneous network, and consumed by users with different preferences. Desirable features of an SLS audio coder include the following. 1) Lossy to lossless scalability: As a basic requirement, the scalable audio coding technology should provide the lossy to lossless scalability in the bit-stream level to mitigate the burden of transcoding. 2) Fine granular scalability (FGS): Although for a particular application a “layered” scalable solution with predefined, application-specificqualitylayerswould besufficient,FGS is still desirable to cater to the large diversity of the network bandwidth, device capabilities and user preferences. 3) Perceptual scalability: This is a natural requirement for a scalable coder that better quality should always result when more resources are consumed. Due to the high nonlinearity of the human auditory system, care should be taken in designing a scalable audio coding system to ensure that this requirement is complied with.
1558-7916/$20.00 © 2006 IEEE
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
4) Coding efficiency: It is inevitable that coding efficiency of a scalable audio coder will be inferior to that of a nonscalable one due to the additional scalability constraint. However, in order not to “damp” applications with limited bandwidth/storages too much overhead should be avoided, especially at low data rates. 5) Low complexity: It is a useful feature for devices with limited computation power such as portable audio players or personal digital assistants (PDA). Technical feasibility of an SLS audio coder has been shown in recent works on scalable audio coding [4]–[6]. In [3], [4], a “two-layer” approach is described where a perceptual audio coder, specifically an MPEG-4 TwinVQ audio coder [7], is used as the core layer encoder and the difference between its lossy reconstruction and the original audio is then bit-plane coded in the time domain to generate the fine granular scalable to lossless bit-stream. Perceptual quality at the intermediate bit-rates is maintained by an iterative post processing that reshapes the spectral envelope of the reconstructed audio to resemble that of the original one. In [5] and [6], a frequency domain FGS coding approach is adopted. In [5], the perceptual set partitioning in hierarchical trees (PSPIHT) coding procedure, which is modified from the well-known SPIHT algorithm in image coding [8], is performed on the Modified Discrete Cosine Transform (MDCT) coefficients of the audio signal to generate the FGS bit-stream and perceptual scalability is ensured by coding the MDCT coefficients in the order of their perceptual significance. Since the MDCT coefficients have been rounded to its nearest integers in PSPIHT, an additional residual coding layer is added to code the error generated due to this rounding operation in the time domain. In [6], a lossless integer-to-integer implementation of MDCT, namely, Integer MDCT (IntMDCT) [9] is used so that there is not need for an additional residual layer. FGS bit-stream is then generated by performing bit-plane coding on the IntMDCT coefficients in the sequences of their perceptual significance. In this paper, we present Advanced Audio Zip (AAZ), an SLS audio coder that has recently been adopted as the reference model for the MPEG-4 audio SLS work [10] at the Trondheim meeting of ISO/IEC JTC1/SC29/WG11 (MPEG) in July, 2003. The performance of this Reference Model had been further improved during the Core Experiment process of MPEG since then, and the specification of this work will be published by the end of 2005. AAZ adopts the IntMDCT based transform coding approach. As an effort to optimize its coding performance at the low end of the bit-rate range, AAZ embeds an MPEG-4 AAC coder [11], [12] as its core coder, which not only delivers state-of-the-art perceptual audio coding performance, but also makes the AAZ system backward compatible to existing MPEG-4 audiovisual systems. The scalable to lossless coding is then performed directly at the IntMDCT domain, by a novel bit-plane coding approach, namely, Bit-Plane Golomb Code (BPGC) [13], which provides the scalability with step size down to 0.4 kbps for audio inputs sampled at 48 kHz. Perceptual scalability of AAZ is achieved by preserving the spectral shape of the noise spectrum of AAC core, which has been perceptually optimized in the AAC encoding process [14], during the scalable coding process.
1353
Despite its abundant features, like the availability of the fine grained scalability, AAZ still achieves an excellent compression ratio performance. It is found that AAZ introduces only marginal overhead when compared with the state-of-the-art nonscalable lossless only audio coders. Several key components help to achieve this high coding efficiency. Firstly, AAZ adopts the IntMDCT based transform lossless coding approach, which provides a simple and efficient framework to achieve a scalable to lossless audio coder, and helps to embed the MDCT based MPEG-4 AAC system in a rather straightforward manner. This also reduces the overhead of an additional residual coding layer [5] in the time domain which is generally inevitable if a noninteger transform is used. Secondly, to efficiently embed the AAC core, a novel error-mapping process is used so that the FGS coding is performed on the residual signal after removing the information that is already coded in the AAC core from the IntMDCT spectral data. In addition, instead of generating a residual signal with flat probability distribution as in the case of a conventional Minimum Mean Square Error (MMSE) error-mapping approach, the error mapping used in AAZ preserves the probability skew of the original IntMDCT spectral data in the residual signal, which is thus more efficiently coded with an entropy coder. Finally, the BPGC adopted in AAZ provides a low complexity, high efficiency embedded coding for source with Laplacian distribution, which as will be shown in this paper closely approximates the probability distribution of the IntMDCT coefficient for audio. The remainder of the paper is organized as follows. Section II provides a statistical study of the IntMDCT coefficients of audio signal, which not only justifies the entropy coder selection of AAZ, but also provides insights for future work around the IntMDCT based transform audio coder. The structure of AAZ is outlined in Section III. Subsequently, the lossless compression performance and the perceptual scalability of AAZ are evaluated in Section IV. Finally, Section V concludes this paper.
II. INTEGER MDCT FOR AUDIO A. Integer MDCT The MDCT [16], has been widely used in transform audio coders such as MPEG-1/2 Layer III (mp3) [18], MPEG-4 AAC, Dolby AC-2/AC3 [19], and numerous experimental audio coding algorithms [1]. The MDCT can be taken as a critical sampled filterbank with 50% overlapped windows in the analysis filterbank. This overlapping analysis windowing is very useful to audio coding since it helps to mitigate the blocking artifacts that deteriorate the reconstruction of transform audio coders with nonoverlapped transforms. To achieve critical sampling, a subsampling operation is performed in the frequency domain; and the aliasing resulted from this subsampling operation is subsequently cancelled in the time domain by an “overlap and add” operation of two succeeding blocks in the synthesis filterbank [17]. The MDCT is performed on a block of time domain samto calculate transform coefficients, and ples with length new time domain samples after each block advanced only transform operation since two succeeding analysis blocks are
1354
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
Fig. 1. Decomposition of MDCT into the windowing and time domain aliasing operation and DCTIV.
50% overlapped. Given an input block are calculated as cients
, the MDCT coeffi-
The IntMDCT is achieved by completely decomposing the MDCT into a cascade of Windowing and Time Domain Aliasing (TDA) operation and Discrete Cosine Transform of type IV [16] as shown in Fig. 1. The TDA operation has a form of
(1) where the , is a window function for a smooth overlapping of blocks. To achieve Perfect Reconstrucshould be further subjected to the following tion (PR), constraints:
(5) which is in fact a Givens rotation
(2) and and for different
the angle
is given by
(3) for . A typical example of the window function that fulfills these PR conditions is the “sine” window defined as (4) The MDCT will produce real spectral data for integer inputs which will in fact result in data expansion instead of compression if they are coded to a precision that ensures lossless reconstruction. To solve this problem, a “two-layer” approach is adopted previously in [5], [20] for transform based lossless audio coding where the real spectral data are truncated to their nearest integer before coding and an additional residual coding layer is then used to code the resulted rounding error in the time-domain. The introduction of the integer transform, such as the IntMDCT has greatly simplified the design of lossless audio coder. The integer transform provides a reversible integer to integer approximation to its theoretical counterpart, and thus a lossless audio coder can be simply implemented by cascading the integer transform with an entropy coder.
(6) The Givens rotations in the TDA operation are then implemented with the lifting scheme by, firstly, decomposing it into three lifting steps as follows:
(7) is In each of the lifting steps, a rounding operation then used to have a reversible integer to integer mapping. For example, the second lifting step can be implemented as (8) where the input integer pair is mapped into integer pair after the lifting step. Clearly, the above lifting step can be losslessly inverted by (9)
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
1355
notably, a multi-dimensional lifting scheme that significantly realgorithm, and corduces the noise level for the integer respondingly for the IntMDCT algorithm. B. Statistical Modeling of IntMDCT Coefficients for Audio It is interesting to observe that many digital signals that are sampled from “natural” multimedia signals, such as speech, audio, image, and video, exhibit a peaky probability distribution. Thus, the larger the sample value, the smaller the chance that it will occur. In many situations, this peaky distribution can be effectively modeled with some probability density functions (pdf) that exhibits similar properties, including the well-known Gaussian and Laplacian pdf. It has been previously reported in [24] that the Laplacian pdf provides a good approximation to wavelet transform audio. A more accurate model is also proposed in [25] on the General Gaussian (GG) pdf defined as (10) , where the gamma function defined as
Fig. 2. Comparison of MDCT and IntMDCT spectrums (1024, sine window). The input signal is a 1 kHz integer sine wave signal sampled at 48 kHz. Top: absolute value of MDCT (solid) and IntMDCT (dotted) spectrums; middle: zoom version around 1 kHz; bottom: absolute value of their difference.
A reversible integer to integer implementation of the Givens rotation is thus obtained. A decomposition of the into the Givens rotations has been previously given in [21]; and this can through complex also be done via a fast algorithm of FFT [16]. The implementation of IntMDCT is now straightforward by implementing all the resulting Givens rotations with the lifting scheme described above. Clearly, due to the rounding operations used in the lifting steps, the outputs of the IntMDCT would only approximate the MDCT outputs. As an example, Fig. 2 compares the MDCT and IntMDCT spectrum for a 1 kHz integer sine wave sampled at 48 kHz. The lengths of the transforms are selected as . It can be seen that the IntMDCT only introduces a noise component with amplitude ranging from 0–5, which is generally negligible in practical applications. We further note that refinements that improve the accuracy of the IntMDCT algorithm have recently been reported in [22], [23]. These include,
and
is
Here, and are the mean and the standard deviation of the random variable (RV) respectively. Compared with the Gaussian or Laplacian pdf, the GG pdf provides the additional flexibility in controlling the shape and tail decay of the pdf by changing the value of the distribution parameter ; and it also , Laplacian includes important pdf’s, such as Gaussian and Uniform , as special cases. In Fig. 3, we compare the histogram of IntMDCT coefficients that are taken from different frequency regions of a piece of symphony audio sequence with the GG pdf. It can be observed or provides that the GG pdf with parameter a good approximation for the distribution of the IntMDCT coefficients. To identify the best GG model for the distribution of IntMDCT coefficients, which is possibly varied for different frequency regions and different audio signals, we further conduct the Kolmogorov-Smirnov (KS) goodness-of-fit test [26] to evaluate the deviation of the GG model to the distribution of the IntMDCT coefficients from a variety of audio signals. In our test, a total of 16 audio sequences that cover a wide range of music sources such as instrumental, pop, rock, classical, speech are used; the sampling rate for those sequences is 48 kHz; and the duration for each sequence is 10 240 samples. The length of the IntMDCT transform is 1024. For each audio sequence, the KS test was performed on the IntMDCT coefficients from three different groups of the scale factor bands (sfb) [10] of the AAZ encoder, namely, the Low Band from sfb 1–17 (0–2.3 kHz), the Mid Band from sfb 31–34 (8.3–11.0 kHz), and the High Band from sfb 42–45 (16.5–19.3 kHz). The KS test statistic results are summarized in Table I, which show that in most cases, the provides the most accurate model for the GG pdf with Low Band, and for the Mid Band and High Band, the best model or Laplacian pdf usually goes to either GG pdf with .
1356
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
of lossless audio coding, assuming the IntMDCT coefficients is rounded from a Laplacian source with pdf given by (11) it is straightforward to show that the amplitudes of the IntMDCT coefficients will be exponentially distributed with probability mass function (pmf) given by
(12) . For such a source, there exist many efwhere ficient yet low complexity entropy codes such as the Golomb code [14] or BPGC. We further evaluate the redundancy due to the above simplification. Specifically, we note that the redundancy , in bits per sample, as a result of coding a source with empirical distribution using a code which is optimized for a distribution , in this case the exponential distribution, is actually bounded by the Kullback-Leibler divergence [27] between these two distributions (13) where stands for expectation with respect to distribution . Consequently, the relative redundancy , which is obtained by normalized with the empirical entropy of the source, , cannot be less than (14) Clearly, the bound above provides a good indication for the performance gap between the expected code length of code to that of an optimal code for the given source. We measure the values of for the IntMDCT coefficients of different frequency locations for the same audio sequences used in our previous test. The results are summarized in Fig. 4. These results show that the redundancy due to the sub-optimal Laplacian assumption for the distribution of IntMDCT coefficients of audio is in fact trivial, smaller than 4% in most cases; and the average redundancy over all testing items is only around 2%. With this finding, we may conclude that it would only lead to insignificant improvement in terms of coding efficiency by replacing the Laplacian model with a more sophisticated one in coding the IntMDCT transformed audio, and hence it is difficult to justify such an effort as it will lead to a very complicated code design. Fig. 3. Comparison of the histogram of IntMDCT coefficients and the generalized Gaussian pdf’s with r = 0:5 (solid), 1.0 (dashed), 1.5 (dotted), and 2.0 (dash-dot) for a symphony audio sequence. (Top: MDCT coefficient from 0–2.3 kHz; Middle: 8.3–11.0 kHz; Bottom: 16.5–19.3 kHz).
A. Overview
Although from the results of the KS test, the Laplacian distribution is not always the most accurate model for the IntMDCT transformed audio, it still remains very attractive in many situations due to its simplicity and relative accuracy. In the context
AAZ is designed to cover the desired features that have been outlined in the Introduction section for SLS audio coding as much as possible, while at the same time maintaining a simple structure and low algorithmic complexity. To this end, we adopt the IntMDCT based lossless coding approach in AAZ, where
III. STRUCTURE OF AAZ
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
1357
TABLE I KS GOODNESS-OF-FIT TEST RESULT FOR MODELING INTMDCT COEFFICIENT OF AUDIO USING GENERALIZED GAUSSIAN PDF WITH DIFFERENT VALUE OF r . HERE, LOWER VALUES INDICATE A BETTER FIT
Fig. 4.
Relative redundancy as a result of encoding IntMDCT transformed audio with Laplacian distribution model for a variety of audio sequences.
the IntMDCT spectral data are coded with two complementary layers, namely, a core MPEG-4 AAC layer which generates an AAC compliant bit-stream at a pre-defined bit-rate which constitutes the minimum rate/quality unit of the lossless bit-stream, and a lossless enhanced (LLE) layer that makes use of bit-plane coding method to produce the fine granular scalable to lossless portion of the lossless bit-stream. The structures of AAZ encoder and decoder are shown in Fig. 5. The core-layer AAC encoder in AAZ follows the informative AAC encoding specification described in [12], [15]. To simplify the encoder design, instead of using a separate floating-point MDCT as defined in [12] for the core-layer AAC encoder, we simply reuse the IntMDCT by properly normalizing its output to
compensate the difference between the gains of these two filterbanks. In addition, for coding of stereo channels the IntMDCT coefficient pairs from left (L) and right (R) channels are further Givens rotation rotated with a
to generate the sum and difference signals if they are coded with joint stereo coding. Since the difference between the IntMDCT and the floating-point MDCT is very small compared with the quantization step size of an AAC encoder
1358
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
Fig. 5.
Structures of the AAZ encoder and decoder.
under normal working condition, such a simplification doesn’t affect the quality of the core layer AAC encoder significantly. In order to efficiently embed the AAC core, AAZ adopts an error mapping process to remove the information that has already been coded in the core AAC bit-stream from the IntMDCT spectral data so that only the resulted IntMDCT residuals are coded in the LLE encoder. Also, this error mapping process is able to preserve the probability distribution skew of the original IntMDCT coefficients, which is approximately Laplacian distributed, in the IntMDCT residuals so that they can be very efficiently coded by the bit-plane Golomb code (BPGC) used in the LLE layer. In Sections III-B–E we discuss the principles underlying the design of AAZ, and we refer interested readers to [10] for a comprehensive description of its implementation.
tained in the LLE one are, optimally, highly complementary. To this end, we notice that the AAC core encoder can be simplified as a quantization and coding process [12], where the IntMDCT spectral data are quantized, possibly with different quantization step size for different sfb and coded to produce the AAC from bit-stream. Consequently, if the IntMDCT coefficients a particular sfb have been quantized at the AAC core encoder, we have the following property: (15) In the above inequation, is the quantized value of from is the quantization thresholds the AAC quantizer, closer to zero of [12], as shown in (16) at the bottom of the page, and (17)
B. Error Mapping Process In order to minimize the overhead of embedding the AAC core, AAZ uses an error mapping process to ensure that the information contained in the AAC core bit-stream and that con-
. Here, is the is the quantization step size of scale factor that determines the quantization step size for sfb , . and the “magic” number is given by
,
(16)
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
1359
Fig. 6. Illustration of the error mapping process. Top: The IntMDCT spectrum and the corresponding quantization thresholds. Bottom: The result IntMDCT residual spectrum. For simplicity only the amplitude is shown here.
Clearly, instead of directly coding the IntMDCT coefficients in the LLE layer encoder, an error mapping process can be employed here to remove the information that has already been coded in the AAC core bit-stream, and the LLE encoding process is then performed on the resulting residual signal. Intuitively, the error mapping process should produce a residual with the smallest entropy, which in most cases is signal simply the Minimum Mean Square Error (MMSE) one obtained to its MMSE reconby subtracting the IntMDCT coefficient struction given , i.e., . Here, the MMSE is given by where reconstruction denotes the expectation operation. However, to fully utilize the and the statistical properties of property implied in (15) for , AAZ adopts a somewhat counterintuitive approach where is given by the residual signal
Here, is the flooring operation that rounds a floating-point value to its nearest integer with smaller amplitude, and is the length of the IntMDCT spectrum. The error mapping process is illustrated in Fig. 6, where two different cases are given. In the first case, the IntMDCT coefficients are from an sfb that has been quantized and coded at the AAC core encoder (Significant band). For these coefficients, the residual coefficients are obtained by subtracting the quanti, resulting in a zation thresholds from their corresponding residual spectrum with reduced amplitude. In the other case are from an sfb that is not coded in the AAC core encoder due to lack of quantization bits (Insignificant band). In this case, the residual spectrum is in fact the IntMDCT spectrum itself. The advantages of the above error mapping process are twothat has already been significant in the core fold. First, for , the sign of the IntMDCT residual will layer, i.e., be identical to and hence it does not need to be coded. Secondly, as illustrated in Fig. 7, we notice that if the amplitude of is exponentially distributed, from the “memoryless” propwill also be erty of the exponential pdf, the amplitude of
Fig. 7. Distribution of the residual signal from the error mapping process. Top: The pdf of the Laplacian distributed IntMDCT coefficient and the conditional pdf of the IntMDCT residual output by the error mapping process. Bottom: The unconditioned pdf of the amplitude of the IntMDCT residual. Uniform quantization in the core layer encoder is assumed here.
approximately exponentially distributed (in fact strictly1 so if the quantizer in the core encoder is uniform). Thus, it can be very efficiently coded in the LLE layer by using BPGC. C. Bit-Plane Golomb Code The bit-plane coding technology has been widely adopted in constructing FGS coding systems including those for image coding [8], [28], video coding [29] and audio coding [30]. Con, for which sider an input data vector is the dimension of and , is extracted from . In a bit-plane coding a source of some alphabet, e.g., in is first represented in bits scheme, each element that is comas prised of a sign symbol (19) bit-plane symbols and The value of is then
,
.
(20)
j [ ]j (j [ ]j) = (1 0 ) 1 [] [] (j [ ]jj [ ]j) = (j [ ] 0 ( [ ])jj0 j [ ]j 0 j ( [ ])j 1) (j [ ]j) (j [ ]j j [ ]) = j [ ]j 1 = (j [ ]j) = (1 0 )
1Assuming c k is exponentially distributed whose pmf is given by p ck . If it is quantized with a midthread uniform quantizer with uniform step size , the conditional pmf of the amplitude of e k given the quantizer output i k will be p e k i k . From the “memoryp ck thr i k ck thr i k < less” property of p c k it is straightforward to have p e k ik , e k < . Here . p ek
1360
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
Here,
is the maximum bit-plane number of . That is , . The bit-plane symbols are then scanned and coded from the Most Significant Bit (MSB) to the Least Significant Bit (LSB) for all the elements in in a certain order to produce the compressed bit-stream. Clearly, the bit-plane code can be taken as an Entropy Constrained Scalar Quantizer (ECSQ) [31] with successive refined ; and it is at the same quantization step sizes , time a lossless entropy coder for an integer source if the finest quantization step is 1. It can be seen that the bit-plane code provides a very natural approach to generate an FGS to lossless bit-stream; and a lossy bit-stream at any given intermediate rate can be easily obtained by performing a simple truncation operation on the lossless bit-stream. In BPGC, unlike most bit-plane coding technologies that rely on adaptive arithmetic coding technology or fixed frequency table to determine the frequency assignment in coding the bit-plane symbols, BPGC uses a probability assignment rule that is derived from the statistical properties of a geometrically distributed source where the bit-plane symbol at bit-plane is given by coded with probability assignment
.
(21)
Here, is the code parameter that indexes the family of BPGC. Each bit-plane symbols is then coded with an arithmetic coder except the sign using the probability assignment given by symbols which are simply coded with probability assignment 1/2. It is shown in [13] that despite its simplicity, BPGC achieves excellent lossless compression ratio performance for geometrically distributed integer source, and its Rate-Distortion (R-D) performance is very close to that of an optimal ECSQ for that source. In the BPGC decoder, the bit-plane coding process described above is reversed to reconstruct the bit-plane of ; and the lossless reconstruction of is obtained if a full BPGC bit-stream is received so that all the bit-stream symbols of are decoded. In the case when the BPGC bit-stream is truncated to lower rates, of is the MMSE reconstruction given by [13], as shown in (22) at the bottom of the page, where is given by the following iteration: the fill element
(23)
Here, and are the length and the absolute sum of the IntMDCT residual for that sfb respectively. The IntMDCT residual is then coded with BPGC with the selected code pasignal rameters. For correct decoding in the AAZ decoder, the differfor each sfb is Huffman coded ence between and the MSB and sent to the decoder as the side information. D. Perceptual Embedded Coding Principle As has been elaborated in Section I, one key feature of a scalable audio coding scheme is that the scalability should be perceptually meaningful. That is, better perceptual quality of the reconstructed audio should be obtained as the bit-rate of the scalable bit-stream is increased. This feature is particularly relevant to the cases when the AAC is working at the rates that are lower than that for perceptual transparent quality, where the scalability in perceptual quality is easily appreciated by the end users. It is, however, also relevant to cases when AAC is working at higher rates, for which the scalability is actually more on the coding margin which is important for high-end applications or applications where the reconstructed audio is to be post-processed. In AAZ, the scalability in the perceptual quality is achieved by adopting a very straightforward perceptual embedding coding principle, which is illustrated in Fig. 8. It can be seen that BPGC is started from the most significant bit-planes (i.e., the first non zero bit-planes) of all the sfb, and progressively moves to lower bit-planes after coding the current bit-plane for all sfb. Consequently, during this process, the energy of the quantization noise of each sfb is gradually reduced by the same amount. As a result, only the level of the quantization noise is reduced and its spectral shape, which has been perceptually optimized at the core AAC encoder, is preserved during bit-plane coding. As an example, Fig. 9 plots the maximum Noise to Mask Ratio (NMR) [32] as a function of audio frame for a piece of symphony music decoded from AAZ bit-streams with different LLE bit-rates. Since NMR actually represents the distance between the coding noise and the JND, smaller value of NMR generally indicates better perceptual quality in the decoded audio. Evidently, as can be seen from Fig. 9, the LLE layer improves the perceptual quality of the AAC core bit-stream; and higher bit-rates in the LLE layer always result in smaller MaxNMR over all the audio frames. E. Noncore Mode
which is implemented in AAZ in a low-complexity manner by using a table lookup. In AAZ, for each sfb an optimal BPGC code parameter is selected by using the following algorithm [13]:
AAZ provides a noncore mode for applications that require only lossless or near-lossless quality, which is achieved by simply disabling the AAC core in AAZ. It is found in our test that the lossless compression performance of AAZ will
(22) else
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
1361
TABLE II COMPRESSION RATIO OF AAZ (TOP: 48 kHz/16 BIT TESTING SET; MIDDLE: 48 kHz/24 BIT TESTING SET; BOTTOM: 96 kHz/24 BIT TESTING SET)
Fig. 8. Bit-plane coding order adopted in AAZ. The bit-plane coding is started from MSB for all the sfbs, and progress to lower bit-planes subsequently.
Fig. 9. MaxNMR as a function of audio frames (Symphony, 48 kHz/16 bits/sample).
be improved by 2%–4% if the AAC core is not presented. However, the scalability of the noncore mode bit-stream is now on MSE instead of on the perceptual quality due to the absence of a core bit-stream from AAC; and it is not longer suitable for low bit-rates applications since the perceptual quality of the reconstructed audio would not be acceptable. IV. EXPERIMENTAL RESULTS We first evaluate the lossless compression performance of AAZ using the audio testing sets from the MPEG lossless audio coding task group [33]. These testing sets comprise audio sequences of different sampling rate/word length combinations, namely, 48/16 (kHz/bit), 48/24 and 96/24, are used in our test, which also represent the most popular sampling rate/word length combinations used in digital audio formats for today’s home theatre systems and professional audio systems. For each testing set audio sequences of a variety of music styles such as vocal, instrumental, jazz and classical music are included. In our tests, the core AAC encoder is working at the bit-rates for “almost perceptually transparent quality,” that is, 64 kb/s
per channel for the 48 kHz testing sets and 80 kb/s per channel for the 96 kHz testing set. In addition, the AAZ noncore mode encoder is also included in our tests. For comparison, the latest versions of Monkey’s Audio 3.97 (MA397) [34], a popular open source lossless audio codec that delivers the state-of-the-art compression performance [35], is used as the benchmarks in our tests. It should be noted that Monkey’s Audio is a lossless only audio coder and does not provide any scalability function. The comparison results are listed in
1362
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
TABLE III
04 IS ANNOYING, 0 IS IMPERCEPTIBLE) FOR AAZ AND THE REFERENCE MPEG-4 AAC ENCODER. THE BIT-RATES SHOWN FOR AAZ INCLUDE THE BIT-RATES FOR THE AAC CORE BIT-STREAM + THOSE FOR THE LLE BIT-STREAM
ODG RESULTS (
Table II. It can be seen that for most items, the compression ratio of the AAZ noncore mode encoder and that of MA397 is quite close and the difference between them is smaller than 4% for the 48/16 testing set and 1% for the 48/24 and 96/24 testing sets. The results listed in Table II also show that despite the abundant functionalities of AAZ, if the AAC core encoder is presented, the overhead for embedding such a core in terms of compression ratio is quite trivial considering the availability of large storage capacity in today’s multimedia applications, which are only around 4.5%, 2.5%, and 3.0% respectively for the 48/16, 48/24, and 96/24 testing sets. The perceptual quality of lossy audio reconstruction when AAZ is working at intermediate rates is evaluated based on the ITU-R BS.1387 PEAQ (Perceptual Evaluation of Audio Quality) test, basic model [36]. The PEAQ test makes use of a computational model of the human auditory system to compare the perceptual difference between the processed (coded) audio signal and the original one, and outputs the Objective Difference Grade (ODG) value that mimic the listening tests rating. The (“very annoying”) to grading scale of ODG ranges from 0 (“imperceptible difference”). Although it is still unable to replace listening tests, which is believed to be the only reliable method to evaluate the perceptual quality of audio coders at this moment, the PEAQ serves as a very useful tool for such a task as it is much less expensive, and the results from a PEAQ test are in fact highly correlated with those from a listening test [37]. We evaluate the perceptual performance of AAZ at bit-rates 64, 96, and 128 kb/s/channel on the 48/16 testing set. The AAC core encoder is working at 64 kb/s/channel. For comparison, a stand alone MPEG-4 AAC codec is also included in our test, and this is working at 96 and 128 kb/s/channel. Note that this standalone AAC codec, as well as the core AAC encoder embedded in AAZ, is adopted from the publicly available ISO/IEC MPEG-4 audio reference software [38], which may not deliver the best perceptual quality achievable for MPEG-4 AAC. For this reason, the evaluation results presented here only serves the purpose for demonstrating the perceptual scalability of AAZ. The PEAQ results from our tests are listed in Table III. Evidently, the LLE layer bit-stream successfully improves the perceptual quality of the core-layer AAC bit-stream in a gradual
manner. The higher the bit-rate of the LLE bit-stream, the better the perceptual quality of the final reconstructed audio. V. CONCLUSION The transform domain scalable to lossless audio coding technology is a new member in the family of audio compression that was recently introduced along with the standardization process of the MPEG-4 Audio SLS work. This newly introduced technology makes it possible to integrate the functionalities of the lossy audio coding, lossless audio coding, and scalable audio coding in a single coding framework. In this paper we present the design and the underlying principles for Advanced Audio Zip (AAZ), an IntMDCT based scalable to lossless audio coder that was selected as the Reference Model for MPEG-4 audio SLS. AAZ maintains all the desirable features that are inherent from its transform coding based design; and meanwhile achieves a reasonable lossless compression performance. The abundant features of AAZ, which includes the fine granular scalability, support for both lossy and lossless audio coding, and the backward compatibility to existing MPEG-4 AAC, distinguish it from other lossless audio codecs. It is envisioned that AAZ can be used as a universal digital audio format for the UMA paradigm where the multimedia contents can be delivered to anyone, anywhere, at any desirable quality through a heterogeneous access network. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments which helped to improve the clarity and quality of the paper. REFERENCES [1] T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proc. IEEE, vol. 88, no. 4, pp. 451–513, Apr. 2000. [2] M. Hans and R. W. Schafer, “Lossless compression of digital audio,” IEEE Signal Processing Mag., no. 4, pp. 21–32, Jul. 2001. [3] R. Mohan, J. R. Smith, and C.-S. Li, “Adapting multimedia internet content for universal access,” IEEE Trans. Multimedia, vol. 1, no. 1, pp. 104–114, Mar. 1999.
YU et al.: FINE GRANULAR SCALABLE TO LOSSLESS AUDIO CODER
[4] T. Moriya, N. Iwakami, A. Jin, and T. Mori, “A design of lossy and lossless scalable audio coding,” in Proc. ICASSP, 2000, pp. 889–892. [5] M. Raad, A. Mertins, and I. Burnet, “Scalable to lossless audio compression based on perceptual set partitioning in hierarchical trees (PSPIHT),” in Proc. ICASSP, 2003, pp. 624–627. [6] R. Geiger, J. Herre, G. Schuller, and T. Sporer, “Fine grain scalable perceptual and lossless audio coding based on INTMDCT,” in Proc. ICASSP, 2003, pp. 445–448. [7] N. Iwakami, T. Moriya, and S. Miki, “High-quality audio coding at less than 64 kbit/s by using twinVQ,” in Proc. ICASSP, 1995, pp. 937–940. [8] A. Said and W. A. Pearlman, “A new, fast, and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996. [9] R. Geiger, T. Sporer, and J. Koller, “Audio coding based on integer transforms,” in 111th AES Convention preprint 5471, New York, Sep. 2001. [10] “Workplan for audio scalable lossless coding (SLS),”, Trondheim, Norway, ISO/IEC JTC1/SC29/WG11, MPEG2003/N5720. [11] R. Yu, X. Lin, S. Rahardja, and H. Huang, “Technical Description of I2R’s Proposal for MPEG-4 Audio Scalable Lossless Coding (SLS): Advanced Audio Zip (AAZ),”, Brisbane, Australia, ISO/IEC JTC1/SC29/WG11, MPEG2003/M10035, 2003. [12] Coding of Audiovisual Objects, Part 3. Audio, Subpart 4 Time/Frequency Coding, ISO/IEC JTC1/SC29/WG11, Int. Std. 14496-3, 1999. [13] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “Bit-plane golomb code for sources with laplacian distributions,” in Proc. ICASSP, 2003, pp. 277–280. [14] S. Golomb, “Run-length encodings,” IEEE Trans. Inform. Theory , vol. IT-12, pp. 399–401, 1966. [15] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, and M. Dietz, “ISO/IEC MPEG-2 advanced audio coding,” J. Audio Eng. Soc., pp. 789–813, Oct. 1997. [16] H. S. Malvar, Signal Processing With Lapped Transforms. Norwood, MA: Artech House, 1992. [17] J. Princen, A. Johnson, and A. Bradley, “Subband/transform coding using filter bank designs based on time domain aliasing cancellation,” in Proc. ICASSP, 1987, pp. 2161–2164. [18] “Information technology—Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s—Part3: Audio,”, ISO/IEC JTC1/SC29/WG11, IS 11172-3, 1992. [19] M. Davis, “The AC-3 multichannel coder,” in Proc. 95th Conv. Aud. Eng. Soc., Oct. 1993, preprint 3774. [20] M. Purat, T. Liebchen, and P. Noll, “Lossless transform coding of audio signals,” in 102nd AES Convention Preprint 4414, Mar. 1997. [21] Z. Wang, “Fast algorithms for the discrete W transform and for the discrete fourier transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, no. 4, pp. 803–816, Aug. 1984. [22] R. Geiger, Y. Yokotani, and G. Schuller, “Improved integer transforms for lossless audio coding,” in Proc. 37th Asilomar Conf. Signals, Systems and Computers, 2003. [23] H. Huang, S. Rahardja, X. Lin, and R. Yu, “A fast algorithm of integer MDCT for lossless audio coding,” in Proc. ICASSP 2004. [24] P. Philippe, F. M. de Saint-Martin, and M. Lever, “Wavelet packet filterbanks for low time delay audio coding,” IEEE Trans. Speech Audio Process., vol. 7, no. 3, pp. 310–322, May 1999. [25] N. R. Reyes, “On the coding gain of dynamic Huffman coding applied to a wavelet-based perceptual audio coder,” in Proc. IEEE Nordic Signal Processing Symp. Sweden, Jun. 2000. [26] K. Brownlee, Statistics Theory and Methodology in Science & Engineering. New York: Wiley, 1961. [27] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [28] D. Taubman, “High performance scalable image compression with EBCOT,” IEEE Tran. Image Process., vol. 9, no. 7, pp. 1158–1170, Jul. 2000. [29] W. Li, “Streaming video profile in MPEG-4,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 3, pp. 301–317, 2001. [30] S. H. Park, Y. B. Kim, and Y. S. Seo, “Multi-layer bit-sliced bit-rate scalable audio coder,” 103rd AES Convention Preprint 4520, 1997. [31] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Application to Speech and Video. Englewood Cliffs, NJ: Prentice-Hall, 1990. [32] B. Beaton et al., “Objective perceptual measurement of audio quality,” in Collected Papers on Digital Audio Bit-rate Reduction. New York: AES, 1996, pp. 126–152. [33] “Call for Proposals on MPEG-4 Lossless Audio Coding,”, Shanghai, China, ISO/IEC JTC1/SC29/WG11 (MPEG), N5040, 2002. [34] Monkey’s Audio, Open Source Lossless Audio Codec. [Online]. Available: http://www.monkeysaudio.com [35] SourceForge. Comparison of Lossless Audio Coders. [Online]. Available: http://flac.sourceforge.net/comparison.html
1363
[36] “Methods for Objective Measurements of Perceptual Audio Quality,” International Telecommunications Union, Geneva, Switzerland, ITU-R Rec. BS-1387, 1999. [37] W. C. Treurniet and G. A. Soulodre, “Evaluation of the ITU-R objective audio quality measurement method,” J. Audio Eng. Soc., vol. 48, no. 3, pp. 164–173, Mar. 2000. [38] ISO/IEC. MPEG-4 Reference Software (ISO/IEC 14 496-5:2001). [Online]. Available: http://www.iso.ch/iso/en/ittf/PubliclyAvailableStandards/ Rongshan Yu (M’01) received the B.Eng. degree from Shanghai Jiaotong University, Shanghai, China, and the M.Eng. degree from National University of Singapore, Singapore, in 1995 and 2000, respectively. He was with the Centre for Signal Processing, School of Electrical and Electronics, Nanyang Technological University, Singapore from 1999 to 2001. He is presently Head of Audio Processing Lab in the Institute for Infocomm Research (I R), A STAR, Singapore. His recent research interests include audio coding, data compression and digital signal processing. Mr. Yu received the Tan Kah Kee Young Inventors’ GOLD Award (Open Category) in 2003. He is the editor of MPEG standard ISO/IEC 14496-3/AMD 5, Audio Scalable to Lossless Coding. Susanto Rahardja received the B.Eng. degree in electrical engineering from the National University of Singapore (NUS) and the M.Eng. and Ph.D. from the Nanyang Technological University (NTU), Singapore. His research interests include audio signal processing, video processing, binary and multiple-valued logic synthesis, and digital communication systems in which he has published more than 140 internationally refereed journals and conferences. He is currently the director of Media Division in the Institute for Infocomm Research (I2R) where he leads 100 full-time researchers in the areas of media related research. He is a faculty member at the School of Electrical and Electronic Engineering in the Nanyang Technological University and cofounder of AMIK and STMIK Raharja Informatika, an institute of higher learning in Tangerang, Indonesia. Dr. Rahardja was the recipient of the IEE Hartree Premium Award for best journal paper published in IEE Proceedings in 2001. In 2003, he received the Tan Kah Kee Young Inventors’ Gold Award (Open Category), which is the second Gold award being ever given in that category. He is a Technical Committee member of Visual Signal Processing and Communications of IEEE ISCAS, Technical Program Committee member of International Workshop on Multimedia Signal Processing. He had served as session chairs in many international conferences as well as reviewers of many IEEE and IEE transactions and journals. Lin Xiao received the Ph.D. degree from the Department of Electronics and Computer Science, University of Southampton, Southampton, U.K., in 1993. He worked at the Centre for Signal Processing for five years as a Researcher and Manager on multimedia programs. He worked for DeSOC Technology as a Technical Director, where he contributed on the VoIP solution, speech packet lost concealment for Bluetooth, and WCDMA baseband SOC development. He jointed the Institute for Infocomm Research on July 2002, where he is now Research Manager in charge of multimedia signal processing areas. He actively participates in MPEG, JPEG2000, JVT, and AVS international standards. Chi Chung Ko (M’82–SM’93) received the B.Sc. (1st Class Honors) and Ph.D. degree in the electrical engineering from Loughborough University of Technology, Loughborough, U.K. In 1982, he joined the Department of Electrical and Computer Engineering, National University of Singapore, where he is presently an Associate Professor and Deputy Head (Research) of the Department. His current research interests include digital signal processing, speech processing, adaptive arrays, and computer networks. He has written over 100 technical publication in these areas. Dr. Ko is an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING.