Frequency Region-Based Prioritized Bit-Plane Coding for Scalable ...

1 downloads 0 Views 2MB Size Report
the recent release of MPEG-4 scalable lossless (SLS) audio coding structure by replacing the sequential bit-plane coding in the en- hancement layer. With zero ...
94

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

Frequency Region-Based Prioritized Bit-Plane Coding for Scalable Audio Te Li, Student Member, IEEE, Susanto Rahardja, Senior Member, IEEE, and Soo Ngee Koh, Senior Member, IEEE

Abstract—A perceptually enhanced prioritized bit-plane audio coding algorithm is presented in this paper. According to the energy distribution in different frequency regions, the bit-planes are prioritized with optimized parameters. Based on the statistical modeling of the frequency spectrum, a much more simplified implementation of prioritized bit-plane coding is integrated with the recent release of MPEG-4 scalable lossless (SLS) audio coding structure by replacing the sequential bit-plane coding in the enhancement layer. With zero extra side information, trivial added complexity, and modification to the original SLS structure, extensive experimental results show that the perceptual quality of SLS with noncore and very low core bit-rate is improved significantly in a wide range of bit-rate combinations. Fully scalable audio coding up to lossless with much enhanced perceptual quality is thus achieved. Index Terms—Bit-plane coding, scalable audio coding (SAC).

I. INTRODUCTION CALABLE AUDIO CODING (SAC) allows an encoder to compress data at a high/lossless bit-rate and a receiver to decode the compressed signal at different fidelity levels according to the specific applications and user context requirements as well as available device capabilities. This scalability is very important for the co- and interoperability of today’s myriad of multimedia applications spanning across various media platforms including cell phone networks, local and wide area computer networks, and television and radio broadcasting networks; each one often requiring a different range of coding bit-rate. Moreover, for heterogeneous networks such as the Internet, audio coding with fine granular scalability (FGS) maintains uninterrupted service in the often highly congested and unstable channel traffic, with capability to demonstrate the efficient use of channel bandwidth. Scalability is also very useful in archiving where all the data are stored in a single, uniform format with high/lossless bit-rate, rather than with varying encoding bit-rates. The essential idea of the initial approaches [1]–[7] in SAC is basically a structure with perceptual or speech core layer and several enhancement layers in terms of higher sampling rate or

S

Manuscript received March 30, 2007; revised August 8, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sylvain Marchand. T. Li and S. Rahardja are with the Institute for Infocomm Research (I R), Singapore 119613, and also with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]; [email protected]). S. N. Koh is with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (e-mail: esnkoh@ ntu.edu.sg). Digital Object Identifier 10.1109/TASL.2007.910777

finer quantization. The first approach of SAC is brought in [1] and later implemented in [2], with a combined structure of three perceptual codecs operating at different sampling frequencies and bit-rates. Specifically, the input of the second and third stage are the residual signals between the input and output of the previous stage. This is further improved in [3] by adopting a speech core coder and up-sampled perceptual enhancement layers with a frequency-selective switch. This algorithm is later standardized in [8] as scalable advanced audio coding (AAC) tool. Besides signal-to-noise ratio (SNR) and bandwidth, the scalability in numbers of channels is implemented in [4]. With the development of transform domain weighted interleave vector quantization (TWIN-VQ) in [5], a TWIN-VQ coder is used as the core coder in [6] and [7]. All these scalable coders are with characteristic of coarse granular scalability. FGS, where increases in coding quality can be achieved with small increments in bit-rate, is desirable in numerous applications. The most common way of implementing FGS in audio coding is through bit-plane coding. There are other approaches, for example, that use two-dimensional transform [9], where the data is progressively ordered in the bitstream according to perceptual relevance. The state-of-the-art FGS scalable audio codec is bit-sliced arithmetic coding (BSAC) [10] which adopts bit-plane coding on quantized spectral data. Scalability of 1 kbps per channel can be achieved with sufficient coverage on the statistics of bit-slices through arithmetic coding models. As a result of the increased computational complexity arising from an increasing number of layers, BSAC is mainly used for scalability in relatively limited bit-rate range like 16 to 64 kbps. With the belief that more efficient bit-plane coding can be achieved by assigning different priorities to the bit-planes, numerous approaches have been proposed to implement the idea of prioritized bit-plane coding. These approaches can be grouped into three basic categories. The first category is the approaches inspired by tree-based significance mapping techniques including embedded zerotrees wavelet (EZW) [11] and set partitioning in hierarchical trees (SPIHT) [12] in wavelet image compression. In combination with wavelet packet transform-based audio coding [13]–[15], ZTW and SPIHT are applied to achieve the scalable audio coding format in [16]–[18]. Besides the tree-based structure, the psychoacoustic information of a signal is another attractive characteristic that can be used to enhance the performance of bit-plane coding. Such information is being utilized in prioritized bit-plane coding approaches of the second category. Here, the priorities of bit-plane coefficients are assigned according to the psychoacoustic information [19], as done in [20] where the scalefactors from the AAC quantization are used as a guide to

1558-7916/$25.00 © 2007 IEEE

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

coding orders. The third group is composed of approaches that combine both psychoacoustic information and significant map structure together, as have been adopted in [21]–[23]. The scalable audio coding schemes mentioned above mainly focus on the scalability of low bit-rate perceptual audio. With the proliferation of broadband access and continual decline of storage costs, there is a trend in providing consumers with a rich media experience of extremely high fidelity. In the realm of audio, this is achieved by employing lossless formats with high resolutions and/or high sampling rates. It is in this context that the ISO/MPEG Audio Standardization Group began exploring technologies for lossless and near lossless coding of audio signals by issuing a call for proposals (CFP) for relevant technology in 2002 [24]. Key output of this CFP, the MPEG-4 scalable lossless (SLS) coding [25], was recently released as a standard audio coding tool. It allows the scaling up of a perceptually coded representation such as MPEG-4 AAC [8] to a fully lossless representation with a wide range of intermediate bit-rate representations. The structure of SLS combines the scalable coding approaches of “perceptual core enhancement layers” scheme and bit-plane coding to achieve backward compatibility with the MPEG perceptual audio coder and FGS. Basically, the core layer in SLS can be MPEG AAC codec or other perceptual scalable codec such as scalable AAC and BSAC. The scalable enhancements are achieved through sequential bit-plane coding of the residual signals between the original and the AAC encoded spectrum. This scalable coding scheme works efficiently when the bit-rate of the core layer is adequate. In this scenario, the quantization noise is well below the psychoacoustic mask, and the perceptual information is mostly preserved to the AAC coded bitstream. For some cases where full scalability or near-full scalability are desired, the bit-rate occupied by the perceptual core may be zero (noncore mode of SLS) or very low. Due to the lack of a perceptual coder, the spectral shape of the residual signal for bit-plane coding roughly follows the shape of the original signal spectrum. This is far from the optimum for sequential bit-plane coding and will directly result in nonoptimal perceptual quality of output audio at intermediate bit-rates. To improve on this, it was mentioned in [26] that psychoacoustic information like just noticeable distortion (JND) can be applied in the bit-plane coding process for a perceptually more efficient enhancement coding. This approach, along with a certain degree of quality improvement, comes with a considerable amount of side information which may cause the degradation of coding efficiency at low bit-rates. As the perceptual quality of fully scalable codec such as noncore SLS is still far behind the quality that can be achieved by perceptual audio coder at the same bit-rate, the prioritized bit-plane coding is proposed in this paper. The main purpose is to efficiently improve the quality of fully scalable audio with the least possible extra side information and modification to the standardized codec. Therefore, the proposed method is different from existing approaches in that it involves a combination of two features. First, the priorities are assigned according to the energy distribution and thus eliminating any psychoacoustic side information attachment. Second, these priorities are assigned to

95

the bit-planes of several frequency regions or groups of scalefactor bands (sfbs) only. These considerations are regarded as important. When the priorities are assigned to the frequency regions instead of a single bit/coefficient/sfb without computation or attachment of psychoacoustic information, a system can indeed be very simple in both its structure and computation. In fact, in its simplest version, only as little as 1 bit per frame of side information is present. Yet, this priority assignment scheme based on pure energy distribution is shown to be perceptually efficient. The test results indicate that even with the simplest version of the proposed system, a considerable amount of perceptual improvement can be achieved for the noncore mode of SLS or SLS with low core bit-rate in a wide range of lossy bit-rates. It should be noted that this low-complexity enhancement algorithm is not to replace the perceptual coder; rather it is to improve the performance of fully or near-fully scalable coders where the bit-rate for the perceptual core is limited. The rest of this paper is organized as follows. Section II gives an overview of the MPEG-4 SLS, including the perceptual performance evaluation at different combinations of intermediate bit-rates and related discussions. The proposed frequency region-based prioritized bit-plane coding is formulated and solved with a simplified solution in Section III. These are followed by extensive test results from the implementation of the enhanced SLS structure with the proposed system. II. MPEG-4 SCALABLE LOSSLESS CODING A review on the structure of MPEG-4 SLS reference model (RM) is provided in the first part of this section. It is followed by a discussion on the performance of SLS in terms of perceptual quality at intermediate bit-rates, especially in cases with low core bit-rates. A. General Structure of SLS Like many common scalable codecs, SLS comprises two separate layers—the core layer and the lossless enhancement (LLE) layer. As depicted in Fig. 1, the core layer is simply an MPEG-4 AAC codec. This AAC core coder can be turned off in the noncore mode of SLS, where full scalability can be achieved through sequential bit-plane coding. In the SLS encoder, the input audio in integer pulse code modulation (PCM) format is losslessly transformed into the frequency domain by using the integer modified discrete cosine transform (IntMDCT)[27], [28] which is a lossless integer to integer transform that approximates the normal MDCT transform. If the Mono IntMDCT is used for the left and the right channel, the integer Mid/Side (M/S) [29] processing has to be applied to the sfbs where the M/S flag is set to 1. The Stereo IntMDCT delivers by default an M/S spectrum. The resulting coefficients are then passed on to the AAC encoder to generate the core layer AAC bitstream. In the AAC encoder, transformed coefficients are first grouped into sfbs which are generally fixed at 49 for a sampling rate of 48 or 44.1 kHz. The coefficients are then quantized with a nonuniform quantizer, usually with different quantization steps in different sfbs to shape the quantization noise so that it can be best masked.

96

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

Fig. 1. Structure of SLS encoder and decoder.

bit-plane coding is performed in a sequential order that the plane of the most significant bit (MSB) for spectral data from the lowest sfb to the highest sfb is coded first. It is followed by the subsequent bit-planes until it reaches the plane of least significant bit (LSB) for all sfbs. The following structural frequency assignment is used for arithmetic coding: (2)

Fig. 2. Bit-plane coding order (for bit-plane golomb coding) adopted in SLS RM.

In order to efficiently utilize the information of the spectral data that has been carried in the core layer bitstream, an error-mapping procedure is employed to generate the residual spectrum coded in the LLE layer. This is done by subtracting the AAC quantized spectrum from the original spectrum. For , where is the dimension of IntMDCT, the residual signal is computed by (1) is the IntMDCT coefficient, is the quantized data Here, is the vector produced by the AAC quantizer, flooring operation that rounds off a floating-point value to its is the low nearest integer with smaller amplitude, and boundary (towards-zero side) of the quantization interval corre. sponding to The residual spectrum is then coded using bit-plane golomb coding (BPGC) [30] combined with context-based arithmetic coding (CBAC) and low-energy mode coding [31] to generate the scalable LLE layer bitstream. As illustrated in Fig. 2, the

where is the current bit-plane to be coded, and lazy plane parameter can be selected using the adaptation rule [32]. Each bit-plane symbol is then coded with an arithmetic coder using except the sign symthe probability assignment given by . bols which are simply coded with probability assignment In SLS RM, BPGC enters the lazy mode coding from the fifth MSB bitplane. As the frequency assignment rule of BPGC is derived from the Laplacian probability density function, BPGC only delivers excellent compression performance when the sources are nearLaplacian distributed. However, for some music items, there exist some “silence” T/F regions where the spectral data are in fact dominated by the rounding errors of IntMDCT. In order to improve the coding efficiency, the low-energy mode coding is adopted for coding signals from low energy regions. It is also possible to improve the coding efficiency of BPGC by further incorporating more sophisticated probability assignment rules that take into account the dependencies of the distribution of IntMDCT spectral data to several contexts such as their frequency locations or the amplitudes of adjacent spectral lines, which can be effectively captured by using CBAC. In the final step of the encoding process, the output of the LLE bitstream is multiplexed with the core AAC bitstream to produce the final SLS bitstream. More details of the SLS can be found in [32] and [33], and some basic knowledge of noiseless coding can be found in [34]–[37].

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

97

TABLE I TEST ITEMS FOR MPEG-4 SLS (48-KHZ/16-BIT STEREO)

Fig. 4. Residual energy spectrum of SLS for AAC core bit-rates at 0, 64, and 96 kbps.

Fig. 3. Perceptual quality performance of SLS at variable bit-rate combinations.

B. Perceptual Quality at Intermediate Bit-rates With the structure described in the previous subsection, in [32], the evaluation shows that SLS can achieve lossless compression performance comparable to lossless-only codecs. In this subsection, the perceptual quality of SLS at intermediate bit-rates are evaluated and discussed. The configuration of evaluation is listed as follows. • The overall test bit-rate range is from 64 to 384 kbps (in total for stereo channels), with step size of 32 kbps in the range of 64–256 kbps and 64 kbps in the range of 256–384 kbps. • The total bit-rate comprises the core bit-rate and enhancement (bit-plane coding) bit-rate. • The perceptual quality of the decoded audio is measured as objective difference grade (ODG) scores using the OPERA voice/quality analyzer [38] which performs the ITU-R BS.1387 perceptual evaluation of audio quality (PEAQ) [39] test (the basic version). The PEAQ test makes use of a computational model of the human auditory system to compare the perceptual difference between the processed (coded) audio signal and the original one, and outputs the ODG score that mimics the listening tests rating. The grading scale of ODG ranges from 4 (“very annoying”) to 0 (“imperceptible difference”). • For each bit-rate combination, the ODG result is the average value of the scores for 15 test sequences as listed in Table I. The evaluation result is shown in Fig. 3. It can be observed that with the same amount of total bit-rates, SLS with adequate kbps) can achieve relatively more efficient core bit-rate ( perceptual improvements compared to the performance of SLS

Fig. 5. Signal energy versus the noise-to-mask ratio for five frames of avemaria.wav by using SLS noncore coding.

at a core bit-rate of 64 kbps and in noncore mode. Thus, the remaining problem with the current structure is that the perceptual quality of SLS at low core bit-rates is still far from optimum for most of the audio sequences. The reason for this is clear and simple. In noncore mode or low core bit-rates, the residual spectrum only roughly tracks the spectrum of the original signal which is generally not as flat or as evenly distributed as the residual spectrum in high core bit-rates scenarios. Therefore, the sequential bit-plane coding as shown in Fig. 2 may not be the ideal solution for the residual spectrum generated from the low core bit-rates. This can be further illustrated by Figs. 4 and 5. Fig. 4 shows an example distribution of the residual energy between the original signal and AAC quantized signal in the frequency domain. The data are created from the 600th frame of avemaria.wav with a total of 49 sfbs per frame. The plot shows the residual energy at the AAC core bit-rates of 0, 64, and 96 kbps, respectively. In particular, the residual energy with 0 core bit-rate is actually the energy of the original signal itself. Compared with the case at 96 kbps, the residual energy at

98

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

the other two core bit-rates are apparently more concentrated in the low-frequency region of the spectrum. Fig. 5 further plots the signal energy against the noise-to-mask ratio (NMR) of the frames numbered 100, 200, 300, 400, and 500 of avemaria.wav with a bit-plane coding bit-rate of 64 kbps in the noncore mode. for NMR is computed as The noise (3) where denotes the offset starting coefficient of sfb , and is the coefficient reconstructed by CBAC bit-plane coding. In addition, in the SLS RM encoder, the psychoacoustic mask in NMR is computed as SMR

Fig. 6. Region division of one frame.

dB and SMR dB and SMR dB (4)

where SMR denotes the signal-to-mask ratio computed from the psychoacoustic model of the AAC for sfb , and the signal energy is computed as (5) Note that (4) is the default mask implementation in the MPEG-4 AAC. It is observed that by using only the sequential bit-plane coding, higher value NMR dominates when the signal energy is above certain threshold. Together, Figs. 4 and 5 clearly explain why the sequential bit-plane coding works inefficiently in low core bit-rate scenarios of SLS. In order to achieve the optimized rate-distortion, higher priorities should be assigned to the bit-planes of the frequency regions with a higher level residual energy. The prioritized bit-plane coding algorithm proposed in the current work is based on the above observations.

Here, is the starting sfb of region . The set of starting is thus defined as sfbs for each region (7) The corresponding coding priority of is denoted by and . The lowest priority is . We further define as the represented by average energy of , and it is computed as (8) where is the residual energy of sfb . Let and the energy difference ratio between

be

(9)

The priorities are assigned according to the average energy as III. FREQUENCY REGION BASED PRIORITIZED BIT-PLANE CODING

(10)

The idea of frequency region-based prioritized bit-plane coding is, as the name suggests, to divide the entire frequency spectrum into several regions and assigning these regions with priorities according to their respective energy levels. In this section, the basic algorithm definitions are given first. It is followed by the formulation of the optimization problem. Finally, a series of simplification procedures and solutions are discussed. A. Basic Algorithm Fig. 6 depicts the spectrum of a particular frame , where is the total number of frames. The whole spectrum of frame with a total of sfbs is regions where and . Each divided into region is denoted by . And includes a , with set of sfbs, (6)

is the minimum energy difference threshold for where frame . It is assumed that the priority of the lower frequency region is always at least equal to that of the higher frequency region. With the above definitions and rules, the regions of frame can be sorted as a list with descending orders of priorities. The coding order of the bit-planes for each region can be thus determined accordingly. Note that the priorities of coding are only and ) bit-planes, while assigned to the top the coding order of the subsequent bit-planes remains the same. For SLS, the top bit-planes are those nonlazy bit-planes. All the lazy bit-planes of each region follow the same coding orders (as in Fig. 2) after all the nonlazy bit-planes are coded. The reason behind this is twofold. First, having more layers of prioritization involves higher complexity along with unwanted side information that can be limited with this method. And second, it is redundant to assign priorities to lazy bit-planes because for

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

99

Fig. 7. Percentage of frames entering lazy-mode coding at noncore bit-rate of 384 kbps (see Table I for the full list of test sequences).

most of audio sequences, near transparent quality can already be achieved with small percentage of lazy coding. This is evident in Fig. 7 which shows the percentage of frames that enter lazy mode coding for the 15 standard test sequences at 384 kbps (noncore). It is observed that 13 out of 15 items have less than 30% of frames in lazy mode coding at 384 kbps. While at this coding bit-rate, quality is almost transparent (refer to Fig. 3). The specific coding order of the bit-planes for each region is . Suppose that the number dependent on the level of of nonlazy bit-planes in SLS is fixed at for a particular coding configuration. The MSB bit-plane of each sfb is indicated by , followed by subsequent bit-planes , with and . Consider a set of energy difference and threshold values, with is the level of priorities. In particular, is the minimum threshold value which is used in (10). In addition (11) Let

and

denote the sfbs in

and

, respectively, where

with (a)  (n; m) = 2

in region will only be coded when the coding of bit-planes from to in is completed. Rule 2 is demonstrated by an example as shown in Fig. 8, with and .

(12)

B. Parameter Optimization

(13)

With the basic algorithm described in the previous subsection, the relevant parameters should be now be chosen to optimize the system’s performance. For a particular frame , the following multiple objective optimization problem can be formulated in the following way:

If

, the priority level difference with is defined as 0. The corresponding coding sequence follows Rule 1 which is actually the sequential order coding as depicted in Fig. 2. will only be coded when the Rule 1: the bit-plane coding of bit-planes from in is completed. The bitplanes from in region will only be coded when the is completed. coding of bit-plane Otherwise, if and

Fig. 8. Bit-plane coding order for regions r and r and (b)  (n; m) = 3.

(14)

it is defined that and the coding will proceed as follows. to in region Rule 2: the bit-planes from will only be coded when the coding of bit-planes from to in is completed. The bit-planes from to

(15) is the distortion function. For simplification, a few where assumptions are made here. First, the available bit-rate is not considered as an influencing parameter. Second, the number of regions and priority thresholds are fixed for all the frames with a particular profile of the proposed system and are denoted by and , respectively. The principal parameter of the proposed system is the total . Let denote the side information bits number of regions for frame , and it can be computed as (16)

100

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

which includes information on the starting sfb for each region, the priority list, and the coding sequence but ignores the number of regions and values of priority thresholds. Suppose that by using prioritized bit-plane coding, the distortion for frame is compared to the sequential coding. This imimproved by provement can be achieved by using pure sequential coding if bits are assigned to the frame. With fixed values of extra and is optimized by (17) Next, with the optimized value of is in turn optimized by

and a fixed value of

(18) where the mean energy difference ratio

is computed as (19)

with denoting the region and . The third parameter includes two subparameters—the number of threshold values and the values of the thresholds. does not necessarily include all Note that the optimized are actually redundant. In the threshold values, as some this scenario, it is denoted as and may include only several and . It is assumed threshold values, e.g., with that all Fig. 9. Typical spectrum plots for (a) Model I and (b) Model II.

C. Simplified SLS Implementation (20) are redundant as their values are very similar to , and these denote the distortion can be easily merged with . Let improvement when bit-plane is coded. The values of can when then be determined as

(21) is satisfied. By combining (21) and (19), the optimized values of can be obtained. The above optimization calculations including (17) and (21) are not directly computable; however, their conditions do provide at least the qualitative insights for the parameters selection. The near optimal parameters can thus be chosen based on statistical results obtained from a large volume of test sequences.

The basic algorithm and parameter optimization have been described in the last two subsections. A further generalized modeling of the audio spectrum is proposed and discussed in this subsection. Based on this, a simplified structure of prioritized bit-plane coding is implemented in SLS, with the goal of optimizing the perceptual performance with the least added complexity, side information, and syntax change. 1) Modeling of the Audio Spectrum: By using (17), stais optimized at 2 to 4 for most of tistical data show that the audio sequences. The audio spectrum for each frame is divided into four regions which are denoted by low frequency , middle-low-frequency , middle-high-frequency , and high-frequency regions with fixed region . It is obboundaries served that most of the spectral can be categorized into two basic models based on the energy distribution in these regions. The typical plots of these two models are shown in Fig. 9. The main criterion that distinguishes these two models is , which is the energy difference ratio between and . Specifically, one frame is identified as regions (22)

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

Fig. 10. Integrated structure of SLS with frequency region-based prioritized bit-plane coding.

where is a constant threshold value [refer to (11)]. The coding priorities for Model I are assigned as

101

Fig. 11. SFB numbers from the case where the energy level is equal to or less than 70 dB (collected from 150 frames). TABLE II PARAMETERS SETTING FOR BPGC/CBAC CODING

(23) where and are merged as one region in the and are equivacoding process lent to and for Model I. The corresponding and priority level differences are denoted by . While the coding priorities for Model II are (24) with the merging of regions and . Similarly, and are equivalent to , and for Model II, and their priority level differences are and . 2) Integrated Structure: The integrated structure of SLS with the prioritized bit-plane coding based on the above-mentioned models is depicted in Fig. 10. In this structure, a conditional switch [(22)] and two alternative prioritized bit-plane coding sequences are adopted to replace the original sequential bit-plane coding block of SLS. As all the parameters are fixed as constants in this integrated structure, the side information is thus 1 bit per frame. There is also a reserved bit in SLS bit-plane coding which can be used as the switch function. This means that there is no additional side information to the bitstream at all. In addition, the complexity of this profile remains as well, since the only added computation will be the energy difference calculation for the switching criteria. 3) Parameters Setting: It is observed from Fig. 5 that for sfbs with energy equal to or lower than 70 dB, the corresponding NMRs are relative low. This can be explained by the mask computation in (4). Therefore, for each frame dB

(25)

A statistical study is performed (as shown in Fig. 11) on the sfb numbers from the case where the energy level is equal to or less than 70 dB. The data is extracted from 150 frames out of the

15 standard test sequences. It is noted that for most sequences, . In addition, the energy level declines to 70 dB from the last four sfbs correspond to the highest frequency region, and the noise are relatively less evident to human perception. Therefore, the last four sfbs are always grouped as one and assigned with the lowest priority. As such, these low-energy sfbs should always be assigned with low coding priorities. The other parameters, including energy difference ratio , the region boundaries , and threshold , can be coding priority differences derived according to (18) and (21). The optimized parameters do vary for different items. However, to keep the side information and complexity to the minimum, the parameters are fixed for a particular profile. The parameters for one typical profile are summarized in Table II, and the corresponding coding sequences are depicted in Figs. 12 and 13 for BPGC and CBAC coding, respectively. IV. EXPERIMENTAL RESULTS By replacing the sequential BPGC/CBAC bit-plane coding block in SLS with the simplified implementation of prioritized bit-plane coding as depicted in Fig. 10, the combined structure is denoted by PSLS (for prioritized SLS). The perceptual quality

102

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

Fig. 13. Bit-plane coding order for CBAC coding with (a) Model I and (b) Model II.

Fig. 12. Bit-plane coding order for BPGC coding with (a) Model I and (b) Model II.

of lossy audio reconstruction when SLS and PSLS are working at zero and low core bit-rates is evaluated based on both the objective and subjective tests. Specifically, the objective test is conducted by using the OPERA voice/audio quality analyzer [38] which applies the ITU-R BS.1387 PEAQ test method and the subjective test is conducted by using ITU-R BS.1116 [40]. In the PEAQ test, the PSLS is evaluated against two main sets of test data, where one set uses CBAC and the other one uses BPGC coding. The full list of items can be found in Table I, with the parameters shown in Table II. For each test set, there are three subsets with core bit-rates at 0, 32, and 64 kbps (in total for stereo channels). The enhancement layer bit-rates for each subset start from 64 kbps up to the bit-rate where the corre. The persponding ODG values fall in the range of formance of PSLS is compared with that of the original SLS and state-of-the-art perceptual audio codec AAC. The AAC codec selected for comparison is the MPEG-4 AAC verification model (VM) low-complexity (LC) profile which is an informative implementation from MPEG. The AAC core in the SLS and PSLS are the MPEG-4 VM LC profile too. The performance of PSLS noncore which is the main enhancement target of the proposed method is further evaluated using the subjective test with comparison of the AAC-LC and SLS noncore at common lossy bit-rates of 64, 96, 128, 192, and 256 kbps (in total for stereo channels) for all the test items.

The summarized results are plotted in Figs. 14–16, with numerical ODG results of SLS noncore with enhancement bitrates of 128, 192, and 256 kbps (in total for stereo channels) shown in Table III. For the PEAQ test in Figs. 14 and 15, the grading scale of ODG ranges from 4 (“very annoying”) to 0 (“imperceptible difference”). For the BS. 1116 test in Fig. 16, the grading scale ranges from 1 (“Very Annoying”) to 5 (“Imperceptible”). Here are a few key observations. • Compared with the perceptual quality of SLS in terms of ODG scores, PSLS displays obvious improvements for both of CBAC and BPGC coding test data sets. The scalable quality curves with increasing enhancement layer bitrates for original SLS are rather linear when core bit-rates are low. While for the PSLS, a nonlinear, or a more “perceptually representative” scalability is achieved. • In both the objective and subjective tests, it is observed that PSLS has significantly shorten the quality gap between perceptual audio codec and fully scalable audio codec, though AAC LC still outperforms PSLS at all test bit-rates. • The more effective improvement range of PSLS begins when the enhancement layer bit-rates are higher than 64 kbps for the first two subsets. In addition, the improvement range begins when the enhancement layer bit-rate is equal to 64 kbps for the third subset. This can be explained by the fact that since only four frequency regions are considered, the priorities assigned by PSLS is rather coarse. As a result, in very low bit-plane coding bit-rates, there will be no obvious improvements. The improvements are

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

Fig. 14. Objective results for CBAC coding with core bit-rate at (a) 0 kbps, (b) 32 kbps, and (c) 64 kbps. It should be noted that the total bit-rate is equal to the sum of the core bit-rate and the enhancement bit-rate.

more significant in the middle-range bit-rates between 96 to 192 kbps. • The improvements are more significant for the noncore subsets, as the proposed prioritized bit-plane coding works more efficiently when the signal spectrum is more unbalanced.

103

Fig. 15. Objective results for BPGC coding with core bit-rate at (a) 0 kbps, (b) 32 kbps, and (c) 64 kbps. It should be noted that the total bit-rate is equal to the sum of the core bit-rate and the enhancement bit-rate.

• From Table III, it is observed that PSLS produces perceptual quality results which are more stable and balanced. The improvements are relatively small for one test item “cymbal.wav,” as this item contains a rather extended state of silence and its spectrum is more balanced compared to those of the rest of the test data.

104

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

TABLE III ODG PERFORMANCE OF ENHANCED SLS NONCORE COMPARING WITH THAT OF ORIGINAL SLS NONCORE

to the performance at high core bit-rates. Inspired by observations on the energy distribution of the residual spectrum, a frequency-region-based prioritized bit-plane coding has been proposed along with an analysis on parameter optimization. Based on the statistical modeling of the spectrum, a much more simplified implementation is designed for SLS. The results of experiments show that with zero extra bit involved and merely trivial added complexity, SLS with low core bit-rates is significantly improved by the proposed system with variable intermediate bit-rate combinations. This is especially important for the noncore mode of SLS as a perceptually more efficient fully scalable coding is achieved. The generalized frequency-region-based prioritized bit-plane coding can also be applied in other bit-plane coding scenarios besides SLS. Fig. 16. Subjective test results of PSLS noncore with comparison of SLS noncore and AAC LC at variable lossy bit-rates (1: Very Annoying. 2: Annoying. 3: Slightly Annoying. 4: Perceptible but not Annoying. 5: Imperceptible).

The quality evaluation conducted shows that the frequency region-based prioritized bit-plane coding does improve the perceptual quality of SLS in a wide range of bit-rates. In addition, as mentioned in last section, this comes with no unwanted extra side information bits, as the only additional bit per frame can be implemented using the reserved 1-bit of SLS. It is expected that greater improvements can be achieved in a wider bit-rate range when a spectrum is divided into more frequency regions. However, this entails a more complex system and more side information. Since SLS is already standardized, the proposed system is in fact a simple yet efficient modification to it. It does not require a complete overhaul and fits this purpose well. V. CONCLUSION The lossy quality performance of the current SLS structure at low or noncore bit-rates are shown to be inefficient compared

REFERENCES [1] K. Brandenburg and B. Grill, “First ideas on scalable audio coding,” in Proc. 97th AES Conv., San Francisco, CA, 1994, preprint 3924. [2] B. Grill and K. Brandenburg, “A two or three-stage bit rate scalable audio coding system,” in Proc. 99th AES Conv., New York, 1995, preprint 4132. [3] B. Grill, “A bit rate scalable perceptual coder for MPEG-4 audio,” in Proc. 103rd AES Conv., New York, 1997, preprint 4620. [4] B. Grill and B. Teichmann, “Scalable joint stereo coding,” in Proc. 105th AES Conv., San Francisco, CA, 1998, preprint 4851. [5] N. Iwakami, T. Moriya, and S. Miki, “High quality audio coding at less than 64 kbps by using transform domain weighted iterleave vector quantization (TWIN-VQ),” in Proc. Int. Conf. Acoust., Speech, Signal Process., May 1995, vol. 2, pp. 3095–3098. [6] J. Herre, E. Allamanche, K. Brandenburg, M. Dietz, B. Teichmann, and B. Grill, “The integrated filterbank based scalable MPEG-4 audio coder,” in Proc. 105th AES Conv., San Francisco, CA, 1998, preprint 4810. [7] A. Jin, T. Moriya, T. Norimatsu, M. Tsushima, and T. Ishikawa, “Scalable audio coder based on quantizer units of MDCT coefficients,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Mar. 1999, vol. 2, pp. 897–900. [8] Information Technology—Coding of Audiovisual Objects, Part 3. Audio, ISO/IEC 14496-3, 1998. [9] M. S. Vinton and L. E. Atlas, “Scalable and progressive audio codec,” in Proc. Int. Conf. Acoust., Speech, Signal Process., May 2001, pp. 3277–3280.

LI et al.: FREQUENCY REGION-BASED PRIORITIZED BIT-PLANE CODING FOR SCALABLE AUDIO

[10] S. H. Park, Y. B. Kim, and Y. S. Seo, “Multi-layer bit-sliced bit-rate scalable audio coding,” in Proc. 103rd AES Conv., New York, 1997, preprint 4520. [11] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3445–3462, Dec. 1993. [12] A. Said and W. A. Pearlman, “A new, fast, and efficient image codec based on set partitioning in hierarchical trees,” IEEE. Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996. [13] M. Purat and P. Noll, “Audio coding with a dynamic wavelet packet decomposition based on frequency-varying modulated lapped transforms,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 1996, vol. 2, pp. 1021–1024. [14] S. Boland and M. Deriche, “Audio coding using the wavelet packet transform and a combined scalar-vector quantization,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 1996, vol. 2, pp. 1041–1044. [15] D. Sinha and A. Tewfik, “Low bit rate transparent audio compression using adapted wavelets,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3463–3479, Dec. 1993. [16] A. Aggarwal, V. Cuperman, K. Rose, and A. Gersho, “Perceptual zerotrees for scalable wavelet coding of wideband audio,” in IEEE Workshop Speech Coding Proc., Jun. 1999, pp. 16–18. [17] Z. Lu and W. A. Pearlman, “A efficient, low complexity audio coder delivering multiple-levels of quality for interactive applications,” in Proc. IEEE Signal Process. Soc. Workshop Multimedia Signal Process., Dec. 1998, pp. 529–534. [18] C. Dunn, “Efficient audio coding with fine-grain scalability,” in Proc. 111th AES Conv., New York, 2001, preprint 5492. [19] N. S. Jayant, J. D. Johnson, and R. Safranek, “Signal compression based on model of human perception,” Proc. IEEE, vol. 81, no. 10, pp. 1385–1422, Oct. 1993. [20] F. C. Chen and T. M. Chiu, “Scalefactor based bit shift FGS audio coding,” in Proc. 19th Int. Conf. Adv. Inf. Netw. Applicat., Mar. 2005, vol. 2, pp. 235–238. [21] J. Li, “Embedded audio coding (EAC) with implicit auditory masking,” in Proc. ACM Multimedia, Nice, France, Dec. 2002, pp. 592–601. [22] M. Raad, A. Mertins, and R. Burnett, “Scalable to lossless audio compression based on perceptual set partitioning in hierarchical trees (PSPIHT),” in Proc. Int. Conf. Acoust., Speech, Signal Process., Apr. 2003, pp. 624–627. [23] S. Strahl, H. Zhou, and A. Mertins, “An adaptive tree-based progressive audio compression scheme,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., Oct. 2005, pp. 219–222. [24] Call for Proposals on MPEG-4 Lossless Audio Coding, ISO/IEC JTC1/ SC29/WG11, 2002, N5040, (MPEG). [25] Scalable Lossless Coding (SLS), ISO/IEC 14496-3:2005/Amd 3, 2006. [26] R. Yu, T. Li, and S. Rahardja, “Perceptually enhanced bit-plane coding for scalable audio,” in Proc. Int. Conf. Multimedia Expo., Jul. 2006, pp. 1153–1156. [27] R. Geiger, T. Sporer, J. Koller, and K. Brandenburg, “Audio coding based on integer transform,” in Proc. 111th AES. Conv., New York, Sep. 2001. [28] R. Geiger, Y. Yokotani, and G. Schuller, “Improved integer transforms for lossless audio coding,” in Proc. Asilomar Conf. Signals, Syst. Comput., Nov. 2003, pp. 2119–2123. [29] J. Johnston and A. Ferreira, “Sum-difference stereo transform coding,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Mar. 1992, pp. 569–571. [30] R. Yu, C. C. Koh, S. Rahardja, and X. Lin, “Bit-plane golomb code for sources with Laplacian distributions,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Apr. 2003, pp. 277–280. [31] R. Yu, X. Lin, S. Rahardja, C. C. Koh, and H. Huang, “Improving coding efficiency for MPEG-4 audio scalable lossless coding,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2005, vol. 3, pp. 169–172. [32] R. Yu, S. Rahardja, X. Lin, and C. C. Koh, “A fine granular scalable to lossless audio coder,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1352–1363, Jul. 2006. [33] R. Geiger, R. Yu, J. Herre, S. Rahardja, S. W. Kim, X. Lin, and M. Schmidt, “ISO/IEC MPEG-4 high-definition scalable advanced audio coding,” Proc. 120th AES Conv., no. 6791, May 2006. [34] S. W. Golomb, “Run-length encodings,” IEEE Trans. Info. Theory, vol. 12, no. 3, pp. 399–401, Jul. 1966. [35] L. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Comm. ACM, vol. 30, no. 6, pp. 520–540, Jun. 1987. [36] M. Nelson, “Arithmetic coding statistical modeling data compression,” Dr. Dobb’s J., vol. 16, no. 2, pp. 16–22, Feb. 1991. [37] M. Hans and R. W. Schafer, “Lossless compression of digital audio,” IEEE Signal Process. Mag., vol. 18, no. 4, pp. 21–32, Jul. 2001. [38] “OPERA Voice/Audio Quality Analyzer,” [Online]. Available: http:// www.peaq.org/

+

=

105

[39] Method for Objective Measurements of Perceived Audio Quality, ITU-R BS.1387. [40] Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multi-Channel Sound System, ITU-R BS.1116.

Te Li (S’07) received the B.Eng. degree in electrical and electronic engineering from Nanyang Technological University (NTU), Singapore, in 2004. She is currently pursuing the Ph.D. degree in electrical and electronic engineering at NTU. She joined the Institute for Infocomm Research, Agency for Science, Technology, and Research, Singapore, as a Research Engineer in 2004. Her research interests include speech, audio, and image processing.

Susanto Rahardja (M’97–SM’03) received the B.Eng. degree from the National University of Singapore and the M.Eng. and Ph.D. degrees from the Nanyang Technological University (NTU), Singapore, all in electrical and electronic engineering. He has been the Director of the Media Division at the Institute for Infocomm Research, Singapore, from 2002 to 2007. His research interests are in audio/video signal processing, spread spectrum and multiuser detection techniques for CDMA applications, digital signal processing algorithms, and implementations and logic synthesis, of which he has more than 180 publications in internationally refereed journals and conferences. He is also an Associate Professor at the School of Electrical and Electronic Engineering, NTU. Dr. Rahardja was the recipient of the Institution of Electronic Engineers (IEE) Hartree Premium Award for the best journal paper published in IEE Proceedings in 2002. In 2003, he received the prestigious Tan Kah Kee Young Inventorsar´ Gold award in the Open Category for his contributions on scalable to lossless audio compression technology. Since 2002, he has been actively participating in the international ISO/IEC JTC1/SC29/WG11 (Moving Picture Expert Group, or MPEG) where he contributed to MPEG-4 scalable to lossless system (SLS) and the technology is incorporated in ISO/IEC 14496-3:2005/Amd.3:2006. He also contributed technology to the MPEG-4 Audio Lossless System (ALS) where it is now incorporated in ISO/IEC 14496-3:2005/Amd.2:2006. In recognition for his contributions to the national standardization program, he was awarded the Standards Council Merit Award by SPRING Singapore in 2006. He has served on several boards, advisory, and technical committees in various IEEEand SPIE-related professional activities in the areas of multimedia. He is currently serving as an Associate Editor for the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING and the Journal of Visual Communication and Image Representation.

Soo Ngee Koh (SM’98) received the B.Eng. degree from the University of Singapore, Singapore, and the B.Sc. degree from the University of London, London, U.K., in 1979, and the M.Sc. and Ph.D. degrees from Loughborough University, Loughborough, U.K., in 1981 and 1984, respectively. Upon graduation, he worked as an Engineer at the Telecommunication Authority of Singapore. Prior to his return to Singapore, he worked as a Consultant in wide-band speech and audio coding at the British Telecom Research Laboratories, Suffolk, U.K. He joined Nanyang Technological University (NTU), Singapore, in 1985. He is currently a Professor and Associate Chair (Academic Matters) of the School of Electrical and Electronic Engineering, NTU. He has published more than 130 papers in international journals and conference proceedings, and holds two international patents on speech coder design. His research interests include speech processing, coding, enhancement, recognition and synthesis, joint source-channel coding, audio and video coding, blind source separation, and communication signal processing. Dr. Koh was a corecipient of the IREE (Australia) Norman Hayes Best Paper Award in 1990.