Sound Source Location Cue Coding System for ... - ACM Digital Library

Sound Source Location Cue Coding System for Compact Representation of Multi-channel Audio Inseon Jang, Jeongil Seo, Seungkwon Beack, Kyeongok Kang

Han-gil Moon

Broadcasting Media Research Group, Electronics and Telecommunications Research Institute (ETRI) 161 Gajeong-dong, Yeseong-gu, Daejeon, 305-350 KOREA +82-42-860-5791

Digital Media R&D Center, Samsung Electronics Co., Ltd. 416 Maetan-dong, Yeongtong-gu, Suwon-si, Gyeonggido 443-742 KOREA +82-31-277-8076

{jinsn, seoji,skbeack,kokang}@etri.re.kr

[email protected]

compact representation method for multi-channel audio with an encoding process, while preserving backward compatibility to legacy mono/stereo audio coding schemes like MPEG Audio, MPEG-2/4 AAC, etc. The MPEG Surround has two coded representation parts; one is a down-mixed signal for retaining backward compatibility, and the other is low bitrate side information for recovering the spatial image of multi-channel audio.

ABSTRACT Binaural cue coding (BCC) has been introduced for compact representation of multi-channel audio. It exploits binaural cue parameters for capturing the spatial image of multi-channel audio. Recently, it has been standardized within MPEG as the name of “MPEG Surround.” In this paper, we propose a sound source location cue coding (SSLCC) system for compressing multichannel audio to be suitable at the narrow bandwidth transmission environment. To improve the compression ability of the conventional BCC, the SSLCC system utilizes the virtual source location information (VSLI) as a spatial cue parameter instead of the inter-channel level difference (ICLD) of the BCC system. Also the SSLCC system adopts enhanced pre/post processing algorithms to improve perceptual sound quality. Objective and subjective assessment results show that the proposed SSLCC system reveals better performance than the conventional BCC system.

Binaural Cue Coding (BCC) [1] is known to be the basis of MPEG Surround system. The BCC represents multi-channel signals as down-mixed audio signal and BCC side information such as inter-channel level difference (ICLD), inter-channel time difference (ICTD), and inter-channel correlation (ICC).

General Terms: Design

In this paper, we proposed a sound source location cue coding (SSLCC) system, which is extension of the conventional BCC for extremely low bitrate representation of side information. The SSLCC estimates virtual source location information (VSLI) [2] as the side information. The VSLI represents the geometric spatial information using an angle which has a finite dynamic range. The VSLI might be quantized with a consistent finite maximum level while preserving spatial sound image. Taking this merit, The SSLCC system can represents original multi-channel audio with robustness to quantization distortion.

Keywords

2. Virtual Source Location Information

Categories and Subject Descriptors E.4 [Coding and Information Theory]: Data compaction and compression.

Spatial Audio Coding, BCC, VSLI, Sound source location cue coding

VSLI is an angle which represents the geometric spatial information between inter-channel power vectors, rather than power ratios such as conventional ICLD. It is extracted under the assumption that the playback layout of multi-channel loudspeakers is fixed as illustrated in Fig. 1. There are five possible number of the existing spatial sound image denoted as S1, S2, S3, S4, and S5, because adjacent inter-channel power vectors form a spatial audio image between adjacent loudspeakers. Hence a spatial image can be represented by one angle between the adjacent loudspeakers.

1. INTRODUCTION MPEG Surround technology has been emergent in MPEG as a new coding scheme for multi-channel audio. It provides extremely Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’05, November 6–11, 2005, Singapore. Copyright 2005 ACM 1-59593-044-2/05/0011…$5.00.

To evaluate the angle information of a spatial image, the power panning law [2] and the notion of time-frequency plane are

463

The advantage of the VSLI cue is that it has the finite dynamic range of angle value. The angle as a spatial cue can be quantized with a consistent finite maximum level while preserving a spatial sound image. However the dynamic range of conventional ICLD is dependent on the unpredictably varying channel power. In this case, the inappropriately quantized maximum level can cause the spectral distortion of original audio signals and as a result, the perceptual quality degradation of reconstructed signals occurs.

(i=2) C

S2 (i=3)

S1 R (i=1)

L

30º

30º S5

80º

S3

80º

3. Sound Source Location Cue Coding System A SSLCC encoder scheme is shown in Fig. 2. The SSLCC encoder receives N-channel audio signal and performs a FFT which is used in the conventional BCC based SAC (spatial audio coding) encoder. The spectra derived from the FFT for each of the channels are partitioned into 20 ERB (Equivalent Rectangular Bandwidth) partitions approximating the ERB scale. Then energy vectors in each band are derived from the energy information of partitioned spectra, and the angles of sound images for all bands are estimated by means of the inverse power panning law at a VSLI analyzer. A side information stream is derived from the Huffman coding for differences of the quantized indices of VSLI parameters after quantization. The power equalization is applied to a down-mixed signal to preserve the power of each channel between the original multi-channel audio signal and synthesized multi-channel audio signal. Then conventional AAC encoding is performed. Finally the down-mixed audio stream and the side information stream are tied up in a MUX, and the resulted SSLCC stream is transmitted.

140º Ls

Rs

(i=4)

(i=5) S4

Figure 1. Playback loudspeakers layout for 5 channel audio employed. Let Lib , k be the level information of b-th partition at kth time in i-th channel. To estimate the power vector of a spatial sound image between channels, both the level information and the angle between adjacent channels are indispensable. The angle estimation with level information {Lib, k , Lib+, 1k } between two adjacent channels is performed with the help of power panning law. To apply the power panning law inversely, the level information of i-th channel and (i+1)-th channel need to be normalized, as given below. gbi , k =

gbi +, k1 =

Lib, k

( ) +( ) Lib, k

2

Lib+, 1k

2

Lib+, k1

( ) ( ) Lib, k

2

+ Lib+, 1k

2

,

.

M ulti-channelAudio

(1)

(2)

(4)

ERB Filter Bank

M PEG-4 AAC Encoder

Lossless Encoding

Side Information Stream M UX

Finally, angle as the location information of virtual source between i-th channel and (i+1)-th channel is given as 2  θvs =  θbi ,,ik+1 ×  × (θi +1 − θi ) + θi . π  

Pow er Norm alization

Q uantization

time k for position estimation as follows:

θbi +, k1 = cos −1 g bi +,k1 .

FFT

Dow nm ix Audio Stream

normalized adjacent channel gains {gbi , k , gbi +, k1} of partition b at

(3)

Dow nm ix

VSLIAnalyzer

Then, the power panning law is applied inversely to the

θbi , k = cos −1 gbi , k ,

W indow ing

SSLCC Stream

Figure 2. The schematic diagram of SSLCC encoder.

(5)

A SSLCC decoder scheme is shown in Fig. 3. SSLCC stream is received by a DEMUX and it is split into side information stream and down-mixed audio stream. VSLI and ICC values per band are extracted from the side information stream after lossless decoding and de-quantization. In other hand, the down-mixed audio stream is decoded into mono/stereo audio signal by conventional AAC decoder.

The estimated angle value is found to be valid for just b-th partition of k-th frame between i-th channel and (i+1)-th channel. Therefore, angle values for the number of partitioned bands, between adjacent channels, are estimated and transmitted as spatial cues for one frame.

464

entropy measurement (Kullback-Leibler distance) [4]. The KL distance is a well known distance measurement for estimating perceptual distortion. It is defined as

SSLCC Stream

DEM UX

M ono/Stereo Audio

M PEG -4 AAC Decoder

Side Inform ation

DKL =

Dow nm ix Audio

Lossless Decoding

W indow ing

De-Q uantization

FFT

ICC

Correlation M odification

SSLCC Synthesizer

LevelM odification

(6)

where P(ω ) and Q (ω ) are power spectra of the reference and the decoded signal, respectively. Several KL distance were calculated from the reconstructed signals decoded by using the ICLD (with ± 24dB) and VSLI (with various quantization levels). Figure 4 shows total accumulative KL distance according to several quantization levels. It indicated that the spectral distortion by VSLI is consistently smaller than that by ICLD, giving a sign of robustness to quantization.

ERB FilterBank

VSLI

P(ω )

∫ ( P(ω ) − Q(ω) )log Q(ω ) dω,

Table 1. Test material Index

IFFT

M ulti-channelAudio

Figure 3. The schematic diagram of SSLCC decoder. The gains per band are used as key information to recover N multi-channel outputs from the downmix spectral values. A SSLCC synthesizer reconstructs N channel gains per band by means of the power panning law using VSLI and decoded downmixed audio signal. Then it performs a correlation regulation whose method is that random noise is added to audio spectrum when ICC value is more than a threshold value. Then the final N channel audio outputs are derived from the N channel spectra after IFFT. To reduce the sound quality degradation during compressing and synthesizing multi-channel audio signals, the SSLCC uses novel methods as followings: First, it removes the spectral degradation of power-equalized down-mixed audio signal at frame boundaries with a new power equalization method, which groups the lowfrequency bands as one. Second, it prohibits an abrupt variation of spectrum at subband boundaries with a parameter smoothing method during the process of synthesizing the spectrum of multichannel audio signal using spatial cue parameters. Third, it reduces correlation between synthesized channel signals with allpass filter controlled by ICC, so it can maintain the sound image of original multi-channel audio signal.

Material

Description

A

Applause

Ambience

B

ARL_Applause

Ambience

C

Chostakovitch

Music (back: direct)

D

Fountain_music

Pathological

E

Glock

Pathological

F

Indie2

Movie sound

G

Jackson1

Music (back: ambience)

H

Pops


I

Poulenc


J

Rock_concert

Music (back: ambience)

K

Stomp

Movie sound

1240

Quantized ICLD Quantized VSLI

(D SKL)

1190

1140

1090

1040

7

15

31

(Q)

Figure 4. Comparison of DKL between quantized VSLI and

4. Experiments

quantized ICLD.

The robustness to the quantization distortion of the proposed VSLI is verified by objective and perceptual preference measures. For these experiments we used eleven multi-channel contents, which are used for MPEG-4 SAC standardization work [3], listed in Table 1.

For subjective assessment, four systems, which are listed in Table 2, were used. Subjective listening test was performed according to the MUSHRA test methodology [5]. To obtain a fair assessment of the quality of test items, the listening panels were composed of 10 experienced listeners. They were asked to judge the ‘Basic Audio Quality’ of the 11 test items in each trial.

In the first experiment, the performance evaluation of VSLI is compared with that of the conventional ICLD objectively. The objective assessment was performed with the method of relative

465

Table 2. Multi-channel audio systems under test

5. Conclusion

System

Description

Hidden Reference

Original

Anchor

3.5kHz lowpass filtered version

Additional anchor

Dolby Prologic II

Proposed system

5-2-5 SSLCC

This paper addressed a SSLCC system which is extension of the conventional BCC for compact representation of multi-channel audio. The SSLCC system estimates virtual source location information (VSLI) as side information. We performed objective and subjective assessments to verify the robustness of proposed system to quantization distortion. The objective assessment was performed by KL distance as the relative entropy measurement. The overall distance of VSLI was found to be shorter than that of ICLD. It implies that the spectral distortion of reconstructed signal of the SSLCC system based on VSLI is smaller than that of SAC system based on ICLD. The subjective assessment was also performed according to the MUSHRA test methodology. It verified that the proposed SSLCC system is psychoacoustically effective.

6. Acknowledgements This work is being supported by the Ministry of Information and Communication Republic of Korea, under the title of “SmarTV Technology Development.”

7. REFERENCES [1] Faller, C., and Baumgarte, F., Binaural cue coding-part II: schemes and application, IEEE Trans. on Speech and Audio Proc., 11, 6, Nov. 2003. [2] Sang Bae Jeon, In Yong Choi, Han-gil Moon, Jeongil Seo, and Keong-mo Sung, “Virtual Source Location Information for Binaural Cue Coding,” 119th AES Convention, Oct. 2005.

Figure 5. Listening test results on 5-2-5 SSLCC system.

[3] Pulkki, V., Virtual sound source positioning using vector base amplitude panning, J. Audio Eng. Soc., 45(June 1997), 456-466.

In Fig. 5, the average scores and the 95% confidence intervals are shown. It indicates that the SSLCC 5-2-5 system (stereo downmix) has significantly better quality than Dolby ProLogic II has. And compared to the original hidden reference, the SSLCC system has on the average little degradation of sound quality although it shows a little degradation of quality for two or three test items .

[4] ISO-IEC JTC1/SC29/WG11 (MPEG) Document N6691, Procedures for the evaluation of spatial audio coding systems, Redmond, July 2004. [5] Klabbers, E., and Veldhuis, R., Reducing audible spectral discontinuities, IEEE Trans. on Speech and Audio Proc., 9, 1(Jan. 2001), 39-51. [6] ITU-R Recommendation BS. 1543-1, Method for the subjective assessment of intermediate sound quality (MUSHRA), International Telecommunication Union, Geneva, Switzerland, 2001.

466

Sound Source Location Cue Coding System for ... - ACM Digital Library

Sound Source Location Cue Coding System for ... - ACM Digital Library

Suggest Documents

A New Coding System for Monochromatic ... - ACM Digital Library

Enabling Location-Based Applications - ACM Digital Library

Mixing Source and Bytecode - ACM Digital Library

Coding of Sound-Source Location by Ensembles ... - Semantic Scholar

Innovative Tools for Sound Sketching Combining ... - ACM Digital Library

responsive sound environments for social ... - ACM Digital Library

A Drawing Instrument for Sound Performance - ACM Digital Library

A source independent framework for research ... - ACM Digital Library

The Singularity system - ACM Digital Library

Sound Gradual Typing: Only Mostly Dead - ACM Digital Library

design - ACM Digital Library

crpit - ACM Digital Library

Conversations - ACM Digital Library

Incentives - ACM Digital Library

Gunrock - ACM Digital Library

Abstract - ACM Digital Library

AdaGIDE - ACM Digital Library

MOVELETS - ACM Digital Library

Location-based Photography as Sense-making - ACM Digital Library

P10 - ACM Digital Library

2PXMiner - ACM Digital Library

feature - ACM Digital Library

C++ ... - ACM Digital Library

practice - ACM Digital Library