Sound Source Location Cue Coding System for Compact Representation of Multi-channel Audio Inseon Jang, Jeongil Seo, Seungkwon Beack, Kyeongok Kang
Han-gil Moon
Broadcasting Media Research Group, Electronics and Telecommunications Research Institute (ETRI) 161 Gajeong-dong, Yeseong-gu, Daejeon, 305-350 KOREA +82-42-860-5791
Digital Media R&D Center, Samsung Electronics Co., Ltd. 416 Maetan-dong, Yeongtong-gu, Suwon-si, Gyeonggido 443-742 KOREA +82-31-277-8076
{jinsn, seoji,skbeack,kokang}@etri.re.kr
[email protected]
compact representation method for multi-channel audio with an encoding process, while preserving backward compatibility to legacy mono/stereo audio coding schemes like MPEG Audio, MPEG-2/4 AAC, etc. The MPEG Surround has two coded representation parts; one is a down-mixed signal for retaining backward compatibility, and the other is low bitrate side information for recovering the spatial image of multi-channel audio.
ABSTRACT Binaural cue coding (BCC) has been introduced for compact representation of multi-channel audio. It exploits binaural cue parameters for capturing the spatial image of multi-channel audio. Recently, it has been standardized within MPEG as the name of “MPEG Surround.” In this paper, we propose a sound source location cue coding (SSLCC) system for compressing multichannel audio to be suitable at the narrow bandwidth transmission environment. To improve the compression ability of the conventional BCC, the SSLCC system utilizes the virtual source location information (VSLI) as a spatial cue parameter instead of the inter-channel level difference (ICLD) of the BCC system. Also the SSLCC system adopts enhanced pre/post processing algorithms to improve perceptual sound quality. Objective and subjective assessment results show that the proposed SSLCC system reveals better performance than the conventional BCC system.
Binaural Cue Coding (BCC) [1] is known to be the basis of MPEG Surround system. The BCC represents multi-channel signals as down-mixed audio signal and BCC side information such as inter-channel level difference (ICLD), inter-channel time difference (ICTD), and inter-channel correlation (ICC).
General Terms: Design
In this paper, we proposed a sound source location cue coding (SSLCC) system, which is extension of the conventional BCC for extremely low bitrate representation of side information. The SSLCC estimates virtual source location information (VSLI) [2] as the side information. The VSLI represents the geometric spatial information using an angle which has a finite dynamic range. The VSLI might be quantized with a consistent finite maximum level while preserving spatial sound image. Taking this merit, The SSLCC system can represents original multi-channel audio with robustness to quantization distortion.
Keywords
2. Virtual Source Location Information
Categories and Subject Descriptors E.4 [Coding and Information Theory]: Data compaction and compression.
Spatial Audio Coding, BCC, VSLI, Sound source location cue coding
VSLI is an angle which represents the geometric spatial information between inter-channel power vectors, rather than power ratios such as conventional ICLD. It is extracted under the assumption that the playback layout of multi-channel loudspeakers is fixed as illustrated in Fig. 1. There are five possible number of the existing spatial sound image denoted as S1, S2, S3, S4, and S5, because adjacent inter-channel power vectors form a spatial audio image between adjacent loudspeakers. Hence a spatial image can be represented by one angle between the adjacent loudspeakers.
1. INTRODUCTION MPEG Surround technology has been emergent in MPEG as a new coding scheme for multi-channel audio. It provides extremely Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’05, November 6–11, 2005, Singapore. Copyright 2005 ACM 1-59593-044-2/05/0011…$5.00.
To evaluate the angle information of a spatial image, the power panning law [2] and the notion of time-frequency plane are
463
The advantage of the VSLI cue is that it has the finite dynamic range of angle value. The angle as a spatial cue can be quantized with a consistent finite maximum level while preserving a spatial sound image. However the dynamic range of conventional ICLD is dependent on the unpredictably varying channel power. In this case, the inappropriately quantized maximum level can cause the spectral distortion of original audio signals and as a result, the perceptual quality degradation of reconstructed signals occurs.
(i=2) C
S2 (i=3)
S1 R (i=1)
L
30º
30º S5
80º
S3
80º
3. Sound Source Location Cue Coding System A SSLCC encoder scheme is shown in Fig. 2. The SSLCC encoder receives N-channel audio signal and performs a FFT which is used in the conventional BCC based SAC (spatial audio coding) encoder. The spectra derived from the FFT for each of the channels are partitioned into 20 ERB (Equivalent Rectangular Bandwidth) partitions approximating the ERB scale. Then energy vectors in each band are derived from the energy information of partitioned spectra, and the angles of sound images for all bands are estimated by means of the inverse power panning law at a VSLI analyzer. A side information stream is derived from the Huffman coding for differences of the quantized indices of VSLI parameters after quantization. The power equalization is applied to a down-mixed signal to preserve the power of each channel between the original multi-channel audio signal and synthesized multi-channel audio signal. Then conventional AAC encoding is performed. Finally the down-mixed audio stream and the side information stream are tied up in a MUX, and the resulted SSLCC stream is transmitted.
140º Ls
Rs
(i=4)
(i=5) S4
Figure 1. Playback loudspeakers layout for 5 channel audio employed. Let Lib , k be the level information of b-th partition at kth time in i-th channel. To estimate the power vector of a spatial sound image between channels, both the level information and the angle between adjacent channels are indispensable. The angle estimation with level information {Lib, k , Lib+, 1k } between two adjacent channels is performed with the help of power panning law. To apply the power panning law inversely, the level information of i-th channel and (i+1)-th channel need to be normalized, as given below. gbi , k =
gbi +, k1 =
Lib, k
( ) +( ) Lib, k
2
Lib+, 1k
2
Lib+, k1
( ) ( ) Lib, k
2
+ Lib+, 1k
2
,
.
M ulti-channelAudio
(1)
(2)
(4)
ERB Filter Bank
M PEG-4 AAC Encoder
Lossless Encoding
Side Information Stream M UX
Finally, angle as the location information of virtual source between i-th channel and (i+1)-th channel is given as 2 θvs = θbi ,,ik+1 × × (θi +1 − θi ) + θi . π
Pow er Norm alization
Q uantization
time k for position estimation as follows:
θbi +, k1 = cos −1 g bi +,k1 .
FFT
Dow nm ix Audio Stream
normalized adjacent channel gains {gbi , k , gbi +, k1} of partition b at
(3)
Dow nm ix
VSLIAnalyzer
Then, the power panning law is applied inversely to the
θbi , k = cos −1 gbi , k ,
W indow ing
SSLCC Stream
Figure 2. The schematic diagram of SSLCC encoder.
(5)
A SSLCC decoder scheme is shown in Fig. 3. SSLCC stream is received by a DEMUX and it is split into side information stream and down-mixed audio stream. VSLI and ICC values per band are extracted from the side information stream after lossless decoding and de-quantization. In other hand, the down-mixed audio stream is decoded into mono/stereo audio signal by conventional AAC decoder.
The estimated angle value is found to be valid for just b-th partition of k-th frame between i-th channel and (i+1)-th channel. Therefore, angle values for the number of partitioned bands, between adjacent channels, are estimated and transmitted as spatial cues for one frame.
464
entropy measurement (Kullback-Leibler distance) [4]. The KL distance is a well known distance measurement for estimating perceptual distortion. It is defined as
SSLCC Stream
DEM UX
M ono/Stereo Audio
M PEG -4 AAC Decoder
Side Inform ation
DKL =
Dow nm ix Audio
Lossless Decoding
W indow ing
De-Q uantization
FFT
ICC
Correlation M odification
SSLCC Synthesizer
LevelM odification
(6)
where P(ω ) and Q (ω ) are power spectra of the reference and the decoded signal, respectively. Several KL distance were calculated from the reconstructed signals decoded by using the ICLD (with ± 24dB) and VSLI (with various quantization levels). Figure 4 shows total accumulative KL distance according to several quantization levels. It indicated that the spectral distortion by VSLI is consistently smaller than that by ICLD, giving a sign of robustness to quantization.
ERB FilterBank
VSLI
P(ω )
∫ ( P(ω ) − Q(ω) )log Q(ω ) dω,
Table 1. Test material Index
IFFT
M ulti-channelAudio
Figure 3. The schematic diagram of SSLCC decoder. The gains per band are used as key information to recover N multi-channel outputs from the downmix spectral values. A SSLCC synthesizer reconstructs N channel gains per band by means of the power panning law using VSLI and decoded downmixed audio signal. Then it performs a correlation regulation whose method is that random noise is added to audio spectrum when ICC value is more than a threshold value. Then the final N channel audio outputs are derived from the N channel spectra after IFFT. To reduce the sound quality degradation during compressing and synthesizing multi-channel audio signals, the SSLCC uses novel methods as followings: First, it removes the spectral degradation of power-equalized down-mixed audio signal at frame boundaries with a new power equalization method, which groups the lowfrequency bands as one. Second, it prohibits an abrupt variation of spectrum at subband boundaries with a parameter smoothing method during the process of synthesizing the spectrum of multichannel audio signal using spatial cue parameters. Third, it reduces correlation between synthesized channel signals with allpass filter controlled by ICC, so it can maintain the sound image of original multi-channel audio signal.
Material
Description
A
Applause
Ambience
B
ARL_Applause
Ambience
C
Chostakovitch
Music (back: direct)
D
Fountain_music
Pathological
E
Glock
Pathological
F
Indie2
Movie sound
G
Jackson1
Music (back: ambience)
H
Pops
Music (back: direct)
I
Poulenc
Music (back: direct)
J
Rock_concert
Music (back: ambience)
K
Stomp
Movie sound
1240
Quantized ICLD Quantized VSLI
(D SKL)
1190
1140
1090
1040
7
15
31
(Q)
Figure 4. Comparison of DKL between quantized VSLI and
4. Experiments
quantized ICLD.
The robustness to the quantization distortion of the proposed VSLI is verified by objective and perceptual preference measures. For these experiments we used eleven multi-channel contents, which are used for MPEG-4 SAC standardization work [3], listed in Table 1.
For subjective assessment, four systems, which are listed in Table 2, were used. Subjective listening test was performed according to the MUSHRA test methodology [5]. To obtain a fair assessment of the quality of test items, the listening panels were composed of 10 experienced listeners. They were asked to judge the ‘Basic Audio Quality’ of the 11 test items in each trial.
In the first experiment, the performance evaluation of VSLI is compared with that of the conventional ICLD objectively. The objective assessment was performed with the method of relative
465
Table 2. Multi-channel audio systems under test
5. Conclusion
System
Description
Hidden Reference
Original
Anchor
3.5kHz lowpass filtered version
Additional anchor
Dolby Prologic II
Proposed system
5-2-5 SSLCC
This paper addressed a SSLCC system which is extension of the conventional BCC for compact representation of multi-channel audio. The SSLCC system estimates virtual source location information (VSLI) as side information. We performed objective and subjective assessments to verify the robustness of proposed system to quantization distortion. The objective assessment was performed by KL distance as the relative entropy measurement. The overall distance of VSLI was found to be shorter than that of ICLD. It implies that the spectral distortion of reconstructed signal of the SSLCC system based on VSLI is smaller than that of SAC system based on ICLD. The subjective assessment was also performed according to the MUSHRA test methodology. It verified that the proposed SSLCC system is psychoacoustically effective.
6. Acknowledgements This work is being supported by the Ministry of Information and Communication Republic of Korea, under the title of “SmarTV Technology Development.”
7. REFERENCES [1] Faller, C., and Baumgarte, F., Binaural cue coding-part II: schemes and application, IEEE Trans. on Speech and Audio Proc., 11, 6, Nov. 2003. [2] Sang Bae Jeon, In Yong Choi, Han-gil Moon, Jeongil Seo, and Keong-mo Sung, “Virtual Source Location Information for Binaural Cue Coding,” 119th AES Convention, Oct. 2005.
Figure 5. Listening test results on 5-2-5 SSLCC system.
[3] Pulkki, V., Virtual sound source positioning using vector base amplitude panning, J. Audio Eng. Soc., 45(June 1997), 456-466.
In Fig. 5, the average scores and the 95% confidence intervals are shown. It indicates that the SSLCC 5-2-5 system (stereo downmix) has significantly better quality than Dolby ProLogic II has. And compared to the original hidden reference, the SSLCC system has on the average little degradation of sound quality although it shows a little degradation of quality for two or three test items .
[4] ISO-IEC JTC1/SC29/WG11 (MPEG) Document N6691, Procedures for the evaluation of spatial audio coding systems, Redmond, July 2004. [5] Klabbers, E., and Veldhuis, R., Reducing audible spectral discontinuities, IEEE Trans. on Speech and Audio Proc., 9, 1(Jan. 2001), 39-51. [6] ITU-R Recommendation BS. 1543-1, Method for the subjective assessment of intermediate sound quality (MUSHRA), International Telecommunication Union, Geneva, Switzerland, 2001.
466