mdct spectrum separation - MIRLab

MDCT SPECTRUM SEPARATION: CATCHING THE FINE SPECTRAL STRUCTURES FOR STEREO CODING Shuhua Zhang, Weibei Dou, Ping Chi, and Huazhong Yang Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

The spectrum of a sinusoid using the Modified Discrete Cosine Transform (MDCT), when separated into an even subspectrum and an odd subspectrum by bin parity, gives rise to a distinctive property—subspectral shapes are independent of the sinusoid phase, which contributes only to scaling. Based on this finding, we propose an Even-Odd (EO) scheme for stereo coding: partitioning the even and odd subspectra separately into subbands to capture the fine spectral structures of sinusoidal and rich tone signals. The scheme reduces the coding noises by 0–20 dB for music signals. When integrated into a MDCT domain KLT-based stereo coder, the scheme boosts subjective listening test (MUSHRA) scores. This coder, called KLT-EO, competes the Parametric Stereo (PS) in quality by a slightly higher bitrate but without the algorithmic delay of 20 ms resulted from the stereo processing. Index Terms— Modified discrete cosine transform, stereo coding, sinusoidal analysis, spectrum separation 1. INTRODUCTION The Modified Discrete Cosine Transform (MDCT [1, 2]) is widely used in audio coding, such as MP3, AAC, and Ogg Vorbis. With 50% overlap between adjacent transform blocks, MDCT is still critically sampled—elegantly smoothing out blocking artifacts without bitrate penalty. Apart from this elegance in signal representation, MDCT poses a great difficulty in spectral processing due to frequency component aliasing, e.g., gain control [3] that works well in the DFT domain produces significant spectral distortion. For this reason, a complexmodulated Quadrature Mirror Filterbank (QMF) substitutes the MDCT in Side Band Replication (SBR [4]) and Parametric Stereo (PS [5]); the Modified Discrete Sine Transform (MDST), in addition to the MDCT, is used in the modified distortion metric for AAC [6] and the MDCT domain spatial audio coding [7]. Both the strategies are not free. They come with higher complexity, which may be mitigated by high performance hardwares, or additional algorithmic delay, which persists regardless of hardware advancements. Pure MDCT domain processing is attractive here provided that coding quality is satisfying. The root of this difficulty is that the MDCT basis is not shiftinvariant [8]. We intend to tackle this problem for pure MDCT domain stereo coding. A critical observation is that although the complete spectrum of a sinusoid has variable shapes as the phase varies (Fig. 1 left), the even and odd subspectra have fixed shapes and vary This work was supported by the National Science Foundation of China (NSFC 60832002). The authors thank Dr. Chen for her arrangement of the subjective test.

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

369

Phase

ABSTRACT

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

-6 -4 -2 0 2 4

-5 -3 -1 1 3 5

k − κ

k − κ

k − κ

Fig. 1. MDCT spectra of tones with the same frequency but different phases. k is the frequency index and κ the tone frequency. Left, the complete spectra around κ; middle and right, the even and odd subspectra corresponding to the left. The subspectra vary only in scaling, but after being merged, they lose this regularity.

only in scaling (Fig. 1 middle and right). In other words, the subspectral shapes are shift-invariant. A stereo sinusoid may have different phases in the left and right channels, so the spectral shapes of the two channels may not be identical. Only one interchannel intensity ratio is not sufficient to register the fine spectral structures. This is the pitfall of Intensity Stereo (IS [9]) coding, useless below 1.6 kHz since our ears are sensitive to the fine spectral structures in this range. Instead, we may use one intensity ratio for each subspectrum to capture the fine structures. Possibly this way, IS can be extended to the full frequency range and still has satisfying quality. 2. MDCT SPECTRA OF SINUSOIDS We study the MDCT spectral structures of sinusoidal signals with general symmetric window functions. Two techniques are used: • expanding a window function by the basis of the type-IV Discrete Sine Transform (DST-IV), then a windowed sinusoid equals to a sum of component sinusoids and thus eliminating the sine window restriction in [8]; • for the spectra of all these component sinusoids, separating out a common phase-dependent factor and leaving the remaining part independent of phase, i.e., shift-invariant. In analogy to DFT, the shift-invariant factor corresponds to amplitude and the phase-dependent factor corresponds to phase.

ICASSP 2010

In the following, Latin characters are reserved for integer variables and Greek characters for real variables.

where

P −1 l A(ξ) ≡ 14 M l=0 (−1) αl Vl (ξ) 1 θ ≡ (1 − 2M )πκ + ϕ − 3π 4

The MDCT maps x(n) ∈ R2M to X(k) ∈ RM by X(k) =

2M −1 X

w(n)x(n) cos[

n=0

M 1 1 π (n + + )(k + )], M 2 2 2

(1)

where w(n) ∈ R2M is the window function (prototype filter) satisfying the Princen-Bradley condition [1]: ( w(n) = w(2M − 1 − n) (Symmetry) (2) w(n)2 + w(n + M )2 = 1 (Perfect Reconstruction) Common choices of w(n) include the sine window and the KaiserBessel Derived (KBD) window. π The basis vectors of DST-IV are sl (n) = sin[ M (n + 12 )(l + 12 )] for l = 0, 1, · · · , M − 1, which are complete and orthogonal in RM . Note that s0 (n) coincides with the sine window. By extending n to 0, 1, · · · , 2M − 1, sl (n) becomes a symmetric vector of length 2M and all these extended vectors form a complete orthogonal basis for the symmetric subspace in R2M . Therefore, w(n) can be uniquely expanded as w(n) = α0 s0 (n) + α1 s1 (n) + · · · + αM −1 sM −1 (n),

(3)

PM −1

where αl = √1M w(n), sl (n) and l=0 αl2 = 1 due to the perfect reconstruction condition in (2). For the sine window, α0 = 1 and αl = 0 if l > 0; for the KBD P and other windows, only the first several αl are significant, e.g., l>4 αl2 = 8.78 × 10−7 for the long KBD window (M = 1024) used in AAC. π Let x(n) = sin[ M κn + ϕ] be a sinusoid and without loss of generality assuming κ ∈ [0, M ) and ϕ ∈ [0, 2π). By (3), x(n) windowed by w(n) can be expanded to w(n)x(n) = α0 s0 (n)x(n) + · · · + αM −1 sM −1 (n)x(n) =

1 2

M −1 X

n

αl sin[

l=0

+ sin[

π 1 (κ − l − )n + ϕ − φl ] M 2

o π 1 (κ + l + )n + ϕ + φl ] , (4) M 2

π where φl ≡ 2M (l + 12 ) − π2 . For each of the component sinusoids in (4), its MDCT coefficients can be derived by summing a 2M -term sine series. The sine summation formula says 2M −1 X n=0

sin[

π 1 ξn + ψ] = D2M (ξ) sin[(1 − )πξ + ψ], M 2M

(5)

where D2M (ξ) ≡ sin(πξ)/ sin(πξ/(2M )), known as the Dirichlet function or periodic sinc function. The right side of (5) is arranged such that the first factor depends only on ξ and the second factor depends also on ψ. They are pertinent to the above mentioned shiftinvariant factor and phase-dependent factor respectively and lead to a representation of the overall spectrum in a similar form. Denote Vl (ξ) ≡ D2M (ξ + l) − D2M (ξ − l − 1). By (1), (4), and (5), the overall MDCT spectrum of the windowed x(n) is X(k) = A(κ − k) sin[θ −

3π k] 2

− A(κ + k + 1) cos[θ +

3π k], 2

(6)

370

(

(7)

Using the first order Taylor expansion of the sine function, we see that A(ξ) decays to 0 on the order of 1/ξ 2 . Since κ, k ∈ [0, M ), except for the boundaries of the spectrum, |κ − k| < |κ + k + 1| and A(κ − k) A(κ + k + 1). Therefore (6) can be approximated to X(k) ≈ A(κ − k) sin[θ −

3π k]. 2

(8)

Here A(κ − k) is the shift-invariant factor, controlling the envelope k] is the phase-dependent factor, controlling of X(k), and sin[θ − 3π 2 the fluctuation of X(k). Similar results hold for DCT types I–IV (proof is similar and omitted for brevity). 3. EVEN-ODD MDCT SPECTRUM SEPARATION 3.1. Linearity of Even and Odd Subspectra π π Let x0 (n) = sin[ M κn + ϕ0 ] and x1 (n) = sin[ M κn + ϕ1 ] be two sinusoids with the same frequency but different phases. By (8), Their MDCT spectra are ( k] X0 (k) ≈ A(κ − k) sin[θ0 − 3π 2 (9) X1 (k) ≈ A(κ − k) sin[θ1 − 3π k] 2

where θ0 and θ1 are defined as θ in (7). Due to the term phase-dependent factors have a period of 4 along k: ··· ··· ···

4 κ4 − 2 − sin θ0 − sin θ1

4 κ4 − 1 cos θ0 cos θ1

4 κ4 sin θ0 sin θ1

4 κ4 + 1 − cos θ0 − cos θ1

3π k, 2

the

··· ··· ···

and X0 (k)/X1 (k) shows a simple but non-trivial pattern—almost constant on alternating bins (k), ( sin θ0 , for k = 2q X0 (k) θ1 (10) ≈ sin cos θ0 X1 (k) , for k = 2q + 1 cos θ1 where q ∈ Z. Thus the even subspectra or odd subspectra of sinusoids with the same frequency but different phases are linearly related, in other words, fully correlated. This explains the property shown in Fig. 1. 3.2. Musical Signals and Subband Partition For music signals, a frequent situation is that tones and overtones of the same frequency but different phases present in different channels. Locally in the frequency domain, these tonal components are approximately sinusoids and their even and odd subspectra are approximately linearly related. To exploit this linearity for stereo and multichannel coding, we partition the even and odd subspectra separately into subbands (Fig. 2(b)), called even-odd scheme, or simply EO scheme. The subbands approximate the Bark frequency scale [10]—broader toward higher frequency—to accommodate the psychoacoustics. Traditionally, subbands are composed of consecutive coefficients (Fig. 2(a)). Under the EO scheme, one scaling parameter per subband is sufficient to register the difference between channels of a tonal stereo signal.

2k + 0 2k + 1 2k + 2 2k + 3

Prob. Density

Subband 2b + 1 Subband 2b (b)

2k − 4

2k − 2 2k − 3

2k + 0 2k − 1

2k + 2 2k + 1

2k + 3

.1 0 70 75 80 85 90 95 100 Coding Gain (dB) .3 (c)

.2 .1 0 -5

Subband 2b + 1

Fig. 2. Partition of a MDCT Spectrum into subbands. (a) the tradition scheme, (b) the even-odd scheme.

(a)

.2

0 5 10 15 20 25 Coding Gain (dB)

2

2

2

where tan 2β = X0 , X1 /( X0 − X1 ) to minimize Y1 . If only Y0 and β are available to a decoder, the power of coding error, equal to Y1 2 , will be 1 ( X0 2 + X1 2 ) 2 1p ( X0 2 − X1 2 )2 + 4X0 , X1 2 . − 2

.3 (b)

.2 .1 0 -5

0 5 10 15 20 25 Coding Gain (dB)

.3 (d)

.2 .1 0 -5

0 5 10 15 20 25 Coding Gain (dB)

Fig. 3. Probability distribution of the coding gain for stereo signals. (a) sinusoids with uniformly distributed initial phase, (b) female speech moving, (c) pitch pipe, (d) trumpet solo and orchestra.

3.3. Coding Gain of the EO Scheme If incoherent non-tonal components arise in different channels, the cross-channel correlation will decrease. In this case, the KarhunenLòeve Transform (KLT [9]) optimally exploits the remaining correlation: two spectral vectors X0 and X1 from the same subband but different channels are orthogonally transformed to a main vector Y0 and a minor vector Y1 as „ « „ «„ « Y0 cos β sin β X0 = (11) Y1 X1 − sin β cos β

Prob. Density

2k − 4 2k − 3 2k − 2 2k − 1

.3

Prob. Density

(a)

Prob. Density

Subband 2b

L

R

C/B

T/M

T/M

C/D

e e o e

S

e o o o

S β

P/B

S

P/D β

e o

K/D

e o

K/D

K/U

K/U

C/C

C/B

β P/C

P/B

M β

e e e o

M

M/T

L

o e o o

M

M/T

R

ε(B) =

(12)

Generally, the larger the X0 , X1 2 , the smaller the ε(B). Let B0 , B1 , · · · , Bp−1 be the subbands partitioned by the EO scheme, ¯0 , B ¯1 , · · · , B ¯p−1 be the subbands partitioned by the tradiand B tional scheme. We define the coding gain as the ratio of the overall power of coding error between the two schemes: Pp−1 ¯ ε(Bb ) . (13) G = 10 log10 Pb=0 p−1 b=0 ε(Bb ) Due to (10), the subspectra of sinusoids from different channels are almost fully correlated but the complete spectra are not. We expect high coding gain for pure sinusoids and rich tone signals. Numerical simulation supported the expectation. We compute the distribution of the gain with a constant frame length of M = 1024 (AAC long window). Each MDCT spectrum is first partitioned into 24 intervals following approximately the Bark scale, then each interval forms an even subband and an odd subband (Fig. 2(b)) or a first-half subband and a second-half subband (Fig. 2(a)). In both the cases, we have p = 48 subbands, with which the gain is computed for each frame. For stereo sinusoids with uniform random phases in [0, 2π), the gain is mostly around 90 dB (Fig. 3(a)); for speeches (Fig. 3(b)) and music (Fig. 3(c)(d)), mostly within 0–20 dB. Since lower coding noises generally lead to higher audio quality, the EO scheme will probably boost stereo coding performance for rich tone signals.

371

Fig. 4. The KLT-EO structure: MDCT domain stereo coding based on the KLT and the EO scheme. Top encoder, bottom decoder.

4. EXPERIMENTS We construct a MDCT domain stereo coder by the KLT and the EO scheme, called KLT-EO, for two considerations: the KLT is the optimal orthogonal transform for compression of correlated signals [9]; the EO scheme increases the subband spectral cross-channel correlation for sinusoidal and rich tone signals. On the encoder side of KLT-EO (Fig. 4 top), signal blocks from the left (L) and the right (R) channels first go through time to MDCT domain mapping (T/M); then the spectra are separated (S) into even subbands (e) and odd subbands (o); on each subband, two spectral vectors from the both channels are downmixed to one vector by the KLT (K/D), i.e., keeping only the main vector Y0 in (11), and the rotation angle (β) is quantized and Huffman coded (P/C) to form the parameter bitstream (P/B); downmixed subspectra are merged (M) into a complete MDCT spectrum, which is sent to a core coder (C/C) such as AAC to generate the core bitstream (C/B). On the decoder side (Fig. 4 bottom), this process is inverted: C/C changes to core decoder (C/D), P/C to parameter decoder (P/D), K/D to KLT upmixing (K/U) which is the inverse of (11) but setting Y1 = 0, and T/M to MDCT to time domain mapping (M/T).

ref ps20 MUSHRA Score

60

73

83 82

88 76

74 56

5. CONCLUSIONS

mono 3.5k

97

100 91 80

fs48 eo48

84 81

47 39

32

40 20 0

We have proved a distinctive property of the MDCT spectra of sinusoids: the even and odd subspectra have phase-independent shapes. This translates to performance boost of stereo coding by exploiting the cross-channel correlation of the MDCT subspectra, as is mostly pronounced in rich tone stereo signals. Apart from the low delay MDCT domain stereo coding, we may use the property to enhance the traditional coding schemes such as Intensity Stereo for lower coding noises, and Mid/Side Stereo for lower side channel power. And this property is also shared by DCTs of types I–IV, so the spectrum separation will work in these domains for compression of correlated signals with rich tone components.

87

21

28

12 Speech and Vocal

Single Instrument

Multi Instruments

6. REFERENCES Fig. 5. MUSHRA scores, mean and 95% confidence interval. ‘ref’ is for reference; ‘ps20’ for PS with 20 subbands; ‘fs48’ for KLT-FS with 48 subbands; ‘eo48’ for KLT-EO with 48 subbands; ‘mono’ for mono downmixed; and ‘3.5k’ for 3.5 kHz low-pass filtered.

To evaluate the performance of the coder, we arranged a MUSHRA [11] subjective listening test. Using headphones, twelve young subjectives graded nine groups of hidden references, anchors, and the processed sequences from 0 (worst) to 100 (transparent) based on their perceived distortion against the given references. The reference set consisted of three speech and vocal, three single instrument, and three multi instruments stereo test sequences from 3GPP and MPEG, all sampled at 48 kHz. Two types of anchors, 3.5 kHz low-pass filtered and mono-downmixed references, were used. To single out stereo processing performance, core coding was bypassed. For each 1024-point frame, KLT-EO produces 48 rotation angles, one from each subband, corresponding to a parameter bitrate of 4.1–5.2 kb/s after quantization and Huffman coding. For comparison, the test sequences were also processed by a plain MDCT domain KLT stereo coder without the EO scheme (called KLT-FS), but otherwise same as our coder; and by the 20-subband mode PS coder in the 3GPP EAAC+ [12] stripped of core coding and SBR, parameter bitrate 1.9–2.9 kb/s. Fig. 5 presents the MUSHRA scores in three groups. For speech and vocal signals, KLT-EO and KLT-FS have similar performance; for multi instruments signals, KLT-EO has slightly higher performance than KLT-FS; for the single instrument signals (pitch pipe, glockenspiel, and plucked strings), KLT-EO outperforms KLT-FS by a large margin—29 MUSHRA points. The single instrument signals are rich of tonal components, to which our ears are very sensitive. The spectrum separation in KLT-EO is critical to catching the fine spectral structure of this kind of signals and leads to the performance boost. Even with more subbands and parameters, we experienced severe harmonic distortions in KLT-FS, making the sounds tremble, as is mostly pronounced for the pitch pipe signal. Generally, KLT-EO performs equally well as PS, but with about 2 kb/s higher bitrate due to the larger number of subbands used to compensate the lack of coherence processing. A distinct advantage of our coder is no additional algorithmic delay. PS has to budget 20 ms to bridge QMF and MDCT (this delay is currently shared with SBR in EAAC+). In KLT-EO, the stereo processing and core coding (AAC) both work in the MDCT domain, so the delay is exempted. This amounts to significant delay reduction for real-time two-way communications and the bitrate increase is relatively small.

372

[1] J. Princen and A. Bradley, “Analysis/synthesis filter bank design based on time domain aliasing cancellation,” IEEE Trans. Acoust., Speech, Signal Process., vol. 34, no. 5, pp. 1153– 1161, Oct. 1986. [2] H. Malvar, “Lapped transforms for efficient transform/subband coding,” IEEE Trans. Acoust. Speech, Signal Process., vol. 38, no. 6, pp. 969–978, June 1990. [3] F. Kuech and B. Edler, “Aliasing reduction for modified discrete cosine transform domain filtering and its application to speech enhancement,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., New York, Oct. 21–24, 2007, pp. 131–134. [4] P. Ekstrand, “Bandwidth extension of audio signals by spectral band replication,” in Proc. 1st IEEE Benelux Workshop on Model Based Process. and Coding of Audio (MPCA-2002), Leuven, Nov. 2002, pp. 53–58. [5] J. Breebaart, S. van de Par, A. Kohlrausch, and E. Schuijers, “Parametric Coding of Stereo Audio,” EURASIP J. Appl. Signal Process., pp. 1305–1322, Sept. 2005. [6] V. Melkote and K. Rose, “A modified distortion metric for audio coding,” in Proc. IEEE Int. Conf. Audio Speech Signal Process., Taipei, Apr. 2009, pp. 17–21. [7] S. Chen, R. Hu, and S. Zhang, “Estimating spatial cues for audio coding in MDCT domain,” in Proc. IEEE Int. Conf. Multimedia Expo, July 2009, pp. 53–56. [8] L. Daudet and M. Sandler, “MDCT analysis of sinusoids: exact results and applications to coding artifacts reduction,” IEEE Trans. Speech Audio Process., vol. 12, no. 3, pp. 302–312, May 2004. [9] R.G. van der Waal and R.N.J. Veldhuis, “Subband coding of stereophonic digital audio signals,” in Proc. IEEE Int. Conf. Audio Speech Signal Process. Toronto, Apr. 1991, vol. 5, pp. 3601–3604. [10] E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models. Berlin Heidelberg, Germany: Springer-Verlag, 1990. [11] ITU-R BS.1534-1: Method for the subjective assessment of intermediate quality levels coding systems, ITU, 2003. [12] 3GPP TS 26.410: General audio codec audio processing functions; enhanced aacplus general audio codec; floatingpoint ANSI-C code, 2008 [online]. Available: http://www. 3gpp.org/ftp/Specs/html-info/26410.htm