Natural Sound Rendering for Headphones

1 downloads 0 Views 3MB Size Report
Feb 16, 2016 - NMF: [UWH07]. Neural network: [UhP08] ... Model: One source, amplitude panned. Time-shifting. Chapter 5. [HTG13 .... platform for natural 3D sound rendering,” Show & Tell at ICASSP 2016, Shanghai, China, Mar. 2016.
Spatial Audio Reproduction Using Primary Ambient Extraction PhD candidate: Jianjun HE Supervisor: Assoc Prof Gan Woon-Seng Digital Signal Processing Laboratory School of Electrical and Electronic Engineering Nanyang Technological University, Singapore 16th Feb, 2016 J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 1 / 165of 53

Outline I. Background of Spatial Audio Reproduction II. Overview of Primary Ambient Extraction (PAE) III. My Contributions 1. 2. 3. 4. 5.

Linear Estimation based PAE Ambient Spectrum Estimation based PAE Time Shifting based PAE Multi-Source and Multichannel based PAE Natural Sound Rendering for Headphones

IV. Conclusions and Future work

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 2 / 265of 53

Hearing is Believing

1. What/where is the sound? 2. How is the sound played? 3. How is the sound heard?

Immersive visual experience J He (NTU)

Immersive listening experience

Sound content processing

Spatial Audio Reproduction Using Primary Ambient Extraction

Personal listening cues 16th Feb 2016

Slide 3 / 365of 53

Spatial hearing of sound source Azimuth, Elevation, Distance

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 4 / 465of 53

Head Related Transfer Functions (HRTF) ITD: Interaural Time Difference ILD: Interaural Level Difference SC : Spectral Cues

Head Related Transfer Functions (HRTF) 30 Contralateral ear response Ipsilateral ear response

20

Magnitude in dB

10 0 -10 -20

30o

-30 -40

L

R

-50 2 10

3

10

4

10

Frequency in Hz HRTF of KEMAR dummy head for an angle of 30 degree azimuth*

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 5 / 565of 53

Spatial hearing of sound scene

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 6 / 665of 53

Source-medium-receiver view of spatial audio reproduction

Immersive Audio

“being there”

Source

Medium

Audio content Channelbased

Playback system

Loudspeakers

Object-based

Receiver Human ear Ear A

Ear B Headphones

Transformdomainbased J He (NTU)

Etc

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 7 / 765of 53

Achieving consistency in source and medium in spatial audio

Immersive Audio

“being there”

Source

Medium

Channel-based

Playback system

Receiver Human ear

Optimal rendering Primary Ambient Extraction

Spatial Audio Processing

Individualized cues

Essentially, PAE serves as a front-end to facilitate flexible, efficient, and immersive spatial audio reproduction. J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 8 / 865of 53

What are primary and ambient components? Main characteristics  Primary components are directional  Ambient components are diffuse

Primary component — Where the sound comes from?

Ambient component — Where are you? J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide 9 / 965of 53

What does PAE do? Sound scenes in movies and games often consist of: directional sound objects and background ambience, e.g., • Circling aircraft in the rain; • Flying bee at the waterfall; • Soaring tiger on the sea; • Attacking wolves in the forest.

Primary Ambient Extraction J He (NTU)

Directional sound

Background ambience

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1010 / 65

A brief literature of PAE No. of channels

Complexity of audio scenes Basic (amplitude panned single source)

Medium (single source)

Complex (multiple sources)

Time-frequency masking: [AvJ02], [AvJ04], [MGJ07], [Pul07]

Stereo (2)

PCA:[IrA02], [BVM06], [MGJ07], [GoJ07b], [BaS07], [God08],

LMS:

[JHS10], [BJP12], [TaG12], [TGC12], [LBP14]

[UsB07]

Least-squares:[Fal06], [Fal07], [JPL10], [FaB11], [UhH15]

Shifted PCA: [HTG13]

PCA: [DHT12], [HGT14], [HeG15]

Linear estimation: [HTG14] Ambient spectrum estimation: [HGT15a], [HGT15b]

Time-shifting: [HGT15c]

Others: [BrS08], [MeF10], [Har11] ICA & time-frequency masking: [SAM06] PCA:

Multichannel

[GoJ07b]

Pairwise correlations: [TSW12]

[HKO04]

Others: [GoJ07a], [WaF11], [TGC12], [CCK14]

ICA:

Pairing: [HeG15b] Others: [StM15]

Single J He (NTU)

NMF: [UWH07]

Neural network: [UhP08]

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1111 / 65

My contributions Basic Model: one source, amplitude panned

Medium Model: one source, amplitude panned, and/or time shifted Time shifting based PAE

Basic approaches

Linear estimation based PAE

Exploiting additional characteristic

Ambient spectrum estimation based PAE

Complex Model: multiple source, amplitude panned, and/or time shifted

More robust

Multiple source based PAE

Better performance : Techniques can be combined directly J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1212 / 65

My contributions Immersive Audio Source (Channel-based)

Medium (Playback system)

Primary ambient extraction

Spatial audio rendering

Receiver (Human ear)

Natural sound rendering for headphones Chapter 7

[SHT15, HGT15d] Spatial audio reproduction

Model: One source, amplitude panned

Basic approaches

Exploiting additional characteristics

Linear estimation Chapter 3 [HTG14]

Ambient spectrum estimation Chapter 4 [HGT15a, HGT15b]

Better performance

J He (NTU)

Model: One source, amplitude panned, and/or time shifted

Model: Multiple sources, amplitude panned, and/or time shifted

Time-shifting

Multi-shift/subband

Chapter 5 [HTG13, HGT15c]

Chapter 6 [HGT14, HeG15]

More robust in handling complex cases

I: Investigation of linear estimation based PAE under the basic signal model. II: Improving PAE performance for strong ambient power cases using ambient spectrum estimation techniques. III: Employing time-shifting techniques for PAE with partially correlated primary components IV: Adaptation of conventional PAE approaches to deal with multiple sources. V: Applying PAE in natural sound rendering headphone systems.

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1313 / 65

My contributions in publications I: Investigation of linear estimation based PAE under the basic signal model. II: Improving PAE performance for strong ambient power cases using ambient spectrum estimation techniques. III: Employing time-shifting techniques for PAE with partially correlated primary components IV: Adaptation of conventional PAE approaches to deal with multiple sources. V: Applying PAE in natural sound rendering headphone systems. J He (NTU)

[J1] J. He, E. L. Tan, and W. S. Gan, “Linear estimation based primaryambient extraction for stereo audio signals,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 505-517, Feb. 2014. [J2] J. He, W. S. Gan, and E. L. Tan, “Primary-ambient extraction using ambient phase estimation with a sparsity constraint,” IEEE Signal Process. Letters, vol. 22, no. 8, pp. 1127-1131, Aug. 2015. [J3] J. He, E. L. Tan, and W. S. Gan, “Primary-ambient extraction using ambient spectrum estimation for immersive spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9, pp. 1430-1443, Sept. 2015. [J4] J. He, W. S. Gan, and E. L. Tan, “Time-shifting based primary-ambient extraction for spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1576-1588, Oct. 2015. [C1] J. He, E. L. Tan, and W. S. Gan, “Time-shifted principal component analysis based cue extraction for stereo audio signals,” in Proc. ICASSP, Vancouver, Canada, 2013, pp. 266-270. [C2] J. He, W. S. Gan, and E. L. Tan, “A study on the frequency-domain primaryambient extraction for stereo audio signals,” in Proc. ICASSP, Florence, Italy, 2014, pp. 2892-2896. [C3] J. He, and W. S. Gan, “Multi-shift principal component analysis based primary component extraction for spatial audio reproduction,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 350-354.

[J5] K. Sunder, J. He, E. L. Tan, and W. S. Gan, “Natural sound rendering for headphones: Integration of signal processing techniques,” IEEE Signal Process. Magazine, vol. 32, no. 2, Mar 2015, pp. 100-113.

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1414 / 65

Other related publications [C4] J. He, W. S. Gan, and E. L. Tan, “On the preprocessing and postprocessing of HRTF individualization based on sparse representation of anthropometry features,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 639-643. [C5] J. He, and W. S. Gan, “Applying primary ambient extraction for immersive spatial audio reproduction,” 2015 Asia Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (invited), Hong Kong, Dec. 2015. [C6] J. He, R. Ranjan, and W. S. Gan, “Fast continuous HRTF acquisition with unconstrained movements of human subjects,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. [Tutorial] W. S. Gan, and J. He, “Assisted listening for headphones and hearing aids: signal processing techniques,” Tutorial at APSIPA ASC 2015, Hong Kong, Dec. 2015.

[Show & Tell] D. H. Nguyen, J. He, K. K. Phyo, and W. S. Gan, “Real-time audio signal processing platform for natural 3D sound rendering,” Show & Tell at ICASSP 2016, Shanghai, China, Mar. 2016. [J6] ] J. He, R. Ranjan, W. S. Gan, and K. Sunder, “Scalable data reusing normalized LMS for acoustic system identification with short duration signals,” IEEE Sig. Process. Letters, under review.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1515 / 65

Objective of PAE

to extract the primary and ambient components from M mixtures Sum of primary and ambient components

Mixtures = primary component + ambient component

xm  n   pm  n   am  n  J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1616 / 65

Definitions with Stereo Signal Model Assumptions

Signal = Primary + Ambient

x0  p0  a0 x1  p1  a1 Stereo signal model True components

Input signal

Extracted components

p0 a0 a1 p1

J He (NTU)

+ +

x0

pˆ 0 , pˆ 1 PAE

x1

aˆ 0 , aˆ 1

Primary components highly correlated

p1  kp 0

Ambient components uncorrelated

a 0  a1

pi  a j

Primary ambient components uncorrelated

i, j  0,1

Ambient power balanced

Pa0  Pa1

Primary panning factor PPF: k  Primary power ratio PPR:  

Spatial Audio Reproduction Using Primary Ambient Extraction

p1 p0

Total primary power Total signal power

16th Feb 2016

Slide of 53 1717 / 65

PAE: Time-Frequency Masking f

f

f Element-wise multiplication

t

Primary / ambient components

t

t

Mixture

Mask

Mask can be constructed using • Inter-channel coherence [Avendano and Jot, 2004] • Pairwise correlation [Thompson et al., 2012] • Equal level of ambience [Merimaa et al., 2007] • Diffuseness [Pulkki, 2007] J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1818 / 65

Linear Estimation framework in PAE

pˆ 0  T pˆ 1 aˆ T  0 aˆ 1T

T

J He (NTU)

  wP0,0     wP1,0   w A0,0     wA1,0

wP0,1   T T wP1,1   x 0   x0   W     T T wA0,1   x1  x1     wA1,1 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 1919 / 65

Performance measures eP   w

P0,0

 kwP0,1  1 p 0



0

  wP0,0 a0  wP0,1a1 

e A   wA0,0  1 a0  wA0,1a1   wA0,0 p 0  wA0,1p1  Extraction Accuracy

PError PTrue signal



PLeakage PDistortion PInterference   PTrue signal PTrue signal PTrue signal

ESR  DSR  ISR  LSR Spatial Accuracy

ICC, ICLD, ICTD(only for primary)

ESR: Error-to-signal ratio DSR: Distortion-to-signal ratio ISR: Interference-to-signal ratio LSR: Leakage-to-signal ratio J He (NTU)

ICC : Inter-channel cross-correlation coefficient ICLD: Inter-channel level difference ICTD: Inter-channel time difference

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2020 / 65

PAE: Linear Estimation  pˆ 0  n    wP0,0     pˆ1  n    wP1,0 ˆ   w  a0  n    A0,0  aˆ  n    wA1,0  1  

wP0,1   wP1,1   x0  n     wA0,1   x1  n    wA1,1 

Objectives and relationships of four linear estimation based PAE approaches. • • • •

J He (NTU)

Blue solid lines represent the relationships in the primary component; Green dotted lines represent the relationships in the ambient component. MLLS: minimum leakage LS MDLS: minimum distortion LS

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2121 / 65

Performance of the four PAE approaches Performance

Primary component

k : PPF

Ambient component

Measure

MDLS/PCA

MLLS/LS

MLLS/PCA

ESR

1  2

1  1 

1 1 k 2

1 2 2 1 k 1 

2 1  k   k 2  1 

0

1 2 1    1  k 2 1   2

1  k  1    2 1  k  1     2   

LSR

DSR

ISR

ICC(ICTD)

ICLD

1  2

1  2

 2    1  

 1      1  

0

0 1(0)

k2

2

2

 1   2   1 k 

2

 k   2   1 k 

1 1 k2

2

LS

 : PPR

MDLS 2

 1 2    2  1 k 1    k 2    2  1 k 1  

2

2

2

2

0   2k    2  1  k  1     2 

2

2

2k 

1  k   1  k  2 2

2 1 1    k 1    k 2 1   1 1    k2

2 2

2

2 1 1    k 1    k 2 1   1 1    k2

Theoretical results are verified in our experiments! J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2222 / 65

Experimental results– Primary extraction (b)

(a) 5

3

P

P

3

0.6

2

2

1

1

0

0 0

0.5 True PPR 

1

DSR

LSR

P

0.8

4

MDLS MLLS

4

ESR

(c)

5

0.4

0.2

0 0

0.5 True PPR 

1

0

0.5 True PPR 

1

Performance of MDLS (or PCA) and MLLS (or LS) in primary extraction (a) ESR, (b) LSR, (c) DSR.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2323 / 65

PAE using Adjustable Least-squares (ALS) MDLS /PCA

Min distortion

Adjustable factor    0,1

MLLS /LS

β =0

ALS (β) Min leakage

β=1

WALS   1  1   k  1    1   1        2 2 1   1   1  k 1  k       k  1   k2  1     1   1    2  1   1 k2  1    1  k    1 1 1   1    1     1 k2 k  1 k2     2 k2  k 1   k   1    2 2   1 k  1 k   

Min error

MLLS /PCA

1    , DSR   2  1    , LeSR  1   , 1  ESR P       2 P P   2 2 1    1  1   2

2

2

1 1  1  ESR A  2      2  2 2 , DSR A   2  , LeSR A  0. 2  k 1  k k  k  1  

Blue solid lines: primary component; Green dotted lines: ambient component. J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2424 / 65

Recommendations of Linear Estimation based PAE Strengths

Approach 

PCA

 



Minimum leakage in the extracted primary and  ambient components; Primary and ambient components are uncorrelated;  No distortion in the extracted primary and ambient components;



MDLS

J He (NTU)



Minimum MSE in the extracted primary and ambient components;



ALS

No distortion in the extracted primary component; No primary leakage in the  extracted ambient component; Primary and ambient components are uncorrelated;



LS

MLLS

Weaknesses



 Performance adjustable;

Recommendations

Ambient component severely panned;

Spatial audio coding and interactive audio in gaming, where the primary component is more important than the ambient component.

Severe primary leakage in the extracted ambient component;

Applications in which both the primary and ambient components are extracted, processed, and finally mixed together.

Ambient component severely panned;

Spatial audio enhancement systems, and applications in which different rendering or playback techniques are employed on the extracted primary and ambient components.

Severe interference and primary leakage in the extracted ambient component; Need to adjust the value of the adjustable factor;

High-fidelity applications in which timbre is of high importance.

For applications without explicit requirements.

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2525 / 65

Contributions on linear estimation based PAE 1. Formulated the framework for PAE.

linear

estimation

2. Introduced two groups of performance measures. – Extraction accuracy: ESR, DSR, ISR, LSR – Spatial accuracy

: ICC, ICTD, ICLD

3. Proposed MLLS, MDLS, ALS and compared them with PCA and LS in PAE. 4. Different approaches are recommended in different spatial audio applications.

[J1] J. He, E. L. Tan and W. S. Gan, “Linear estimation based primary-ambient extraction for stereo audio signals,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 505-517, Feb. 2014.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2626 / 65

Observation Ambient components are decorrelated using: – A small delay – All-pass filtering with random phase – Artificial reverberation, and binaural reverberation

Magnitude of the ambient components are usually kept the same. Element-wise multiplication

Ambient components

j 0  n ,l 

A0  A0

W0

W0  n, l   e

A1  A1

W1

j n ,l W1  n, l   e 1 

Ambient magnitude

A0  A1  A

J He (NTU)

Complex with unit magnitude

Ambient phase

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2727 / 65

From PAE to APE: ambient phase estimation

PAE X0  P0  A 0 X1  P1  A1

with P1  kP0 , A 0  A1

Pi  A j , i, j  0,1 PP1  k PP0 , PA1  PA0 2

Find W0 , W1

APE

X: Mixed signal P: Primary components k: Primary panning factor

A   X1  kX0 .  W1  kW0 

and A c   X1  kX 0  .  W1  kW0 

Wc ,

Pc  X c   X1  kX 0  .  W1  kW0 

Wc

channel index c  0,1.

Find θ1

Find θ0 , θ1

θ0  θ  arcsin  k 1 sin  θ  θ1     where θ    X1  kX0 

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2828 / 65

From PAE to APE: ambient phase estimation

APE

PAE X0  P0  A 0

A   X1  kX0 .  W1  kW0 

X1  P1  A1

Is this the only way? For each time-frequency bin

J He (NTU)

kX 0  P1  kA0 X 1  P1  A1

Complex spectrum shown in complex plane?

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 2929 / 65

From PAE to AME: ambient magnitude estimation Put the complex spectrum in 2 dimensional vector form k X 0  P1  k A0 X 1  P1  A1

Im B (BRe, BIm)

k A0

k X 0  OB   BRe , BIm  X 1  OC   CRe , C Im 

P1  OP   PRe , PIm  k A0  PB

C (CRe, CIm)

?

A1 P (PRe, PIm) X1

k X0

P1

Re

O (0, 0)

A1  PC J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3030 / 65

From PAE to AME: ambient magnitude estimation r  A0  A1

PC  r , PB  kr.

 PRe  BRe    PIm  BIm   k 2 rˆ 2 , 2

 PRe  CRe 

2

2

  PIm  CIm   rˆ 2 . 2

CRe  BRe   k 2  1 rˆ 2   BIm  CIm    B  C Re Re Pˆ Re   , 2 2 2 BC CIm  BIm   k 2  1 rˆ 2  B  C Im Im Pˆ Im   2 2 2 BC



 BRe  CRe  

Im B (BRe, BIm)

,

k A0



Pˆ1  Pˆ Re  jPˆ Im , Pˆ0  k 1 Pˆ Re  jPˆ Im ,





C (CRe, CIm)

Aˆ1  X 1  Pˆ Re  jPˆ Im ,

  Find r

Aˆ0  X 0  k 1 Pˆ Re  jPˆ Im .

 CRe  BRe 

BC 

2

A1 P (PRe, PIm) X1

k X0

P1

Re

O (0, 0)

  C Im  BIm  , 2

2 2 2 2    k  1 rˆ 2  BC   k  1 rˆ 2  BC  .



J He (NTU)

 



Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3131 / 65

APE

 AME

AME: Find

r  PC Im

APE: Find

1  PC

B (BRe, BIm)

k A0 C (CRe, CIm)

Ambient Spectrum Estimation J He (NTU)

A1 P (PRe, PIm) X1

k X0

P1

Re

O (0, 0)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3232 / 65

Ambient Spectrum Estimation (ASE)

Extracted primary and ambient components

Input signal

x 0 , x1

pˆ 0 , pˆ 1 , aˆ 0 , aˆ 1

APE Find θ1 T-F Transform

Compute

Select ASE

ˆ ,A ˆ Pˆ 0 , Pˆ1 , A 0 1 Find

Inverse T-F Transform

r

AME

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3333 / 65

ASE: how to estimate? Using a Sparsity Constraint Find θ1

Objective

How

ˆ θˆ 1*  arg min P 1 ˆ

Method Analytic?

Solution Numerical?

J He (NTU)

Find r

θ1

rˆ*  arg min Pˆ1 rˆ

1

APES

1

AMES

No Convex?

No

Heuristic?

Inefficient

Discrete Searching

Independent Bounded

Spatial Audio Reproduction Using Primary Ambient Extraction

Pˆ1  ni , li   A1  n j , l j  ni  n j or li  l j

θˆ 1    ,   rˆ   rlb , rub  16th Feb 2016

Slide of 53 3434 / 65

Approximate solution APEX and computational cost Approximate close-form solution APEX : X1 , k  1  θˆ      X 1  X 0  , k  1 * 1

Computation cost per time-frequency bin APES

Square root D

AMES APEX

Operations

15D+18

15D+13

4D+6

D-1

Trigonometric operation 7D+6

2D+2

25D+35

24D+24

9D+13

D-1

0

0

13

7

4

1

7

Addition Multiplication Division

Comparison

D: number of phase or magnitude estimates in discrete searching

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3535 / 65

Evaluation of PAE- Extraction accuracy Performance measure: Error-to-Signal Ratio (ESR)   1 1 pˆ c  pc ESR P  10log10   2 2 p c  0  0 2 

   1 1 aˆ c  ac 2  , ESR A  10log10   2 2 a c  0   c 2  

2

2 2

  .  

Error  Distortion  Interference  Leakage ESR  DSR

 ISR

 LSR

When there is no analytic solution, how to compute these measures? We propose an optimization technique to compute these measures. J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3636 / 65

Optimization method for PAE Extraction accuracy pˆ c  pc  Leakpc  Distpc

aˆ c  ac  Leakac  Intfac  Distac

 pc   wPc ,0a0  wPc,1a1   Distpc ,

w

* Pc ,0

 ac  wAc ,cpc  wAc ,1ca1c  Distac ,

w

* , wPc ,1  

arg min

 wPc ,0 , wPc ,1 

* Ac , c

pˆ c  p c   wPc ,0a0  wPc ,1a1  , 2

 1 w* a  w* a Pc ,0 0 Pc ,1 1 1 LSR P  10 log10   2 pc 2  2 c 0

2 2

min

 wAc ,c , wAc ,1c 

aˆ c  ac   wAc ,cp c  wAc ,1ca1c  , 2

2

 1 w* p Ac , c c 1 LSR A  10 log10   2  2 c 0 ac 2

  , 

 1 pˆ  p  w* a  w* a  Pc,0 0 Pc,1 1  c c 1 DSR P  10 log10   2 pc 2  2 c 0 

J He (NTU)

arg

2

, w*Ac ,1c  

 2 .  

2

2 2

 1 w* a Ac ,1 c 1 c 1 ISR A  10 log10   2 ac 2  2 c 0

  ,  2 2

  , 

 1 aˆ  a   w p  w c c Ac ,c c Ac ,1 c a1 c  1 DSR A  10 log10   2 ac 2  2 c 0 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

2 2

  .  

Slide of 53 3737 / 65

Objective evaluation Stimuli • Primary component: – Speech, k = 2

• Ambient component: – Wave lapping sound

• Primary power ratio (PPR): – (0, 1) at an interval of 0.1

Approaches compared • Masking • PCA • LS • ASE (APEX)

• FFT size: 4096

Performance evaluated 1. Extraction accuracy: ESR 2. Spatial accuracy: ICC, ICLD

J He (NTU)

 1 1 pˆ c  p c ESR P  10 log10   2 2 p c  0  c 2

2

 1 1 aˆ c  ac ESR A  10 log10   2 2 a c  0  c 2

2

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

2

2

 ,   . 

Slide of 53 3838 / 65

Extraction accuracy (a)

(b)

20

20

Masking PCA LS ASE

15

10

15

10

5

ESRA(dB)

ESRP(dB)

5

0

-5

0

-5

-10

-10

-15

-15

-20

-20

-25

0

0.2

0.4

0.6

Primary power ratio 

J He (NTU)

0.8

1

-25

0

0.2

0.4

0.6

0.8

1

Primary power ratio 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 3939 / 65

Spatial accuracy (a)

(b)

1.2

20

ICLDP (dB)

Best 1 ICCP

0.8 0.6 0.4

15

Best True Masking PCA LS ASE

10

5

0.2 0

0

0.2

0.4

0.6

0.8

0

1

0

Primary power ratio 

0.2

0.4

0.6

0.8

1

Primary power ratio 

(c)

(d) 2

ICLDA (dB)

ICCA

0.8 0.6 0.4 0.2

Best 00

-2 -4 -6 -8 -10 -12

0.2

0.4

0.6

Primary power ratio  J He (NTU)

Best

0

1

0.8

1

-14

0

0.2

0.4

0.6

0.8

1

Primary power ratio 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4040 / 65

Subjective evaluation Stimuli

Approaches compared

• Primary component:

• • • • • •

– speech, music, and bee sound, k = 2

• Ambient component: – forest, canteen, and waterfall sound

• Primary power ratio (PPR):

Masking PCA LS ASE (APEX) Reference Mixture

– (0.3, 0.7)

• Duration: 2-4 seconds

Performance evaluated 1. Extraction accuracy 2. Ambient diffuseness J He (NTU)

Listening tests • 17 subjects • Headphone listening • Procedure similar to MUSHRA

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4141 / 65

(a) Extraction accuracy of primary components 100

(b) Extraction accuracy of ambient components 100

(c) Diffuseness accuary 100

Excellent 80

60

40

20

80

Good Subjective score

Subjective score

Subjective score

80

60

60

Fair 40

40

Poor 20

20

Bad 0

Ref

Mix Masking PCA

(b) Extraction accuracy of ambient components 100

LS

ASE

0

Ref

Mix Masking PCA

LS

ASE

0

Ref

Mix Maskin

(c) Diffuseness accuary of ambient components 100

Excellent 80

80

Subjective score

Subjective score

Subjective scores

60

40

20

Good 60

Fair 40

Poor 20

Bad 0

Ref

Mix Masking PCA

J He (NTU)

LS

ASE

0

Ref

Mix Masking PCA

LS

ASE

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4242 / 65

Contributions on ASE based PAE 1. Reformulated PAE as ASE by exploiting the equal ambient magnitude. 2. Solved ASE using the criterion of sparse primary component, resulting in APES, AMES, and APEX.

3. Proposed a technique to compute error measures for PAE approaches without analytic solutions. 4. Validated the improved performance (ESR reduction: 3-6 dB average) in the objective and subjective experiments.

Im B (BRe, BIm)

k A0 C (CRe, CIm)

A1 P (PRe, PIm) X1

k X0

P1

Re

O (0, 0)

Approximate efficient solution APEX : X , k  1 ˆθ*   1 1   X1  X0  , k  1

[J2] J. He, W. S. Gan, and E. L. Tan, “Primary-ambient extraction using ambient phase estimation with a sparsity constraint,” IEEE Signal Process. Letters, vol. 22, no. 8, pp. 1127-1131, Aug. 2015. [J3] J. He, E. L. Tan, and W. S. Gan, “Primary-ambient extraction using ambient spectrum estimation for immersive spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9, pp. 1431-1444, Sept. 2015.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4343 / 65

PAE in ideal case → complex case

p0

p1

kˆic k 2  1 k    2 k 2P k

p  0

2

 k 2 1  1  ,  2  2   2P k  k

ˆic k 2  1  2P    . 2  k 1

p  1 (a)

(b)

10

1 k =3 k =1 k =1/3

0.8 0.6

5

0.2



 k (dB)

0.4

0

0 -0.2 -5

k =3 k =1 k =1/3

-0.4 -0.6

-10

J He (NTU)

0

0.2 0.4 0.6 0.8 Primary correlation  p

1

-0.8

0

0.2 0.4 0.6 0.8 Primary correlation  p

Spatial Audio Reproduction Using Primary Ambient Extraction

1

16th Feb 2016

Slide of 53 4444 / 65

PAE in ideal case → complex case pˆ PCA, 0  p 0  v 

p0

(a)

(b) 0.6

1.8

0.59

1.6

0.58

=0.2 =0.5 =0.8 ESR

A

P

ESR

kˆic v 1  kˆ

0.55 0.54

0.6

0.53

0.4

0.52

0.2

0.51

0.2 0.4 0.6 0.8 Primary correlation  p



0.56

0.8

J He (NTU)



 kˆic a1 ,

0.57

1

0

0



p  0

2

0

a

kˆic 1 pˆ PCA, 1  p1  v  a 0  kˆica1 , 2 kˆic 1  kˆic kˆic 2 kˆic aˆ PCA, 0  a v a, 2 0 2 1 ˆ ˆ 1  k ic 1  kic kˆic 1 1 aˆ PCA, 1  a  v a , 2 1 2 0 ˆ ˆ ˆ 1  kic kic 1  kic

p  1

1.2

2

ic

p1

1.4

1 1  kˆ

1

0.5

2

 kˆ p ic

0

 p1



ic

0

0.2 0.4 0.6 0.8 Primary correlation  p

1

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4545 / 65

Time-shifting for PAE To account for the partial primary correlation (0-lag) caused by the inter-channel time difference (ICTD).

x0

x'0

x1

ICTD estimation

o

Time shifting

1 pˆ SPCA, 0  n   1  kˆ

2

ic

kˆic aˆSPCA, 0  n   1  kˆ

ic

J He (NTU)

x1

Shifted signal ( P is maximum)

Stereo input Signal ( P is small)

2

pˆ '0

pˆ 0

pˆ 1

Extracted components (at shifted positions)

pˆ 1

Primary and ambient components (at original positions)

Output mapping

PCA

ˆ  x0  n   kˆic x1  n   o   , pˆ SPCA, 1  n   kic   1  kˆ

2

 x0  n   o   kˆic x1  n   ,  

ic

 kˆic x0  n   x1  n   o   , aˆSPCA, 1  n    1   1  kˆ

2

 kˆic x0  n   o   x1  n   .  

ic

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4646 / 65

Output mapping Frame Frame ICTD (ms) ICTD (ms)

channel 0 channel 0

i-1 i-1 1 1

frame i-1 frame i-1

i+1 i+1 1 1 frame i+1 frame i+1

frame i frame i an overlap of 2 ms an overlap of 2 ms

channel 1 channel 1

i i -1 -1

frame i-1 frame i-1

a gap of 2 ms a gap of 2 ms frame i+1 frame i+1

frame i frame i (a) Conventional output mapping (a) Conventional output mapping overlaps of Q = 2 ms overlaps of Q = 2 ms

channel 0 channel 0

frame i-1 frame i-1

frame i frame i

an overlap of Q +2 = 4 ms an overlap of Q +2 = 4 ms

channel 1 channel 1

frame i-1 frame i-1

x x

frame i+1 frame i+1

an overlap of Q -2 = 0 ms an overlap of Q -2 = 0 ms

frame i frame i

frame i+1 frame i+1

(b) Overlapped output mapping (b) Overlapped output mapping

Overlapping duration: Q ≥ 2 ms J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4747 / 65

Experiments

Exp

Input signal

Primary component

Ambient component

1

Synthesized

Speech

Lapping wave

Fixed direction; different values of

2

Synthesized

Shaking matchbox

Lapping wave

Panning directions with close 

3

Synthesized

4

Recorded

J He (NTU)

Settings



Direct path of Reverberation of Varying directions with speech speech different  Speech

Canteen recording

Three directions with close 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 4848 / 65

Experiment 1: Extraction accuracy (a)

(b)

8

12

7 10 PCA SPCA

6

8 A

ESR

ESR

P

5 4

6

3 4 2 2 1 0

0

J He (NTU)

0.2 0.4 0.6 0.8 Input primary power ratio 

1

0

0

0.2 0.4 0.6 0.8 Input primary power ratio 

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

1

Slide of 53 4949 / 65

Experiment 1: Spatial accuracy (a)

(b)

(c)

26

5

40 24

True PCA SPCA

20

P

0 -10

-5

18 16

A

10

ICLD (dB)

P

ICTD (lag)

20

0

22

ICLD (dB)

30

14

-15

12

-20

-10

10

-20

-30 8 -40 0.2 0.4 0.6 0.8 Input primary power ratio 

J He (NTU)

1

6

0

0.2 0.4 0.6 0.8 Input primary power ratio 

1

Spatial Audio Reproduction Using Primary Ambient Extraction

-25

0

0.2 0.4 0.6 0.8 Input primary power ratio 

16th Feb 2016

1

Slide of 53 5050 / 65

Experiment 2: Tracking of moving source (a) True primary

(b) Stereo signal

J He (NTU)

(c) PCA

(d) SPCA

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5151 / 65

Experiment 3: Reverberation experiment (a)

(b)

1.4

3.5 PCA SPCA

1.2

3 2.5 A

0.8

ESR

ESR

P

1

0.6

1

0.2

0

2

4 6 Source position

8

20

2

4 6 Source position 1

6

0

0 -10

4 3

-20

2

-30

1

10

-1

ICLDA (dB)

10

8

(c)

7

5

ICLDP (dB)

ICTDP (lag)

0

(b)

True PCA SPCA

30

0.5

10

(a) 40

2 1.5

0.4

0

PCA SPCA NLMS

-2 -3 -4 -5 -6

-40 2

J He (NTU)

4 6 8 Source position

10

0

0

2

4 6 Source position

8

10

-7

0

Spatial Audio Reproduction Using Primary Ambient Extraction

5 Source position

16th Feb 2016

10

Slide of 53 5252 / 65

Experiment 4: Recording experiment ESR

Spatial

Primary component

Ambient component

θ



45°

90°



45°

90°

PCA

0.27

0.64

0.88

1.08

1.89

2.49

SPCA

0.21

0.31

0.34

0.81

1.02

1.39

ICTDP

ICLDP (dB)

ICLDA(dB)

θ



45°

90°



45°

90°



45°

90°

True

1

-17

-31

-1.02

7.74

11.90

1.03

1.18

1.03

PCA

0

0

0

-1.46

36.03

23.11

1.46

-36.03

-23.11

SPCA

1

-17

-31

-1.26

8.65

15.60

1.26

-8.65

-15.60

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5353 / 65

Contributions on Time Shifting PAE Stereo input Signal ( P is small)

ICTD estimation

o

Shifted signal ( P is maximum)

Time shifting

Extracted components (at shifted positions)

PCA

Primary and ambient components (at original positions)

Output mapping

For mixture signals with partially correlated primary components, • Analyzed the performance of conventional PAE; • Proposed time shifting technique to improve the PAE performance; • Achieved lower extraction error and more accurate spatial cues. [J4] J. He, W. S. Gan, and E. L. Tan, “Time-shifting based primary-ambient extraction for spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1576-1588, Oct. 2015.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5454 / 65

Extensions for Multiple Sources Stereo input signal

Primary components

X

Multi-shift PAE with ICC based output weighting

Multiple ICTD estimation



1

Timeshifting

X1

i

Timeshifting

Xi

PCA

Timeshifting

XT

PCA

T

τ  1 ,

Subband PAE with frequency bin partitioning

PCA

, i ,

Pˆ1 Pˆ i

Output mapping

PˆT

, T 

Stereo input signal

Extracted primary and ambient components

DFT

Inverse DFT

Frequency bin partitioning

ITD estimation

Shifted PCA (frequency domain)

Frequency bin grouping

[C3] J. He, and W. S. Gan, “Multi-shift principal component analysis based primary component extraction for spatial audio reproduction,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 350-354. [C2] J. He, E. L. Tan, and W. S. Gan, "A study on the frequency-domain primary-ambient extraction for stereo audio signals," in Proc. ICASSP, Florence, Italy, 2014, pp. 2892-2896.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5555 / 65

PAE: from stereo to multichannel signals 1. Using down-mix

2. Using pairing

3. Direct [C5] J. He, and W. S. Gan, “Applying primary ambient extraction for immersive spatial audio reproduction,” in Proc. APSIPA Annual Summit and Conference, Hong Kong, Dec. 2015.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5656 / 65

Natural Sound Rendering for Headphones

K. Sunder, J. He, E. L. Tan, and W. S. Gan, “Natural sound rendering for headphones,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 100-113, Mar. 2015.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5757 / 65

Challenges and solutions Listening Headphone listening

Signal processing

Natural listening

Virtualization Source

Sound scene decomposition Medium

Equalization Individualization

Receiver

Head tracking D. R. Begault, 3-D sound for virtual reality and multimedia: AP Professional, 2000.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5858 / 65

Integration

Sound sources Sound environment

Sound Mixing

Head movement

Head tracking

Individual parameters

Individualization

Sound mixture

Decomposition using PAE

Source rendering

Equalization

Headphone playback

Environment rendering Virtualization Rendering of natural sound

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 5959 / 65

3D Audio Headphone: an example Key features

Conchae

Sources

EQ Front

 Patented structure with strategicpositioned emitters;  Individualization via frontal projection; no measurements or training required;  Recreate an immersive perception of sound objects with surrounding ambience;  Compatible with all existing sound formats.

Correct sound localization

Front emitters

PAE

“Immersive” Ambience

EQ Diffuse

All emitters

Surrounding sound environment

W. S. Gan and E. L. Tan, "Listening device and accompanying signal processing method," US Patent 2014/0153765 A1, 2014.

J He (NTU)

Spatial Audio Reproduction Using Primary Ambient Extraction

16th Feb 2016

Slide of 53 6060 / 65

Subjective evaluation



J He (NTU)

(a) MOS for 4 measures 100

100 Natural sound rendering Conventional stereo

90

80

Natural sound rendering



18 subjects, score of 0-100; Stimuli: binaural, movie and gaming tracks; Four measures: 1. Sense of direction, 2. Externalization, 3. Ambience, 4. Timbral quality.

Mean opinion score



p-value