Feb 16, 2016 - NMF: [UWH07]. Neural network: [UhP08] ... Model: One source, amplitude panned. Time-shifting. Chapter 5. [HTG13 .... platform for natural 3D sound rendering,â Show & Tell at ICASSP 2016, Shanghai, China, Mar. 2016.
Spatial Audio Reproduction Using Primary Ambient Extraction PhD candidate: Jianjun HE Supervisor: Assoc Prof Gan Woon-Seng Digital Signal Processing Laboratory School of Electrical and Electronic Engineering Nanyang Technological University, Singapore 16th Feb, 2016 J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 1 / 165of 53
Outline I. Background of Spatial Audio Reproduction II. Overview of Primary Ambient Extraction (PAE) III. My Contributions 1. 2. 3. 4. 5.
Linear Estimation based PAE Ambient Spectrum Estimation based PAE Time Shifting based PAE Multi-Source and Multichannel based PAE Natural Sound Rendering for Headphones
IV. Conclusions and Future work
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 2 / 265of 53
Hearing is Believing
1. What/where is the sound? 2. How is the sound played? 3. How is the sound heard?
Immersive visual experience J He (NTU)
Immersive listening experience
Sound content processing
Spatial Audio Reproduction Using Primary Ambient Extraction
Personal listening cues 16th Feb 2016
Slide 3 / 365of 53
Spatial hearing of sound source Azimuth, Elevation, Distance
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 4 / 465of 53
Head Related Transfer Functions (HRTF) ITD: Interaural Time Difference ILD: Interaural Level Difference SC : Spectral Cues
Head Related Transfer Functions (HRTF) 30 Contralateral ear response Ipsilateral ear response
20
Magnitude in dB
10 0 -10 -20
30o
-30 -40
L
R
-50 2 10
3
10
4
10
Frequency in Hz HRTF of KEMAR dummy head for an angle of 30 degree azimuth*
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 5 / 565of 53
Spatial hearing of sound scene
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 6 / 665of 53
Source-medium-receiver view of spatial audio reproduction
Immersive Audio
“being there”
Source
Medium
Audio content Channelbased
Playback system
Loudspeakers
Object-based
Receiver Human ear Ear A
Ear B Headphones
Transformdomainbased J He (NTU)
Etc
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 7 / 765of 53
Achieving consistency in source and medium in spatial audio
Immersive Audio
“being there”
Source
Medium
Channel-based
Playback system
Receiver Human ear
Optimal rendering Primary Ambient Extraction
Spatial Audio Processing
Individualized cues
Essentially, PAE serves as a front-end to facilitate flexible, efficient, and immersive spatial audio reproduction. J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 8 / 865of 53
What are primary and ambient components? Main characteristics Primary components are directional Ambient components are diffuse
Primary component — Where the sound comes from?
Ambient component — Where are you? J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide 9 / 965of 53
What does PAE do? Sound scenes in movies and games often consist of: directional sound objects and background ambience, e.g., • Circling aircraft in the rain; • Flying bee at the waterfall; • Soaring tiger on the sea; • Attacking wolves in the forest.
Primary Ambient Extraction J He (NTU)
Directional sound
Background ambience
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1010 / 65
A brief literature of PAE No. of channels
Complexity of audio scenes Basic (amplitude panned single source)
Medium (single source)
Complex (multiple sources)
Time-frequency masking: [AvJ02], [AvJ04], [MGJ07], [Pul07]
Stereo (2)
PCA:[IrA02], [BVM06], [MGJ07], [GoJ07b], [BaS07], [God08],
LMS:
[JHS10], [BJP12], [TaG12], [TGC12], [LBP14]
[UsB07]
Least-squares:[Fal06], [Fal07], [JPL10], [FaB11], [UhH15]
Shifted PCA: [HTG13]
PCA: [DHT12], [HGT14], [HeG15]
Linear estimation: [HTG14] Ambient spectrum estimation: [HGT15a], [HGT15b]
Time-shifting: [HGT15c]
Others: [BrS08], [MeF10], [Har11] ICA & time-frequency masking: [SAM06] PCA:
Multichannel
[GoJ07b]
Pairwise correlations: [TSW12]
[HKO04]
Others: [GoJ07a], [WaF11], [TGC12], [CCK14]
ICA:
Pairing: [HeG15b] Others: [StM15]
Single J He (NTU)
NMF: [UWH07]
Neural network: [UhP08]
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1111 / 65
My contributions Basic Model: one source, amplitude panned
Medium Model: one source, amplitude panned, and/or time shifted Time shifting based PAE
Basic approaches
Linear estimation based PAE
Exploiting additional characteristic
Ambient spectrum estimation based PAE
Complex Model: multiple source, amplitude panned, and/or time shifted
More robust
Multiple source based PAE
Better performance : Techniques can be combined directly J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1212 / 65
My contributions Immersive Audio Source (Channel-based)
Medium (Playback system)
Primary ambient extraction
Spatial audio rendering
Receiver (Human ear)
Natural sound rendering for headphones Chapter 7
[SHT15, HGT15d] Spatial audio reproduction
Model: One source, amplitude panned
Basic approaches
Exploiting additional characteristics
Linear estimation Chapter 3 [HTG14]
Ambient spectrum estimation Chapter 4 [HGT15a, HGT15b]
Better performance
J He (NTU)
Model: One source, amplitude panned, and/or time shifted
Model: Multiple sources, amplitude panned, and/or time shifted
Time-shifting
Multi-shift/subband
Chapter 5 [HTG13, HGT15c]
Chapter 6 [HGT14, HeG15]
More robust in handling complex cases
I: Investigation of linear estimation based PAE under the basic signal model. II: Improving PAE performance for strong ambient power cases using ambient spectrum estimation techniques. III: Employing time-shifting techniques for PAE with partially correlated primary components IV: Adaptation of conventional PAE approaches to deal with multiple sources. V: Applying PAE in natural sound rendering headphone systems.
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1313 / 65
My contributions in publications I: Investigation of linear estimation based PAE under the basic signal model. II: Improving PAE performance for strong ambient power cases using ambient spectrum estimation techniques. III: Employing time-shifting techniques for PAE with partially correlated primary components IV: Adaptation of conventional PAE approaches to deal with multiple sources. V: Applying PAE in natural sound rendering headphone systems. J He (NTU)
[J1] J. He, E. L. Tan, and W. S. Gan, “Linear estimation based primaryambient extraction for stereo audio signals,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 505-517, Feb. 2014. [J2] J. He, W. S. Gan, and E. L. Tan, “Primary-ambient extraction using ambient phase estimation with a sparsity constraint,” IEEE Signal Process. Letters, vol. 22, no. 8, pp. 1127-1131, Aug. 2015. [J3] J. He, E. L. Tan, and W. S. Gan, “Primary-ambient extraction using ambient spectrum estimation for immersive spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9, pp. 1430-1443, Sept. 2015. [J4] J. He, W. S. Gan, and E. L. Tan, “Time-shifting based primary-ambient extraction for spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1576-1588, Oct. 2015. [C1] J. He, E. L. Tan, and W. S. Gan, “Time-shifted principal component analysis based cue extraction for stereo audio signals,” in Proc. ICASSP, Vancouver, Canada, 2013, pp. 266-270. [C2] J. He, W. S. Gan, and E. L. Tan, “A study on the frequency-domain primaryambient extraction for stereo audio signals,” in Proc. ICASSP, Florence, Italy, 2014, pp. 2892-2896. [C3] J. He, and W. S. Gan, “Multi-shift principal component analysis based primary component extraction for spatial audio reproduction,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 350-354.
[J5] K. Sunder, J. He, E. L. Tan, and W. S. Gan, “Natural sound rendering for headphones: Integration of signal processing techniques,” IEEE Signal Process. Magazine, vol. 32, no. 2, Mar 2015, pp. 100-113.
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1414 / 65
Other related publications [C4] J. He, W. S. Gan, and E. L. Tan, “On the preprocessing and postprocessing of HRTF individualization based on sparse representation of anthropometry features,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 639-643. [C5] J. He, and W. S. Gan, “Applying primary ambient extraction for immersive spatial audio reproduction,” 2015 Asia Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference (invited), Hong Kong, Dec. 2015. [C6] J. He, R. Ranjan, and W. S. Gan, “Fast continuous HRTF acquisition with unconstrained movements of human subjects,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. [Tutorial] W. S. Gan, and J. He, “Assisted listening for headphones and hearing aids: signal processing techniques,” Tutorial at APSIPA ASC 2015, Hong Kong, Dec. 2015.
[Show & Tell] D. H. Nguyen, J. He, K. K. Phyo, and W. S. Gan, “Real-time audio signal processing platform for natural 3D sound rendering,” Show & Tell at ICASSP 2016, Shanghai, China, Mar. 2016. [J6] ] J. He, R. Ranjan, W. S. Gan, and K. Sunder, “Scalable data reusing normalized LMS for acoustic system identification with short duration signals,” IEEE Sig. Process. Letters, under review.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1515 / 65
Objective of PAE
to extract the primary and ambient components from M mixtures Sum of primary and ambient components
Mixtures = primary component + ambient component
xm n pm n am n J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1616 / 65
Definitions with Stereo Signal Model Assumptions
Signal = Primary + Ambient
x0 p0 a0 x1 p1 a1 Stereo signal model True components
Input signal
Extracted components
p0 a0 a1 p1
J He (NTU)
+ +
x0
pˆ 0 , pˆ 1 PAE
x1
aˆ 0 , aˆ 1
Primary components highly correlated
p1 kp 0
Ambient components uncorrelated
a 0 a1
pi a j
Primary ambient components uncorrelated
i, j 0,1
Ambient power balanced
Pa0 Pa1
Primary panning factor PPF: k Primary power ratio PPR:
Spatial Audio Reproduction Using Primary Ambient Extraction
p1 p0
Total primary power Total signal power
16th Feb 2016
Slide of 53 1717 / 65
PAE: Time-Frequency Masking f
f
f Element-wise multiplication
t
Primary / ambient components
t
t
Mixture
Mask
Mask can be constructed using • Inter-channel coherence [Avendano and Jot, 2004] • Pairwise correlation [Thompson et al., 2012] • Equal level of ambience [Merimaa et al., 2007] • Diffuseness [Pulkki, 2007] J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1818 / 65
Linear Estimation framework in PAE
pˆ 0 T pˆ 1 aˆ T 0 aˆ 1T
T
J He (NTU)
wP0,0 wP1,0 w A0,0 wA1,0
wP0,1 T T wP1,1 x 0 x0 W T T wA0,1 x1 x1 wA1,1
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 1919 / 65
Performance measures eP w
P0,0
kwP0,1 1 p 0
0
wP0,0 a0 wP0,1a1
e A wA0,0 1 a0 wA0,1a1 wA0,0 p 0 wA0,1p1 Extraction Accuracy
PError PTrue signal
PLeakage PDistortion PInterference PTrue signal PTrue signal PTrue signal
ESR DSR ISR LSR Spatial Accuracy
ICC, ICLD, ICTD(only for primary)
ESR: Error-to-signal ratio DSR: Distortion-to-signal ratio ISR: Interference-to-signal ratio LSR: Leakage-to-signal ratio J He (NTU)
ICC : Inter-channel cross-correlation coefficient ICLD: Inter-channel level difference ICTD: Inter-channel time difference
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2020 / 65
PAE: Linear Estimation pˆ 0 n wP0,0 pˆ1 n wP1,0 ˆ w a0 n A0,0 aˆ n wA1,0 1
wP0,1 wP1,1 x0 n wA0,1 x1 n wA1,1
Objectives and relationships of four linear estimation based PAE approaches. • • • •
J He (NTU)
Blue solid lines represent the relationships in the primary component; Green dotted lines represent the relationships in the ambient component. MLLS: minimum leakage LS MDLS: minimum distortion LS
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2121 / 65
Performance of the four PAE approaches Performance
Primary component
k : PPF
Ambient component
Measure
MDLS/PCA
MLLS/LS
MLLS/PCA
ESR
1 2
1 1
1 1 k 2
1 2 2 1 k 1
2 1 k k 2 1
0
1 2 1 1 k 2 1 2
1 k 1 2 1 k 1 2
LSR
DSR
ISR
ICC(ICTD)
ICLD
1 2
1 2
2 1
1 1
0
0 1(0)
k2
2
2
1 2 1 k
2
k 2 1 k
1 1 k2
2
LS
: PPR
MDLS 2
1 2 2 1 k 1 k 2 2 1 k 1
2
2
2
2
0 2k 2 1 k 1 2
2
2
2k
1 k 1 k 2 2
2 1 1 k 1 k 2 1 1 1 k2
2 2
2
2 1 1 k 1 k 2 1 1 1 k2
Theoretical results are verified in our experiments! J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2222 / 65
Experimental results– Primary extraction (b)
(a) 5
3
P
P
3
0.6
2
2
1
1
0
0 0
0.5 True PPR
1
DSR
LSR
P
0.8
4
MDLS MLLS
4
ESR
(c)
5
0.4
0.2
0 0
0.5 True PPR
1
0
0.5 True PPR
1
Performance of MDLS (or PCA) and MLLS (or LS) in primary extraction (a) ESR, (b) LSR, (c) DSR.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2323 / 65
PAE using Adjustable Least-squares (ALS) MDLS /PCA
Min distortion
Adjustable factor 0,1
MLLS /LS
β =0
ALS (β) Min leakage
β=1
WALS 1 1 k 1 1 1 2 2 1 1 1 k 1 k k 1 k2 1 1 1 2 1 1 k2 1 1 k 1 1 1 1 1 1 k2 k 1 k2 2 k2 k 1 k 1 2 2 1 k 1 k
Min error
MLLS /PCA
1 , DSR 2 1 , LeSR 1 , 1 ESR P 2 P P 2 2 1 1 1 2
2
2
1 1 1 ESR A 2 2 2 2 , DSR A 2 , LeSR A 0. 2 k 1 k k k 1
Blue solid lines: primary component; Green dotted lines: ambient component. J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2424 / 65
Recommendations of Linear Estimation based PAE Strengths
Approach
PCA
Minimum leakage in the extracted primary and ambient components; Primary and ambient components are uncorrelated; No distortion in the extracted primary and ambient components;
MDLS
J He (NTU)
Minimum MSE in the extracted primary and ambient components;
ALS
No distortion in the extracted primary component; No primary leakage in the extracted ambient component; Primary and ambient components are uncorrelated;
LS
MLLS
Weaknesses
Performance adjustable;
Recommendations
Ambient component severely panned;
Spatial audio coding and interactive audio in gaming, where the primary component is more important than the ambient component.
Severe primary leakage in the extracted ambient component;
Applications in which both the primary and ambient components are extracted, processed, and finally mixed together.
Ambient component severely panned;
Spatial audio enhancement systems, and applications in which different rendering or playback techniques are employed on the extracted primary and ambient components.
Severe interference and primary leakage in the extracted ambient component; Need to adjust the value of the adjustable factor;
High-fidelity applications in which timbre is of high importance.
For applications without explicit requirements.
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2525 / 65
Contributions on linear estimation based PAE 1. Formulated the framework for PAE.
linear
estimation
2. Introduced two groups of performance measures. – Extraction accuracy: ESR, DSR, ISR, LSR – Spatial accuracy
: ICC, ICTD, ICLD
3. Proposed MLLS, MDLS, ALS and compared them with PCA and LS in PAE. 4. Different approaches are recommended in different spatial audio applications.
[J1] J. He, E. L. Tan and W. S. Gan, “Linear estimation based primary-ambient extraction for stereo audio signals,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 505-517, Feb. 2014.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2626 / 65
Observation Ambient components are decorrelated using: – A small delay – All-pass filtering with random phase – Artificial reverberation, and binaural reverberation
Magnitude of the ambient components are usually kept the same. Element-wise multiplication
Ambient components
j 0 n ,l
A0 A0
W0
W0 n, l e
A1 A1
W1
j n ,l W1 n, l e 1
Ambient magnitude
A0 A1 A
J He (NTU)
Complex with unit magnitude
Ambient phase
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2727 / 65
From PAE to APE: ambient phase estimation
PAE X0 P0 A 0 X1 P1 A1
with P1 kP0 , A 0 A1
Pi A j , i, j 0,1 PP1 k PP0 , PA1 PA0 2
Find W0 , W1
APE
X: Mixed signal P: Primary components k: Primary panning factor
A X1 kX0 . W1 kW0
and A c X1 kX 0 . W1 kW0
Wc ,
Pc X c X1 kX 0 . W1 kW0
Wc
channel index c 0,1.
Find θ1
Find θ0 , θ1
θ0 θ arcsin k 1 sin θ θ1 where θ X1 kX0
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2828 / 65
From PAE to APE: ambient phase estimation
APE
PAE X0 P0 A 0
A X1 kX0 . W1 kW0
X1 P1 A1
Is this the only way? For each time-frequency bin
J He (NTU)
kX 0 P1 kA0 X 1 P1 A1
Complex spectrum shown in complex plane?
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 2929 / 65
From PAE to AME: ambient magnitude estimation Put the complex spectrum in 2 dimensional vector form k X 0 P1 k A0 X 1 P1 A1
Im B (BRe, BIm)
k A0
k X 0 OB BRe , BIm X 1 OC CRe , C Im
P1 OP PRe , PIm k A0 PB
C (CRe, CIm)
?
A1 P (PRe, PIm) X1
k X0
P1
Re
O (0, 0)
A1 PC J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3030 / 65
From PAE to AME: ambient magnitude estimation r A0 A1
PC r , PB kr.
PRe BRe PIm BIm k 2 rˆ 2 , 2
PRe CRe
2
2
PIm CIm rˆ 2 . 2
CRe BRe k 2 1 rˆ 2 BIm CIm B C Re Re Pˆ Re , 2 2 2 BC CIm BIm k 2 1 rˆ 2 B C Im Im Pˆ Im 2 2 2 BC
BRe CRe
Im B (BRe, BIm)
,
k A0
Pˆ1 Pˆ Re jPˆ Im , Pˆ0 k 1 Pˆ Re jPˆ Im ,
C (CRe, CIm)
Aˆ1 X 1 Pˆ Re jPˆ Im ,
Find r
Aˆ0 X 0 k 1 Pˆ Re jPˆ Im .
CRe BRe
BC
2
A1 P (PRe, PIm) X1
k X0
P1
Re
O (0, 0)
C Im BIm , 2
2 2 2 2 k 1 rˆ 2 BC k 1 rˆ 2 BC .
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3131 / 65
APE
AME
AME: Find
r PC Im
APE: Find
1 PC
B (BRe, BIm)
k A0 C (CRe, CIm)
Ambient Spectrum Estimation J He (NTU)
A1 P (PRe, PIm) X1
k X0
P1
Re
O (0, 0)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3232 / 65
Ambient Spectrum Estimation (ASE)
Extracted primary and ambient components
Input signal
x 0 , x1
pˆ 0 , pˆ 1 , aˆ 0 , aˆ 1
APE Find θ1 T-F Transform
Compute
Select ASE
ˆ ,A ˆ Pˆ 0 , Pˆ1 , A 0 1 Find
Inverse T-F Transform
r
AME
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3333 / 65
ASE: how to estimate? Using a Sparsity Constraint Find θ1
Objective
How
ˆ θˆ 1* arg min P 1 ˆ
Method Analytic?
Solution Numerical?
J He (NTU)
Find r
θ1
rˆ* arg min Pˆ1 rˆ
1
APES
1
AMES
No Convex?
No
Heuristic?
Inefficient
Discrete Searching
Independent Bounded
Spatial Audio Reproduction Using Primary Ambient Extraction
Pˆ1 ni , li A1 n j , l j ni n j or li l j
θˆ 1 , rˆ rlb , rub 16th Feb 2016
Slide of 53 3434 / 65
Approximate solution APEX and computational cost Approximate close-form solution APEX : X1 , k 1 θˆ X 1 X 0 , k 1 * 1
Computation cost per time-frequency bin APES
Square root D
AMES APEX
Operations
15D+18
15D+13
4D+6
D-1
Trigonometric operation 7D+6
2D+2
25D+35
24D+24
9D+13
D-1
0
0
13
7
4
1
7
Addition Multiplication Division
Comparison
D: number of phase or magnitude estimates in discrete searching
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3535 / 65
Evaluation of PAE- Extraction accuracy Performance measure: Error-to-Signal Ratio (ESR) 1 1 pˆ c pc ESR P 10log10 2 2 p c 0 0 2
1 1 aˆ c ac 2 , ESR A 10log10 2 2 a c 0 c 2
2
2 2
.
Error Distortion Interference Leakage ESR DSR
ISR
LSR
When there is no analytic solution, how to compute these measures? We propose an optimization technique to compute these measures. J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3636 / 65
Optimization method for PAE Extraction accuracy pˆ c pc Leakpc Distpc
aˆ c ac Leakac Intfac Distac
pc wPc ,0a0 wPc,1a1 Distpc ,
w
* Pc ,0
ac wAc ,cpc wAc ,1ca1c Distac ,
w
* , wPc ,1
arg min
wPc ,0 , wPc ,1
* Ac , c
pˆ c p c wPc ,0a0 wPc ,1a1 , 2
1 w* a w* a Pc ,0 0 Pc ,1 1 1 LSR P 10 log10 2 pc 2 2 c 0
2 2
min
wAc ,c , wAc ,1c
aˆ c ac wAc ,cp c wAc ,1ca1c , 2
2
1 w* p Ac , c c 1 LSR A 10 log10 2 2 c 0 ac 2
,
1 pˆ p w* a w* a Pc,0 0 Pc,1 1 c c 1 DSR P 10 log10 2 pc 2 2 c 0
J He (NTU)
arg
2
, w*Ac ,1c
2 .
2
2 2
1 w* a Ac ,1 c 1 c 1 ISR A 10 log10 2 ac 2 2 c 0
, 2 2
,
1 aˆ a w p w c c Ac ,c c Ac ,1 c a1 c 1 DSR A 10 log10 2 ac 2 2 c 0
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
2 2
.
Slide of 53 3737 / 65
Objective evaluation Stimuli • Primary component: – Speech, k = 2
• Ambient component: – Wave lapping sound
• Primary power ratio (PPR): – (0, 1) at an interval of 0.1
Approaches compared • Masking • PCA • LS • ASE (APEX)
• FFT size: 4096
Performance evaluated 1. Extraction accuracy: ESR 2. Spatial accuracy: ICC, ICLD
J He (NTU)
1 1 pˆ c p c ESR P 10 log10 2 2 p c 0 c 2
2
1 1 aˆ c ac ESR A 10 log10 2 2 a c 0 c 2
2
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
2
2
, .
Slide of 53 3838 / 65
Extraction accuracy (a)
(b)
20
20
Masking PCA LS ASE
15
10
15
10
5
ESRA(dB)
ESRP(dB)
5
0
-5
0
-5
-10
-10
-15
-15
-20
-20
-25
0
0.2
0.4
0.6
Primary power ratio
J He (NTU)
0.8
1
-25
0
0.2
0.4
0.6
0.8
1
Primary power ratio
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 3939 / 65
Spatial accuracy (a)
(b)
1.2
20
ICLDP (dB)
Best 1 ICCP
0.8 0.6 0.4
15
Best True Masking PCA LS ASE
10
5
0.2 0
0
0.2
0.4
0.6
0.8
0
1
0
Primary power ratio
0.2
0.4
0.6
0.8
1
Primary power ratio
(c)
(d) 2
ICLDA (dB)
ICCA
0.8 0.6 0.4 0.2
Best 00
-2 -4 -6 -8 -10 -12
0.2
0.4
0.6
Primary power ratio J He (NTU)
Best
0
1
0.8
1
-14
0
0.2
0.4
0.6
0.8
1
Primary power ratio
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4040 / 65
Subjective evaluation Stimuli
Approaches compared
• Primary component:
• • • • • •
– speech, music, and bee sound, k = 2
• Ambient component: – forest, canteen, and waterfall sound
• Primary power ratio (PPR):
Masking PCA LS ASE (APEX) Reference Mixture
– (0.3, 0.7)
• Duration: 2-4 seconds
Performance evaluated 1. Extraction accuracy 2. Ambient diffuseness J He (NTU)
Listening tests • 17 subjects • Headphone listening • Procedure similar to MUSHRA
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4141 / 65
(a) Extraction accuracy of primary components 100
(b) Extraction accuracy of ambient components 100
(c) Diffuseness accuary 100
Excellent 80
60
40
20
80
Good Subjective score
Subjective score
Subjective score
80
60
60
Fair 40
40
Poor 20
20
Bad 0
Ref
Mix Masking PCA
(b) Extraction accuracy of ambient components 100
LS
ASE
0
Ref
Mix Masking PCA
LS
ASE
0
Ref
Mix Maskin
(c) Diffuseness accuary of ambient components 100
Excellent 80
80
Subjective score
Subjective score
Subjective scores
60
40
20
Good 60
Fair 40
Poor 20
Bad 0
Ref
Mix Masking PCA
J He (NTU)
LS
ASE
0
Ref
Mix Masking PCA
LS
ASE
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4242 / 65
Contributions on ASE based PAE 1. Reformulated PAE as ASE by exploiting the equal ambient magnitude. 2. Solved ASE using the criterion of sparse primary component, resulting in APES, AMES, and APEX.
3. Proposed a technique to compute error measures for PAE approaches without analytic solutions. 4. Validated the improved performance (ESR reduction: 3-6 dB average) in the objective and subjective experiments.
Im B (BRe, BIm)
k A0 C (CRe, CIm)
A1 P (PRe, PIm) X1
k X0
P1
Re
O (0, 0)
Approximate efficient solution APEX : X , k 1 ˆθ* 1 1 X1 X0 , k 1
[J2] J. He, W. S. Gan, and E. L. Tan, “Primary-ambient extraction using ambient phase estimation with a sparsity constraint,” IEEE Signal Process. Letters, vol. 22, no. 8, pp. 1127-1131, Aug. 2015. [J3] J. He, E. L. Tan, and W. S. Gan, “Primary-ambient extraction using ambient spectrum estimation for immersive spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9, pp. 1431-1444, Sept. 2015.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4343 / 65
PAE in ideal case → complex case
p0
p1
kˆic k 2 1 k 2 k 2P k
p 0
2
k 2 1 1 , 2 2 2P k k
ˆic k 2 1 2P . 2 k 1
p 1 (a)
(b)
10
1 k =3 k =1 k =1/3
0.8 0.6
5
0.2
k (dB)
0.4
0
0 -0.2 -5
k =3 k =1 k =1/3
-0.4 -0.6
-10
J He (NTU)
0
0.2 0.4 0.6 0.8 Primary correlation p
1
-0.8
0
0.2 0.4 0.6 0.8 Primary correlation p
Spatial Audio Reproduction Using Primary Ambient Extraction
1
16th Feb 2016
Slide of 53 4444 / 65
PAE in ideal case → complex case pˆ PCA, 0 p 0 v
p0
(a)
(b) 0.6
1.8
0.59
1.6
0.58
=0.2 =0.5 =0.8 ESR
A
P
ESR
kˆic v 1 kˆ
0.55 0.54
0.6
0.53
0.4
0.52
0.2
0.51
0.2 0.4 0.6 0.8 Primary correlation p
0.56
0.8
J He (NTU)
kˆic a1 ,
0.57
1
0
0
p 0
2
0
a
kˆic 1 pˆ PCA, 1 p1 v a 0 kˆica1 , 2 kˆic 1 kˆic kˆic 2 kˆic aˆ PCA, 0 a v a, 2 0 2 1 ˆ ˆ 1 k ic 1 kic kˆic 1 1 aˆ PCA, 1 a v a , 2 1 2 0 ˆ ˆ ˆ 1 kic kic 1 kic
p 1
1.2
2
ic
p1
1.4
1 1 kˆ
1
0.5
2
kˆ p ic
0
p1
ic
0
0.2 0.4 0.6 0.8 Primary correlation p
1
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4545 / 65
Time-shifting for PAE To account for the partial primary correlation (0-lag) caused by the inter-channel time difference (ICTD).
x0
x'0
x1
ICTD estimation
o
Time shifting
1 pˆ SPCA, 0 n 1 kˆ
2
ic
kˆic aˆSPCA, 0 n 1 kˆ
ic
J He (NTU)
x1
Shifted signal ( P is maximum)
Stereo input Signal ( P is small)
2
pˆ '0
pˆ 0
pˆ 1
Extracted components (at shifted positions)
pˆ 1
Primary and ambient components (at original positions)
Output mapping
PCA
ˆ x0 n kˆic x1 n o , pˆ SPCA, 1 n kic 1 kˆ
2
x0 n o kˆic x1 n ,
ic
kˆic x0 n x1 n o , aˆSPCA, 1 n 1 1 kˆ
2
kˆic x0 n o x1 n .
ic
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4646 / 65
Output mapping Frame Frame ICTD (ms) ICTD (ms)
channel 0 channel 0
i-1 i-1 1 1
frame i-1 frame i-1
i+1 i+1 1 1 frame i+1 frame i+1
frame i frame i an overlap of 2 ms an overlap of 2 ms
channel 1 channel 1
i i -1 -1
frame i-1 frame i-1
a gap of 2 ms a gap of 2 ms frame i+1 frame i+1
frame i frame i (a) Conventional output mapping (a) Conventional output mapping overlaps of Q = 2 ms overlaps of Q = 2 ms
channel 0 channel 0
frame i-1 frame i-1
frame i frame i
an overlap of Q +2 = 4 ms an overlap of Q +2 = 4 ms
channel 1 channel 1
frame i-1 frame i-1
x x
frame i+1 frame i+1
an overlap of Q -2 = 0 ms an overlap of Q -2 = 0 ms
frame i frame i
frame i+1 frame i+1
(b) Overlapped output mapping (b) Overlapped output mapping
Overlapping duration: Q ≥ 2 ms J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4747 / 65
Experiments
Exp
Input signal
Primary component
Ambient component
1
Synthesized
Speech
Lapping wave
Fixed direction; different values of
2
Synthesized
Shaking matchbox
Lapping wave
Panning directions with close
3
Synthesized
4
Recorded
J He (NTU)
Settings
Direct path of Reverberation of Varying directions with speech speech different Speech
Canteen recording
Three directions with close
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 4848 / 65
Experiment 1: Extraction accuracy (a)
(b)
8
12
7 10 PCA SPCA
6
8 A
ESR
ESR
P
5 4
6
3 4 2 2 1 0
0
J He (NTU)
0.2 0.4 0.6 0.8 Input primary power ratio
1
0
0
0.2 0.4 0.6 0.8 Input primary power ratio
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
1
Slide of 53 4949 / 65
Experiment 1: Spatial accuracy (a)
(b)
(c)
26
5
40 24
True PCA SPCA
20
P
0 -10
-5
18 16
A
10
ICLD (dB)
P
ICTD (lag)
20
0
22
ICLD (dB)
30
14
-15
12
-20
-10
10
-20
-30 8 -40 0.2 0.4 0.6 0.8 Input primary power ratio
J He (NTU)
1
6
0
0.2 0.4 0.6 0.8 Input primary power ratio
1
Spatial Audio Reproduction Using Primary Ambient Extraction
-25
0
0.2 0.4 0.6 0.8 Input primary power ratio
16th Feb 2016
1
Slide of 53 5050 / 65
Experiment 2: Tracking of moving source (a) True primary
(b) Stereo signal
J He (NTU)
(c) PCA
(d) SPCA
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5151 / 65
Experiment 3: Reverberation experiment (a)
(b)
1.4
3.5 PCA SPCA
1.2
3 2.5 A
0.8
ESR
ESR
P
1
0.6
1
0.2
0
2
4 6 Source position
8
20
2
4 6 Source position 1
6
0
0 -10
4 3
-20
2
-30
1
10
-1
ICLDA (dB)
10
8
(c)
7
5
ICLDP (dB)
ICTDP (lag)
0
(b)
True PCA SPCA
30
0.5
10
(a) 40
2 1.5
0.4
0
PCA SPCA NLMS
-2 -3 -4 -5 -6
-40 2
J He (NTU)
4 6 8 Source position
10
0
0
2
4 6 Source position
8
10
-7
0
Spatial Audio Reproduction Using Primary Ambient Extraction
5 Source position
16th Feb 2016
10
Slide of 53 5252 / 65
Experiment 4: Recording experiment ESR
Spatial
Primary component
Ambient component
θ
0°
45°
90°
0°
45°
90°
PCA
0.27
0.64
0.88
1.08
1.89
2.49
SPCA
0.21
0.31
0.34
0.81
1.02
1.39
ICTDP
ICLDP (dB)
ICLDA(dB)
θ
0°
45°
90°
0°
45°
90°
0°
45°
90°
True
1
-17
-31
-1.02
7.74
11.90
1.03
1.18
1.03
PCA
0
0
0
-1.46
36.03
23.11
1.46
-36.03
-23.11
SPCA
1
-17
-31
-1.26
8.65
15.60
1.26
-8.65
-15.60
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5353 / 65
Contributions on Time Shifting PAE Stereo input Signal ( P is small)
ICTD estimation
o
Shifted signal ( P is maximum)
Time shifting
Extracted components (at shifted positions)
PCA
Primary and ambient components (at original positions)
Output mapping
For mixture signals with partially correlated primary components, • Analyzed the performance of conventional PAE; • Proposed time shifting technique to improve the PAE performance; • Achieved lower extraction error and more accurate spatial cues. [J4] J. He, W. S. Gan, and E. L. Tan, “Time-shifting based primary-ambient extraction for spatial audio reproduction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1576-1588, Oct. 2015.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5454 / 65
Extensions for Multiple Sources Stereo input signal
Primary components
X
Multi-shift PAE with ICC based output weighting
Multiple ICTD estimation
Pˆ
1
Timeshifting
X1
i
Timeshifting
Xi
PCA
Timeshifting
XT
PCA
T
τ 1 ,
Subband PAE with frequency bin partitioning
PCA
, i ,
Pˆ1 Pˆ i
Output mapping
PˆT
, T
Stereo input signal
Extracted primary and ambient components
DFT
Inverse DFT
Frequency bin partitioning
ITD estimation
Shifted PCA (frequency domain)
Frequency bin grouping
[C3] J. He, and W. S. Gan, “Multi-shift principal component analysis based primary component extraction for spatial audio reproduction,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 350-354. [C2] J. He, E. L. Tan, and W. S. Gan, "A study on the frequency-domain primary-ambient extraction for stereo audio signals," in Proc. ICASSP, Florence, Italy, 2014, pp. 2892-2896.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5555 / 65
PAE: from stereo to multichannel signals 1. Using down-mix
2. Using pairing
3. Direct [C5] J. He, and W. S. Gan, “Applying primary ambient extraction for immersive spatial audio reproduction,” in Proc. APSIPA Annual Summit and Conference, Hong Kong, Dec. 2015.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5656 / 65
Natural Sound Rendering for Headphones
K. Sunder, J. He, E. L. Tan, and W. S. Gan, “Natural sound rendering for headphones,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 100-113, Mar. 2015.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5757 / 65
Challenges and solutions Listening Headphone listening
Signal processing
Natural listening
Virtualization Source
Sound scene decomposition Medium
Equalization Individualization
Receiver
Head tracking D. R. Begault, 3-D sound for virtual reality and multimedia: AP Professional, 2000.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5858 / 65
Integration
Sound sources Sound environment
Sound Mixing
Head movement
Head tracking
Individual parameters
Individualization
Sound mixture
Decomposition using PAE
Source rendering
Equalization
Headphone playback
Environment rendering Virtualization Rendering of natural sound
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 5959 / 65
3D Audio Headphone: an example Key features
Conchae
Sources
EQ Front
Patented structure with strategicpositioned emitters; Individualization via frontal projection; no measurements or training required; Recreate an immersive perception of sound objects with surrounding ambience; Compatible with all existing sound formats.
Correct sound localization
Front emitters
PAE
“Immersive” Ambience
EQ Diffuse
All emitters
Surrounding sound environment
W. S. Gan and E. L. Tan, "Listening device and accompanying signal processing method," US Patent 2014/0153765 A1, 2014.
J He (NTU)
Spatial Audio Reproduction Using Primary Ambient Extraction
16th Feb 2016
Slide of 53 6060 / 65
Subjective evaluation
J He (NTU)
(a) MOS for 4 measures 100
100 Natural sound rendering Conventional stereo
90
80
Natural sound rendering
18 subjects, score of 0-100; Stimuli: binaural, movie and gaming tracks; Four measures: 1. Sense of direction, 2. Externalization, 3. Ambience, 4. Timbral quality.
Mean opinion score
p-value