Broadband DOA Estimation using Convolutional. Neural Networks Trained with Noise Signals. Soumitro Chakrabarty and Emanuël Habets. WASPAA 2017 ...
Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals Soumitro Chakrabarty and Emanuël Habets WASPAA 2017
Motivation Signal processing methods § Cross-correlation-based methods • GCC-PHAT • SRP-PHAT • MCCC…
Challenges -
Performance degradation in presence of noise and reverberation
-
High computational cost
§ Subspace-based methods • MUSIC... § Model-based methods • Maximum-likelihood estimation... § ...
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 2
Motivation Supervised learning methods § Advantage: Supervised learning methods can be adapted to
different acoustic environments § Recently, deep neural network (DNN) based supervised learning
methods have been successful across a range of applications: § § §
Automatic speech recognition Object recognition in images Machine translation….
§ Few DNN based methods that estimate DOA of a sound source from
the observed signals
[1] Xiao et al. "A learning-based approach to direction of arrival estimation in noisy and reverberant environments,“ 2015 [2] Takeda et al. "Sound source localization based on deep neural networks with directional activate function exploiting phase information“ 2016 © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 3
Motivation DNN based DOA estimation
Aim: A DNN based supervised learning method for DOA estimation that § Estimates DOA per time frame given the STFT representation of the
observed signals § Simple input representation to learn relevant features during training
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 4
Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...
i ≈ θi ∈ Θ = {θ1 , . . . , θI }
§ DOA estimation is formulated as an I class classification problem § Discretize the whole DOA range into I discrete values to obtain a set
of possible DOA values: ⇥ = {✓1 , . . . , ✓I } § Each class corresponds to a possible DOA value in the set © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 5
Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...
i ≈ θi ∈ Θ = {θ1 , . . . , θI }
§ For each frame, compute the posterior probability for each class § DOA estimate is the DOA of the class with the highest posterior
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 6
System Overview Supervised learning framework True DOA Labels
Training data
STFT
Input feature
Train DOA classifier
Training Inference/Test
Trained parameters
Test data
STFT
Input feature
DOA classifier
Posterior probabilities
DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 7
System Overview Supervised learning framework True DOA Labels
Training data
STFT
Input feature
Train DOA classifier
Training Inference/Test
Trained parameters
Test data
STFT
Input feature
DOA classifier
Posterior probabilities
DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 8
Input feature representation STFT magnitude and phase component
icr
op
ho
ne
s
K − Frequency Bins
m (n,k)
Magnitude
© AudioLabs, 2017 Soumitro Chakrabarty
Phase
Am (n, k)
Broadband DOA Estimation with CNN 9
m (n, k)
− M
−
N − Time frames
M
N − Time frames
M
M
icr
op
ho
ne
s
K − Frequency Bins
Ym (n, k) = Am (n, k)ej
Input feature representation Phase map
K − Frequency Bins
for each time frame
n
Soumitro Chakrabarty
s ne ho op −
M
icr
Phase
M
M
n
N − Time frames
© AudioLabs, 2017
K
m (n, k)
Phase Map
Broadband DOA Estimation with CNN 10
n
System Overview Supervised learning framework True DOA Labels
Training data
STFT
Input feature
Φn
Train DOA classifier
Training Inference/Test
Trained parameters
Test data
STFT
Input feature
Φn
DOA classifier
Posterior probabilities
DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 11
Total Feature Maps: 64 Size: (M − 2) × (K − 2) Total Feature Maps: 64 Size: (M − 3) × (K − 3)
Conv1: 2 × 2, 64
Conv2: 2 × 2, 64
Conv3: 2 × 2, 64
§ Pooling is not employed § Experiments showed decreased performance
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 12
FC2: 512
Total Feature Maps: 64 Size: (M − 1) × (K − 1)
FC1: 512
Input: M × K
Output: I × 1
CNN Architecture
Training with synthesized noise signals
§ CNN learns the required information for DOA estimation from the
phase map § CNN can be trained using synthesized noise signals (!?)
§ Advantages: 1.
No speech/audio database required
2.
Easier to create training data
§ Spectrally white noise was used
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 13
System overview True DOA Labels
Training data (Noise signals)
STFT
Input feature
Φn
Train CNN based DOA classifier
Training Inference/Test Test data (Speech signals)
Trained parameters
STFT
Input feature
Φn
CNN based DOA classifier
p(θi |Φn )
DOA estimate θˆn = arg max p(θi |Φn ) θi
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 14
Evaluation Experiments 1.
Generalization to speech and robustness to noise
2.
Performance in acoustic conditions different from training
3.
Robustness to small perturbations in microphone positions
4.
Real acoustic environments
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 15
Evaluation Experiments 1.
Generalization to speech and robustness to noise
2.
Performance in acoustic conditions different from training
3.
Robustness to small perturbations in microphone positions
4.
Real acoustic environments
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 16
Evaluation Performance measure §
Performance of CNN compared to SRP-PHAT
§
Evaluation measure §
Frame-level accuracy ˆ N A(%) = c ⇥ 100, Ns Ns ˆc N
© AudioLabs, 2017 Soumitro Chakrabarty
Total time frames with speech active Time frames with correct estimate
Broadband DOA Estimation with CNN 17
Evaluation Experimental parameters §
Uniform linear array (ULA) § Number of microphones = 4 § Inter-microphone distance = 3 cm
§
STFT length 256, 50% overlap
§
Resolution for classes: 5 degrees, I = 37 classes
§
Training data is simulated using RIR generator [3]
[3] https://github.com/ehabets/RIR-Generator © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 18
Evaluation CNN training conditions
Room with training positions Simulated training data Signal Synthesized noise signals Room size R1: (6 ⇥ 6) m , R2: (5 ⇥ 5) m Array positions in room 7 di↵erent positions in each room Source-array distance 1 m and 2 m for position RT60 R1: 0.3 s, R2: 0.2 s SNR Uniformly sampled from 0 to 20 dB © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 19
Evaluation CNN training parameters § Training data: 5.6 million time frames § Validation data: 20% split from the training data § Loss: Cross-entropy § Activation: ReLU, Softmax (final layer) § Optimizer: Adam § Batchsize: 512 § Nb Epochs: 10 § Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 20
Evaluation Test conditions § Database: TIMIT test § Speech samples: 500, 4 s each § Test data size: 100000 active time frames
Simulated test data Signal Speech signals from TIMIT Room size Room 1: (7 ⇥ 6) m , Room 2: (8 ⇥ 8) m Array positions in room 1 random position for each room Source-array distance 1.5 m for both rooms RT60 Room 1: 0.45 s, Room 2: 0.53 s SNR 2 categories: 5 dB, and 15 dB © AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 21
Evaluation Test conditions Frame-level accuracies (%)
CNN SRP-PHAT
Room 1 5 dB 15 dB 56.2 69.8 22.6 33.6
Room 2 5 dB 15 dB 54.1 68.2 21.8 38.4
§ Better performance compared to SRP-PHAT § For all cases, CNN is accurate for the majority of the frames
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 22
Evaluation Qualitative results - Room 2, 15 dB D OA ( ◦ )
1 150 100
0.5
50 10
20
30
40
50
60
Tim e fr am e s
70
80
90
100
CNN
0
D OA ( ◦ )
1 150 100
0.5
50 10
20
30
40
50
60
Tim e fr am e s
70
80
90
100
SRP-PHAT
0
D OA ( ◦ )
1 150 100
0.5
50 10
Fr e que nc y bins
True DOA
20
30
40
50
60
Tim e fr am e s
70
80
90
100
0
120 100 80 60 40 20
−10 −20 −30 10
© AudioLabs, 2017 Soumitro Chakrabarty
0
20
30
40
50
60
Tim e fr am e s
70
80
90
Broadband DOA Estimation with CNN 23
100
Speech segment
Evaluation Qualitative results – Average over 0.8 s
CNN
Pr obability
1
SRP−PHAT
True DOA
0.8 0.6 0.4 0.2 0
0
20
40
60
80
100
120
D OA
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 24
140
160
180
Evaluation Real environment – Measured RIRs § Database: Multichannel Impulse Response Database from Bar-Ilan § Source setup: [0 , 180 ] , steps of 15 degrees § Array configuration: M = 8 microphones, d = 8 cm § Source-array distances: 1 m and 2 m § Test sample: 15 s long speech sample
15 resolution {1, 2} m
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 25
Evaluation Real environment – Measured RIRs § Database: Multichannel Impulse Response Database from Bar-Ilan § Source setup: [0 , 180 ] , steps of 15 degrees § Array configuration: [8,8,8,8,8,8,8], M = 8 microphones § Source-array distances: 1 m and 2 m § Test sample: 15 s long speech sample § CNN was retrained for the new array geometry (with simulated data)
6m
6m
© AudioLabs, 2017 Soumitro Chakrabarty
{1,2} m
Broadband DOA Estimation with CNN 26
Evaluation Real environment – Measured RIRs Frame-level accuracies (%)
CNN SRP-PHAT
RT60 = 0.160 s 1m 2m 91.8 88.7 94.4 69.0
RT60 = 0.360 s 1m 2m 86.8 79.4 87.1 68.3
RT60 = 0.610 s 1m 2m 72.3 67.3 71.7 62.4
§
For 2 m distance, CNN outperforms SRP-PHAT
§
SRP-PHAT performs better at lower reverberation times for 1 m source-array distance
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 27
Conclusions §
Proposed a CNN based supervised learning method for DOA estimation with a simple input representation
§
CNN trained with synthesized noise signals is able to localize a speech source
§
Proposed system performs better than SRP-PHAT in unmatched simulated acoustic conditions
§
Adaptability to unseen real acoustic environments was also demonstrated
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 28
Current Work Multi-speaker localization §
Aim: Localize simultaneously active speakers
§
Formulate multi-speaker localization as a multi-class multi-label classification problem
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 29
Current Work Multi-speaker localization 1
D OA
150
0.8 0.6
100
0.4
CNN
50 0.2 0 10
20
30
40
50
60
70
80
90
100
110
0
Ti m e Fr am e s 1
D OA
150
0.8 0.6
100
0.4
True DOA
50 0.2 0 10
20
30
40
50
60
70
80
90
100
110
0
Ti m e Fr am e s 1
D OA
150
0.8 0.6
100
0.4 50
SRP-PHAT
0.2 0 10
20
30
40
50
60
70
80
90
100
110
0
Fr e que nc y bins
Ti m e Fr am e s 250
15
200
10
150
5
100
0 −5
50 10
20
30
40
50
60
70
80
90
100
110
Ti m e fr am e s
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 30
−10
Speech segment
Current Work Multi-speaker localization
Pr obability
CNN
SRP−PHAT
True DOAs
1 0.8 0.6 0.4 0.2 0
0
20
40
60
80
100
120
140
160
D OA
Still trained with synthesized noise signals!!
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 31
180
Thank you for your attention!
Trained model and weights available: Soumitro-Chakrabarty/Single-speaker-localization
© AudioLabs, 2017 Soumitro Chakrabarty
Broadband DOA Estimation with CNN 32