Applications of Machine Hearing. Axel Plinge ... Machine Hearing â An Emerging Field. IEEE Signal Process. ...... Signal Process., Florence, Italy, May 2014 ...
Applications of Machine Hearing Axel Plinge Pattern Recognition, Computer Science XII, TU Dortmund University November 25, 2014 Oldenburg University
Axel Plinge
Applications of Machine Hearing
1/53
Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I
sound classification
I
speech enhancement
I
speaker tracking
I
...
[1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010
Axel Plinge
Applications of Machine Hearing
1/53
Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I
sound classification
I
speech enhancement
I
speaker tracking
I
...
with limited success in real world applications! /
So how can we achieve more robustness?
[1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010
Axel Plinge
Applications of Machine Hearing
1/53
Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I
sound classification
I
speech enhancement
I
speaker tracking
I
...
with limited success in real world applications! /
So how can we achieve more robustness?
By realizing that we know a lot about how humans manage it [1] [2] [3] ... and use this knowledge to (functionally) model the properties of human perception that make it so successful [4] [5]! , [1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010
Axel Plinge
Applications of Machine Hearing
Contents
Speech Enhancement I Phone specific processing I Phone replacement I Detection I Speaker localization I Neurobiological model I Spike generation I Multimodal Tracking I Demo I CASA tracking I Simultaneous Grouping I Sequential Integration I Evaluation I Demo: Multi-Speaker Tracking
I
Multi-array tracking I Association I Triangulation I Evaluation I Demo: Euclidean Tracking I Geometry calibration I Audio segment estimation I Visual localization I Geometry calibration I Method comparison I Demo video I Acoustic event detection I Features I Classifiers I Demo Video I
2/53
Axel Plinge
Applications of Machine Hearing
3/53
Phone specific speech processing [1] Problem Amplification based hearing aids provide a gapped speech stream [2] /
typical severe sensory hearing loss [3] I
high consonants (/s/,/z/,/C/,/t/) can not be made audible by amplification
I
plosives can require more amplification than available (acoustic feedback)
Solution Detected and replace them with synthetic, speech-like stimuli , [1] L.M. Arslan and J.H.L. Hansen. Minimum cost based Phoneme Class Detection for Improved Iterative Speech Enhancement. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume ii, pages II/45–II/48, Apr 1994 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [3] Dieter Bauer and Axel Plinge. Selective Phoneme Spotting for Realization of an /s, z, C, t/ Transposer. In 8th Int. Conf. on Computers Helping People with Special Needs (ICCHP2002), volume 2398 of LNCS, Linz, Austria, 2002. Springer
Axel Plinge
Applications of Machine Hearing
4/53
Replacement Frequency shifting can be counter-productive [1] I I
constant transposition can mask audible sounds fixed transposition factor can lead to confusion I I
target can be other sounds learned location results may be non-discriminable
Specific replacement sounds I speech-like to avoid stream segregation [2] I I
I
mixture of spectrally shaped noises modulated by the original sounds charateristics
require online recognition, false positives will lead to confusion
[1] Dieter Bauer, Axel Plinge, and Walter H Ehrenstein. Compensation of Severe Sensory Hearing Deficits. Re-Sampling Versus Re-Synthesis. In Ger M Craddock, Lisa P McCormack, Richard B Reilly, and Harry T P Knops, editors, Assistive Technology - Shaping the Future; AAATE Conference Proceedings, volume 11, pages 522–526. IOS Press, 2003 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990
Axel Plinge
Applications of Machine Hearing
5/53
Detect /f/, /S/, /C/, /s/, /z/, /t/ with almost no false positives Features I
spectral energy ratios
I
normalized cross correlation [1]
I
rate of rise [2]
Classification I
ML: GMM with bounds cutting the tails [3]
[1] David Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W.B. Kleijn and K.K. Paliwal, editors, Speech Coding and Synthesis. Elsevier Science, 1995 [2] B. Plannerer et al. A continuous speech recognition system integrating additional acoustic knowl-edge sources. Technical report, TU M¨ unchen, 1996 [3] Axel Plinge and Dieter Bauer. Providing Speech Enhancement and Replacement for Persons with Severely Impaired Hearing. In Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments, Granada, Spain, 2004
Axel Plinge
Applications of Machine Hearing
6/53
Results and Implementation /f/ /S/ /C/ /s/ /z/ /k/ /t/ /ts/
/f/ 100.0 1.7 1.7 2.0 0.0 3.3 10.0 11.0
/S/ /C/ /s/ /z/ 0.0 0.0 0.0 0.0 95.0 3.3 0.0 0.0 6.7 91.3 0.0 0.0 0.0 0.0 98.0 0.0 0.0 0.0 3.3 96.7 0.0 0.0 0.0 0.0 0.0 0.0 3.3 0.0 0.0 0.0 55.6 0.0 confusion matrix [%]
/k/ 0.0 0.0 0.0 0.0 0.0 76.7 15.0 5.6
/t/ 0.0 0.0 0.0 0.0 0.0 5.0 61.7 16.7
realtime version implemented in Assembly on a 80 MIPS DSP [1]
[1] Axel Plinge and Dieter Bauer. Genesis of Wearable DSP Structures for Selective Speech Enhancement and Replacement to Compensate Severe Hearing Deficits. In A. Pruski and H. Knops, editors, Assistive Technology - From Virtuality to Reality; AAATE Conference Proceedings. IOS Press, Amsterdam, 2005
Axel Plinge
Applications of Machine Hearing
getting it out there ?
I
patent [1]
I
startup project
I
...
Sonderpreis Neue Technologien start2grow 2010
[1] Dieter Bauer and Axel Plinge Method and Device for Processing Acoustic Voice Signals PCT/EP2009/009129 Issued December 18, 2009
7/53
Axel Plinge
Applications of Machine Hearing
8/53
Speaker localization
Applications I
speaker diarization
I
camera control
Challenges I
reverberation
I
unknown number of concurrent speakers
Method I
microphone array
I
neurobiological inspired speaker localization model [1]
[1] Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010
Axel Plinge
Applications of Machine Hearing
9/53
Neuro-inspired speaker localization circular microphone array Patterson-Holdsworth Model of basilar membrane Gammaton-Filterbank using FFT Overlap Add t
⊗ τ↦ (θ,φ)
⊙ θ
cochlea model Peak-over-Average-Position impulses phase locked to maxima with strong modulation [1] Jeffress-Colburn midbrain model TDoA (ITD) through correlation backprojection to polar coordinates spatial likelihood muliplicative combination of all microphone pairs Hamacher fuzzy-t-norm
auditory cortex cochlea auditory nerve midbrain outer and mid ear
sequential integration, speech model peak localization of concurrent speakers
[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010
Axel Plinge
Applications of Machine Hearing
10/53
Neuro-inspired speaker localization: Model (1/3) Cochlea’s basilar membrane frequency analysis [1] I
Patterson-Holdsworth model [2]: Gammatone filters [3], Glasberg and Moore parameters, ERB scale 0.3 - 3.6 kHz [4] G (b) [f ] = (1 + j(f − fb )/wb )−4 FFT overlap-add implementation [5]: online capable, linear phase H(f) [dB]
I
−12 −24 −36
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 f [kHz]
[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. An efficient auditory filterbank based on the gammatone functions. Technical Report APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K, 1988 [3] Masashi Unoki and Masato Akagi. A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun., 27(3):261–279, 1999 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] S. W. Smith. The Scientists and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, 2 edition, 1999
Axel Plinge
Applications of Machine Hearing
11/53
Neuro-inspired speaker localization: Model (2/3)
Coclea’s Spike Generation: Peak-over-Average-Position [1]
-D
D I
glimpsing model [2]: only high PoA peaks > 6 dB
I
rectangular spikes (pn , hn ) phase locked on maxima [3]
I
precedence effect [4]: shift average relative to signal
[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [3] B Grothe. New roles for synaptic inhibtion in sound localisation. Nature, 4(7):540–550, 2003 [4] K. J. Palom¨ aki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43(4):361–378, 2004
Axel Plinge
Applications of Machine Hearing
12/53
Neuro-inspired speaker localization: Model (3/3)
Midbrain Model [1] I
Jeffress-Colburn model [2]: ITD (TDoA) by correlation
I
avoid aliasing & harmonic errors
I
rectangular spikes yield sharp correlation figure
I
fast sparse spike matching (pn , hn ) ⊗ (pn0 , hn0 )
Localization [1] I
backprojection to discrete half-sphere s(θ, φ)
I
product-like combination of mic. pairs: Hamacher fuzzy t-norm hγ [3]
I
extraction of angular peaks with speech like spread over the bands
[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996 [3] P. Pertil¨ a, T. Korhonen, and A. Visa. Measurement combination for acoustic source localization in a room environment. EURASIP J. Audio Speech & Music Process., 2008:1–14, 2008
Axel Plinge
Applications of Machine Hearing
Neuro-inspired speaker localization: Spike generation
sum Σ
hw-rectified f [kHz]
f [kHz]
3.60
3.60
1.62
1.62
0.66
0.66
0.20
0.20 -180
Hamacher hγ
rectangular
0
180
a [°]
-180
f [kHz]
f [kHz]
3.60
3.60
1.62
1.62
0.66
0.66
0.20
0
180
a [°]
0
180
a [°]
0.20 -180
0
180
a [°]
-180
13/53
Axel Plinge
Applications of Machine Hearing
14/53
Neuro-inspired speaker localization: Comparison
halfway rectified
SNR [dB]
0 6 12 18 ∞
zerocross
SNR [dB]
0 6 12 18 ∞
PoAP
SNR [dB]
0 6 12 18 ∞ 0
0.5
1 T60 [s]
1.5
2
RMS localization error for windows of 12ms. black = 0◦ , white = 15◦ or more [1] [1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010
Axel Plinge
Applications of Machine Hearing
15/53
Neuro-inspired speaker localization: Examples
θ[◦ ]
θ [◦ ]
one moving speaker [1] 180 135 90 45 0 −45 −90 −135 −180
180 135 90 45 0 −45 −90 −135 −180
20
40
60
80
100
120
140
160
180
Speaker Localization
200
220
240
260
280 t [s]
two speakers in a natural conversation [2]
Speaker 2 Localization
Speaker 1 0
50
100
150
200
250
300
350
400
450
500
550
t [s]
reverberant conference room (T60 ≈ 0.65 s) [1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010
Axel Plinge
Applications of Machine Hearing
16/53
Reactive online speaker tracking Applications I
video conferences [1]
I
online lectures [2]
I
...
Problems acustic detection erros due to reverberation, noise, non-speech events visual limited field of view, detection errors Method
, bio-inspired multimodal approach [3]
conference room (T60 > 0.6 s)
[1] Shankar T. Shivappa, Mohan Manubhai Trivedi, and Bhaskar D. Rao. Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey. Proceedings of the IEEE, 98(10):1692–1715, October 2010 [2] Damien Kelly, Anil Kokaram, and Frank Boland. Voxel-Based Viterbi Active Speaker Tracking (V-VAST) with Best View Selection for Video Lecture Post-Production. In IEEE Int. Conf. Acoustics Speech & Signal Process., pages 2296–2299, Prague, Czech Republic, 2011 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011
Axel Plinge
Applications of Machine Hearing
17/53
Reactive online speaker tracking: Method posterior parietal cortex
motor cortex
Inspiration - human perception
auditory cortex
1. glimpsing modell [1] 2. attention [2] 3. head turn 4. face detection
eyes
inferior temporal cortex
midbrain
ears
visual cortex
5. reactive head movement
[1] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [2] A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980 [3] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010 [4] S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004
Axel Plinge
Applications of Machine Hearing
17/53
Reactive online speaker tracking: Method posterior parietal cortex
motor cortex
Inspiration - human perception
auditory cortex
1. glimpsing modell [1] 2. attention [2] 3. head turn eyes
4. face detection
inferior temporal cortex
ears
midbrain
visual cortex
5. reactive head movement Method 1. acustic localization [3] 2. salient direction [4] 3. camera movement 4. face detection 5. multimodal tracking
Unimodal Processing
Sensors & Motors
Camera Image
Image Preprocessing
Multimodal Model
Movement Model
Face Localization
Multimodal Integration
Speaker Tracking
Overt Visual Camera Control
Movement Processing
Multimodal Processing
Face Model
Attention
Camera Movement
Covert multimodal Attention
Overt Auditory Attention Microphone Array
Audio Preprocessing
Acoustic Localization Speech Model
top-downb/befferent bottom-upb/bafferent memory/modelbuse
[1] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [2] A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980 [3] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010 [4] S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004
Axel Plinge
Applications of Machine Hearing
Reactive online speaker tracking: Sensors and camera control camera and microphone array at same position in the middle of the room –Iidentical pan = azimuth coordinate system (θ) reactive camera control
ic acoust ation localiz
1. acoustic localization (salient)
2. camera move (θ, 15◦ ) face detection (distance estimate by head height)
camera fov
face detection
3. camera movement (θ, φ) face and lip movement detection lip movement detections ?
tracking 1. acoustic and visual localization 2. integration ∆θ < 15◦ , ∆t < 1 s
18/53
Axel Plinge
Applications of Machine Hearing
Reactive online speaker tracking: Demo Video
19/53
Axel Plinge
Applications of Machine Hearing
20/53
CASA tracking
(b) t
i,j
⊗
τ↦ (θ,φ)
⊙
θ
θ θ t
t
Tracking speakers in adverse conditions I
Technical approaches moderately robust
I
Human perception holds insights [1]
Auditory Scene Analysis [2] [3] clustering simultaneous grouping by multiple cues tracking sequential integration over time
[1] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [3] S. Handel. Listening. MIT Press, 1989
Axel Plinge
Applications of Machine Hearing
21/53
Spatial Likelihood Example: Two concurrent speakers in AV16.3 [1] 180 θ [◦ ]
90 0 −90 −180 1
6
11
16 b 16 11 6 1 0
2
4
6
8
10 t[s]
[1] Guillaume Lathoud, Jean-Marc Odobez, and Daniel Gatica-Perez. AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In International conference on Machine Learning for Multimodal Interaction, volume 3361 of LNCS, pages 182–195, Martigny, Switzerland, 2005
Axel Plinge
Applications of Machine Hearing
22/53
CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I
Localizations with angle and spectrum x = (θ, s )
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}
Axel Plinge
Applications of Machine Hearing
22/53
CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I I
Localizations with angle and spectrum x = (θ, s ) Grouping Ψi = (Θi , σi , ti ) according to spectrum and angle I I I
d(θ,Θ)2 [2] exp −0.5 σ2 D E t s spectral similarity –I scalar product ps (x|Ψi ) = ||s || , ||t || spatial similarity –I Gaussian pa (x|Ψi ) =
√1 2πσ
joint similarity p(x|Ψi ) = ps (x|Ψi )pa (x|Ψi )
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}
Axel Plinge
Applications of Machine Hearing
22/53
CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I I
Localizations with angle and spectrum x = (θ, s ) Grouping Ψi = (Θi , σi , ti ) according to spectrum and angle I I I
d(θ,Θ)2 [2] exp −0.5 σ2 D E t s spectral similarity –I scalar product ps (x|Ψi ) = ||s || , ||t || spatial similarity –I Gaussian pa (x|Ψi ) =
√1 2πσ
joint similarity p(x|Ψi ) = ps (x|Ψi )pa (x|Ψi )
I
Expectation Maximization using likelihood l(x) = ss T
I
Dynamic number of sources by split (left) and join (right) 0.6
0.6 0.5
0.5
0.4 p
0.4 p 0.3
0.3
150
100
50 angl0 e
50
100
150
0
1
2
3 tion itera
4
0.2
0.2
0.1
0.1
0.0 5
150
100
50 angl0 e
50
100
150
0
1
2
3 tion itera
4
0.0 5
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}
Axel Plinge
Applications of Machine Hearing
23/53
CASA tracking: Tracking by sequential integration Rule based tracking [1] I
form tracks of speakers by association over consecutive time frames k →k +1
θ
k
t
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008
Axel Plinge
Applications of Machine Hearing
23/53
CASA tracking: Tracking by sequential integration Rule based tracking [1] I
form tracks of speakers by association over consecutive time frames k → k + 1
I
use Gaussian of the angles pa for new detection to belong to a given track
I
spectral similarity ps can be used additionally to disambiguate in hard cases
θ
k k+1
t
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008
Axel Plinge
Applications of Machine Hearing
23/53
CASA tracking: Tracking by sequential integration Rule based tracking [1] I
form tracks of speakers by association over consecutive time frames k → k + 1
I
use Gaussian of the angles pa for new detection to belong to a given track
I
spectral similarity ps can be used additionally to disambiguate in hard cases
I
if no track with p > pmin exists, create a new one
θ
k k+1
t
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008
Axel Plinge
Applications of Machine Hearing
23/53
CASA tracking: Tracking by sequential integration Rule based tracking [1] I
form tracks of speakers by association over consecutive time frames k → k + 1
I
use Gaussian of the angles pa for new detection to belong to a given track
I
spectral similarity ps can be used additionally to disambiguate in hard cases
I
if no track with p > pmin exists, create a new one
θ
k k+1
t
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008
Axel Plinge
Applications of Machine Hearing
23/53
CASA tracking: Tracking by sequential integration Rule based tracking [1] I
form tracks of speakers by association over consecutive time frames k → k + 1
I
use Gaussian of the angles pa for new detection to belong to a given track
I
spectral similarity ps can be used additionally to disambiguate in hard cases
I
if no track with p > pmin exists, create a new one
I
Allow gaps smaller than tTTL before tracks die [2]
θ
k k+1
t
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008
Axel Plinge
Applications of Machine Hearing
24/53
CASA tracking: Evaluation (1) Method Comparison SRP PHAT [1] PCA BF [2] PoAP EM [3]
εa [ ◦ ]
15
10
5
0 0
0.5
1
1.5
2
T60 [s]
[1] Guillaume Lathoud and Jean-Marc Odobez. Short-Term Spatio-Temporal Clustering applied to Multiple Moving Speakers. IEEE Trans. Audio Speech & Language Process., 2007 [2] E. Warsitz and R. Haeb-Umbach. Acoustic Filter-and-Sum Beamforming by adaptive Principal Component Analysis. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume 4, pages 797–800. IEEE, 2005 [3] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013
Axel Plinge
Applications of Machine Hearing
CASA tracking: Evaluation (2) AV16.3 angular tracks, SRP PHAT
180 135 90
azimuth [ ◦ ]
45 0 45 90 135 1800
50
SRP PHAT th=0
100
RMS 42.44◦
t [s]
F1 83.52%
150
precision 72.91%
200
recall 97.75%
25/53
Axel Plinge
Applications of Machine Hearing
CASA tracking: Evaluation (2) AV16.3 angular tracks, SRP PHAT good threshold
180 135 90
azimuth [ ◦ ]
45 0 45 90 135 1800
50
SRP PHAT th=0 SRP PHAT th=*
100
RMS 42.44◦ 2.86◦
t [s]
F1 83.52% 94.12%
150
precision 72.91% 98.77%
200
recall 97.75% 89.89%
25/53
Axel Plinge
Applications of Machine Hearing
CASA tracking: Evaluation (2) AV16.3 angular tracks, PoAP EM
180 135 90
azimuth [ ◦ ]
45 0 45 90 135 1800
50
SRP PHAT th=0 SRP PHAT th=* PoAP EM
100
RMS 42.44◦ 2.86◦ 2.69◦
t [s]
F1 83.52% 94.12% 96.82%
150
precision 72.91% 98.77% 96.64%
200
recall 97.75% 89.89% 97.00%
25/53
Axel Plinge
Applications of Machine Hearing
CASA tracking: Evaluation (2) AV16.3 angular tracks, PoAP EM Tracking
180 135 90
azimuth [ ◦ ]
45 0 45 90 135 1800
50
SRP PHAT th=0 SRP PHAT th=* PoAP EM PoAP EM track
100
RMS 42.44◦ 2.86◦ 2.69◦ 2.33◦
t [s]
F1 83.52% 94.12% 96.82% 99.63%
150
precision 72.91% 98.77% 96.64% 100.00%
200
recall 97.75% 89.89% 97.00% 99.26%
25/53
Axel Plinge
Applications of Machine Hearing
26/53
CASA tracking: Example
Two speakers in the FINCA (T60 ≈ 0.65 s) 180 135 90 45 R θ [◦ ] 0 −45 −90 −135 −180 15 S1 b 0 15 S2 b 0
T1 T2
RMS 5◦ precision 100% recall 81.2%
Axel Plinge
Applications of Machine Hearing
CASA tracking: Demo video
27/53
Axel Plinge
Applications of Machine Hearing
28/53
Multi-array tracking (b) t
i,j (b) t
i,j
⊗ ⊗
τ↦ (θ,φ)
⊙
θ
τ↦ (θ,φ)
⊙
θ
θ
θ
t
t θ
θ
t
y
x
t
Advantages I
Euclidean tracking possible with multiple small arrays
I
distributed microphone arrays can capture speech at all positions of interest
I
can be deployed ad-hoc and coupled wirelessly
Challenges I
syncronization may not be good: drift, jitter, omissions
I
multiple concurrent speakers produce ambiguous localizations
I
geometry has to be calibrated
I
triangulation of multiples DoAs not straightforward
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
29/53
Multi-array tracking (1/3) System design
y
x
I
distributed nodes localize concurrent speech events [1]
I
network or radio transmission (< 32 kbps) integration node
I
I I I
association [1] CASA tracking (as before but parallel on all available arrays) triangulation [2]
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
30/53
Multi-array tracking (2/3) Association I
Multiple concurrent estimates by multiple microphone arrays
I
Location alone remains ambiguous, other modalities needed
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013
Axel Plinge
Applications of Machine Hearing
30/53
Multi-array tracking (2/3) Association I
Multiple concurrent estimates by multiple microphone arrays
I
Location alone remains ambiguous, other modalities needed
I
Resolve using spectral similarity (again) [1]
I
Assign scalar product as spectral similarity ps for being the same source s (m) s (n) ps s (m) , s (n) = , ||s (m) || ||s (n) || Q Find groups of localizations by high m,n ps s (m) , s (n)
I
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013
Axel Plinge
Applications of Machine Hearing
31/53
Multi-array tracking (3/3) triangulation I
Calculate intersection of lines from array positions o (m,n) with angles Θ(m,n) cos Θ(m) cos Θ(n) z (m,n) = o (m) + r (m) = o (n) + r (n) (m) (n) sin Θ sin Θ
I
The accuracy of an angular intersection is dependent of the intersection angle: Use weighted sum [1] P (m) , Θ(n) z (m,n) Θ(m) , Θ(n) m,n q Θ P z= q(α, β) = |sin(α − β)| (m) , Θ(n) ) m,n q (Θ
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
32/53
Multi-array tracking: Simulation
3
2
18
4
5
C
A
6
7
E
9
17 B
16
8
15
D
14
13
12
11
10
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
33/53
Multi-array tracking evaluation: Triangulation five unweighted five q-weighted two strongest
εl [m]
2 1.5 1 0.5
Recall [%]
0 100 50 0 0.25
0.5
0.75
1
1.25
1.5
1.75
T60 [s] Localization error of simulation of a single speaker tracked by five sensor nodes at different reverberation times. [1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
34/53
Multi-array tracking evaluation: Jitter
0.8
εl [m]
0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8 1 jitter [s]
1.2
1.4
1.6
1.8
Localization error of simulation of a single speaker tracked by five sensor nodes at T60 = 0.5 s for different inter-array jitter.
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
35/53
Multi-array tracking evaluation: Geometry T60 = 0.5 s
T60 = 1.0 s
εl [m]
1.5 1 0.5 0 0
0.1
0.2
0.3
0.4
0.5
calibration error [m] εl [m]
1.5 1 0.5 0 0
2
4
6
8
10
12
calibration error [◦ ]
Localization error as function of Euclidean (top) and angular (bottom) geometry calibration error. [1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
36/53
Multi-array tracking evaluation: Real data
θ(0) θ(1) θ(2) x,y
#1 one static speaker precision recall RMS 77% 58% 5.6◦ 91% 70% 4.6◦ 86% 70% 4.9◦ 99% 79% 0.17 m
#2 one moving speaker precision recall RMS 89% 85% 5.4◦ 80% 72% 6.7◦ 72% 70% 5.8◦ 90% 82% 0.39 m
θ(0) θ(1) θ(2) x,y
#3 two static speakers precision recall RMS 94% 88% 5.9◦ 95% 95% 8.4◦ 88% 92% 5.9◦ 100% 95% 0.10 m
#4 three speakers precision recall RMS 73% 73% 6.7◦ 93% 86% 4.9◦ 96% 97% 3.2◦ 99% 98% 0.21 m
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
37/53
Multi-array tracking: Example
θ (0) [◦ ]
180 90 0 −90 −180
θ (1) [◦ ]
180 90 0 −90 −180
θ (2) [◦ ]
180 90 0 −90 −180
x [m]
6 4 2 0
y [m]
speaker 1 track 1
6 4 2 0 10
15
20
25
speaker 2 track 2
30
speaker 3 track 3
35
40
45
t [s]
Axel Plinge
Applications of Machine Hearing
38/53
Multi-array tracking: Demo video
[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
39/53
Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I
manual measurement tedious
I
ad hoc placement
[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014
Axel Plinge
Applications of Machine Hearing
39/53
Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I
manual measurement tedious
I
ad hoc placement
Estimating by diffuse noise? [1]
s [cm]
, ok for small arrays [2] / diffuse noise not given, does not work for large distances 20 15 10 5 0
0
20
40
60
80
100
120
140
160
180
array size [cm]
[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014
Axel Plinge
Applications of Machine Hearing
39/53
Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I
manual measurement tedious
I
ad hoc placement
Estimating by diffuse noise? [1]
s [cm]
, ok for small arrays [2] / diffuse noise not given, does not work for large distances 20 15 10 5 0
0
Solutions
20
40
60
80
100
120
140
160
180
array size [cm]
I
speech/noise based measurements of distance and orientation [2] [3]
I
multimodal anchoring using video cameras [4]
[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014
Axel Plinge
Applications of Machine Hearing
Multimodal Geometry Calibration I
Calibrated video cameras
I
Record a human speaker at fixed positions
I
Estimate time segments and DoA from audio
I
Compute visual x,y localization for segments
I
Estimate microphone array geometry from audio-visual correspondence
8
9
10
1
7
2
6
5
4
3
40/53
Axel Plinge
Applications of Machine Hearing
41/53
Audio Segment Estimation I
Localize source using [1] for each microphone array
θ (0) [◦ ]
180 90 0 −90 −180
θ (1) [◦ ]
180 90 0 −90 −180
θ (2) [◦ ]
gt localization
180 90 0 −90 −180 0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013
Axel Plinge
Applications of Machine Hearing
41/53
Audio Segment Estimation I
Localize source using [1] for each microphone array
I
Compute segments with low angular deviation and TTL = 1 s
θ (0) [◦ ]
180 90 0 −90 −180
θ (1) [◦ ]
180 90 0 −90 −180
θ (2) [◦ ]
gt seg. 0 seg. 1 seg. 2
180 90 0 −90 −180 0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013
Axel Plinge
Applications of Machine Hearing
42/53
Visual Localization I
Localize speaker using background subtraction [1] and a upper body HoG detector [2] for each camera
[1] P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001 [2] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005 [3] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
42/53
Visual Localization I
Localize speaker using background subtraction [1] and a upper body HoG detector [2] for each camera
I
Compute x,y for each time segment by weighted triangulation [3] gt
localization
y [m]
3 2 1 0 0
1
2
3
4
5
6
x [m]
[1] P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001 [2] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005 [3] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
43/53
Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m
si
si − rm = ki,m
cos (om + Θi,m ) sin (om + Θi,m )
Θi,m om rm
[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995
Axel Plinge
Applications of Machine Hearing
43/53
Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate
si
ei = si − rm − ki,m
cos (om + Θi,m ) sin (om + Θi,m )
Θi,m om rm
[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995
Axel Plinge
Applications of Machine Hearing
43/53
Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate stack equations for multiple positions, overdetermined for I > 3
si
e1 =s1 − rm − k1,m
cos (om + Θ1,m ) sin (om + Θ1,m )
cos (om + ΘI ,m ) sin (om + ΘI ,m )
.. . Θi,m
eI =sI − rm − kI ,m
om rm
[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995
Axel Plinge
Applications of Machine Hearing
43/53
Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate stack equations for multiple positions, overdetermined for I > 3 solve by minimizing the error with bounded gradient descent [1]
si
e1 =s1 − rm − k1,m
cos (om + Θ1,m ) sin (om + Θ1,m )
cos (om + ΘI ,m ) sin (om + ΘI ,m )
.. . Θi,m
eI =sI − rm − kI ,m
om rm
e2 =
I X
ei2
! –I
min
i=1
[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995
Axel Plinge
Applications of Machine Hearing
Geometry calibration: Method (2/2) I
compute all N = 35 sets for I = 5
y [m]
er = 230, 98, 88 mm
eo = 5.78◦ , 1.82◦ , 4.02◦
2
3
4 x [m]
44/53
Axel Plinge
Applications of Machine Hearing
Geometry calibration: Method (2/2) I
compute all N = 35 sets for I = 5 and their median
y [m]
er = 103, 79, 70 mm
eo = 1.92◦ , 1.00◦ , 4.04◦
2
3
4 x [m]
44/53
Axel Plinge
Applications of Machine Hearing
Geometry calibration: Method (2/2) I
compute all N = 35 sets for I = 5 and their median
I
consensus estimate: mean of the N/4 closest to the median
y [m]
er = 110, 74, 72 mm
eo = 0.93◦ , 0.78◦ , 3.47◦
2
3
4 x [m]
44/53
Axel Plinge
Applications of Machine Hearing
45/53
Calibration results & method comparison
RANSAC [1] unimodal [2] mutimodal [3]
scaling
rotation
translation
3 3 3
– – 3
– – 3
20
sync required required online
eγ [◦ ]
15 es [cm]
3 3 [1] [2] [3]
10
10
online ?
5
5 0
0 noise
speech
noise
speech
[1] F. Jacob, J. Schmalenstroeer, and R. Haeb-Umbach. Doa-based microphone array postion self-calibration using circular statistics. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013 [2] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014
Axel Plinge
Applications of Machine Hearing
Geometry calibration: Demo video
46/53
Axel Plinge
Applications of Machine Hearing
Acoustic event detection
Applications I
meeting analysis, context awareness, security
I
prefilter for tracking / calibration
Problems I
very heterogeneous classes
I
open set
Method I
neuro inspired features
I
supervised learning
I
Bag-of-Features classification
47/53
Axel Plinge
Applications of Machine Hearing
48/53
BoF acoustic event detection Bag-of-Features? I
approach from text retrieval [1] histograms of quantized features
Our approach [2] I
MFCCs and GFCCs [3] succesful in speaker identification
I
“super codebook” from class-wise training
I
ML classification of histograms
[1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999 [2] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014 [3] Xiaojia Zhao, Student Member, Yang Shao, and DeLiang Wang. CASA-Based Robust Speaker Identification. IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1608–1616, 2012
Axel Plinge
Applications of Machine Hearing
49/53
BoF acoustic event detection
I
MFCCs and GFCCs for a frame yk quantized by super-codebook v vl=(I ·c+i) = (µi,c , σi,c )
I
where
qk,l (yk , vl ) = N (yk |µl , σl )
pyramid histogram for temporal structure (1)
bl (Yn , vl ) =
K /2 2 X 2 (2) qk,l (yk , vl ), bl (Yn , vl ) = K K k=1
K X
qk,l (yk , vl )
k=K /2+1
n o (1) (2) (3) bl (Yn , vl ) = max bl (Yn , vl ), bl (Yn , vl ) I
multinominal maximum likelihood classification P(Yn |Ωc ) =
Y vl ∈v
P(vl |Ωc )bl (Yn ,vl ) , P(vl |Ωc ) =
L+
1+ PL
P
m=1
Y ∈Ωc
n P
bl (Yn , vl )
Yn ∈Ωc
bm (Yn , vm )
[1] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
AED: Features
MFCC GFCC* MFCC + GFCC*
F%
FINCA [1] 100 80 60 40 20 0
ch
s air
s
p cu
do
or
s ard key bo top p key la
pa
r
pe
50/53
ng
uri po
lin
rol
g
e nc sile
ech
spe
ps
ste
F%
D CASE Development Set [2] 100 80 60 40 20 0
e k h h h p rn rt er er se er m at ys rd ce ale thro coug orsla draw boa ke knoc ught mou getu ndro phon print silen speec switc y r a o e a pe l pa k d cle [1] http://patrec.cs.tu-dortmund.de/cms/en/home/Resources/index.html [2] Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013 [3] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014
Axel Plinge
Applications of Machine Hearing
51/53
Acoustic event detection: Classifiers D-CASE Development Set [1]
FINCA [2]
HMM GBFB (SCS) [3] NMF HMM (GVV) HMM MFCC GFCC* BoSF-P MFCC GFCC* [4] GMM MFCC∆∆ (VVK) HMM RF (NVM) GMM MFCC BoAW MFCC [5] SVM MFCC∆∆ (NR2) NMF (Baseline) 0
20
40
60
80
F1 (frames) [%]
0
20
40
60
80 100
F1 (frames) [%]
[1] Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013 [2] http://patrec.cs.tu- dortmund.de/cms/en/home/Resources/index.html [3] Jens Schroeder, Niko Moritz, Marc Rene Schaedler, Benjamin Cauchi, Kamil Adiloglu, Joern Anemueller, Simon Doclo, Birger Kollmeier, and Stefan Goetze. On the use of spectro-temporal features for the IEEE AASP challenge ’detection and classification of acoustic scenes and events’. In IEEE Workshop on Appl. of Signal Proc. to Audio and Acoustics, 2013 [4] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014 [5] Stephanie Pancoast and Murat Akbacak. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech, Portland, OR, USA, 2012
Axel Plinge
Applications of Machine Hearing
Acoustic event detection: Example
ground truth
BoSF-P MFCC+GFCC* (F = 60.17)
HMM MFCC+GFCC* (F = 66.55)
0
30
60
90
t[s] alert clearthroat cough doorslam drawer keyboard knock laughter mouse pageturn pendrop phone printer speech switch keys silence
52/53
Axel Plinge
Applications of Machine Hearing
Acoustic Event Detection: Demo Video
53/53
Axel Plinge
Applications of Machine Hearing
53/53
References L.M. Arslan and J.H.L. Hansen. Minimum cost based Phoneme Class Detection for Improved Iterative Speech Enhancement. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume ii, pages II/45–II/48, Apr 1994.
Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996.
Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995.
Dieter Bauer and Axel Plinge. Selective Phoneme Spotting for Realization of an /s, z, C, t/ Transposer. In 8th Int. Conf. on Computers Helping People with Special Needs (ICCHP2002), volume 2398 of LNCS, Linz, Austria, 2002. Springer.
Dieter Bauer, Axel Plinge, and Walter H Ehrenstein. Compensation of Severe Sensory Hearing Deficits. Re-Sampling Versus Re-Synthesis. In Ger M Craddock, Lisa P McCormack, Richard B Reilly, and Harry T P Knops, editors, Assistive Technology - Shaping the Future; AAATE Conference Proceedings, volume 11, pages 522–526. IOS Press, 2003.
Axel Plinge
Applications of Machine Hearing
A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.
M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006.
Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005.
B. Plannerer et al. A continuous speech recognition system integrating additional acoustic knowl-edge sources. Technical report, TU M¨ unchen, 1996.
B Grothe. New roles for synaptic inhibtion in sound localisation. Nature, 4(7):540–550, 2003.
53/53
Axel Plinge
Applications of Machine Hearing
53/53
Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013. S. Handel. Listening. MIT Press, 1989. Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009. F. Jacob, J. Schmalenstroeer, and R. Haeb-Umbach. Doa-based microphone array postion self-calibration using circular statistics. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013. P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001. Damien Kelly, Anil Kokaram, and Frank Boland. Voxel-Based Viterbi Active Speaker Tracking (V-VAST) with Best View Selection for Video Lecture Post-Production. In IEEE Int. Conf. Acoustics Speech & Signal Process., pages 2296–2299, Prague, Czech Republic, 2011.
Axel Plinge
Applications of Machine Hearing
53/53
Guillaume Lathoud and Jean-Marc Odobez. Short-Term Spatio-Temporal Clustering applied to Multiple Moving Speakers. IEEE Trans. Audio Speech & Language Process., 2007.
Guillaume Lathoud, Jean-Marc Odobez, and Daniel Gatica-Perez. AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In International conference on Machine Learning for Multimodal Interaction, volume 3361 of LNCS, pages 182–195, Martigny, Switzerland, 2005.
Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010.
Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008.
N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008.
Stephanie Pancoast and Murat Akbacak. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech, Portland, OR, USA, 2012.
Axel Plinge
Applications of Machine Hearing
53/53
Axel Plinge and Dieter Bauer. Providing Speech Enhancement and Replacement for Persons with Severely Impaired Hearing. In Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments, Granada, Spain, 2004. Axel Plinge and Dieter Bauer. Genesis of Wearable DSP Structures for Selective Speech Enhancement and Replacement to Compensate Severe Hearing Deficits. In A. Pruski and H. Knops, editors, Assistive Technology - From Virtuality to Reality; AAATE Conference Proceedings. IOS Press, Amsterdam, 2005. Axel Plinge, Dieter Bauer, and Martin Finke. Intelligibility Enhancement of Human Speech for Severely Hearing Impaired Persons by Dedicated Digital Processing. In Crt. Marincek and Christian B¨ uhler, editors, Assitive Technology - Added Value to the Quality of Life; AAATE Conference Proceedings, Amsterdam, 2001. IOS Press. K. J. Palom¨ aki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43(4):361–378, 2004. Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013.
Axel Plinge
Applications of Machine Hearing
Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014.
Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014.
Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014. Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014.
Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010.
Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010.
53/53
Axel Plinge
Applications of Machine Hearing
P. Pertil¨ a, T. Korhonen, and A. Visa. Measurement combination for acoustic source localization in a room environment. EURASIP J. Audio Speech & Music Process., 2008:1–14, 2008. Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010. R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. An efficient auditory filterbank based on the gammatone functions. Technical Report APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K, 1988. S. W. Smith. The Scientists and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, 2 edition, 1999. Jens Schroeder, Niko Moritz, Marc Rene Schaedler, Benjamin Cauchi, Kamil Adiloglu, Joern Anemueller, Simon Doclo, Birger Kollmeier, and Stefan Goetze. On the use of spectro-temporal features for the IEEE AASP challenge ’detection and classification of acoustic scenes and events’. In IEEE Workshop on Appl. of Signal Proc. to Audio and Acoustics, 2013. Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011.
53/53
Axel Plinge
Applications of Machine Hearing
53/53
Shankar T. Shivappa, Mohan Manubhai Trivedi, and Bhaskar D. Rao. Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey. Proceedings of the IEEE, 98(10):1692–1715, October 2010. David Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W.B. Kleijn and K.K. Paliwal, editors, Speech Coding and Synthesis. Elsevier Science, 1995. A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980. Masashi Unoki and Masato Akagi. A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun., 27(3):261–279, 1999. Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, page 511–518, 2001. S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004.
Axel Plinge
Applications of Machine Hearing
DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006. E. Warsitz and R. Haeb-Umbach. Acoustic Filter-and-Sum Beamforming by adaptive Principal Component Analysis. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume 4, pages 797–800. IEEE, 2005. Xiaojia Zhao, Student Member, Yang Shao, and DeLiang Wang. CASA-Based Robust Speaker Identification. IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1608–1616, 2012.
53/53