Applications of Machine Hearing

20 downloads 0 Views 8MB Size Report
Applications of Machine Hearing. Axel Plinge ... Machine Hearing – An Emerging Field. IEEE Signal Process. ...... Signal Process., Florence, Italy, May 2014 ...
Applications of Machine Hearing Axel Plinge Pattern Recognition, Computer Science XII, TU Dortmund University November 25, 2014 Oldenburg University

Axel Plinge

Applications of Machine Hearing

1/53

Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I

sound classification

I

speech enhancement

I

speaker tracking

I

...

[1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010

Axel Plinge

Applications of Machine Hearing

1/53

Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I

sound classification

I

speech enhancement

I

speaker tracking

I

...

with limited success in real world applications! /

So how can we achieve more robustness?

[1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010

Axel Plinge

Applications of Machine Hearing

1/53

Motivation Technical systems are tackling a large number of tasks that are in fact mimicking and/or augmenting human perception I

sound classification

I

speech enhancement

I

speaker tracking

I

...

with limited success in real world applications! /

So how can we achieve more robustness?

By realizing that we know a lot about how humans manage it [1] [2] [3] ... and use this knowledge to (functionally) model the properties of human perception that make it so successful [4] [5]! , [1] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [2] S. Handel. Listening. MIT Press, 1989 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010

Axel Plinge

Applications of Machine Hearing

Contents

Speech Enhancement I Phone specific processing I Phone replacement I Detection I Speaker localization I Neurobiological model I Spike generation I Multimodal Tracking I Demo I CASA tracking I Simultaneous Grouping I Sequential Integration I Evaluation I Demo: Multi-Speaker Tracking

I

Multi-array tracking I Association I Triangulation I Evaluation I Demo: Euclidean Tracking I Geometry calibration I Audio segment estimation I Visual localization I Geometry calibration I Method comparison I Demo video I Acoustic event detection I Features I Classifiers I Demo Video I

2/53

Axel Plinge

Applications of Machine Hearing

3/53

Phone specific speech processing [1] Problem Amplification based hearing aids provide a gapped speech stream [2] /

typical severe sensory hearing loss [3] I

high consonants (/s/,/z/,/C/,/t/) can not be made audible by amplification

I

plosives can require more amplification than available (acoustic feedback)

Solution Detected and replace them with synthetic, speech-like stimuli , [1] L.M. Arslan and J.H.L. Hansen. Minimum cost based Phoneme Class Detection for Improved Iterative Speech Enhancement. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume ii, pages II/45–II/48, Apr 1994 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [3] Dieter Bauer and Axel Plinge. Selective Phoneme Spotting for Realization of an /s, z, C, t/ Transposer. In 8th Int. Conf. on Computers Helping People with Special Needs (ICCHP2002), volume 2398 of LNCS, Linz, Austria, 2002. Springer

Axel Plinge

Applications of Machine Hearing

4/53

Replacement Frequency shifting can be counter-productive [1] I I

constant transposition can mask audible sounds fixed transposition factor can lead to confusion I I

target can be other sounds learned location results may be non-discriminable

Specific replacement sounds I speech-like to avoid stream segregation [2] I I

I

mixture of spectrally shaped noises modulated by the original sounds charateristics

require online recognition, false positives will lead to confusion

[1] Dieter Bauer, Axel Plinge, and Walter H Ehrenstein. Compensation of Severe Sensory Hearing Deficits. Re-Sampling Versus Re-Synthesis. In Ger M Craddock, Lisa P McCormack, Richard B Reilly, and Harry T P Knops, editors, Assistive Technology - Shaping the Future; AAATE Conference Proceedings, volume 11, pages 522–526. IOS Press, 2003 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990

Axel Plinge

Applications of Machine Hearing

5/53

Detect /f/, /S/, /C/, /s/, /z/, /t/ with almost no false positives Features I

spectral energy ratios

I

normalized cross correlation [1]

I

rate of rise [2]

Classification I

ML: GMM with bounds cutting the tails [3]

[1] David Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W.B. Kleijn and K.K. Paliwal, editors, Speech Coding and Synthesis. Elsevier Science, 1995 [2] B. Plannerer et al. A continuous speech recognition system integrating additional acoustic knowl-edge sources. Technical report, TU M¨ unchen, 1996 [3] Axel Plinge and Dieter Bauer. Providing Speech Enhancement and Replacement for Persons with Severely Impaired Hearing. In Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments, Granada, Spain, 2004

Axel Plinge

Applications of Machine Hearing

6/53

Results and Implementation /f/ /S/ /C/ /s/ /z/ /k/ /t/ /ts/

/f/ 100.0 1.7 1.7 2.0 0.0 3.3 10.0 11.0

/S/ /C/ /s/ /z/ 0.0 0.0 0.0 0.0 95.0 3.3 0.0 0.0 6.7 91.3 0.0 0.0 0.0 0.0 98.0 0.0 0.0 0.0 3.3 96.7 0.0 0.0 0.0 0.0 0.0 0.0 3.3 0.0 0.0 0.0 55.6 0.0 confusion matrix [%]

/k/ 0.0 0.0 0.0 0.0 0.0 76.7 15.0 5.6

/t/ 0.0 0.0 0.0 0.0 0.0 5.0 61.7 16.7

realtime version implemented in Assembly on a 80 MIPS DSP [1]

[1] Axel Plinge and Dieter Bauer. Genesis of Wearable DSP Structures for Selective Speech Enhancement and Replacement to Compensate Severe Hearing Deficits. In A. Pruski and H. Knops, editors, Assistive Technology - From Virtuality to Reality; AAATE Conference Proceedings. IOS Press, Amsterdam, 2005

Axel Plinge

Applications of Machine Hearing

getting it out there ?

I

patent [1]

I

startup project

I

...

Sonderpreis Neue Technologien start2grow 2010

[1] Dieter Bauer and Axel Plinge Method and Device for Processing Acoustic Voice Signals PCT/EP2009/009129 Issued December 18, 2009

7/53

Axel Plinge

Applications of Machine Hearing

8/53

Speaker localization

Applications I

speaker diarization

I

camera control

Challenges I

reverberation

I

unknown number of concurrent speakers

Method I

microphone array

I

neurobiological inspired speaker localization model [1]

[1] Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010

Axel Plinge

Applications of Machine Hearing

9/53

Neuro-inspired speaker localization circular microphone array Patterson-Holdsworth Model of basilar membrane Gammaton-Filterbank using FFT Overlap Add t

⊗ τ↦ (θ,φ)

⊙ θ

cochlea model Peak-over-Average-Position impulses phase locked to maxima with strong modulation [1] Jeffress-Colburn midbrain model TDoA (ITD) through correlation backprojection to polar coordinates spatial likelihood muliplicative combination of all microphone pairs Hamacher fuzzy-t-norm

auditory cortex cochlea auditory nerve midbrain outer and mid ear

sequential integration, speech model peak localization of concurrent speakers

[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010

Axel Plinge

Applications of Machine Hearing

10/53

Neuro-inspired speaker localization: Model (1/3) Cochlea’s basilar membrane frequency analysis [1] I

Patterson-Holdsworth model [2]: Gammatone filters [3], Glasberg and Moore parameters, ERB scale 0.3 - 3.6 kHz [4] G (b) [f ] = (1 + j(f − fb )/wb )−4 FFT overlap-add implementation [5]: online capable, linear phase H(f) [dB]

I

−12 −24 −36

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5 f [kHz]

[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. An efficient auditory filterbank based on the gammatone functions. Technical Report APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K, 1988 [3] Masashi Unoki and Masato Akagi. A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun., 27(3):261–279, 1999 [4] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [5] S. W. Smith. The Scientists and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, 2 edition, 1999

Axel Plinge

Applications of Machine Hearing

11/53

Neuro-inspired speaker localization: Model (2/3)

Coclea’s Spike Generation: Peak-over-Average-Position [1]

-D

D I

glimpsing model [2]: only high PoA peaks > 6 dB

I

rectangular spikes (pn , hn ) phase locked on maxima [3]

I

precedence effect [4]: shift average relative to signal

[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [3] B Grothe. New roles for synaptic inhibtion in sound localisation. Nature, 4(7):540–550, 2003 [4] K. J. Palom¨ aki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43(4):361–378, 2004

Axel Plinge

Applications of Machine Hearing

12/53

Neuro-inspired speaker localization: Model (3/3)

Midbrain Model [1] I

Jeffress-Colburn model [2]: ITD (TDoA) by correlation

I

avoid aliasing & harmonic errors

I

rectangular spikes yield sharp correlation figure

I

fast sparse spike matching (pn , hn ) ⊗ (pn0 , hn0 )

Localization [1] I

backprojection to discrete half-sphere s(θ, φ)

I

product-like combination of mic. pairs: Hamacher fuzzy t-norm hγ [3]

I

extraction of angular peaks with speech like spread over the bands

[1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996 [3] P. Pertil¨ a, T. Korhonen, and A. Visa. Measurement combination for acoustic source localization in a room environment. EURASIP J. Audio Speech & Music Process., 2008:1–14, 2008

Axel Plinge

Applications of Machine Hearing

Neuro-inspired speaker localization: Spike generation

sum Σ

hw-rectified f [kHz]

f [kHz]

3.60

3.60

1.62

1.62

0.66

0.66

0.20

0.20 -180

Hamacher hγ

rectangular

0

180

a [°]

-180

f [kHz]

f [kHz]

3.60

3.60

1.62

1.62

0.66

0.66

0.20

0

180

a [°]

0

180

a [°]

0.20 -180

0

180

a [°]

-180

13/53

Axel Plinge

Applications of Machine Hearing

14/53

Neuro-inspired speaker localization: Comparison

halfway rectified

SNR [dB]

0 6 12 18 ∞

zerocross

SNR [dB]

0 6 12 18 ∞

PoAP

SNR [dB]

0 6 12 18 ∞ 0

0.5

1 T60 [s]

1.5

2

RMS localization error for windows of 12ms. black = 0◦ , white = 15◦ or more [1] [1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010

Axel Plinge

Applications of Machine Hearing

15/53

Neuro-inspired speaker localization: Examples

θ[◦ ]

θ [◦ ]

one moving speaker [1] 180 135 90 45 0 −45 −90 −135 −180

180 135 90 45 0 −45 −90 −135 −180

20

40

60

80

100

120

140

160

180

Speaker Localization

200

220

240

260

280 t [s]

two speakers in a natural conversation [2]

Speaker 2 Localization

Speaker 1 0

50

100

150

200

250

300

350

400

450

500

550

t [s]

reverberant conference room (T60 ≈ 0.65 s) [1] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010 [2] Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010

Axel Plinge

Applications of Machine Hearing

16/53

Reactive online speaker tracking Applications I

video conferences [1]

I

online lectures [2]

I

...

Problems acustic detection erros due to reverberation, noise, non-speech events visual limited field of view, detection errors Method

, bio-inspired multimodal approach [3]

conference room (T60 > 0.6 s)

[1] Shankar T. Shivappa, Mohan Manubhai Trivedi, and Bhaskar D. Rao. Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey. Proceedings of the IEEE, 98(10):1692–1715, October 2010 [2] Damien Kelly, Anil Kokaram, and Frank Boland. Voxel-Based Viterbi Active Speaker Tracking (V-VAST) with Best View Selection for Video Lecture Post-Production. In IEEE Int. Conf. Acoustics Speech & Signal Process., pages 2296–2299, Prague, Czech Republic, 2011 [3] Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011

Axel Plinge

Applications of Machine Hearing

17/53

Reactive online speaker tracking: Method posterior parietal cortex

motor cortex

Inspiration - human perception

auditory cortex

1. glimpsing modell [1] 2. attention [2] 3. head turn 4. face detection

eyes

inferior temporal cortex

midbrain

ears

visual cortex

5. reactive head movement

[1] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [2] A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980 [3] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010 [4] S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004

Axel Plinge

Applications of Machine Hearing

17/53

Reactive online speaker tracking: Method posterior parietal cortex

motor cortex

Inspiration - human perception

auditory cortex

1. glimpsing modell [1] 2. attention [2] 3. head turn eyes

4. face detection

inferior temporal cortex

ears

midbrain

visual cortex

5. reactive head movement Method 1. acustic localization [3] 2. salient direction [4] 3. camera movement 4. face detection 5. multimodal tracking

Unimodal Processing

Sensors & Motors

Camera Image

Image Preprocessing

Multimodal Model

Movement Model

Face Localization

Multimodal Integration

Speaker Tracking

Overt Visual Camera Control

Movement Processing

Multimodal Processing

Face Model

Attention

Camera Movement

Covert multimodal Attention

Overt Auditory Attention Microphone Array

Audio Preprocessing

Acoustic Localization Speech Model

top-downb/befferent bottom-upb/bafferent memory/modelbuse

[1] M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006 [2] A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980 [3] Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010 [4] S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004

Axel Plinge

Applications of Machine Hearing

Reactive online speaker tracking: Sensors and camera control camera and microphone array at same position in the middle of the room –Iidentical pan = azimuth coordinate system (θ) reactive camera control

ic acoust ation localiz

1. acoustic localization (salient)

2. camera move (θ, 15◦ ) face detection (distance estimate by head height)

camera fov

face detection

3. camera movement (θ, φ) face and lip movement detection lip movement detections ?

tracking 1. acoustic and visual localization 2. integration ∆θ < 15◦ , ∆t < 1 s

18/53

Axel Plinge

Applications of Machine Hearing

Reactive online speaker tracking: Demo Video

19/53

Axel Plinge

Applications of Machine Hearing

20/53

CASA tracking

(b) t

i,j



τ↦ (θ,φ)



θ

θ θ t

t

Tracking speakers in adverse conditions I

Technical approaches moderately robust

I

Human perception holds insights [1]

Auditory Scene Analysis [2] [3] clustering simultaneous grouping by multiple cues tracking sequential integration over time

[1] DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006 [2] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990 [3] S. Handel. Listening. MIT Press, 1989

Axel Plinge

Applications of Machine Hearing

21/53

Spatial Likelihood Example: Two concurrent speakers in AV16.3 [1] 180 θ [◦ ]

90 0 −90 −180 1

6

11

16 b 16 11 6 1 0

2

4

6

8

10 t[s]

[1] Guillaume Lathoud, Jean-Marc Odobez, and Daniel Gatica-Perez. AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In International conference on Machine Learning for Multimodal Interaction, volume 3361 of LNCS, pages 182–195, Martigny, Switzerland, 2005

Axel Plinge

Applications of Machine Hearing

22/53

CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I

Localizations with angle and spectrum x = (θ, s )

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}

Axel Plinge

Applications of Machine Hearing

22/53

CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I I

Localizations with angle and spectrum x = (θ, s ) Grouping Ψi = (Θi , σi , ti ) according to spectrum and angle I I I

  d(θ,Θ)2 [2] exp −0.5 σ2 D E t s spectral similarity –I scalar product ps (x|Ψi ) = ||s || , ||t || spatial similarity –I Gaussian pa (x|Ψi ) =

√1 2πσ

joint similarity p(x|Ψi ) = ps (x|Ψi )pa (x|Ψi )

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}

Axel Plinge

Applications of Machine Hearing

22/53

CASA tracking: Simultaneous grouping by clustering Simultaneous Grouping by EM [1] I I

Localizations with angle and spectrum x = (θ, s ) Grouping Ψi = (Θi , σi , ti ) according to spectrum and angle I I I

  d(θ,Θ)2 [2] exp −0.5 σ2 D E t s spectral similarity –I scalar product ps (x|Ψi ) = ||s || , ||t || spatial similarity –I Gaussian pa (x|Ψi ) =

√1 2πσ

joint similarity p(x|Ψi ) = ps (x|Ψi )pa (x|Ψi )

I

Expectation Maximization using likelihood l(x) = ss T

I

Dynamic number of sources by split (left) and join (right) 0.6

0.6 0.5

0.5

0.4 p

0.4 p 0.3

0.3

150

100

50 angl0 e

50

100

150

0

1

2

3 tion itera

4

0.2

0.2

0.1

0.1

0.0 5

150

100

50 angl0 e

50

100

150

0

1

2

3 tion itera

4

0.0 5

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] d(α, β) = min{360 − |α − β|, |α − β|}

Axel Plinge

Applications of Machine Hearing

23/53

CASA tracking: Tracking by sequential integration Rule based tracking [1] I

form tracks of speakers by association over consecutive time frames k →k +1

θ

k

t

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008

Axel Plinge

Applications of Machine Hearing

23/53

CASA tracking: Tracking by sequential integration Rule based tracking [1] I

form tracks of speakers by association over consecutive time frames k → k + 1

I

use Gaussian of the angles pa for new detection to belong to a given track

I

spectral similarity ps can be used additionally to disambiguate in hard cases

θ

k k+1

t

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008

Axel Plinge

Applications of Machine Hearing

23/53

CASA tracking: Tracking by sequential integration Rule based tracking [1] I

form tracks of speakers by association over consecutive time frames k → k + 1

I

use Gaussian of the angles pa for new detection to belong to a given track

I

spectral similarity ps can be used additionally to disambiguate in hard cases

I

if no track with p > pmin exists, create a new one

θ

k k+1

t

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008

Axel Plinge

Applications of Machine Hearing

23/53

CASA tracking: Tracking by sequential integration Rule based tracking [1] I

form tracks of speakers by association over consecutive time frames k → k + 1

I

use Gaussian of the angles pa for new detection to belong to a given track

I

spectral similarity ps can be used additionally to disambiguate in hard cases

I

if no track with p > pmin exists, create a new one

θ

k k+1

t

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008

Axel Plinge

Applications of Machine Hearing

23/53

CASA tracking: Tracking by sequential integration Rule based tracking [1] I

form tracks of speakers by association over consecutive time frames k → k + 1

I

use Gaussian of the angles pa for new detection to belong to a given track

I

spectral similarity ps can be used additionally to disambiguate in hard cases

I

if no track with p > pmin exists, create a new one

I

Allow gaps smaller than tTTL before tracks die [2]

θ

k k+1

t

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008

Axel Plinge

Applications of Machine Hearing

24/53

CASA tracking: Evaluation (1) Method Comparison SRP PHAT [1] PCA BF [2] PoAP EM [3]

εa [ ◦ ]

15

10

5

0 0

0.5

1

1.5

2

T60 [s]

[1] Guillaume Lathoud and Jean-Marc Odobez. Short-Term Spatio-Temporal Clustering applied to Multiple Moving Speakers. IEEE Trans. Audio Speech & Language Process., 2007 [2] E. Warsitz and R. Haeb-Umbach. Acoustic Filter-and-Sum Beamforming by adaptive Principal Component Analysis. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume 4, pages 797–800. IEEE, 2005 [3] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013

Axel Plinge

Applications of Machine Hearing

CASA tracking: Evaluation (2) AV16.3 angular tracks, SRP PHAT

180 135 90

azimuth [ ◦ ]

45 0 45 90 135 1800

50

SRP PHAT th=0

100

RMS 42.44◦

t [s]

F1 83.52%

150

precision 72.91%

200

recall 97.75%

25/53

Axel Plinge

Applications of Machine Hearing

CASA tracking: Evaluation (2) AV16.3 angular tracks, SRP PHAT good threshold

180 135 90

azimuth [ ◦ ]

45 0 45 90 135 1800

50

SRP PHAT th=0 SRP PHAT th=*

100

RMS 42.44◦ 2.86◦

t [s]

F1 83.52% 94.12%

150

precision 72.91% 98.77%

200

recall 97.75% 89.89%

25/53

Axel Plinge

Applications of Machine Hearing

CASA tracking: Evaluation (2) AV16.3 angular tracks, PoAP EM

180 135 90

azimuth [ ◦ ]

45 0 45 90 135 1800

50

SRP PHAT th=0 SRP PHAT th=* PoAP EM

100

RMS 42.44◦ 2.86◦ 2.69◦

t [s]

F1 83.52% 94.12% 96.82%

150

precision 72.91% 98.77% 96.64%

200

recall 97.75% 89.89% 97.00%

25/53

Axel Plinge

Applications of Machine Hearing

CASA tracking: Evaluation (2) AV16.3 angular tracks, PoAP EM Tracking

180 135 90

azimuth [ ◦ ]

45 0 45 90 135 1800

50

SRP PHAT th=0 SRP PHAT th=* PoAP EM PoAP EM track

100

RMS 42.44◦ 2.86◦ 2.69◦ 2.33◦

t [s]

F1 83.52% 94.12% 96.82% 99.63%

150

precision 72.91% 98.77% 96.64% 100.00%

200

recall 97.75% 89.89% 97.00% 99.26%

25/53

Axel Plinge

Applications of Machine Hearing

26/53

CASA tracking: Example

Two speakers in the FINCA (T60 ≈ 0.65 s) 180 135 90 45 R θ [◦ ] 0 −45 −90 −135 −180 15 S1 b 0 15 S2 b 0

T1 T2

RMS 5◦ precision 100% recall 81.2%

Axel Plinge

Applications of Machine Hearing

CASA tracking: Demo video

27/53

Axel Plinge

Applications of Machine Hearing

28/53

Multi-array tracking (b) t

i,j (b) t

i,j

⊗ ⊗

τ↦ (θ,φ)



θ

τ↦ (θ,φ)



θ

θ

θ

t

t θ

θ

t

y

x

t

Advantages I

Euclidean tracking possible with multiple small arrays

I

distributed microphone arrays can capture speech at all positions of interest

I

can be deployed ad-hoc and coupled wirelessly

Challenges I

syncronization may not be good: drift, jitter, omissions

I

multiple concurrent speakers produce ambiguous localizations

I

geometry has to be calibrated

I

triangulation of multiples DoAs not straightforward

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

29/53

Multi-array tracking (1/3) System design

y

x

I

distributed nodes localize concurrent speech events [1]

I

network or radio transmission (< 32 kbps) integration node

I

I I I

association [1] CASA tracking (as before but parallel on all available arrays) triangulation [2]

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013 [2] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

30/53

Multi-array tracking (2/3) Association I

Multiple concurrent estimates by multiple microphone arrays

I

Location alone remains ambiguous, other modalities needed

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013

Axel Plinge

Applications of Machine Hearing

30/53

Multi-array tracking (2/3) Association I

Multiple concurrent estimates by multiple microphone arrays

I

Location alone remains ambiguous, other modalities needed

I

Resolve using spectral similarity (again) [1]

I

Assign scalar product as spectral similarity ps for being the same source     s (m) s (n) ps s (m) , s (n) = , ||s (m) || ||s (n) ||   Q Find groups of localizations by high m,n ps s (m) , s (n)

I

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013

Axel Plinge

Applications of Machine Hearing

31/53

Multi-array tracking (3/3) triangulation I

Calculate intersection of lines from array positions o (m,n) with angles Θ(m,n)     cos Θ(m) cos Θ(n) z (m,n) = o (m) + r (m) = o (n) + r (n) (m) (n) sin Θ sin Θ

I

The accuracy of an angular intersection is dependent of the intersection angle: Use weighted sum [1]     P (m) , Θ(n) z (m,n) Θ(m) , Θ(n) m,n q Θ P z= q(α, β) = |sin(α − β)| (m) , Θ(n) ) m,n q (Θ

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

32/53

Multi-array tracking: Simulation

3

2

18

4

5

C

A

6

7

E

9

17 B

16

8

15

D

14

13

12

11

10

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

33/53

Multi-array tracking evaluation: Triangulation five unweighted five q-weighted two strongest

εl [m]

2 1.5 1 0.5

Recall [%]

0 100 50 0 0.25

0.5

0.75

1

1.25

1.5

1.75

T60 [s] Localization error of simulation of a single speaker tracked by five sensor nodes at different reverberation times. [1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

34/53

Multi-array tracking evaluation: Jitter

0.8

εl [m]

0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8 1 jitter [s]

1.2

1.4

1.6

1.8

Localization error of simulation of a single speaker tracked by five sensor nodes at T60 = 0.5 s for different inter-array jitter.

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

35/53

Multi-array tracking evaluation: Geometry T60 = 0.5 s

T60 = 1.0 s

εl [m]

1.5 1 0.5 0 0

0.1

0.2

0.3

0.4

0.5

calibration error [m] εl [m]

1.5 1 0.5 0 0

2

4

6

8

10

12

calibration error [◦ ]

Localization error as function of Euclidean (top) and angular (bottom) geometry calibration error. [1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

36/53

Multi-array tracking evaluation: Real data

θ(0) θ(1) θ(2) x,y

#1 one static speaker precision recall RMS 77% 58% 5.6◦ 91% 70% 4.6◦ 86% 70% 4.9◦ 99% 79% 0.17 m

#2 one moving speaker precision recall RMS 89% 85% 5.4◦ 80% 72% 6.7◦ 72% 70% 5.8◦ 90% 82% 0.39 m

θ(0) θ(1) θ(2) x,y

#3 two static speakers precision recall RMS 94% 88% 5.9◦ 95% 95% 8.4◦ 88% 92% 5.9◦ 100% 95% 0.10 m

#4 three speakers precision recall RMS 73% 73% 6.7◦ 93% 86% 4.9◦ 96% 97% 3.2◦ 99% 98% 0.21 m

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

37/53

Multi-array tracking: Example

θ (0) [◦ ]

180 90 0 −90 −180

θ (1) [◦ ]

180 90 0 −90 −180

θ (2) [◦ ]

180 90 0 −90 −180

x [m]

6 4 2 0

y [m]

speaker 1 track 1

6 4 2 0 10

15

20

25

speaker 2 track 2

30

speaker 3 track 3

35

40

45

t [s]

Axel Plinge

Applications of Machine Hearing

38/53

Multi-array tracking: Demo video

[1] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

39/53

Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I

manual measurement tedious

I

ad hoc placement

[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014

Axel Plinge

Applications of Machine Hearing

39/53

Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I

manual measurement tedious

I

ad hoc placement

Estimating by diffuse noise? [1]

s [cm]

, ok for small arrays [2] / diffuse noise not given, does not work for large distances 20 15 10 5 0

0

20

40

60

80

100

120

140

160

180

array size [cm]

[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014

Axel Plinge

Applications of Machine Hearing

39/53

Microphone array geometry calibration Application speaker tracking with unknown placement of microphone arrays I

manual measurement tedious

I

ad hoc placement

Estimating by diffuse noise? [1]

s [cm]

, ok for small arrays [2] / diffuse noise not given, does not work for large distances 20 15 10 5 0

0

Solutions

20

40

60

80

100

120

140

160

180

array size [cm]

I

speech/noise based measurements of distance and orientation [2] [3]

I

multimodal anchoring using video cameras [4]

[1] Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008 [2] Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [4] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014

Axel Plinge

Applications of Machine Hearing

Multimodal Geometry Calibration I

Calibrated video cameras

I

Record a human speaker at fixed positions

I

Estimate time segments and DoA from audio

I

Compute visual x,y localization for segments

I

Estimate microphone array geometry from audio-visual correspondence

8

9

10

1

7

2

6

5

4

3

40/53

Axel Plinge

Applications of Machine Hearing

41/53

Audio Segment Estimation I

Localize source using [1] for each microphone array

θ (0) [◦ ]

180 90 0 −90 −180

θ (1) [◦ ]

180 90 0 −90 −180

θ (2) [◦ ]

gt localization

180 90 0 −90 −180 0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013

Axel Plinge

Applications of Machine Hearing

41/53

Audio Segment Estimation I

Localize source using [1] for each microphone array

I

Compute segments with low angular deviation and TTL = 1 s

θ (0) [◦ ]

180 90 0 −90 −180

θ (1) [◦ ]

180 90 0 −90 −180

θ (2) [◦ ]

gt seg. 0 seg. 1 seg. 2

180 90 0 −90 −180 0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

[1] Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013

Axel Plinge

Applications of Machine Hearing

42/53

Visual Localization I

Localize speaker using background subtraction [1] and a upper body HoG detector [2] for each camera

[1] P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001 [2] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005 [3] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

42/53

Visual Localization I

Localize speaker using background subtraction [1] and a upper body HoG detector [2] for each camera

I

Compute x,y for each time segment by weighted triangulation [3] gt

localization

y [m]

3 2 1 0 0

1

2

3

4

5

6

x [m]

[1] P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001 [2] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005 [3] Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

43/53

Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m

si

si − rm = ki,m



cos (om + Θi,m ) sin (om + Θi,m )



Θi,m om rm

[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995

Axel Plinge

Applications of Machine Hearing

43/53

Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate

si

ei = si − rm − ki,m



cos (om + Θi,m ) sin (om + Θi,m )



Θi,m om rm

[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995

Axel Plinge

Applications of Machine Hearing

43/53

Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate stack equations for multiple positions, overdetermined for I > 3

si

e1 =s1 − rm − k1,m



cos (om + Θ1,m ) sin (om + Θ1,m )





cos (om + ΘI ,m ) sin (om + ΘI ,m )



.. . Θi,m

eI =sI − rm − kI ,m

om rm

[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995

Axel Plinge

Applications of Machine Hearing

43/53

Geometry calibration: Method (1/2) For each circular array known speaker position si localized at angle Θi,m unknown position rm , orientation om , distance ki,m error ei of the estimate stack equations for multiple positions, overdetermined for I > 3 solve by minimizing the error with bounded gradient descent [1]

si

e1 =s1 − rm − k1,m



cos (om + Θ1,m ) sin (om + Θ1,m )





cos (om + ΘI ,m ) sin (om + ΘI ,m )



.. . Θi,m

eI =sI − rm − kI ,m

om rm

e2 =

I X

ei2

! –I

min

i=1

[1] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995

Axel Plinge

Applications of Machine Hearing

Geometry calibration: Method (2/2) I

compute all N = 35 sets for I = 5

y [m]

er = 230, 98, 88 mm

eo = 5.78◦ , 1.82◦ , 4.02◦

2

3

4 x [m]

44/53

Axel Plinge

Applications of Machine Hearing

Geometry calibration: Method (2/2) I

compute all N = 35 sets for I = 5 and their median

y [m]

er = 103, 79, 70 mm

eo = 1.92◦ , 1.00◦ , 4.04◦

2

3

4 x [m]

44/53

Axel Plinge

Applications of Machine Hearing

Geometry calibration: Method (2/2) I

compute all N = 35 sets for I = 5 and their median

I

consensus estimate: mean of the N/4 closest to the median

y [m]

er = 110, 74, 72 mm

eo = 0.93◦ , 0.78◦ , 3.47◦

2

3

4 x [m]

44/53

Axel Plinge

Applications of Machine Hearing

45/53

Calibration results & method comparison

RANSAC [1] unimodal [2] mutimodal [3]

scaling

rotation

translation

3 3 3

– – 3

– – 3

20

sync required required online

eγ [◦ ]

15 es [cm]

3 3 [1] [2] [3]

10

10

online ?

5

5 0

0 noise

speech

noise

speech

[1] F. Jacob, J. Schmalenstroeer, and R. Haeb-Umbach. Doa-based microphone array postion self-calibration using circular statistics. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013 [2] Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014 [3] Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014

Axel Plinge

Applications of Machine Hearing

Geometry calibration: Demo video

46/53

Axel Plinge

Applications of Machine Hearing

Acoustic event detection

Applications I

meeting analysis, context awareness, security

I

prefilter for tracking / calibration

Problems I

very heterogeneous classes

I

open set

Method I

neuro inspired features

I

supervised learning

I

Bag-of-Features classification

47/53

Axel Plinge

Applications of Machine Hearing

48/53

BoF acoustic event detection Bag-of-Features? I

approach from text retrieval [1] histograms of quantized features

Our approach [2] I

MFCCs and GFCCs [3] succesful in speaker identification

I

“super codebook” from class-wise training

I

ML classification of histograms

[1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999 [2] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014 [3] Xiaojia Zhao, Student Member, Yang Shao, and DeLiang Wang. CASA-Based Robust Speaker Identification. IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1608–1616, 2012

Axel Plinge

Applications of Machine Hearing

49/53

BoF acoustic event detection

I

MFCCs and GFCCs for a frame yk quantized by super-codebook v vl=(I ·c+i) = (µi,c , σi,c )

I

where

qk,l (yk , vl ) = N (yk |µl , σl )

pyramid histogram for temporal structure (1)

bl (Yn , vl ) =

K /2 2 X 2 (2) qk,l (yk , vl ), bl (Yn , vl ) = K K k=1

K X

qk,l (yk , vl )

k=K /2+1

n o (1) (2) (3) bl (Yn , vl ) = max bl (Yn , vl ), bl (Yn , vl ) I

multinominal maximum likelihood classification P(Yn |Ωc ) =

Y vl ∈v

P(vl |Ωc )bl (Yn ,vl ) , P(vl |Ωc ) =

L+

1+ PL

P

m=1

Y ∈Ωc

n P

bl (Yn , vl )

Yn ∈Ωc

bm (Yn , vm )

[1] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

AED: Features

MFCC GFCC* MFCC + GFCC*

F%

FINCA [1] 100 80 60 40 20 0

ch

s air

s

p cu

do

or

s ard key bo top p key la

pa

r

pe

50/53

ng

uri po

lin

rol

g

e nc sile

ech

spe

ps

ste

F%

D CASE Development Set [2] 100 80 60 40 20 0

e k h h h p rn rt er er se er m at ys rd ce ale thro coug orsla draw boa ke knoc ught mou getu ndro phon print silen speec switc y r a o e a pe l pa k d cle [1] http://patrec.cs.tu-dortmund.de/cms/en/home/Resources/index.html [2] Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013 [3] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014

Axel Plinge

Applications of Machine Hearing

51/53

Acoustic event detection: Classifiers D-CASE Development Set [1]

FINCA [2]

HMM GBFB (SCS) [3] NMF HMM (GVV) HMM MFCC GFCC* BoSF-P MFCC GFCC* [4] GMM MFCC∆∆ (VVK) HMM RF (NVM) GMM MFCC BoAW MFCC [5] SVM MFCC∆∆ (NR2) NMF (Baseline) 0

20

40

60

80

F1 (frames) [%]

0

20

40

60

80 100

F1 (frames) [%]

[1] Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013 [2] http://patrec.cs.tu- dortmund.de/cms/en/home/Resources/index.html [3] Jens Schroeder, Niko Moritz, Marc Rene Schaedler, Benjamin Cauchi, Kamil Adiloglu, Joern Anemueller, Simon Doclo, Birger Kollmeier, and Stefan Goetze. On the use of spectro-temporal features for the IEEE AASP challenge ’detection and classification of acoustic scenes and events’. In IEEE Workshop on Appl. of Signal Proc. to Audio and Acoustics, 2013 [4] Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014 [5] Stephanie Pancoast and Murat Akbacak. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech, Portland, OR, USA, 2012

Axel Plinge

Applications of Machine Hearing

Acoustic event detection: Example

ground truth

BoSF-P MFCC+GFCC* (F = 60.17)

HMM MFCC+GFCC* (F = 66.55)

0

30

60

90

t[s]  alert  clearthroat  cough  doorslam  drawer  keyboard  knock  laughter  mouse  pageturn  pendrop  phone  printer  speech  switch  keys  silence

52/53

Axel Plinge

Applications of Machine Hearing

Acoustic Event Detection: Demo Video

53/53

Axel Plinge

Applications of Machine Hearing

53/53

References L.M. Arslan and J.H.L. Hansen. Minimum cost based Phoneme Class Detection for Improved Iterative Speech Enhancement. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume ii, pages II/45–II/48, Apr 1994.

Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996.

Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16:1190–1208, 1995.

Dieter Bauer and Axel Plinge. Selective Phoneme Spotting for Realization of an /s, z, C, t/ Transposer. In 8th Int. Conf. on Computers Helping People with Special Needs (ICCHP2002), volume 2398 of LNCS, Linz, Austria, 2002. Springer.

Dieter Bauer, Axel Plinge, and Walter H Ehrenstein. Compensation of Severe Sensory Hearing Deficits. Re-Sampling Versus Re-Synthesis. In Ger M Craddock, Lisa P McCormack, Richard B Reilly, and Harry T P Knops, editors, Assistive Technology - Shaping the Future; AAATE Conference Proceedings, volume 11, pages 522–526. IOS Press, 2003.

Axel Plinge

Applications of Machine Hearing

A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.

M. P. Cooke. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am., 119:1562–1573, 2006.

Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005.

B. Plannerer et al. A continuous speech recognition system integrating additional acoustic knowl-edge sources. Technical report, TU M¨ unchen, 1996.

B Grothe. New roles for synaptic inhibtion in sound localisation. Nature, 4(7):540–550, 2003.

53/53

Axel Plinge

Applications of Machine Hearing

53/53

Dimitrios Giannoulis, Dan Stowell, Emmanouil Benetos, Mathias Rossignol, and Mathieu Lagrange. A Database and Challenge for Acoustic Scene Classification and Event Detection. In European Signal Processing Conference, Marrakech, Morocco, 2013. S. Handel. Listening. MIT Press, 1989. Marius H. Hennecke, Thomas Pl¨ otz, Gernot A. Fink, and Reinhold Haeb-Umbach. A Hierarchical Approach to Unsupervised Shape Calibration of Microphone Array Networks. In IEEE Workshop on Stat. Signal Proc., pages 257–260, Cardiff, Wales, UK, 2009. F. Jacob, J. Schmalenstroeer, and R. Haeb-Umbach. Doa-based microphone array postion self-calibration using circular statistics. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013. P. KadewTraKuPong and R. Bowden. An improved Adaptive Background Mixture Model for Real-Time Rracking with Shadow Detection. In European Workshop on Advanced Video-Based Surveillance Systems, 2001. Damien Kelly, Anil Kokaram, and Frank Boland. Voxel-Based Viterbi Active Speaker Tracking (V-VAST) with Best View Selection for Video Lecture Post-Production. In IEEE Int. Conf. Acoustics Speech & Signal Process., pages 2296–2299, Prague, Czech Republic, 2011.

Axel Plinge

Applications of Machine Hearing

53/53

Guillaume Lathoud and Jean-Marc Odobez. Short-Term Spatio-Temporal Clustering applied to Multiple Moving Speakers. IEEE Trans. Audio Speech & Language Process., 2007.

Guillaume Lathoud, Jean-Marc Odobez, and Daniel Gatica-Perez. AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In International conference on Machine Learning for Multimodal Interaction, volume 3361 of LNCS, pages 182–195, Martigny, Switzerland, 2005.

Richard F. Lyon. Machine Hearing – An Emerging Field. IEEE Signal Process. Magazine, September 2010.

Iain McCowan and Mike Lincoln. Microphone Array Shape Calibration in Diffuse Noise Fields. IEEE Transactions on Audio, Speech and Language Processing, 16(3):666–670, 2008.

N. Madhu and R. Martin. A Scalable Framework for Multiple Speaker Localization and Tracking. In International Workshop on Acoustic Echo and Noise Control, Seattle, WA, USA, September 2008.

Stephanie Pancoast and Murat Akbacak. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech, Portland, OR, USA, 2012.

Axel Plinge

Applications of Machine Hearing

53/53

Axel Plinge and Dieter Bauer. Providing Speech Enhancement and Replacement for Persons with Severely Impaired Hearing. In Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments, Granada, Spain, 2004. Axel Plinge and Dieter Bauer. Genesis of Wearable DSP Structures for Selective Speech Enhancement and Replacement to Compensate Severe Hearing Deficits. In A. Pruski and H. Knops, editors, Assistive Technology - From Virtuality to Reality; AAATE Conference Proceedings. IOS Press, Amsterdam, 2005. Axel Plinge, Dieter Bauer, and Martin Finke. Intelligibility Enhancement of Human Speech for Severely Hearing Impaired Persons by Dedicated Digital Processing. In Crt. Marincek and Christian B¨ uhler, editors, Assitive Technology - Added Value to the Quality of Life; AAATE Conference Proceedings, Amsterdam, 2001. IOS Press. K. J. Palom¨ aki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43(4):361–378, 2004. Axel Plinge and Gernot A. Fink. Online Multi-Speaker Tracking Using Multiple Microphone Arrays Informed by Auditory Scene Analysis. In European Signal Process. Conf., Marrakesh, Morocco, 2013.

Axel Plinge

Applications of Machine Hearing

Axel Plinge and Gernot Fink. Multi-Speaker tracking using multiple distributed microphone arrays. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014.

Axel Plinge and Gernot A. Fink. Geometry Calibration of Distributed Microphone Arrays Exploiting Audio-Visual Correspondences. In European Signal Process. Conf., Lisbon, Portugal, September 2014.

Axel Plinge and Gernot A. Fink. Geometry Calibration of Multiple Microphone Arrays in Highly Reverberant Environments. In Int. Workshop on Acoustic Signal Enhancement, Antibes – Juan les Pins, France, September 2014. Axel Plinge, Rene Grzeszick, and Gernot Fink. A Bag-of-Features Approach to Acoustic Event Detection. In IEEE Int. Conf. Acoustics Speech & Signal Process., Florence, Italy, May 2014.

Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization Using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aviv, Israel, August 2010.

Axel Plinge, Marius H. Hennecke, and Gernot A. Fink. Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array. In Int. Workshop on Acoustic Echo and Noise Control, Tel Aiv, Israel, 2010.

53/53

Axel Plinge

Applications of Machine Hearing

P. Pertil¨ a, T. Korhonen, and A. Visa. Measurement combination for acoustic source localization in a room environment. EURASIP J. Audio Speech & Music Process., 2008:1–14, 2008. Axel Plinge. Neurobiologisch inspirierte Lokalisierung von Sprechern in realen Umgebungen. Master’s thesis, TU Dortmund; Fakult¨ at f¨ ur Informatik in Zusammenarbeit mit dem Institut f¨ ur Roboterforschung, Dortmund, Germany, May 2010. R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. An efficient auditory filterbank based on the gammatone functions. Technical Report APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K, 1988. S. W. Smith. The Scientists and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, 2 edition, 1999. Jens Schroeder, Niko Moritz, Marc Rene Schaedler, Benjamin Cauchi, Kamil Adiloglu, Joern Anemueller, Simon Doclo, Birger Kollmeier, and Stefan Goetze. On the use of spectro-temporal features for the IEEE AASP challenge ’detection and classification of acoustic scenes and events’. In IEEE Workshop on Appl. of Signal Proc. to Audio and Acoustics, 2013. Barry E Stein and Benjamin A Rowland. Organization and plasticity in multisensory integration: early and late experience affects its governing principles. Progress in Brain Research, pages 145–163, 2011.

53/53

Axel Plinge

Applications of Machine Hearing

53/53

Shankar T. Shivappa, Mohan Manubhai Trivedi, and Bhaskar D. Rao. Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey. Proceedings of the IEEE, 98(10):1692–1715, October 2010. David Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W.B. Kleijn and K.K. Paliwal, editors, Speech Coding and Synthesis. Elsevier Science, 1995. A. Treisman and G. Gelade. A feature–integration theory of attention. Cognitive Psychology, 12:97–136, 1980. Masashi Unoki and Masato Akagi. A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun., 27(3):261–279, 1999. Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, page 511–518, 2001. S. N. Wrigley and G. J. Brown. A computational model of auditory selective attention. IEEE Transactions on Neural Networks: Special Issue on Temporal Coding for Neural Information Processing, 15(5):1151–1163, 2004.

Axel Plinge

Applications of Machine Hearing

DeLiang Wang and Guy J. Brown, editors. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley Interscience, 2006. E. Warsitz and R. Haeb-Umbach. Acoustic Filter-and-Sum Beamforming by adaptive Principal Component Analysis. In IEEE Int. Conf. Acoustics Speech & Signal Process., volume 4, pages 797–800. IEEE, 2005. Xiaojia Zhao, Student Member, Yang Shao, and DeLiang Wang. CASA-Based Robust Speaker Identification. IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1608–1616, 2012.

53/53