Environmental Sound Recognition and Classification

9 downloads 144 Views 13MB Size Report
Jun 1, 2011 ... Environmental Sound. Recognition and Classification. Dan Ellis. Laboratory for Recognition and Organization of Speech and Audio. Dept.
Environmental Sound Recognition and Classification Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected]

1. 2. 3. 4. 5.

http://labrosa.ee.columbia.edu/

Machine Listening Background Classification Foreground Event Recognition Speech Separation Open Issues

Environmental Sound Recognition - Dan Ellis

2011-06-01

1 /36

1. Machine Listening

• Extracting useful information from sound Describe

Automatic Narration

Emotion

Music Recommendation

Classify

Environment Awareness

ASR

Music Transcription

“Sound Intelligence”

VAD

Speech/Music

Environmental Sound

Speech

Dectect

Task

... like animals do

Environmental Sound Recognition - Dan Ellis

Music

Domain 2011-06-01

2 /36

Environmental Sound Applications



Audio Lifelog Diarization

09:00 09:30 10:00 10:30 11:00 11:30

2004-09-14

preschool cafe

L2 cafe office

preschool cafe Ron lecture

12:00

office

12:30

office outdoor group

13:00

outdoor lecture

outdoor

DSP03 compmtg

meeting2

13:30

lab

14:00

cafe meeting2 Manuel

outdoor office cafe office

Mike

Arroyo?

outdoor

Sambarta?

14:30 15:00 15:30



2004-09-13

16:00

office office

office postlec office

Lesser

16:30

Consumer Video Classification & Search

Environmental Sound Recognition - Dan Ellis

17:00 17:30 18:00

outdoor lab cafe

2011-06-01

3 /36

• Environment

Prior Work

Classification speech/music/ silent/machine Zhang & Kuo ’01

• Nonspeech Sound Recognition Meeting room Audio Event Classification sports events - cheers, bat/ball sounds, ...

Temko & Nadeu ’06 Environmental Sound Recognition - Dan Ellis

2011-06-01

4 /36

Consumer Video Dataset

• 25 “concepts” from

1G+KL2 (10/15)

Kodak user study

boat, crowd, cheer, dance, ... 1

Labeled Concepts

museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts

Environmental Sound Recognition - Dan Ellis

• Grab top 200 videos 8GMM+Bha (9/15)

from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts pLSA500+lognorm (12/15)

2011-06-01

5 /36

Obtaining Labeled Data

• Amazon Mechanical Turk

Y-G Jiang et al. 2011

10s clips 9,641 videos in 4 weeks

Environmental Sound Recognition - Dan Ellis

2011-06-01

6 /36

2. Background Classification

• Baseline for soundtrack classification

8 7 6 5 4 3 2 1 0

VTS_04_0001 - Spectrogram

MFCC Covariance Matrix

30 20 10 0 -10

MFCC covariance

-20 1

2

3

4

5

6

7

8

9 time / sec

20 18 16 14 12 10 8 6 4 2 1

2

3

4

5

6

7

8 9 time / sec

level / dB

18 16

20 15 10 5 0 -5 -10 -15 -20 value

14 12 0

10

• Classify by e.g. KL distance + SVM

Environmental Sound Recognition - Dan Ellis

50

20

MFCC dimension

MFCC features

MFCC bin

Video Soundtrack

freq / kHz

divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features”

8 6 4 2 5

10 15 MFCC dimension

20

2011-06-01

-50

7 /36

Codebook Histograms

• Convert high-dim. distributions to multinomial 8

150

6 4

7

2

MFCC features

Per-Category Mixture Component Histogram count

MFCC(1)

Global Gaussian Mixture Model

0

2 5

-2

6 1 10 14 8

100

15 9 12 13 3 4

-4

50 11

-6 -8 -10 -20

-10

0

10

20

0

1

2

3

MFCC(0)

4

5

6 7 8 9 10 11 12 13 14 15 GMM mixture

• Classify by distance on histograms KL, Chi-squared + SVM

Environmental Sound Recognition - Dan Ellis

2011-06-01

8 /36

Latent Semantic Analysis (LSA)

• Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’

.. each clip may have several things going on

• Topic sets optimized through EM

p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)

=

GMM histogram ftrs

*

“Topic”

p(topic | clip)

p(ftr | clip)

“Topic”

AV Clip

AV Clip

GMM histogram ftrs

p(ftr | topic)

use (normalized?) p(topic | clip) as per-clip features Environmental Sound Recognition - Dan Ellis

2011-06-01

9 /36

Background Classification Results Average Precision

K Lee & Ellis ’10

1

Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz))

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

pl

gr

• Wide range in performance

sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N

a sp t o gr ad rts ua tio n

bo

de

ra

nd

pa

ou

ay

gr

ba

by

rk

h

pa

ac

g

be

in

da

nc

ow

o

sh

tw

t p

of

ni

gh

on

ou

e

pe

rs

g

in

r

si

ng

ee

d

ch

ow

us

m

cr

on

gr

ou

p

of

3+

0

ic

0.1

Concept class

audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes

Environmental Sound Recognition - Dan Ellis

2011-06-01 10/36

Sound Texture Features

McDermott Simoncelli ’09 Ellis, Zheng, McDermott ’11

• Characterize sounds by

perceptually-sufficient statistics \x\

Sound

\x\

Automatic gain control

mel filterbank (18 chans)

Octave bins 0.5,1,2,4,8,16 Hz

FFT

\x\ \x\

Histogram

\x\ \x\

Mahalanobis distance ...

mean, var, skew, kurt (18 x 4)

Cross-band correlations (318 samples)

Envelope correlation

1159_10 urban cheer clap

1062_60 quiet dubbed speech music

2404 1273 1

617 0

2

4

6

8

10

0

2

4

6

8 10 time / s

Texture features

mel band



Subband distributions & env x-corrs

freq / Hz

.. verified by matched resynthesis

Modulation energy (18 x 6)

0 level

15 10 5

M V S K 0.5 2 8 32 moments

Environmental Sound Recognition - Dan Ellis

mod frq / Hz

5

10 15 mel band

M V S K 0.5 2 8 32 moments

mod frq / Hz

5

10 15 mel band

2011-06-01 11/36

Sound Texture Features

• Test on MED 2010 development data

10 specially-collected manual labels cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M

0.9 0.8 0.7 0.6 0.5

MED 2010 Txtr Feature (normd) means 0.5 0 −0.5 V

S

K

ms

cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0

V

S

K

ms

cbc MED 2010 Txtr Feature (normd) means/stddevs 0.5 0 −0.5

V

S

K

ms

M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc

1

cbc

Environmental Sound Recognition - Dan Ellis

Rural

Urban

Quiet

Noisy Dubbed Speech Music

Cheer

Clap

Avg

• Contrasts in feature sets

correlation of labels...

• Perform

~ same as MFCCs combine well 2011-06-01 12/36

3. Foreground Event Recognition

• Global vs. local class models

K Lee, Ellis, Loui ’10

tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs:

YT baby 002: voice baby laugh 4

New Way: Limited temporal extents freq / kHz

freq / kHz

Old Way: All frames contribute 3

4

2

1

1 5

10 voice

15

0 time / s

baby

3

2

0

voice

bg

5 voice baby

10 laugh

15 bg

time / s

laugh

baby laugh

“background” model shared by all clips Environmental Sound Recognition - Dan Ellis

2011-06-01 13/36

Foreground Event HMMs freq / Hz

Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”

• Training labels only •

Refine models by EM realignment

• Use for classifying entire video...

or seeking to relevant part Environmental Sound Recognition - Dan Ellis

etc speaker4 speaker3 speaker2 unseen1 Background

25 concepts + 2 global BGs

at clip-level

4k 3k 2k 1k 0

global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic park one person night museum group of 2 group of 3+ graduation crowd boat birthday beach baby animal 0

0 −50

drum

Manual Frame-level Annotation wind

cheer

voice voice voice water wave

voice

voice voice voice cheer

Viterbi Path by 27−state Markov Model

1

2

3

4

5

6

7

8

9 10 time/sec

Lee & Ellis ’10 2011-06-01 14/36

Transient Features

Cotton, Ellis, Loui ’11

• Transients = •

foreground events? Onset detector finds energy bursts best SNR

• PCA basis to

represent each 300 ms x auditory freq

• “bag of transients” Environmental Sound Recognition - Dan Ellis

2011-06-01 15/36

Nonnegative Matrix Factorization templates + activation

freq / Hz

• Decompose spectrograms into 5478

Original mixture

3474 2203 1398 883

X=W·H fast forgiving gradient descent algorithm 2D patches sparsity control...

Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07

442

Basis 1 (L2)

Basis 2 (L1)

Basis 3 (L1) 0

Environmental Sound Recognition - Dan Ellis

1

2

3

4

5

6

7

8

9

10 time / s

2011-06-01 16/36

NMF Transient Features

• Learn 20 patches from

Acoustic Event Error Rate (AEER%)



Cotton, Ellis ’11 freq (Hz)

Meeting Room Acoustic Event data

300

Patch 3

Patch 4

Patch 5

Patch 6

Patch 7

Patch 8

Patch 9

Patch 10

Patch 11

Patch 12

Patch 13

Patch 14 Patch 15

Patch 16

Patch 17

Patch 18

Patch 19 Patch 20

1398 442 3474 1398 442

1398 442 0.0

200

100

0.5 0.0

0.5 0.0

0.5 0.0 0.5 time (s)

noise-robust

MFCC NMF combined 15

0.5 0.0

• NMF more

150

0 10

Patch 2

442

3474

250

50

Patch 1

1398

3474

Compare to MFCC-HMM detector Error Rate in Noise

3474

20 SNR (dB)

25

30

Environmental Sound Recognition - Dan Ellis

combines well ... 2011-06-01 17/36

4. Speech Separation

• Speech recognition is finding best-fit parameters - argmax P(W | X) • Recognize mixtures with Factorial HMM

model + state sequence for each voice/source exploit sequence constraints, speaker differences model 2

model 1 observations / time

separation relies on detailed speaker model Environmental Sound Recognition - Dan Ellis

2011-06-01 18/36

• Idea: Find

Eigenvoices

Kuhn et al. ’98, ’00 Weiss & Ellis ’07, ’08, ’09

speaker model parameter space generalize without losing detail?

• Eigenvoice model: µ = µ ¯ + U adapted model

mean voice

eigenvoice bases

Environmental Sound Recognition - Dan Ellis

Speaker models Speaker subspace bases

w + B weights

h

channel channel bases weights

2011-06-01 19/36

Eigenvoice Bases

280 states x 320 bins = 89,600 dimensions

Frequency (kHz)

• Mean model

8

30

4

40

2 b d g p t k jh ch s z f th v dh m n l

Frequency (kHz)

r w y iy ih eh ey ae aa aw ay ah ao owuw ax

Eigenvoice dimension 1 8

6

6 4

4

2

2

0 b d g p t k jh ch s z f th v dh m n l 8

Frequency (kHz)

additional components for acoustic channel

20

50

r w y iy ih eh ey ae aa aw ay ah ao owuw ax

Eigenvoice dimension 2 8

6

6 4

4

2

2

0 b d g p t k jh ch s z f th v dh m n l 8

Frequency (kHz)



10

6

8

Eigencomponents shift formants/ coloration

Mean Voice

r w y iy ih eh ey ae aa aw ay ah ao owuw ax

Eigenvoice dimension 3 8

6

6 4

4

2

2

0 b d g p t k jh ch s z f th v dh m n l

Environmental Sound Recognition - Dan Ellis

r w y iy ih eh ey ae aa aw ay ah ao owuw ax

2011-06-01 20/36

Eigenvoice Speech Separation

• Factorial HMM analysis

Weiss & Ellis ’10

with tuning of source model parameters = eigenvoice speaker adaptation

Environmental Sound Recognition - Dan Ellis

2011-06-01 21/36

Eigenvoice Speech Separation

Environmental Sound Recognition - Dan Ellis

2011-06-01 22/36

Eigenvoice Speech Separation

• Eigenvoices for Speech Separation task

speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix

SI SA SD

Environmental Sound Recognition - Dan Ellis

2011-06-01 23/36

Binaural Cues

• Model interaural spectrum of each source as stationary level and time differences:

• e.g. at 75°, in reverb: IPD

IPD residual

Environmental Sound Recognition - Dan Ellis

ILD

2011-06-01 24/36

Model-Based EM Source Separation and Localization (MESSL)

Mandel & Ellis ’09

Re-estimate source parameters

Assign spectrogram points to sources

can model more sources than sensors flexible initialization Environmental Sound Recognition - Dan Ellis

2011-06-01 25/36

MESSL Results

• Modeling uncertainty improves results tradeoff between constraints & noisiness

2.45 dB

0.22 dB

12.35 dB

8.77 dB 9.12 dB

Environmental Sound Recognition - Dan Ellis

-2.72 dB

2011-06-01 26/36

MESSL with Source Priors

• Fixed or adaptive speech models

Environmental Sound Recognition - Dan Ellis

Weiss, Mandel & Ellis ’11

2011-06-01 27/36

MESSL-EigenVoice Results

• Source models function as priors • Interaural parameter spatial separation

Frequency (kHz)

Frequency (kHz)

source model prior improves spatial estimate Ground truth (12.04 dB)

8

DUET (3.84 dB)

8

6

6

6

4

4

4

2

2

2

0

0

0.5

1

1.5

MESSL (5.66 dB)

8

0

0

0.5

1

1.5

MESSL SP (10.01 dB)

8

0

6

6

4

4

4

2

2

2

0

0.5 1 Time (sec)

1.5

0

0

0.5 1 Time (sec)

Environmental Sound Recognition - Dan Ellis

1.5

1 0.8 0

0

0.5

1

1.5

MESSL EV (10.37 dB)

8

6

0

2D FD BSS (5.41 dB)

8

0.6 0.4 0.2 0

0

0.5 1 Time (sec)

1.5

2011-06-01 28/36

Noise-Robust Pitch Tracking

• Important for voice detection & separation • Based on channel selection Wu & Wang (2003)

BS Lee & Ellis ’11

pitch from summary autocorrelation over “good” bands trained classifier decides which channels to include Mean BER of pitch tracking algorithms

Mean BER / %

100

• Improves over

simple Wu criterion

10

especially for mid SNR 1

Wu CS-GTPTHvarAdjC5

-10

-5

0

5

10 SNR / dB

15

20

Environmental Sound Recognition - Dan Ellis

25

2011-06-01 29/36

Freq / kHz

4

Channels

40

08_rbf_pinknoise5dB

20 0 −20 −40

2 0

• Trained selection

20 25

includes more off-harmonic channels

12.5

period / ms

Wu

Noise-Robust Pitch Tracking

0 25

12.5

Channels

40 20 25

12.5

period / ms

CS-GT

0

0 25

12.5 0 0

1

2

3

Environmental Sound Recognition - Dan Ellis

4 time / sec

2011-06-01 30/36

5. Outstanding Issues

• Better object/event separation parametric models spatial information? computational auditory scene analysis...

Frequency Proximity

• Large-scale analysis • Integration with video Environmental Sound Recognition - Dan Ellis

Common Onset

Harmonicity

Barker et al. ’05

2011-06-01 31/36

Audio-Visual Atoms

Jiang et al. ’09

• Object-related features

from both audio (transients) & video (patches) time

visual atom ...

visual atom ...

audio atom frequency

frequency

audio spectrum

time

Environmental Sound Recognition - Dan Ellis

audio atom

Joint Audio-Visual Atom discovery

...

visual atom ...

audio atom time

2011-06-01 32/36

Audio-Visual Atoms

• Multi-instance learning of A-V co-occurrences M Visual Atoms visual atom …

visual atom

time





time

codebook learning

Joint A-V Codebook



audio atom

audio atom

M×N A-V Atoms



N audio atoms Environmental Sound Recognition - Dan Ellis

2011-06-01 33/36

Audio-Visual Atoms black suit + romantic music

marching people + parade sound

Wedding

sand + beach sounds

Parade

Beach

Audio only Visual only Visual + Audio

Environmental Sound Recognition - Dan Ellis

2011-06-01 34/36

Summary

• Machine Listening:

Getting useful information from sound

• Background sound classification ... from whole-clip statistics?

• Foreground event recognition

... by focusing on peak energy patches

• Speech content is very important ... separate with pitch, models, ...

Environmental Sound Recognition - Dan Ellis

2011-06-01 35/36

• • • • • • • • • • • •

References Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011. Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” submitted to IEEE WASPAA, 2011. Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011. Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010. Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010. Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to IEEE WASPAA, 2011. Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering schemes,” Pattern Recognition 39(4): 682-694, 2006 Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang. 24(1): 16-29, 2010. Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011. Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,” IEEE TSAP 9(4): 441-457, May 2001

Environmental Sound Recognition - Dan Ellis

2011-06-01 36/36