Jun 1, 2011 ... Environmental Sound. Recognition and Classification. Dan Ellis. Laboratory for
Recognition and Organization of Speech and Audio. Dept.
Environmental Sound Recognition and Classification Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA
[email protected]
1. 2. 3. 4. 5.
http://labrosa.ee.columbia.edu/
Machine Listening Background Classification Foreground Event Recognition Speech Separation Open Issues
Environmental Sound Recognition - Dan Ellis
2011-06-01
1 /36
1. Machine Listening
• Extracting useful information from sound Describe
Automatic Narration
Emotion
Music Recommendation
Classify
Environment Awareness
ASR
Music Transcription
“Sound Intelligence”
VAD
Speech/Music
Environmental Sound
Speech
Dectect
Task
... like animals do
Environmental Sound Recognition - Dan Ellis
Music
Domain 2011-06-01
2 /36
Environmental Sound Applications
•
Audio Lifelog Diarization
09:00 09:30 10:00 10:30 11:00 11:30
2004-09-14
preschool cafe
L2 cafe office
preschool cafe Ron lecture
12:00
office
12:30
office outdoor group
13:00
outdoor lecture
outdoor
DSP03 compmtg
meeting2
13:30
lab
14:00
cafe meeting2 Manuel
outdoor office cafe office
Mike
Arroyo?
outdoor
Sambarta?
14:30 15:00 15:30
•
2004-09-13
16:00
office office
office postlec office
Lesser
16:30
Consumer Video Classification & Search
Environmental Sound Recognition - Dan Ellis
17:00 17:30 18:00
outdoor lab cafe
2011-06-01
3 /36
• Environment
Prior Work
Classification speech/music/ silent/machine Zhang & Kuo ’01
• Nonspeech Sound Recognition Meeting room Audio Event Classification sports events - cheers, bat/ball sounds, ...
Temko & Nadeu ’06 Environmental Sound Recognition - Dan Ellis
2011-06-01
4 /36
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ... 1
Labeled Concepts
museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts
Environmental Sound Recognition - Dan Ellis
• Grab top 200 videos 8GMM+Bha (9/15)
from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts pLSA500+lognorm (12/15)
2011-06-01
5 /36
Obtaining Labeled Data
• Amazon Mechanical Turk
Y-G Jiang et al. 2011
10s clips 9,641 videos in 4 weeks
Environmental Sound Recognition - Dan Ellis
2011-06-01
6 /36
2. Background Classification
• Baseline for soundtrack classification
8 7 6 5 4 3 2 1 0
VTS_04_0001 - Spectrogram
MFCC Covariance Matrix
30 20 10 0 -10
MFCC covariance
-20 1
2
3
4
5
6
7
8
9 time / sec
20 18 16 14 12 10 8 6 4 2 1
2
3
4
5
6
7
8 9 time / sec
level / dB
18 16
20 15 10 5 0 -5 -10 -15 -20 value
14 12 0
10
• Classify by e.g. KL distance + SVM
Environmental Sound Recognition - Dan Ellis
50
20
MFCC dimension
MFCC features
MFCC bin
Video Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features”
8 6 4 2 5
10 15 MFCC dimension
20
2011-06-01
-50
7 /36
Codebook Histograms
• Convert high-dim. distributions to multinomial 8
150
6 4
7
2
MFCC features
Per-Category Mixture Component Histogram count
MFCC(1)
Global Gaussian Mixture Model
0
2 5
-2
6 1 10 14 8
100
15 9 12 13 3 4
-4
50 11
-6 -8 -10 -20
-10
0
10
20
0
1
2
3
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15 GMM mixture
• Classify by distance on histograms KL, Chi-squared + SVM
Environmental Sound Recognition - Dan Ellis
2011-06-01
8 /36
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use (normalized?) p(topic | clip) as per-clip features Environmental Sound Recognition - Dan Ellis
2011-06-01
9 /36
Background Classification Results Average Precision
K Lee & Ellis ’10
1
Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
pl
gr
• Wide range in performance
sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N
a sp t o gr ad rts ua tio n
bo
de
ra
nd
pa
ou
ay
gr
ba
by
rk
h
pa
ac
g
be
in
da
nc
ow
o
sh
tw
t p
of
ni
gh
on
ou
e
pe
rs
g
in
r
si
ng
ee
d
ch
ow
us
m
cr
on
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes
Environmental Sound Recognition - Dan Ellis
2011-06-01 10/36
Sound Texture Features
McDermott Simoncelli ’09 Ellis, Zheng, McDermott ’11
• Characterize sounds by
perceptually-sufficient statistics \x\
Sound
\x\
Automatic gain control
mel filterbank (18 chans)
Octave bins 0.5,1,2,4,8,16 Hz
FFT
\x\ \x\
Histogram
\x\ \x\
Mahalanobis distance ...
mean, var, skew, kurt (18 x 4)
Cross-band correlations (318 samples)
Envelope correlation
1159_10 urban cheer clap
1062_60 quiet dubbed speech music
2404 1273 1
617 0
2
4
6
8
10
0
2
4
6
8 10 time / s
Texture features
mel band
•
Subband distributions & env x-corrs
freq / Hz
.. verified by matched resynthesis
Modulation energy (18 x 6)
0 level
15 10 5
M V S K 0.5 2 8 32 moments
Environmental Sound Recognition - Dan Ellis
mod frq / Hz
5
10 15 mel band
M V S K 0.5 2 8 32 moments
mod frq / Hz
5
10 15 mel band
2011-06-01 11/36
Sound Texture Features
• Test on MED 2010 development data
10 specially-collected manual labels cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M
0.9 0.8 0.7 0.6 0.5
MED 2010 Txtr Feature (normd) means 0.5 0 −0.5 V
S
K
ms
cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0
V
S
K
ms
cbc MED 2010 Txtr Feature (normd) means/stddevs 0.5 0 −0.5
V
S
K
ms
M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc
1
cbc
Environmental Sound Recognition - Dan Ellis
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
• Contrasts in feature sets
correlation of labels...
• Perform
~ same as MFCCs combine well 2011-06-01 12/36
3. Foreground Event Recognition
• Global vs. local class models
K Lee, Ellis, Loui ’10
tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs:
YT baby 002: voice baby laugh 4
New Way: Limited temporal extents freq / kHz
freq / kHz
Old Way: All frames contribute 3
4
2
1
1 5
10 voice
15
0 time / s
baby
3
2
0
voice
bg
5 voice baby
10 laugh
15 bg
time / s
laugh
baby laugh
“background” model shared by all clips Environmental Sound Recognition - Dan Ellis
2011-06-01 13/36
Foreground Event HMMs freq / Hz
Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”
• Training labels only •
Refine models by EM realignment
• Use for classifying entire video...
or seeking to relevant part Environmental Sound Recognition - Dan Ellis
etc speaker4 speaker3 speaker2 unseen1 Background
25 concepts + 2 global BGs
at clip-level
4k 3k 2k 1k 0
global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic park one person night museum group of 2 group of 3+ graduation crowd boat birthday beach baby animal 0
0 −50
drum
Manual Frame-level Annotation wind
cheer
voice voice voice water wave
voice
voice voice voice cheer
Viterbi Path by 27−state Markov Model
1
2
3
4
5
6
7
8
9 10 time/sec
Lee & Ellis ’10 2011-06-01 14/36
Transient Features
Cotton, Ellis, Loui ’11
• Transients = •
foreground events? Onset detector finds energy bursts best SNR
• PCA basis to
represent each 300 ms x auditory freq
• “bag of transients” Environmental Sound Recognition - Dan Ellis
2011-06-01 15/36
Nonnegative Matrix Factorization templates + activation
freq / Hz
• Decompose spectrograms into 5478
Original mixture
3474 2203 1398 883
X=W·H fast forgiving gradient descent algorithm 2D patches sparsity control...
Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07
442
Basis 1 (L2)
Basis 2 (L1)
Basis 3 (L1) 0
Environmental Sound Recognition - Dan Ellis
1
2
3
4
5
6
7
8
9
10 time / s
2011-06-01 16/36
NMF Transient Features
• Learn 20 patches from
Acoustic Event Error Rate (AEER%)
•
Cotton, Ellis ’11 freq (Hz)
Meeting Room Acoustic Event data
300
Patch 3
Patch 4
Patch 5
Patch 6
Patch 7
Patch 8
Patch 9
Patch 10
Patch 11
Patch 12
Patch 13
Patch 14 Patch 15
Patch 16
Patch 17
Patch 18
Patch 19 Patch 20
1398 442 3474 1398 442
1398 442 0.0
200
100
0.5 0.0
0.5 0.0
0.5 0.0 0.5 time (s)
noise-robust
MFCC NMF combined 15
0.5 0.0
• NMF more
150
0 10
Patch 2
442
3474
250
50
Patch 1
1398
3474
Compare to MFCC-HMM detector Error Rate in Noise
3474
20 SNR (dB)
25
30
Environmental Sound Recognition - Dan Ellis
combines well ... 2011-06-01 17/36
4. Speech Separation
• Speech recognition is finding best-fit parameters - argmax P(W | X) • Recognize mixtures with Factorial HMM
model + state sequence for each voice/source exploit sequence constraints, speaker differences model 2
model 1 observations / time
separation relies on detailed speaker model Environmental Sound Recognition - Dan Ellis
2011-06-01 18/36
• Idea: Find
Eigenvoices
Kuhn et al. ’98, ’00 Weiss & Ellis ’07, ’08, ’09
speaker model parameter space generalize without losing detail?
• Eigenvoice model: µ = µ ¯ + U adapted model
mean voice
eigenvoice bases
Environmental Sound Recognition - Dan Ellis
Speaker models Speaker subspace bases
w + B weights
h
channel channel bases weights
2011-06-01 19/36
Eigenvoice Bases
280 states x 320 bins = 89,600 dimensions
Frequency (kHz)
• Mean model
8
30
4
40
2 b d g p t k jh ch s z f th v dh m n l
Frequency (kHz)
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1 8
6
6 4
4
2
2
0 b d g p t k jh ch s z f th v dh m n l 8
Frequency (kHz)
additional components for acoustic channel
20
50
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2 8
6
6 4
4
2
2
0 b d g p t k jh ch s z f th v dh m n l 8
Frequency (kHz)
•
10
6
8
Eigencomponents shift formants/ coloration
Mean Voice
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3 8
6
6 4
4
2
2
0 b d g p t k jh ch s z f th v dh m n l
Environmental Sound Recognition - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
2011-06-01 20/36
Eigenvoice Speech Separation
• Factorial HMM analysis
Weiss & Ellis ’10
with tuning of source model parameters = eigenvoice speaker adaptation
Environmental Sound Recognition - Dan Ellis
2011-06-01 21/36
Eigenvoice Speech Separation
Environmental Sound Recognition - Dan Ellis
2011-06-01 22/36
Eigenvoice Speech Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix
SI SA SD
Environmental Sound Recognition - Dan Ellis
2011-06-01 23/36
Binaural Cues
• Model interaural spectrum of each source as stationary level and time differences:
• e.g. at 75°, in reverb: IPD
IPD residual
Environmental Sound Recognition - Dan Ellis
ILD
2011-06-01 24/36
Model-Based EM Source Separation and Localization (MESSL)
Mandel & Ellis ’09
Re-estimate source parameters
Assign spectrogram points to sources
can model more sources than sensors flexible initialization Environmental Sound Recognition - Dan Ellis
2011-06-01 25/36
MESSL Results
• Modeling uncertainty improves results tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB 9.12 dB
Environmental Sound Recognition - Dan Ellis
-2.72 dB
2011-06-01 26/36
MESSL with Source Priors
• Fixed or adaptive speech models
Environmental Sound Recognition - Dan Ellis
Weiss, Mandel & Ellis ’11
2011-06-01 27/36
MESSL-EigenVoice Results
• Source models function as priors • Interaural parameter spatial separation
Frequency (kHz)
Frequency (kHz)
source model prior improves spatial estimate Ground truth (12.04 dB)
8
DUET (3.84 dB)
8
6
6
6
4
4
4
2
2
2
0
0
0.5
1
1.5
MESSL (5.66 dB)
8
0
0
0.5
1
1.5
MESSL SP (10.01 dB)
8
0
6
6
4
4
4
2
2
2
0
0.5 1 Time (sec)
1.5
0
0
0.5 1 Time (sec)
Environmental Sound Recognition - Dan Ellis
1.5
1 0.8 0
0
0.5
1
1.5
MESSL EV (10.37 dB)
8
6
0
2D FD BSS (5.41 dB)
8
0.6 0.4 0.2 0
0
0.5 1 Time (sec)
1.5
2011-06-01 28/36
Noise-Robust Pitch Tracking
• Important for voice detection & separation • Based on channel selection Wu & Wang (2003)
BS Lee & Ellis ’11
pitch from summary autocorrelation over “good” bands trained classifier decides which channels to include Mean BER of pitch tracking algorithms
Mean BER / %
100
• Improves over
simple Wu criterion
10
especially for mid SNR 1
Wu CS-GTPTHvarAdjC5
-10
-5
0
5
10 SNR / dB
15
20
Environmental Sound Recognition - Dan Ellis
25
2011-06-01 29/36
Freq / kHz
4
Channels
40
08_rbf_pinknoise5dB
20 0 −20 −40
2 0
• Trained selection
20 25
includes more off-harmonic channels
12.5
period / ms
Wu
Noise-Robust Pitch Tracking
0 25
12.5
Channels
40 20 25
12.5
period / ms
CS-GT
0
0 25
12.5 0 0
1
2
3
Environmental Sound Recognition - Dan Ellis
4 time / sec
2011-06-01 30/36
5. Outstanding Issues
• Better object/event separation parametric models spatial information? computational auditory scene analysis...
Frequency Proximity
• Large-scale analysis • Integration with video Environmental Sound Recognition - Dan Ellis
Common Onset
Harmonicity
Barker et al. ’05
2011-06-01 31/36
Audio-Visual Atoms
Jiang et al. ’09
• Object-related features
from both audio (transients) & video (patches) time
visual atom ...
visual atom ...
audio atom frequency
frequency
audio spectrum
time
Environmental Sound Recognition - Dan Ellis
audio atom
Joint Audio-Visual Atom discovery
...
visual atom ...
audio atom time
2011-06-01 32/36
Audio-Visual Atoms
• Multi-instance learning of A-V co-occurrences M Visual Atoms visual atom …
visual atom
time
…
…
time
codebook learning
Joint A-V Codebook
…
audio atom
audio atom
M×N A-V Atoms
…
N audio atoms Environmental Sound Recognition - Dan Ellis
2011-06-01 33/36
Audio-Visual Atoms black suit + romantic music
marching people + parade sound
Wedding
sand + beach sounds
Parade
Beach
Audio only Visual only Visual + Audio
Environmental Sound Recognition - Dan Ellis
2011-06-01 34/36
Summary
• Machine Listening:
Getting useful information from sound
• Background sound classification ... from whole-clip statistics?
• Foreground event recognition
... by focusing on peak energy patches
• Speech content is very important ... separate with pitch, models, ...
Environmental Sound Recognition - Dan Ellis
2011-06-01 35/36
• • • • • • • • • • • •
References Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011. Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” submitted to IEEE WASPAA, 2011. Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011. Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010. Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010. Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to IEEE WASPAA, 2011. Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering schemes,” Pattern Recognition 39(4): 682-694, 2006 Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang. 24(1): 16-29, 2010. Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011. Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,” IEEE TSAP 9(4): 441-457, May 2001
Environmental Sound Recognition - Dan Ellis
2011-06-01 36/36