Average zero crossing rate. â¢. Energy distribution. â¢. Bandwidth. â¢. Harmonicity. â«. Scene determination. â¢. Video shot detection. â¢. Audio shot segmentation. â¢.
Video Scene Determination using Audiovisual Data Analysis Multimedia Network Systems & Applications 2004 SeungMin Rho and EenJun Hwang Graduate School of Information and Communication Ajou University
Motivation
Use of shot/scene change detection ¾ ¾
Shot boundary is useful for video editing and more detailed analysis of video content Scene change detection is very important for video indexing and retrieval
Existing algorithms ¾ ¾
Mostly based on visual information such as color, motion, edge features, etc Do not clearly distinguish between shot boundary and scene change detections.
Multimedia Network Systems and Applications 2004
2/20
Our Goal
Various audio feature analysis ¾ ¾
Audio feature extraction Classification of those features into 6 categories such as silence, speech, music, speech w/ music, environmental sound, and speech w/ environmental sound
Video scene classification using audio and visual information ¾ ¾
Shot boundary detection – simultaneous change of color, motion, and audio characteristics Scene determination – analyzed audio and video shots are classified into the semantic scenes
Multimedia Network Systems and Applications 2004
3/20
Outline
Related works Audio feature analysis ¾ ¾ ¾ ¾ ¾
Scene determination ¾ ¾ ¾
Short time average energy function Average zero crossing rate Energy distribution Bandwidth Harmonicity Video shot detection Audio shot segmentation Scene determination
Experiments Conclusion and future works
Multimedia Network Systems and Applications 2004
4/20
Related Works
Audio Segmentation and Classification
Content-Based Audio Retrieval
Audio Scene Analysis (ASA)
Audio Analysis for Video Indexing
Integrations of Audio and Visual Information for Video Segmentation and Indexing
Multimedia Network Systems and Applications 2004
5/20
System Overview Shot A nalyzer video signal
A udiovis ual Data
audio signal
Sho t boundary det ec tion Audio fe atur e e xtr act ion
K e yfra me e xtrac tion Audio shot bounda ry de tec tio n
C lassifier Cla ssific ation of e ach se gme nt E nerg y function Ave rag e z er oc rossing rat e
Cha rac ter iz ation
Ene rgy distr ibutio n Annotation too l Bandwidt h
Har monicity XM L Data base
Multimedia Network Systems and Applications 2004
Audiov isual Da tabase
6/20
Audio Feature Analysis 1. Short time average energy function 2. Average zero crossing rate 3. Energy distribution 4. Bandwidth 5. Harmonicity
Multimedia Network Systems and Applications 2004
7/20
Audio Features Computation Audio Input
Compute Average Energy (Em)
Is Em > Te Yes
No
Yes Speech
Compute Fundamental Frequency (Ff)
Compute Average Zero Crossing Rate (ZCR)
Is Ff > Tfreq
Is ZCR > Tzcr
Yes
No
Silence
No
Compute the Harmonicity
Music or Environmental Sound
Multimedia Network Systems and Applications 2004
8/20
Audio Feature Analysis (1)
(a) speech
(b) speech w/ music
Audio waveform and spectrogram of speech and speech with music
Multimedia Network Systems and Applications 2004
9/20
Audio Feature Analysis (2)
(a) music
(b) environmental sound
Audio waveform and spectrogram of music and environmental sound
Multimedia Network Systems and Applications 2004
10/20
Scene Determination Process (1) A u d io V is u a l D a ta
D e M u ltip le x e r A u d io S ig n a l
V id e o S ig n a l
S h o t A n a ly z e r A u d io S h o ts
V id e o S h o ts
F in d th e c a n d id a te s ce n e b o u n d a ry
C o m p a re th e c a n d id a te a u d io s h o ts & A d ju s t c a n d id a te v id e o sh o ts to th e s ta rtin g tim e o f c lo s e r a u d io sh o ts
S cene D e te rm in a tio n P ro c e s s
M e rg e th e c o n se c u tiv e s h o ts
S c e n e b o u n d a ry is d e te rm in e d
Multimedia Network Systems and Applications 2004
11/20
Scene Determination Process (2) Step 1:
If ( t(CSvi) = t(CSaj) ) then Candidate scene boundary is detected and go to step 3 else Go to step 3
Step 2:
diff1 = Diff(t(CSvi, CSaj)), diff2 = Diff(t(CSvi, CSaj+1)) Candidate shot boundary is adjusted to t(CSaj)
[ diff1 < diff2 ]
Candidate shot boundary is adjusted to t(CSaj+1) [ diff1 ≥ diff2 ] Then go to step 3
CSvi = Candidate video shot boundaries (i = 1, …, n ) CSaj = Candidate audio shot boundaries (j = 1, …, m) t(CS) = Starting time of a candidate shot boundary Multimedia Network Systems and Applications 2004
12/20
Scene Determination Process (3) Step 3:
Max1 = Max(F(t(CSa))) between t(CSvi) and t(CSvi+1) Max2 = Max(F(t(CSa))) between t(CSvi+1) and t(CSvi+2) If ( dist(Max1, Max2) ≤ Tf ) then Merge the consecutive video shots (CSvi and CSvi+1) and Adjust a candidate shot boundary to t(CSvi+1) Else Scene boundary is determined
F(t(CSa)) = { Silence, Speech, Music, Speech w/ Music, Environmental Sound, Speech w/ Environmental Sound } Max = maximum value of the percentages of candidate audio shots within a candidate video shot dist(Max1, Max2) = the distance of the maximum values which are obtained from the max functions using Euclidean distance Multimedia Network Systems and Applications 2004
13/20
Scene Determination
Multimedia Network Systems and Applications 2004
14/20
Experiments
Video data ¾ 6 sample data: movies, TV commercials, news ¾ captured by 320x240 pixels, 24 bits, 30 fps Audio data ¾ sampled by 16bit stereo ¾ sampled at 44.1KHz Measurement
Multimedia Network Systems and Applications 2004
15/20
Scene Detection Rate by audiovisual features
TV Commercials & News
Shot
Scene
Correct
Miss
Fault
Precision
Recall
VSample1
85
22
20
2
1
0.95
0.91
VSample2
54
16
13
3
2
0.87
0.81
VSample3
18
10
8
2
1
0.89
0.8
0.90
0.84
Average VSample4
68
26
23
3
1
0.96
0.88
VSample5
159
41
34
7
1
0.97
0.83
VSample6
243
67
61
6
2
0.97
0.91
0.97
0.87
0.93
0.86
Movies
Average Total average
Multimedia Network Systems and Applications 2004
16/20
Details of integrated evaluation Correct / Miss / Fault / Precision / Recall
TV Commercial & News
Movies
①
②
③
④ (①+②)
⑤ (①+③)
⑥ (①+②+③)
VSample1
14/8/4/ 0.78/0.64
6/16/13/ 0.32/0.27
8/14/7/ 0.53/0.36
15/7/6/ 0.71/0.68
19/3/3/ 0.86/0.86
20/2/1/ 0.95/0.91
VSample2
10/6/3/ 0.77/0.63
4/12/7/ 0.36/0.25
5/11/6/ 0.46/0.31
11/5/5/ 0.69/0.69
12/4/3/ 0.8/0.75
13/3/2/ 0.87/0.81
VSample3
6/4/2/ 0.75/0.6
2/8/3/ 0.4/0.2
3/7/4/ 0.43/0.3
6/4/4/ 0.6/0.6
8/2/2/ 0.8/0.8
8/2/1/ 0.89/0.8
VSample4
17/9/4/ 0.81/0.65
8/18/3/ 0.73/0.31
10/16/13/ 0.43/0.38
19/7/5/ 0.79/0.73
21/5/3/ 0.88/0.81
23/3/1/ 0.96/0.88
VSample5
28/13/6/ 0.82/0.68
11/30/12/ 0.48/0.27
16/25/24/ 0.4/0.39
30/11/16/ 0.65/0.73
33/8/4/ 0.89/0.8
34/7/1/ 0.97/0.83
VSample6
47/20/9/ 0.84/0.7
24/43/19/ 0.56/0.36
21/46/33/ 0.39/0.31
51/16/24/ 0.68/0.76
58/9/4/ 0.94/0.87
61/6/2/ 0.97/0.91
Multimedia Network Systems and Applications 2004
17/20
Conclusion & Future Work A scheme of scene change determination based on the integration of audio and video information is proposed Various useful audio features are discussed and classification method of semantic scene by analyzing both audio and video data together are also discussed Future Work ¾ Better audio features for scene classification ¾ Better integration of audio/visual information for classification
Multimedia Network Systems and Applications 2004
18/20
References [1] Z. Liu, J. Huang, and Y. Wang et al., “Audio feature extraction and analysis for scene classification,” in Proc. IEEE 1st Multimedia Workshop, 1997. [2] T. Zhang, C.-C. Kuo: “Content-based Classification and Retrieval of Audio,” SPIE's 43rd Annual Meeting - Conference on Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, San Diego, July 1998. [3] Hao Jiang, Tony Lin, Hongjiang Zhang, “Video segmentation with the Support of Audio Segmentation and classification,” ICME'2000-IEEE International Conference on Multimedia and Expo, New York City, NY, USA, July 30 - August 2, 2000. [4] A. Yoshitaka, and M. Miyake, “Scene Detection by Audio-Visual Features,” IEEE International Conference on Multimedia and Expo (ICME01), pp.49-52, 2001. [5] Shu-Ching Chen, Mei-Ling Shyu, Wenhui Liao, and Chengcui Zhang, “Scene Change Detection By Audio and Video Clues,” IEEE International Conference on Multimedia and Expo (ICME02), pp.365-368, 2002. Multimedia Network Systems and Applications 2004
19/20
Q&A
Thank You!!! Any Questions? Visit our homepage if you need additional information http://adtl.ajou.ac.kr Multimedia Network Systems and Applications 2004
20/20