Goal Event Detection in Soccer Videos via Collaborative. Multimodal Analysis ... Shot Boundary Detection et al ..... using enhanced logo detection th ...
Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis 1
* and Mandava Rajeswari2
1
Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia 2 School of Computer Sciences, Universiti Sains Malaysia, 11800 Minden, Penang, Malaysia
ABSTRACT
Keywords: indexing, webcasting-text
INTRODUCTION Technological advances have greatly enhanced broadcast, capture, transfer and storage of digital video (Tjondronegoro et al
repositories has spurred interest in automatic indexing and retrieval techniques, especially those that cater for content-based semantics
Article history:
E-mail addresses:
*Corresponding Author
restricting the domain being addressed is to some extent, imperative in order to bridge the semantic gap between low-level features and
sports has been used to extract important semantic concepts such as tennis that serves and rallies (Huang et al et al et al
posterity logging (Assfalg et al running time and sparseness of event occurrences further complicate matters, where traditional
RELATED WORKS A great body of literature has been dedicated to events or highlights detection in soccer, as well
et al
Jinjun et al
et al
is carried out using supervised learning algorithms, which discovers the audiovisual patterns
patterns are not detected due to feature patterns being less prominent during event occurrences, et al et al
1
et al algorithms face the challenge of recognizing event patterns from the majority of non-event As for supervised learning algorithms, their robustness may be questionable since for some
CONTRIBUTIONS
directly extracted from the video, an external textual resource was utilized to initiate event
considerations were solely made within each video itself, and without relying on any pre-
as the issue of the huge and asymmetric search space, are solved by utilizing the minute-bydetailed and reliable annotations of a match’s progression, two crucial cues (namely, the 1
as by Changsheng et al
et al
this study analyzed the visual and aural information only from the particular video under All the audiovisual considerations and assumptions are uncomplicated and therefore able to
FRAMEWORK FOR GOAL EVENT DETECTION
Video Pre-processing
Shot Boundary Detection
et al
V into m-shots, represented as V {Si , Si 1 , S m }
far-views or close-up views
1. Dominant Hue Detection idx peak
this range are detected, an immediate close-up view label can be assigned since it highly
2. idx peak
idx peak
was
determined with the optimal value for saturation value region, morphological image processing and connected components analysis are applied (Halin et al 3. indicate close-up views, whereas smaller objects indicate far views et al et al et al ratio alone, which as suggested in Halin et al et al close-up view as either a close-up or far view based on the majority voting of all the frame labels within
Textual Cues Utilization
source, namely, the event name and its minute time-stamp
Goal Event Keyword Matching and Time Stamp Extraction
G {goal !, goal by , scored , scores, own goal , convert}
(1)
g, that has i time-stamp of each of the i These can be written as a set T g tig , where i g Then, for each i, the goal event search is initiated within the one-minute segment of each ti
Text-Video Synchronization and Event Search Localization The time-stamp tig mapping to the corresponding video frames can be erroneous due to the misalignment with the tig and its corresponding video
t ref f
ref
and
The values
t ref
and f
ref
can then be used to localize the event search to the one-minute
tig being the minute time-stamp of the goal event, the beginning
g g ( f i ,begin ) and ending ( f i ,end
f i ,gbegin
f ref
fr t ref
fi ,gend
fi ,gbegin
tig fr
where fr is the video frame rate and g Note that for f i ,begin , the time-stamp tig (after being converted to seconds) is subtracted by tig 1, tig g i , end
for f end, fr g one-minute after f i ,begin g i
Candidate Shortlist Generation
goal segments within
g i
fi ,gbegin , fi ,gend
broadcasters (including the footage used in this paper), three generic visual-related properties These properties were exploited to decompose the one-minute segment into a shortlist of The camera transitions from a far-view to a close-up view 2. Close-up views during goals normally last at least 6-seconds;
has already been localized to the one-minute eventful segment, detecting other events is very
the n-number of candidate segments is generated from Cig
cikg
for k 1,
g i
,n
g g where Cig i is the set containing the shortlisted candidates, and cik s the k candidate segment within Cig
Candidate Ranking At this stage, we have obtained the candidate segments cikg , where one of them is the actual needed from each cikg
pitch or the fundamental frequency of an audio signal f
is reliable to detect excited human f
is called shrp.m
on the other hand, was chosen as it managed to accurately capture the average measurement
f f f
g ik
The rule being applied here is that the candidate with the goal event will cause commentator f0 values across audio frames, leading g f to high c*, with the maximum ik f
g ik
EXPERIMENTAL RESULTS AND DISCUSSION
1
Teams 1
5 6 7 8 9 11
Note that for the following sub-sections of Shortlist Generation, the evaluation criteria used are precision and recall
and Candidate
Sub-section Candidate Shortlist Generation to cater for each of these contexts, which will be further explained in detail within each of the
subsets from different matches were used to demonstrate the robustness of the algorithm across Precision and recall true positives, false positives and false negatives are explained, supposing that the positive class being predicted is a far-view True positive
far-view, when the actual class is indeed a far-view;
False positive
far-view, when the actual class is a close-up view;
False negative
close-up view, when the actual class is a far-view
Precision
# true positives # true positives # false positives
(5)
Recall
# true positives # true positives # false negatives
(6)
The results are encouraging where very high recalls
Shot Type
# of shots
Precision
Close-up view
Average
98.27%
96.27%
Average
91.65%
96.49%
Candidate Shortlist Generation
cikg precision and recall are
Relevant refers to the number of candidate segments generated that actually Retrieved refers to the total number of candidates generated based on the
Precision
Recall
# relevant # retrieved # retrieved # relevant # retrieved # retrieved
can be observed that the Average Number of Candidates per Shortlist
(7)
(8)
Sub-section Candidate Ranking), the actual segment can still be retrieved without the need recall cases since it is mandatory that an actual goal event segment be present within each of the Candidate Ranking
Average Number of Candidates per Shortlist Precision
Candidate Ranking
n represents the g ik
number of candidate segments c i-th g for each k-th k = 1 n) is recorded in the sub-columns of column 5, where f ik g the numeric boldface values indicate the maximum f ik of that time-stamp, which is the top-
X
shown in
1
Tg
number
N
g 1
(f
t g = 56
1
301.55 264.59
t1g
3
5
1
275.99
tg
273.13
t g = 77
267.05
g 5
293.08
t = 91
4
1
t1g
305.79
g 1
284.33
t
5
t g = 95
6
7 8
1
281.00
t1g
299.02
tg
277.35
g 1
t
290.87
tg
286.89
g 1
295.47
t
t1g t
g
)
271.64
t1g
tg
g in
290.08
t g = 86
2
f
285.55
t
1
g ik
279.46, 1
283.63
9 tg
274.09
tg
278.12
t1g = 18 t
10
11
5
298.86
g
301.02
tg
279.22
t g = 51
300.38
t5g = 68
298.66
t1g = 56
281.59
tg
2
295.39
281.84
t1g g
272.70
tg
273.82
t g = 56
306.16
t5g
280.70
t1g = 57
300.00
tg
270.98
tg
293.71
t
12
5
13
X
COMPARISON
et al replay shot, and the replay must directly follow the close-up shot; et al
et al
Note that the approach proposed in this paper only considers two shot labels; the far and close-up replay
were able to obtain
and depend on viewers’ preferences (Changsheng et al
5 matches shown in
number
1 Precision
Truth CT
5 Proposed CT
5 5 5
10 Proposed
7 5
CT
5 5
8
5
12 Proposed
5
CONCLUSION
distinctly identify event occurrences and to localize the video search space to only relevant and
REFERENCES Online, simultaneous shot boundary detection and key frame extraction for th sports videos using rank tracing . Computer Vision and Image Understanding, 92, Sport news images
IEEE Transactions on Multimedia, 10, IEEE Transactions on Multimedia, 10,
Journal of Information Science and Engineering, 24, IEEE Transactions on Multimedia, 8, Unsupervised soccer video abstraction based on pitch, dominant th color and camera motion analysis ,
IEEE Transactions on Image Processing, 12, Soccer video summarization using enhanced logo detection
th
sports video.
Expert Systems with Applications, 36, Sports highlight detection from keyword sequences using HMM
IEEE Transactions on Circuits and Systems for Video Technology, 14, Hierarchical temporal association mining for video rd
event detection in video databases.
IEEE Signal Processing Magazine, 23, Audio-visual football video analysis, from structure detection to attention analysis.
IEEE Transactions on Circuits and Systems for Video Technology, 15, A decision tree-based multimodal data mining framework for soccer goal detection. The authoring metaphor to machine understanding of multimedia Time interval maximum entropy based event indexing in soccer video. Live Match
Content-based video indexing for sports applications using multi-modal approach ACM Transactions on Multimedia Computing, Communications, and Applications, 4, Uefa champions League, Match Season 2011.
Algorithms and system for segmentation and structure analysis in soccer video. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio.
IEEE Signal Processing Magazine, IEEE, 17 Goal event detection in broadcast soccer videos by th combining heuristic rules with unsupervised fuzzy c-means algorithm.
IEEE Transactions on Circuits and Systems for Video Technology, 17,