Event Detection in Soccer Videos through Text

0 downloads 0 Views 569KB Size Report
This paper presents a framework for soccer event detection through joint textual ...... using enhanced logo detection,” In Proceedings of the 16th IEEE Internatinal ...
Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis 1

Alfian Abdul Halin, 2Mozaherul Hoque Abul Hasanat, 3Mandava Rajeswari Faculty of Computer Science & Information Technology, Universiti Putra Malaysia [email protected] 2 Asia E-University Malaysia [email protected] 3 School of Computer Sciences, Universiti Sains Malaysia [email protected]

1, First and Corresponding Author

Abstract This paper presents a framework for soccer event detection through joint textual, aural and visual feature analysis. Firstly, textual cues from online sporting resources are used to significantly reduce and localize the event search space. Then, analysis is performed based on generic rule-sets imposed on specific audiovisual feature properties to isolate the most compressed view of the events. Experiments conducted on 30-hours of soccer videos from various broadcasters show encouraging results for the detection of goals, penalties, yellow cards, red cards and substitutions.

Keywords: Soccer Event Detection, Sports Video Analysis, Semantic Multimedia Analysis 1. Introduction Recent advances in technology have caused a boom in digital video consumption. The need for effective organization for huge video repositories has spurred interest in automated processing and analysis techniques for indexing and retrieval. Sports video has attracted widespread attention with works focusing on annotation and extraction of semantics, such as objects, events and highlights from various sporting domains. Among the domains that have been tackled are such as soccer [1], tennis [2], American football [3] and volleyball [5]. Sports can naturally be decomposed into events. Interesting events such as tries in rugby, jumpingsmashes in badminton, and goals in soccer, are often the high-points of a particular game. Such events are also more easily recalled as they are memorable and meaningful compared to other semantic concepts such as objects and locations. Furthermore, viewers’ interest levels are normally retained only for short periods during event occurrences. Therefore, detecting interesting events is useful as they can serve as semantic indices for organization and retrieval. Sports event detection however is not clear-cut. It requires effective mapping of the specific lowlevel features to the desired events. The semantic gap between machines and humans dictates the impracticality to develop generic detection approaches across sporting domains. Such a feat requires huge, if not infinite amounts of sporting knowledge [18]. This problem however, can partially be alleviated by restricting the domain being addressed. Since different sports have different regulations, dynamics, playfield geometry, and events, domain knowledge of a specific sport can be used to approximate the most accurate projections of features to semantics. Domain restriction also allows for more specific event sets to be identified. In this paper, we propose a framework for soccer video event detection. Generally, textual cues are used together with audiovisual processing and analysis to locate sub-segments containing the desired events. The knowledge of the audiovisual feature properties evolutions before, during and after events are scrutinized, to construct the most suitable model for event detection.

International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number15,August 2012 doi:10.4156/jdcta.vol6.issue15.20

164

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

2. Related Work Soccer event detection is commonly tackled through two main steps. Firstly, the relevant audiovisual feature content evolutions during events are identified. This is followed by constructing event models based on these feature evolutions, for the final task of event detection.

2.1. Semantic Feature Extraction Mid or semantic-level features (SLF) have recently been preferred over their low-level counterparts. This is because of the simplicity offered by SLFs, as instead of being represented as arrays of realvalued numbers, such features are represented using high-level domain concepts. In soccer, SLFs can mainly be divided into two; visual and aural. Popular examples of visual SLFs are the semantic shot classes pertaining to the current camera shooting style [4, 5], objects such as persons and the ball [6], playfield positions and player deployments [7, 8], and camera motion parameters [7]. Aural SLFs on the other hand relate to the type of referee sound of long and short whistling, and commentator or crowd sounds such as excited/plain commentator speech and cheering [9, 10]. Utilizing SLFs greatly simplifies event model construction, since it avoids reliance on multidimensional arrays of numerical values.

2.2. Event Modeling Event models are normally constructed using either predefined rule-sets or supervised learning algorithms. In [1, 18, 11], templates following IF-THEN structures were used. For example, video segments containing replay shots lasting between 4 to 15-seconds were deemed to contain a foul event. In [10], if the aural SLFs of excited commentator speech and crowd sounds were simultaneously present, then shots on goal and saves were considered detected. Alternatively, graphical models such as Finite State Machines (FSM) are also used, where event detection is treated as a logical rule fulfillment process. When a series of SLFs are successfully observed, an event is considered to have occurred [7]. From the supervised learning perspective, models are built based on supervised classification. Given the labeled training examples , , ,…, , , , , the algorithm learns a function ∶ → , predicting future/unseen input/output mappings [12]. The inputs (i.e. ) are normally SLF observations whereas the output is the desired event class. Supervised classification is effective when sufficient amounts of labeled training instances are available. Examples of popular algorithms are such as Hidden Markov Models [13], Dynamic Bayesian Networks [14] and Support Vector Machines [15].

3. Problem Statement 3.1. Issues Regarding SLF Identification Having various SLFs can be very useful for elaborate event models. However, the inference process for various SLF classes can be cumbersome, involving multiple steps of audiovisual feature processing and analysis. In [4], 17-concepts were inferred including team players, camera shooting styles and replay shots, using a hierarchical classification tree. In [16], 6-playfield related concepts were generated through projection profile analysis, edge detection, and color, shape and texture analyses. Hierarchical SVMs were further used together with heuristic-rules to obtain each position’s class label. The works in [10, 13] inferred audio SLFs by training 3-level SVMs, to obtain excited and plain speech classes. These examples demonstrate that when considering many SLF classes, the inference processes can be complicated, requiring examination at various levels. This can result in misclassifications such as that reported in [10], where the error rates were between 20%-26%. Having wrongly generated SLFs can be detrimental as the proper feature to semantic mapping might not be properly represented.

165

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

3.2. Issues Regarding Event Modeling Existing rule-based approaches wade through the entirety of the video when searching for eventful segments. Conditions set regarding specific audiovisual feature properties however, might not be exclusive to one event alone [17]. When similarities occur between different events, false positives can be recorded, resulting in low detection precision. Moreover, defining strict rule-sets across all events is impractical, especially when a wide range of events are to be detected. Therefore, in unconstrained search spaces, rule-sets imposed on audiovisual feature properties alone are unreliable. In soccer, non-events such as throw-ins, goal-kicks and dribbling outnumber interesting events by a large margin [5, 18]. This asymmetry translates to the lack of positive training examples for proper classifier construction. Works in [7, 18, 29] stress the importance of sufficient labeled training data; however, due to the scarcity of events, these examples are difficult to obtain. Even if examples were available, each needs to be labeled, requiring cumbersome manual labor and the risk of committing errors. Classifier training moreover, can be a time consuming, where tweaking of various parameters is necessary [19]. All these factors make supervised learning methods inconvenient.

4. The Proposed Framework We present a framework to detect the five soccer events goals, penalties, yellow cards, red cards and substitutions. We opt for a rule-based approach because our dataset consists of unlabeled video sequences. Since standard labeled examples are unavailable, and manual labeling is virtually prohibitive, supervised classification is not considered. Moreover, rule-based approaches are still worth investigating since previous techniques mostly performed search over the entirety of a match video. In our work, we believe that a generic rule-set, if applied within a reduced and localized search space, is sufficient for event detection. This is because, the search environment becomes less cluttered with nonevents, where events can exhibit more pronounced feature properties. We use audio and visual features that can be reliably generated/computed. Hence, the likelihood of obtaining erroneous features is greatly minimized.

Figure 1. Minute-by-minute reporting from ESPN

4.1. Feature Considerations 4.1.1. Textual Modality We use online minute-by-minute (MBM) reports of soccer matches as the textual feature. MBMs can be obtained freely online from broadcasters such as ESPN1 and BBC2, or information providers such as Sportinglife3 and UEFA4. MBMs are game-logs prepared using syntax pertaining to specific 1

http://soccernet.espn.go.com/ http://newsimg.bbc.co.uk/ 3 http://www.sportinglife.com/football/live match/200111.html 2

166

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

soccer lingo. They record the progressive minute-by-minute happenings of a match. A tabular style is often used where each row reports the time (i.e. minute) and the description during that minute; both in separate columns. An example is shown in Figure 1. We extract two cues from each match MBM, namely the event name and its minute time-stamp. 4.1.2. Aural Modality Audio energy pertaining to audience sound is expected to rise dramatically during events. From [20], audience reactions (e.g. cheering and chanting) can be represented using the twelve Melfrequency Cepstral Coefficients (MFCC) of an audio signal. Selecting the most appropriate coefficient therefore, is crucial since each or a combination can be used to represent different audio properties. We use the 1-st coefficient of the MFCC, i.e. the log-energy (le) [21], as it can reliably represent crowd cheering. It is also insensitive towards noise and speech variations. The formula for calculating MFCC is given in Eq. 1 [22]. ∑

log

cos



(1)

, is the filter Here, is the number of the sub-bands and is the length of the cepstrum. , for 0 bank energy after the jth triangular band-pass filtering, which is a spectrum envelope scaled according to the mel-frequency. Here, 1 as only the first coefficient is considered. 4.1.3. Visual Modality A shot is a sequence frames taken by one continuous camera action [23, 24]. Shots are preferred for video processing since they are more tractable compared to individual frames. A shot’s semantic class is its label coinciding with the current camera shooting style. For the purpose of this work, two shot classes are considered as the visual modality, namely the far and close up-view shots (Figures 2a and 2b, respectively).

Figure 2. a) A colseup-view class and b) a far-view class.

5. Framework Components The framework comprises the four components of Video Preprocessing; Textual Cue Processing and Utilization; Audio-based Event Detection; and Visual-based Refinement. Basically, an event is detected only if its keywords or keyword combinations is found within a match MBM. Once found, the search space for the event is localized based on its minute time-stamp. Aural analysis is then performed to obtain a rough estimate of where the event took place in the video. Semantic shot classes are finally analyzed to obtain the frame range encapsulating the most compressed view of the event. Figure 3 shows the general flow of the proposed framework.

4

http://www.uefa.com/

167

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

5.1. Video Preprocessing - Shot Boundary Detection and Semantic Shot Classification Shot Boundary Detection (SBD) is firstly performed using the algorithm in [25] to segment the video into shots. Each shot is then assigned with a semantic class label through Semantic Shot Classification (SSC) using the algorithm in [23]. Each shot is resultantly assigned either a far or close up-view label. Table 1 shows an example output after SBD and SSC have been applied.

Figure 3. Block diagram of the proposed framework Table 1. Example output after SBD and SSC Shot Number Start Frame End Frame End Frame 1 2 3 … 495

1 30 41 … 41530

29 40 62 … 41601

far close up close up … far

5.2. Textual Processing and Utilization Firstly, a match MBM is searched for the occurrence of any event keyword(s). Secondly, if a match is found, the time-stamp is extracted from the MBM and used for event search localization. 5.2.1. Event Keyword Matching and Time-stamp Extraction Event occurrences are commonly indicated when specific keywords or keyword combinations exist. For example, the keyword(s) “goal by” or “scored” are used for the goal event. Being able to identify all the relevant keywords is necessary to identify the event occurrence(s). After careful scrutiny of various MBM sources, we defined the keywords and keyword combinations in Table 2 for all the considered events. Notice that some consist of single keywords, whereas others contain expressions such as ’converts [a-z][a-z][a-z] penalty’ for penalties; and ’[(]in[)]’ and ’[(]out[)]’ for substitutions.

168

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

Event

Table 2. Keywords and keyword combinations for each event Keyword(s)

Goal Penalty

- goal!, goal by, scored, scores, own goal, convert - penalty spot, power penalty, placed penalty, penalty kick, penalty-kick, converts [a-z][a-z][a-z] penalty Yellow Card - yellow, yellow card, booking, booked, bookable, caution, cautioned Red Card - dismissed, sent-off, sent off, sending off, red card, red-card, sees red, second booking, second bookable Substitution - substitution, replaced by, comes on for, replaces, [(]in[)], [(]out[)], on for, in for

Each element in the set , , , , is used to represent each of the events, with g, p, y, r and s being abbreviations for goal, penalty, yellow card, red card and substitution, respectively. Events are detected one at a time. To detect event that occurs -times in a match video, its keyword(s) is searched for within the MBM. The time-stamp for the -th occurrence of is then recorded if any , where ∈ ; and 0 if at keyword match is found. Each time-stamp is stored in the set least one instance of event occurs. Then, for each of the -th occurrence of , the event search is initiated within the one-minute segment defined by .

Figure 4. Reference frame and elapsed game-time determination 5.2.2. Text-Video Synchronization and Search Space Localization Directly mapping to the corresponding video frame can be erroneous since actual kickoff5 might start later than when the video starts playing. This delay can be caused by reasons such as team introductions and opening commentaries. Hence, synchronization between with the corresponding video frame is necessary to allow the event search boundary to be accurately localized. Consequently, for each match video, we perform a manual step: a reference frame is identified to match with its corresponding frame number. This is shown in Figure 4, where the elapsed game time of 50-seconds is noted to correspond with the 1013-th frame of the match video. These are each denoted as (in , respectively. Both these values are used to localize the search space to within the seconds) and one-minute segment denoted by the time-stamp. With being the event’s minute time-stamp, the beginning ( , ) and ending ( , ) frames of the search space is determined through Eqs. 2 and 3. ,

,

5

60







,

60

1

(2) (3)

The start of the soccer game, as initiated by the referee.

169

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

where is the video frame rate and ∙ rounds-down calculations to the nearest integer. For , , 1, . (in seconds) is subtracted by 60 (seconds) since the actual event occurs between minutes Also, for , , is multiplied by 60 (seconds) so that the end boundary is one-minute after , . Finally, the one-minute search space is defined as Eq. 4. This represents the reduced and localized search space. ,

, ,



(4)

5.3. Audio-based Event Detection This component identifies a rough estimate (sub-segment) of where event has occurred. The gist is to locate the sub-segment with the highest audio energy, implying loudest crowd reactions. This is done by calculating and analyzing the log energy (le) within sub-segments. The two steps involved are explained in the following sections. 5.3.1. Calculating the first MFCC Coefficient For audio calculations to be done, the audio channel is firstly divided into smaller units called (audio) frames. This is necessary to obtain a pseudo-stationary measurement of the audio signal at a particular instance in time. The one minute segment is partitioned into 40-millisecond6 frames, with a time-step of the same duration. This partitioning process results in 1,500-frames non-overlapping segment. frames within the entire 60-second

Figure 5. Before (a) and after (b) applying threshold 5.3.2. Peak Energy Determination Events normally cause a dramatic rise in audio energy. Therefore, segments containing le peaks are identified. The local maxima (i.e. peaks) from the audio signal within are determined by comparing each frame measurement to its neighboring values. If a particular measurement is larger than both of its 6

From our observations, this duration is able to capture the best snapshot of the audio signal.

170

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

neighbors, it is deemed as a local peak. We use the peak determination algorithm in [26]. The is used to determine which audio algorithm returns the peak values as a vector . A threshold 0.98 max . peaks are considered to be potentially eventful. We set the value of are retained, and set to zero if otherwise. An illustration of this Consequently, values of process is shown in Fig. 5. 5.3.3. Audio Signal Partitioning and Maximum Energy Segment Determination For further analysis, is divided into overlapping 12-second partitions. It was observed that events can be fully viewed from conception to finish within a 12-second window. Hence, 9 partitions are generated, with an overlap of 6-seconds between partitions. An illustration is given in Figure 6. Consequently, each 12-second partition consists of 300-frames (i.e. ). Each



partition can be considered a member of the set , where 1 … . The frame ranges for each , considering the overlapping partitions, can be represented using the matrix as in Eq. 5, where all elements in column-1 are the beginning frame indices, whereas elements in column-2 are the corresponding ending frame indices for that particular partition.

Figure 6.

9 overlapping partitions

divided into



1 150 300 450 600 750 900 1050 1200

300 450 600 750 900 1050 1200 1350 1500

(5)

Each partition is ultimately defined by the sum of all the le values within the respective frame ranges defined by their indices in , and can be written as Eq. 6. ∑

, ,

(6)

where j is the current audio frame index from . The variable is the current partition for 1… . The eventful segment is assumed to the partition from exhibiting the strongest le sum. Therefore, the

170-1

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

final eventful segment ∗ is calculated as Eq. 7. The corresponding audio frame indices for ∗, , ∗, . noted as ∗

max

,

1…



are duely

(7)

Consequently, the beginning and ending times (in seconds) within the actual video segment can be determined by Eqs. 8 and 9, respectively, with reference to the event time stamp and reference . time ∗,







Resultantly, the range , ∗ le sum. Note that , and seconds.

∗ ∗,

∗,

60 60



1

(8)

1

(9)

is obtained, which corresponds to the sub-segment with the highest are both divided by 25 to convert their units from milliseconds to

Figure 7. The updated event video frame range



,



5.4. Visual-based Refinement After identifying ∗ , frame boundaries encapsulating the eventful sub-segment needs to be identified. We observe that events are normally accompanied by specific semantic shot transitions (SST). Directly after an interesting event, the camera will shift from a far-view to a close up-view. Besides assumedly done to indicate a game break, this transition is able to show the players’/referee’s/crowds’ reactions towards the event. Therefore, to successfully capture the entire and most compressed visual evolutions of the event, such SST patterns need to be identified. ∗ ∗ The temporal information (in seconds) obtained from , is used to map to the corresponding frame boundaries. However, there is no guarantee that this range will contain a transition from a far-view to a close up-view (although most likely, it will). To ensure that an event can be viewed in its entirety, after mapping the time (in seconds) to the corresponding video frames, a further refinement step is performed to ensure that the starting point is a far -view. Mapping the temporal information into video frame numbers can simply be done by multiplying ∗ ∗ both and by the video frame rate , where we use 15-fps. For notational purposes, this can be written as Eqs. 10 and 11.

170-2

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari













(10)



(11)



and are the corresponding beginning and ending video frames, respectively. where ∗ We further apply a visual refinement step to ensure viewing begins with a far-view. If does not belong to a far-view shot, a backward search is performed until a far-view shot is found. Once found, the last six-second segment of the shot is taken as the head of the eventful sub-segment. The ensuing close up-view is then included as the tail of the sub-segment, where the first six-second ∗ segment of the shot is taken. In case already belongs to a far-view shot, the transition point with the following close up-view shot is searched for. Once found, similarly, the last six-second segment from the far-view shot is taken as the head, and the tail is the first six-second segment of the ensuing close up-view shot. An illustration is provided in Figure 7. Consequently, the beginning and ending ∗ ∗ frames are updated as , , which is the final sub-segment expected to contain the most compressed view of the event .

6. Experimental Results Firstly, the keyword matching component is evaluated to determine whether all events are accounted for. Note that this component is crucial since missed keywords means missed events. Secondly, the effectiveness of the peak log-energy assumption and the visual-based refinement process are evaluated to ensure eventful sub-segments have been correctly identified. The framework was implemented using MATLAB R2007a and tested on a video dataset of 21-matches spanning approximately 30-hours from the Barclay’s Premiere League, Champions League, Italian Serie-A and Spanish La-Liga. All videos were in AVI format. MP3-mono was used to encode the audio channel, with a sampling rate of 22.5-kHz and a bit-rate of 32-kbps. Note that for this work, each match half was separately processed and stored as different video files. Non-game footage were also omitted such as commercial breaks, half-time commentaries, match highlights etc.

6.1. Keyword Matching The measurements of precision and recall are used for evaluation purposes, whose formulas are calculated using Eqs. 12 and 13, respectively. Detected

(12)

Detected False Detected

(13)

Detected Missed

Detected refers to the number of correctly identified keywords. False refers to the number of keywords found when the event did not occur, and Missed refers to keywords that failed to be detected even when the event actually occurred.

Event Goals Penalties Yellow Cards Red Cards Substitutions

Table 3. Keywords matching results Ground Truth Detected False Missed Precision Recall 60 6 76 3 100

60 6 76 3 106

0 0 0 0 6

0 0 0 0 0

100.00% 100.00% 100.00% 100.00% 94.64%

100.00% 100.00% 100.00% 100.00% 100.00%

The results for event detection via keyword matching are shown in Table 3. It seems that overall precision for all events is 100%, except for substitutions at 94.64%. This means that for goals,

170-3

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

penalties, yellow cards and red cards, from the set of detected keywords, all of the actual event occurrences were accounted for. Substitutions suffered slightly since 6 from the 106 detections were not actual instances of the event. Recall however, is perfect for all events. This translates to zero misses, which is crucial since it is undesirable to miss any occurring events. 6.1.1. Discussion The perfect recall for all events in Tables 4 can be attributed to the manner in which the events are annotated in the MBMs, where very unique soccer vocabulary is used. For example, ’goal! ’ and ’scored’ are unique to the goal event, and it is impossible to relate them to other events. The case is similar for all the other events. The exclusiveness of the event annotation terms makes it possible for such a simple implementation of keyword matching to work extremely well. Another property is the structure of the annotations. The MBMs considered in this work employs an annotation style where terminologies used are very specific, and slang or informal language are avoided. This allows keywords to be easily found. It is however believed that, if an unstructured annotation style is used, the simple keyword matching approach will not perform well, especially when random keyword combinations are used.

6.2. Peak Log-energy Assumption and Visual-based Refinement Table 4 illustrates the accuracy (Eq. 14) after of applying comparison with visual-based refinement steps.

and performing the

Correct

(14)

Ground Truth

Correct refers to the total number of correctly identified event segments after the Visual-based Refinement step. From Table 4, all the red cards and penalties segments were correctly identified. Substitutions segments were also well detected at 85%. Yellow cards and goals however, reported slightly lower accuracy of 78.33% and 78.95%, respectively.

Event

Table 4. Accuracy score for all events Ground Truth Correct Missed Accuracy

Goals Penalties Yellow Cards Red Cards Substitutions

60 6 76 3 100

47 6 60 3 85

13 0 16 0 15

78.33% 100.00% 78.95% 100.00% 85.00%

6.2.1. Discussion We provide some discussion on the cause of the missed detections and lower accuracy measurements, based on close observation of the results. 1.

2.

3.

Events occurring in the same minute: Occasionally, two events occur in one-minute. For example, when looking for a goal event, a serious foul is committed before/after the goal resulting in a yellow card. Consequently, the crowd reactions were stronger than that of the goal event. Similarly, goals occurring before/after a yellow card minute can also be missed due to the same reason; Exceptional circumstances: Usually, a yellow card is given immediately after a serious foul. However, in some of the cases, the cards were given at later time. One yellow card was awarded almost one-minute after the offense due to injury. Another case was when cards were awarded for off-the-ball incidents such as for players using inappropriate language. These normally cause less intense crowd reactions and were hence, missed; and Noise: Although rare, unwanted noise occurred during some goal event minutes. The microphone was too close to the crowd and resultantly recorded abnormally high le readings

170-4

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

that were more pronounced compared to when goals occurred. In other cases, goals and yellow cards segments were not properly identified due to other segments containing loud clanking noises. Table 5. Confusion matrix for all events Detected

Actual

Event Goals Penalties Yellow Cards Red Cards Substitutions

Goals Penalties 47 0 0 0 0

3 6 1 0 0

Yellow Red Substitutions Cards Cards 2 0 0 0 0 0 0 2 60 0 0 3 0 0 85

The confusion matrix in Table 5 provides an insight about the mistaken detections. As shown, 3penalty and 2-yellow card segments were identified when the actual objective was to detect a goal. This was due to them occurring within the same minute as the goal event, while having more pronounced audio energy. The case was similar for 3 wrongly identified yellow cards, where in actuality; both were 1-penalty event and 2-substitution events, respectively. From the results in Table 4, some events such as red cards and penalties can be reliably identified using the aural modality. This is because; these events are always accompanied by loud crowd cheer and often dominate a big part of the one-minute segment. The other events however report lower accuracy. It was observed that the missed detections occurred when two events existed within the same minute.

7. Conclusions and Future Directions The proposed framework is able to detect various soccer events by imposing a generic set of rules on specific audiovisual properties. Despite the simplicity of the framework, many events were detected at relatively high accuracy. The textual cues were especially important as the time-stamp in particular, enabled audiovisual analyses to be performed within more localized video segments, as opposed to wading through the entirety of the video. However, we believe that since simple keyword(s) matching was used, the approach could perform badly when unstructured textual sources are considered. Due to this, investigation into learning-based text analysis methods might be useful, such as fuzzy grammar analysis [27]. The missed detections being reported were mainly due to unpredictable circumstances such as two events occurring within the same minute, or due to the presence of noise. Future works hence might consider semi-supervised approaches since labeled training examples are difficult to come by. This might lessen the burden of manually labeling all the training data, while being robust enough for effective event detection.

8. References [1] M. Zameni, M. Fathy, and A. Sadri, “A Low Cost Algorithm for Expected Goal Events Detection in Broadcast Soccer Video”, International Journal of Digital Content Technology and its Applications, vol. 4, no. 8, pp. 118-125, 2010. [2] Y.-P. Huang, C.-L. Chiou, and F. E. Sandnes, “An intelligent strategy for the automatic detection of highlights in tennis video recordings”, Expert Systems with Applications, vol. 36, no. 6, pp. 9907–9918, 2009. [3] N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, “Personalized abstraction of broadcasted American football video by highlight selection”, IEEE Transactions on Multimedia, vol. 6, no. 4, pp. 575–586–, 2004. [4] M. Kolekar, K. Palaniappan, S. Sengupta, and G. Seetharaman, “Semantic concept mining based on hierarchical event detection for soccer video indexing”, Journal of Multimedia, vol. 4, no. 5, pp. 298– 312, 2009. [5] C. Min, C. Shu-Ching, and S. Mei-Ling, “Hierarchical temporal association mining for video event detection in video databases,” In Proceedings of the 23rd IEEE International Conference on Data Engineering Workshop, pp. 137–145, 2007.

170-5

Event Detection in Soccer Videos through Text-based Localization and Audiovisual Analysis Alfian Abdul Halin, Mozaherul Hoque Abul Hasanat, Mandava Rajeswari

[6] D. A. Sadlier and N. E. O’Connor, “Event detection in field sports video using audio-visual features and a support vector machine,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1225–1233, 2005. [7] M. Bertini, A. Del Bimbo, and W. Nunziati, “Highlights modeling and detection in sports videos,” Pattern Analysis & Applications, vol. 7, no. 4, pp. 411–421, 2004. [8] A. Ekin, A. M. Tekalp, and R. Mehrotra, “Automatic soccer video analysis and summarization,” IEEE Transactions on Image Processing, vol. 12, no. 7, pp. 796–807, 2003. [9] X. Min, X. Changsheng, D. Lingyu, S. J. Jesse, and L. Suhuai, “Audio keywords generation for sports video analysis,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 4, no. 2, pp. 1–23, 2008. [10] X. Min, N. C. Maddage, X. Changsheng, M. Kankanhalli, and T. Qi, “Creating audio keywords for event detection in soccer video,” In Proceedings of the International Conference on Multimedia and Expo, vol. 2, pp. 281–284, 2003. [11] M. Y. Eldib, B. Zaid, H. M. Zawbaa, M. El-Zahar, and M. El-Saban, “Soccer video summarization using enhanced logo detection,” In Proceedings of the 16th IEEE Internatinal Conference on Image Processing, pp. 4345–4348, 2009. [12] F. Camastra and A. Vinciarelli, Machine Learning for Audio, Image and Video Analysis: Theory and Applications - Advanced Information and Knowledge Processing, 1st ed., 2008. [13] B. Min, J. Kim, C. Choe, H. Eom, and R. McKay, “A compound framework for sports results prediction: A football case study,” Knowledge-Based Systems, vol. 21, no. 7, pp. 551–562–, 2008. [14] H. Chung-Lin, S. Huang-Chia, and C. Chung-Yuan, “Semantic analysis of soccer video using dynamic Bayesian network,” IEEE Transactions on Multimedia, vol. 8, no. 4, pp. 749–760, 2006. [15] L. Ballan, M. Bertini, A. Del Bimbo, and G. Serra, “Video event classification using string kernels,” Multimedia Tools and Applications, vol. 48, pp. 69–87, 2010, [16] K. Yu Lin, L. Joo-Hwee, T. Qi, M. S. Kankanhalli, and C. S. Xu., “Visual keywords labeling in soccer video,” In Proceedings of the 17th International Conference on Pattern Recognition, vol. 3, pp. 850– 853, 2004. [17] W. Jinjun, “Content-based sport [18] s video analysis and composition,” Ph.D. dissertation, Nanyang Technological University, School of Computer Engineering, 2006. [19] R. Ren, “Audio-visual football video analysis, from structure detection to attention analysis,” Ph.D. dissertation, University of Glasgow, 2008. [20] F. Coldefy and P. Bouthemy, “Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis,” In Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, pp. 268–271, 2004. [21] C. Chih-Chieh and H. Chiou-Ting, “Fusion of audio and motion information on hmm-based highlight extraction for baseball games,” IEEE Transactions on Multimedia, vol. 8, no. 3, pp. 585–599, 2006. [22] L. Feng, A. Nielsen, and L. Hansen, “Vocal segment classification in popular music,” In Proceedings of the 9th International Conference of Music Information Retrieval, pp. 121–126, 2008. [23] L. Rabiner and B. Juang, Fundamentals of speech recognition. Prentice hall Englewood Cliffs, New Jersey, 1993. [24] A. A. Halin, M. Rajeswari, and D. Ramachandram, “Shot view classification for playfield-based sports video,” In Proceedings of the IEEE International Conference on Signal and Image Processing Applications, pp. 410 –414, 2009. [25] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang, “A formal study of shot boundary detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 2, pp. 168– 186, 2007. [26] W. Abd-Almageed, “Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing,” In Proceedings of the 15th IEEE International Conference on Image Processing, pp. 3200–3203, 2008. [27] M. Brookes, “Findpeaks-voicebox [Online] available at: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/doc/voicebox/findpeaks.html [accessed 21 april 2011],” 2005. [28] N. Sharef, T. Martin, and Y. Shen, “Order independent incremental evolving fuzzy grammar fragment learner,” In Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, pp. 1221–1226, 2009. [29] J. Li , Y. Ding , Y. Shi and J. Zhang, “Building a Large Annotation Ontology for Movie Video Retrieval”, International Journal of Digital Content Technology and its Applications, vol. 4, no. 5, pp. 74-81, 2010.

170-6