Automatic Sports Video Genre Classification using ... - Semantic Scholar

Automatic Sports Video Genre Classification using Pseudo-2D-HMM

2

Jinjun Wang2,1 , Changsheng Xu1 , Engsiong Chng2 Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 [email protected] 1 CeMNet, SCE, Nanyang Technological University, Singapore 639798 [email protected], [email protected]

Abstract Building a generic content-based sports video analysis system remains a challenging problem because of the diversity in sports rules and game features which makes it difficult to discover generic low-level features or high-level modeling algorithms. One possible alternative is to first classify the sports genre and then apply specific sports domain knowledge to perform analysis. In this paper we describe a multi-level framework to automatically recognize the genre of the sports video. The system consists of a Pseudo-2D-HMM classifier using low-level visual/audio features to evaluate the video clips. The experimental results are satisfactory and extension of the framework to a generic sports video analysis system is being implemented.

1. Introduction The wide distribution of sports video and its appealing to large global audiences have led to increasing research attentions in the domain of sports video in recent years. Researchers focus on identifying semantically interesting sports contents based on multimodal analysis techniques to assist sports video structuring, video summary, highlight generation, automatic annotation, tactics analysis, video adaptation, etc tasks. However, despite some promising results reported [1], existing approaches to analyze the generic sports video remain primitive. Most current works rely on specific sports domain knowledge to perform analysis, e.g. detecting the appearance of referee to identify card event in soccer game [2]. Hence a system that can analyze one type of sports video could not easily fit to another sports domain. To build a generic sports video analysis framework, researchers are trying to discover more generic lowlevel features or high-level modeling methods. However, due to the diversity of sports rules and game patterns, only a few low-level features like dominant color or high-level

The 18th International Conference on Pattern Recognition (ICPR'06) 0-7695-2521-0/06 $20.00 © 2006

algorithms such as Hidden Markov Model (HMM) have proven to be generic and robust so far, which cannot fully accomplish the generic sports video analysis task. Since it is unavoidable to use sports domain-specific knowledge to achieve high accuracy in semantic analysis, one possible alternative approach for a generic framework is to automatically identify the genre of sports video first and then to allow individual game analysis module to use domain knowledge for evaluation. Automatic identification of video genre has long been studied for Content Based Video Retrieval. Some recent examples relating to sports video include: In [3] Shinichi proposed a HMM based video classification system to distinguish different sports video using camera motion parameter features. In [4] Doermann applied the replay detection, text and motion features with Bayesian classifiers to identify sports video, and then in [5] the motion and color features were used to classify several sports genres, including ice hockey, basketball, football and soccer. These existing works mainly focus on low-level visual features and simple classifiers. To improve performance, we examine features from both visual and audio domains as multi-modality analysis techniques have been proved robust for video indexing [6]. We also attempt alternative classifiers in this paper. Our main contribution includes: 1) we demonstrate the effectiveness of Global Camera Motion, Dominant Color and Mel-Frequency Cepstral Coefficients features, and 2) we propose a Pseudo-2DHMM (P2DHMM) framework to model the temporal patterns in video for sports-genre classification problem. The rest of the paper is organized as following: Section 2 and section 3 describe the low-level feature extraction and highlevel modeling of our framework, respectively. Section 4 lists some experimental results. In section 5 conclusions and future work are discussed.

2. Feature selection As a classification problem, the first step of sports video genre identification is to extract suitable low-level features.

The sports video contains information from multiple domains, such as visual, auditory or textual features, that could be utilized for video genre classification. For robustness and generality consideration, the low-level visual and audio features are presented in our framework, as described in the following two subsections, respectively.

2.1. Visual feature selection The visual features have two categories: cinematic and object-based features [2]. Compared with object-based features, the cinematic features are compute-easy and common with multiple sports genre. Hence we focus on using visual cinematic features for sports genre recognition. The first cinematic visual feature used is camera motion based feature. Camera motion features are useful because different game recordings appear different camera motion patterns, e.g. quick left-right motion in basketball, slow pan/tilt motion in soccer. Our framework uses the Motion Vector Field that is available from the compressed video to compute the Average Motion Magnitude (R1 ), Motion Entropy (R1 ), Dominant Motion Direction (R1 ) and camera pan/tile/zoom factors (R3 ) using the method proposed in [7]. Individual object motion is not considered as we use cinematic visual features only. The second cinematic visual feature is color based feature. The sports game usually take place in environments with uniform colors, e.g. the soccer pitch is normally green while the basketball playground is normally beige, and this makes the Dominant Color (DC) features useful. In out framework, both the local DC and global DC features are extracted in HSV and rg color space. The difference between the local DC and global DC is that the former is based on each individual frame while to compute the later, we first extract two masks (HSV mask and rg mask) for each frame where the pixels that fall into the local HSV and rg DC ranges are included in the respective mask. Then if 1. sizeHSV > 2. sizerg >

sizef rame 2

sizef rame 2

3. sizeHSV ∩ sizerg > 0.6 where sizeHSV , sizerg and sizef rame are the size of HSV mask, rg mask and video frame, respectively, we put the local DC into an accumulation buffer. The global DC is extracted from this buffer over the whole clip. The local DC and global DC are all R5 (HSV R3 + rg R2 ) features.

2.2. Audio feature selection The audio information is also presented in our framework because of the existence of certain domain-specific


sports sounds, e.g. ball-hitting sound in tennis game, whistling sound in soccer. As the sports sound mainly consists of speech and environment sound or noise, the Mel Frequency Cepstral Coefficients (MFCC) that is common with Automatic Speech Recognition (ASR) are extracted. However we do not directly perform ASR for our video genre classification problem as defining vocabulary is difficult and heuristic, which renders the system less generic. In our framework, 13 order of MFCC is extracted (R13 ).

3. Genre recognition To classify amongst the extracted features, a HMM based classifier is used because of its ability to model both the single state feature patterns and the states transition patterns. HMM has been proved robust and accurate for many problems [8], such as ASR, image processing, communications, signal processing, finance, traffic modeling, etc. The HMM classifier works in the following manner [8]: Given a set of states S = {s1 , s2 , ..., sK } and an observation sequence X = {x1 , x2 , ..., xN }, the likelihood of X with respect to a HMM with parameters Θ expands as p(X|Θ) and: p(X, Q|Θ) (1) p(X|Θ) = all Q

where p(X, Q|Θ) = p(X|Q, Θ)p(Q|Θ)

(Bayes)

(2)

We have p(X|Q, Θ)

N = n=1 p(xn |qn , Θ) = bq1 x1 · bq2 x2 · ... · bqN xN

and p(Q|Θ) = πi ·

N −1

aqn qn+1

(3)

(4)

n=1

Q = {q1 , q2 , ..., qN } is a (hidden) state sequence where each qi ∈ S; πi = p(q1 = si ) is the prior probabilities of si being the first state of a state sequence; aij denotes the transition probabilities to go from state i to state j, and bqi xi is the emission probabilities. Typically bqi xi is modeled by Gaussian Mixture Model (GMM). However, this might cause error in case that the actual distribution of the observations is different from Gaussian distribution. To overcome this limitation, in one of our early work [9], we took advantage of Support Vector Machine (SVM) to firstly performance Vector Quantization (VQ) over the feature vectors, and the obtained VQ labels were used as the HMM observation vectors (xi ) instead of using these feature vectors (xi ) directly. For similar reason, in this paper we apply a P2DHMM [10] framework to

cope with the video genre classification task. The emission probabilities of the HMM (now referred to as “2-D HMM”) are estimated through a low-level HMM (referred to as “1D HMM”). The states of the 1-D HMMs are modeled by GMMs. Fig.1 shows the a P2DHMM structure.

where Θv and Θa represent video and audio HMM parameters respectively. xl is a 2 × C dimension vector. The lowlevel HMM definition for both video and audio consists of 5 states, and each emitting state has 3 GMMs. The highlevel HMM consists of 3 states and each emitting state has 3 GMMs. Both the two level HMMs are left-right HMMs.

4. Experimental Result

Figure 1. Structure of Pseudo 2D HMM In our system, the output of the 1-D HMMs xl , similar as VQ label, is chosen to be the normalized likelihoods of Xl going through a set of HMMs Θ = {Θ1 , Θ2 , ..., ΘC }: p(Xl |ΘC ) p(Xl |Θ1 ) p(Xl |Θ2 ) , , ..., } (5) D D D Thus each xl is a C dimension vector. D is the normalizer and xl = {

D = max {p(Xl |Θ1 ), p(Xl |Θ2 ), ..., p(Xl |ΘC )}

(6)

The obtained sequence of the sub-clip class indices X = {x1 , x2 , ..., xL } is then sent to the 2-D HMMs to get the final classification of the clip by: class = arg max(p(X |Θc )

In the experiment, we try to classify the genre into three classes: Basketball, soccer and tennis. We also try to identify the sports video from non-sports video such as news video. Our data sets include totally 16 hours of basketball video, tennis video, soccer video and news video, each type of video 4 hours long. To avoid confusion, the selected news video does not contain sports news video segments. To identify the genre of a video clip, we classify a randomly extracted video segment (X) from the clip to label the genre of the original video. In order to measure the accuracy of each block in the multi-level framework, firstly the accuracy of the low-level HMM (1-D) classification (quantization) over the visual and audio features are investigated. As elaborated in section 3, to measure this 1-D HMM classification accuracy, a genre label should also be given to each sub-clip xl from X. Suppose this label is cl , we have

(7)

c

cl = arg max(p(Xl |Θc )

(9)

c

Fig.2 illustrated the flow chart our framework.

1-D Audio HMM

genre truth Soccer Basketball Tennis News

Probn(1d)

Prob2(1d)

Prob1(1d)

Probn(1d)

Prob2(1d)

Prob1(1d)

1-D Video HMM

Then cl is compared with the ground truth to get the performance, as listed in Table 1 and 2 respectively. Soccer

Basketball

Tennis

News

100% 0% 0% 0%

0% 89.7% 0% 3.4%

0% 0% 89.7% 3.4%

0% 10.3% 10.3% 93.2%

Table 1. Performance of 1-D video HMM

Figure 2. Diagram of structure analysis It can be seen from Fig.2 that the video and audio features are going through different 1-D HMMs. This is because we notice in our data set that there is no obvious alignment of video and audio feature transition path in our video clips. Accordingly, Eq.(5) should be written as: |Θv1 ) p(xl |Θv2 ) |ΘvC ) , , ..., p(xlD , xl = { p(xlD D p(xl |Θa1 ) p(xl |Θa2 ) |ΘaC ) , , ..., p(xl D } D D


(8)


Soccer

Basketball

Tennis

News

86.6% 50.0% 0% 6.8%

13.4% 40.0% 0% 17.3%

0% 0% 100% 0%

0% 10.0% 0% 75.9%

Table 2. Performance of 1-D audio HMM The 1-D visual and audio HMM classification results are then combined using Eq.(8) to generate the input for the second-level HMM classification, and the final genre classification result is shown in Table 3 below.


Soccer

Basketball

Tennis

News

100% 0% 0% 0%

0% 100% 0% 0%

0% 0% 100% 0%

0% 0% 0% 100%

Table 3. Performance of P2DHMM It can be seen from Table 1, 2 and 3 that the classification error after the 1-D visual HMM has been rectified by the fusion of both visual and audio features for the 2D HMM classification. We took a look at the incorrectly recognized video clips at the 1-D output end: the randomly chosen segments from these clips either contain a long time of close-up view, which cannot be easily differentiated from one sport to the other, or contains a long time of statistical results or commentator view, which confuse with the news clips. With the aids of audio information, they are correctly recognized at the 2-D HMM output end. We also evaluate the effect of the length of the random selected video clip N over accuracy. The performance listed in Table 3 is based on N = 2400 frames per clip, which represents 96 seconds (25 f/s). The performance will drop/rise if the clip length goes shorter/longer, and this is illustrated in Fig.3 where the horizontal coordinate denotes the segment length, ranging from 75 frames to 3000 frames, and the vertical coordinate shows the respective classification accuracy. We notice from Fig.3 that, when the segment length goes smaller than 300 frames per clip, the classification accuracy drops significantly. On the other hand, when the segment length goes greater than 1500 frames per clip, there is no obvious improvement in the classification. 1

Accuracy

0.9 0.8

0.7

extracted on the fly because the computation of the global DC requires accumulation over the whole clip. We are examining other methods, e.g. computing the global DC using the up-to-date accumulation result plus leaky algorithm, to improve the framework into a real-time classification system; 2) To enlarge our test data set with longer game video recording from more sports domains, e.g. golf, table tennis, etc; 3) To concatenate the sports video genre classification framework with our sports video analysis modules to build an generic sports video analysis system.

References [1] N. Adami, R. Leonardi, and P. Migliorati, “An overview of multi-modal techniques for the characterization of sport programmes,” Proc. of SPIE-VCIP’03, pp. 1296–1306, July, 2003. [2] A. Ekin, A. M. Tekalp, and R. Mehrotra, “Automatic soccer video analysis and summarization,” IEEE Trans. on Image Processing, vol. 12:7, no. 5, pp. 796–807, 2003. [3] S. Takagi, et al, “Sports video categorizing method using camera motion parameters,” in Proc. of ICME’2003, 2003. [4] V. Kobla, et al, “Identification of sports videos using replay, text, and camera motion features,” Proc. of SPIE on Storage and Retrieval for Media Databases, vol. 3972, pp. 332–343, 2000. [5] X. Gibert, et al, “Sports video classification using hmms,” Proc. of ICME’2003, 2003. [6] C. Snoek and M. Worring, “Multimodal video indexing: A review of the state-of-the-art,” ISIS Technical Report, Intelligent Sensory Information Systems Group, University of Amsterdam, vol. 2001-20, 2001.

0.6 0.5

75

300

500

1000

1500

2000

2400

3000

Segment length

Figure 3. Performance of different segment length

5. Conclusions and Future Work In this article we present a P2DHMM framework using visual and audio feature to classify sports video genre. The performance of the system is validated by decent experimental results. Our future work for this framework includes: 1) To improve the system into an on-line classification system. Currently the visual features cannot all be


[7] Y. Tan, et al, “Rapid estimation of camera motion from compressed video with application to video annotation,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 10 1, pp. 133–146, February 2000. [8] V. A. Petrushin, “Hidden markov models: Fundamentals and applications,” Online Symposium for Electronics Engineer, 2000. [9] J. Wang, et al, “Sports highlight detection from keyword sequences using hmm,” Proc. of IEEE ICME’04, July 2004. [10] F. Samaria and A. Harter, “Parameterisation of a stochastic model for human face identification,” Proc. of Second IEEE Workshop Applications of Computer Vision, 1994.