We study event detection in the context of sports video min- ing that ... EiMM'09, October 23, 2009, Beijing, China. Copyright ...... 15(10):1225â1233, Oct. 2005.
Event Detection in Sports Video based on Generative-Discriminative Models Yi Ding and Guoliang Fan School of Electrical and Computer Engineering Oklahoma State University Stillwater, OK 74078, USA
{yi.ding,guoliang.fan}@okstate.edu
ABSTRACT We study event detection in the context of sports video mining that involves a three-layer semantic space, i.e., low-level visual features, mid-level semantic structures, and high-level semantics (or events). This space supports explicit semantic modeling and direct semantic computing. Specifically, the mid-level semantic structures are the basic recurrent temporal patterns that serve as the building blocks for event analysis. We also propose a unified video mining framework where event detection is formulated as two inter-related inference problems associated with two different machine learning tools. One is from low-level to mid-level by generative models, and the other is from mid-level to high-level by discriminate models. We use American football video analysis as a case study, and the experimental results demonstrate the promising results of the proposed approach.
Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis
General Terms Algorithms
Keywords video mining, event detection, generative models, discriminative models, sports video analysis
1.
INTRODUCTION
The goal of video mining is to discover knowledge, patterns and events, in the video data stored either in databases, data warehouses, or other online repositories [4, 22]. Particularly, event detection is one active research field in recent years driven by the ever increasing need of numerous multimedia and online database applications. Its benefits range from efficient browsing and summarization of video content to facilitating user-oriented video access and retrieval.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EiMM’09, October 23, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-754-7/09/10 ...$10.00.
According to different production and edition styles, videos can be classified into two major categories: scripted and nonscripted [37], which are usually associated with different video mining tasks. Scripted videos, e.g., news and movies, are produced or edited according to a pre-defined script or plan, for which we can build a Table-of-Content (TOC) [27] to facilitate video search. Non-scripted videos such as meeting, sports, and surveillance videos do not have a pre-defined script, where, however, all events happen spontaneously and usually in a relatively fixed setting. Therefore, how to detect the highlights or events of interests is of great interest for non-scripted videos. We are interested in sports video mining where we use American football video as a case study. Sports video mining has been widely studied due to its great commercial value [40, 13, 9, 14]. Although sports videos are considered as non-scripted, they usually have a relatively well-defined structure (such as camera views) or repetitive patterns (such as play types), which could help us enhance the scriptedness of sports videos and consequentially improve the detectability of events. However, there are still two major challenges in this research. One is the notorious “semantic” gap between the understandable semantics and computable features, and the other is the ambiguity and uncertainty of both features and semantics. Therefore, in this paper, we propose a generative-discriminative model based framework that offers explicit semantic modeling and direct semantic computing to attack these two challenges. Specifically, we addressed the following two main issues: • To represent sports video in a three-layer semantic space that is composed of low-level visual features, mid-level semantic structures, and high-level semantics. A key element introduced is the mid-level semantic structures that serve as the bridge of the semantic gap as well as the building blocks for high-level event analysis. Specifically, two mid-level semantic structures are considered here, i.e., play types (what happened) and camera views (where did it happened), from which we can define high-level semantics (events). • To formulate event detection as two related statistical inference problems that can be addressed by two different machine learning tools individually. Machine learning is considered one of the most plausible approaches for video mining that is divided into two stages. The first stage is to use the generative models to explore the mid-level semantic structures from low-level visual features, and the second stage is to use the discriminative models to infer high-level semantics (event detection) from mid-level semantic structures.
2.
RELATED WORKS
Event detection in the sports video can be accomplished at both the object level [15, 39, 11] and scene level [24, 19, 9]. The object-level approaches usually associate the definition of events with the appearances or behaviors of specific objects, e.g., in [11], where object-based features, such as ball and players trajectory, are used to detect goals in soccer games. However, more semantic events are usually defined on a complex collections of objects. Also, in some cases, the object is not always available and may be hard to extract. Event detection at the scene level utilize a variety of cues in the video data including color, shape, motion and even caption information to detect semantic events [9]. In this work, we are interested in scene-level event detection where some scene-based visual features are involved. There are two popular event detection at the scene level. One is the rule-based approaches [17, 41], and the other is the statistical model-based approaches [24, 28]. The major advantages of rule-based methods are that they are usually goal-oriented and can support specific video search tasks. In [17], camera motion related criteria are defined to classify each play shot into one of seven different plays. However, they are limited in dealing with the uncertainty and ambiguity in the visual features, and their task-specificity makes them less flexible. The statistical approaches use some statistical models to discover underlying semantic events, which are usually more effective and flexible to discover underlying structures in the video data, such as recurrent events or highlights [24]. Compared with rule-based approaches, statistical approaches are capable of handling a large volume of video data and support more complex video mining tasks. There are two types of statistical models, i.e., the generative and discriminative ones. Generative models, e.g., hidden Markov models (HMMs), are usually preferred for a large data set due to their better generality when prior knowledge is available. Various extended HMMs were proposed to enhance the capability of exploring complex recurrent temporal patterns from video data [35, 36, 33]. The Dynamic Bayesian Network (DBN) [23] provides a unified probabilistic framework to represent various graphical structures. Generative models tend to make assumptions about the underlying distributions as well as the dependency structure between random variables. In contrast, the discriminative models, such as the conditional random field (CRF) [32, 5] and the support vector machine (SVM) [30, 20], are more flexible and expressive for statistical modeling where less assumption or prior knowledge is used. However, the discriminative model-based approaches could be sensitive to noise or limited when a large and complex data set is involved. Recently, researches have suggested that the hybrid generative/discriminative models have several advantages over purely generative or discriminative counterparts [26].
3.
PROBLEM FORMULATION
We believe that explicit semantic modeling and direct semantic computing are effective to bridge the semantic gap and reduce the ambiguity and uncertainty in sports video mining. Therefore, we proposed a three-layer semantic space that is composed of low-level visual features, mid-level semantic structures and high-level semantics (or events). Then in this semantic space, event detection is converted into two inter-related statistical inference problems. One is from low-
level to mid-level via generative models, and the other is from mid-level or high-level via discriminative models. This approach supports not only event detection but also customized events-of-interest, increasing the usability and interactivity of video data. In this work, we assume that a video sequence is composed of a set of pre-segmented consecutive shots with all breaks and commercials removed. Semantic Space High-level semantics (Touchdown, turn-over, or user-specified events of interest) Stage 2: infer high-level semantics by discriminative models
Mid-level semantic structures (play types, camera views, possessions) Stage 1: infer mid-level semantic structures by generative models
Low-level visual features (Color, landmarks, motion)
Figure 1: The three-layer semantic space.
3.1 The semantic space As shown in Fig. 1, we want to advocate the concept of a three-layer semantic space to facilitate event detection and semantic analysis, where the mid-level semantic structures are essential and serve as the bridge to the semantic gap. The three layers are discussed below. • Low-level visual features are relevant visual cues extracted at the scene levels that are used to infer midlevel semantic structures. In this work, three types of visual features are used, i.e. color distribution and landmarks, and the camera motion. The first two are useful to indicate the position information, and the last one can reflect the play information. For each shot, visual features are extracted for all frames in the shot. • Mid-level semantic structures are rudimentary building blocks for semantic analysis. They are frequent, recurrent and relatively well-defined, and their transitions can be governed by certain dynamics. More interestingly, they can be used together to specify highlevel semantics or events. Particularly, we choose several common structures in most field sports, i.e., play types (what happened) and camera views (where did it happened), both of which are defined at the shotlevel and can be specified by a set of frame-wise visual features in a shot. • High-level semantics are immediately useful to the user. Usually, they are rare and diverse, and may not be defined uniquely or inferred from visual features directly. However, they can be better defined based on mid-level semantic structures. In the case of American football videos, high-level semantics could include the touchdown, turnover, and some events of interest.
3.2 Two inference problems
4. LOW-LEVEL VISUAL FEATURES
Similar video mining paradigms can be found in recent literature [22, 32]. We want to provide a unified and machine learning paradigm where semantic video analysis can be systematically formulated as two statistical inference problems. The first one is from the low-level visual feature to mid-level semantic structures, and the other is from mid-level semantic structures to high-level semantics.
In this work, we are mainly interested in two types of midlevel semantic structures, play types and camera views. The play types are largely dependent on the pattern of camera motion, i.e., panning and tilting, as shown in Fig. 2(a). We choose the optical flow based method [29] to qualitatively compute motion parameters between two adjacent frames. Also, the frame indices are included as an additional temporal feature to differentiate long plays and short plays. Therefore, a 3-D feature vector of camera motion will be employed to identify the play type in each shot. In Fig. 2(b), we illustrate typical scenes for four different camera views that can be reflected by spatial color distribution and visual landmarks (yard lines). Therefore, we use the dominant color to estimate the spatial color distribution and to extract the dominant color region, i.e., the play ground [10]. Moreover, we use Canny edge detection followed by the Hough transform to detect the yard lines in the play field. We divide the scene into three regions vertically and three regions horizontally. Then we extract a 6-D feature vector for each frame that includes the ratios of dominant color between different horizontal/vertical regions and the average angle of yard lines.
• Inference problem 1: Due to the nature of mid-level semantic structures, generative models, such as HMMs or their variants, are more suitable for the first inference problem where some prior knowledge is available as well as useful. Specifically, the dynamic model directly reflects the transitions of recurrent temporal patterns (i.e., mid-level semantic structures), and the observation model represents the different mid-level structures by distinct low-level features. • Inference problem 2: High-level semantics or events of interest that are immediately useful for users are relatively less well-defined by low-level features. It is mainly due to their uncertainty and ambiguity nature in the video data. On the other hand, they can be better specified by mid-level structures. However, a more flexible and general statistical association relationship is needed between states and observations. Therefore, we invoke discriminative models, such as CRFs, for the second inference problem, which support versatile statistical modeling structures. There are three major advantages in this framework. First, we can fully take advantage of the available semantics and prior knowledge about a sport game that makes the video mining problem well structured and formulated. Second, it provides both structure analysis at mid-level and event analysis at high-level that are complementary in nature. Third, it not only supports highlight detection but also can deliver customized events-of-interest and increase the usability and interactivity of video data.
3.3 Comparison with other approaches Recent studies with multi-level video mining paradigms are focused on two aspects. One is to use a top-down conceptguided approach that relies on pre-defined rules with direct reasoning [9, 38], and the other is to develop a bottomup data-driven methods that depend on powerful machine learning tools [31, 32]. We believe that the top-down conceptguided semantic representation should be supported by the bottom-up data-driven methods. In other words, we believe that a balance is needed to combine both top-down and bottom-up flows in high-level semantic analysis. Therefore, we want to advocate the concept of semantic space which is associated with two inter-related inference problems for event detection in sports video data. The key of our research is mid-level semantic structures that make video mining tasks better defined and more flexible. It differs from the mid-level representation introduced in [9, 22] which is deterministically specified by low-level features according to pre-defined specific rules. We argue that the mid-level structures have the following properties: (1) they are frequent, recurrent and relatively well-defined; (2) they provide a general video representation for semantic video modeling; (3) they are inferrable from low-level features and expressive for high-level semantic analysis.
Panning
Tilting
(a)
Left Camera View End zone
Right Camera View
Central Camera View Field
Banner & logo
End Zone Camera View Audience
Yard line
(b)
Figure 2: Visual features in football video: (a) play type analysis; (b) camera view analysis.
5. FIRST INFERENCE PROBLEM The objective of the first inference problem is to infer the mid-level semantic structures from low-level features extracted from a set of shots, where play type and camera view are considered, as shown in Fig. 3. Each shot has variable length frames. Also it was observed that the mid-level structures are quite related to each other during a game. It is interesting to see whether we can explore multiple semantic structures in parallel and simultaneously. Therefore, our work will be focused on three issues about HMMs: (1) how to develop a powerful observation model to capture rich statistics from frame-wise visual features in a shot; (2) how to define an interactive dynamic relationship among multiple Markov chains, and (3) how to learn the model compactly and effectively. Specifically, we will discuss three HMMs, as shown in Fig. 4, the traditional HMM, the Segmental HMM (SHMM), and the Multi-Channel SHMM (MCSHMM).
St −1 St −1
St
St +1
ot −1
ot
ot +1
ot−1,nt−1
o t , nt
ot ,1
(a)
Ft −1
ot +1,1
St(1)
St(−21) ot +1,nt+1
Ot(−21)
Ft +1
Ft
St(−11)
µ
µ
µ ot −1,1
St+1
St
St(+21)
St( 2)
O(t1−)1 Ot(2)
St(+1)1
Ot(1)
O(t+21)
Ot(1+)1
(c)
(b)
Figure 4: Generative models: (a) HMM; (b) SHMM; (c) MCSHMM.
End Zone View
Left View
Left view
Right view
Central view
End zone view
End Zone View
Right View
Central View
Semantic Structures of Camera Views
Field Goal
Long plays
Short plays
Kicks
Field goals
Short play
Kick
Long play
Semantic Structures of Play Types
Figure 3: Two mid-level semantic structures.
5.1 Hidden Markov models A typical HMM is shown in Fig.4(a), where the hidden state is defined at the shot-level and directly associate the mid-level semantic structure in this work. HMM assumes the underlying system is a Markov process with unknown parameters including a state transition matrix: Ai,j = P (St = j|St−1 = i) and a probabilistic observation model that we choose Gaussian or Gaussian mixture models: p(Ot |St = k) p(Ot |St = k)
= =
N (Ot |µk , Σk ), N X
(αn N (Ot , µn,k , Σn,k )) ,
(1) (2)
n=1
where N is the number of components. Ai,j captures the underlying state dynamics, and p(Ot |St = k) characterizes the distributions of visual features under different states. To fit HMM, we compute the feature vector for each shot by averaging over all frame-wise features in a shot. After Expectation Maximization (EM) training [3], we can obtain the optimized parameter set. Then the Viterbi decoding algorithm [25] finds the optimal state sequences, i.e., the mid-level semantic structure. This HMM-based approach is the baseline algorithm for our research.
5.2 Segmental hidden Markov models The major limitation of the HMM is that each shot is represented by an averaged feature vector that is less representative and informative. This motivates us to enhance the observation model by exploring the temporal dependency
across all frames in a shot. We invoke a segmental HMM (SHMM) proposed in [12] (Fig.4(b)) that is able to handle variable-length observations [7]. Each state in SHMM can emit a sequence of observations, which can be called a segment. Observations in a segment are assumed to be independent to the observations in other segments. In addition, in each segment all observations are conditionally independent given the mean of that segment, leading to a closed-form likelihood calculation. Here, we can regard a segment as a shot and observations in a segment as frames in that shot. A SHMM of M states can be characterized by a set of parameters Θ = {πm , am,n , µµ,m , Σµ,m , Σm |m, n = 1, ..., M }, where πm is the initial probability, am,n the transition probability, µµ,m and Σµ,m characterize the mean of the each segment of state m, and Σµ,m the variance of the segment. Given a shot in time t with nt observations, i.e., Ot = {Ot,k |k = 1, ..., nt }, under the conditionally independent assumption, the conditional likelihood of Ot is defined as: p(Ot |St = m, Θ) = Z nt Y p(µ|St = m, Θ) p(Ot,k |µ, St = m, Θ)dµ,
(3)
k=1
where p(Ot,k |µ, St = m, Θ) = p(µ|St = m, Θ) =
N (Ot |µ, Σm ), N (µ|µµ,m , Σµ,m ).
The likelihood function defined in (3) provides a closed form probability density function to characterize variable-length observations conditioned on different states. As shown in [12], the EM learning of SHMM requires an approximation to the integral in (3). Then, similar as HMMs, by using the Viterbi algorithm, we can estimate the optimal state ∗ sequence S1:T , i.e., the mid-level semantic structures.
5.3 Multi-channel SHMM Both SHMM and HMMs can only detect a single type of semantic structure and cannot capture the interaction between two mid-level semantic structures. We expect the new model can not only inherent the merits of the SHMM, but also explore the interaction between different semantic structures. Therefore, we advance a new multi-channel SHMM model (MCSHMM) that involves two parallel SHMMs and a two-layer hierarchical Markovian structure, as shown in Fig.4.(c). In this model, each channel is a single SHMM and they are connected by a hybrid parallel-hierarchical structure, also we developed a new learning algorithm that can learn a model with compact structure and optimized parameters simultaneously.
In MCSHMM, both the dynamic and observation models have a two-layered structure. Specifically, at the first-layer (j) of the dynamic model, S = {St |t = 1, ..., T ; j = 1, 2} de(j) notes the state sequence of two channels where St denotes the state of shot t in channel j, and at the second-layer of (1) (2) the dynamic model, F = {Ft = (St , St )|t = 1, ..., T } represents the state sequence at the second layer where each state consists of two current states at the first layer. At (j) (j) the observation layer Ot = {Ot,1:nt |t = 1, ..., T ; j = 1, 2} indicates observations of shot t with nt frames in channel j. Therefore, the MCSHMM’s parameter set Θ includes following components Θ = {Π, A, Ω}, • Initial probabilities: (1)
(2)
(1)
(2)
Π = {P (S1 ), P (S1 ), P (F1 |S1 , S1 )};
(4)
• Transition probabilities: A = {Aw , w = 1, 2, 3},
(5)
where (1)
= m|St−1 = n, Fi−1 = l)},
(1)
(2)
= m|St−1 = n, Fi−1 = l)},
A1
=
{P (St
A2
=
{P (St
A3
=
{P (Ft = l|St
(2)
(1)
(2)
= m, St
= n)};
• Observation density functions: (j)
(j)
p(Ot |St = m, Ω) = Z nt Y (j) N (µ|µjµ,m , Σjµ,m ) N (Ot,k |µ, Σjm )dµ,
(6)
6. SECOND INFERENCE PROBLEM 6.1 High-level Semantics High-level semantics are usually referred to the events or incidents that are of particular interest to the user with the following three characteristics: • They are rare and specific, and the training data could be insufficient; • They have more ambiguities and uncertainties in terms of low-level visual features compared with the midlevel semantic structures, e.g., a touchdown in football video could occur in many different ways; • They can be better defined or specified from mid-level semantic structures. For example, as shown in Fig. 5, an event-of-interest, e.g., a touchdown, can be inferred from mid-level semantic structures across several shots, including camera views, play types, and possessions. Event detection requires a flexible modeling structure that allows more versatile dependency relationship between random variables (both observations and states) to accommodate the nature of high-level semantics. Normally, generative models are mainly focused on the temporal dependency across states (not observations). This motivates us to use discriminative models for the second inference problem, where the observation corresponds to the mid-level semantic structure, and the state is associated with the high-level semantics or events-of-interest.
k=1
where Ω = {µjµ,m , Σjµ,m , Σjm |j = 1, 2; m = 1, ..., Nj } specifies two segmental models, Nj is the number of (j) states in channel j (N1 = N2 = 4), and Ot,k denotes the observation of frame k in shot t and channel j. The training of MCSHMM is more challenging compared with HMM and MSHMM due to the large possible state space involved in the two-layer dynamics. In order to fully take advantages of MCSHMM, we propose a unsupervised learning algorithm in which two learning processes, i.e., structure learning and parameter learning, could be unified into one framework and optimized simultaneously, like the ones in [34, 2]. Initially, we pre-define a coarse model structure that includes every possible state configuration in MCSHMM, then we will use the ideas of entropic prior and parameter extinction proposed in [2] that leads to a maximum a posteriori (MAP) estimator. This MAP estimator can be incorporated into the M-step of the EM algorithm which will encourage a maximally structured and minimally ambiguous model. This is accomplished by trimming the weakly supported parameters and states, leading to a compact model with good determinism. Thus the model structure and parameter will be simultaneously optimized. See more details in [6]. One may wonder whether the coupled HMM [1] (CHMM) can achieve the same goal. However, CHMM usually assumes a relative strong interaction between two Markovian chains that is not the case in our application. Also, CHMM involves a large state space due to the two parallel channels in one layer which makes the EM learning very difficult, while the learning of MCSHMM can keep only 30% (pre-set) useful state transitions.
High-level semantics: Touchdown Camera view sequence: Right
Middle
Left
End zone
Field goal
Play type sequence:
Long
Long
Short
Possession sequence:
Right
Right
Right
Right
Figure 5: The second inference problem.
6.2 Conditional Random Fields The conditional random fields (CRFs) (Fig. 6(a)) were first introduced in [16] that provide a undirected probabilistic model to directly compute the statistical mapping relationship between the input (observations) and the output (states) for classification and regression purposes. Compared with HMMs, CRF does’t involve the conditional distributions of observations, and the state variable in CRF can be related to multiple observations in different time slices, e.g., multiple shots. Compared with SVM, CRF is more appropriate to capture the temporal dependencies both among observations and states in the sequential data. Specifically, the Markov blanket in a CRF defines the set of neighboring nodes of a given node that specifies the potential functions from both states and observations and simplifies the factorization of joint distribution in the CRF.
St
St −1 ot −1
ot
7.2 Mid-level structure analysis
St +1
St −1
ot +1
M R R E M Camera views L S S G K Play types L L L L R Possessions
Ο
St
St +1
t-2 t-1 t t+1 t+2 Time slices
(a)
(b)
Figure 6: (a) CRF; (b) CRF for event detection (M,R,E: central, right, end-zone views; L,S,G,K: long/short plays, field goal, kick; L,R: left, right). A CRF model can be characterized by the parameter set Θ = (λ1 , λ2 , . . . , λn ; µ1 , µ2 , . . . , µn ),
(7)
which specifies two kinds of feature functions between observations (O) and hidden states (S) as pθ (S|O) = ( X 1 exp Z(O) t
(8) X
λi fi (St−1 , St ) +
i
X
µj gj (St , O)
j
!)
,
where Z(O) is a normalization over the data sequence O; fi (·) and gj (·) are transition feature functions and state feature functions respectively. Given a set of training data O, we can optimize the CRF by maximizing the conditional probability defined as: Θ∗ = arg max pθ (S|O),
(9)
Θ
′
Then, for a new data sequence O , we can predict the label S that can maximize the p(S|O′ , Θ∗ ), i.e., S∗ = arg max p(S|O′ , Θ∗ ). S
(10)
Specifically, we can regard O′ as the available mid-level semantic structure sequence and S as the desired high-level event sequence. Similar to [32], the Markov blanket of node St in this work (as shown in Fig. 6(b)) includes a first-order transition feature function and the five sequent observations to be the state feature function. This setting was found to be a good compromise between the complexity and accuracy. We utilize the dynamic programming algorithms proposed in [18] to optimize the CRF parameters, and then classification task is straightforward according to (10).
7.
EXPERIMENTAL RESULTS
7.1 Experiment settings The proposed algorithms were tested on 9 NFL football games (352×240, 30fps) that have been pre-segmented into a series of consecutive play shots by removing commercials and replays, and each game has 150-170 shots and each shot has 300-500 frames. Cut detection and shot segmentation are beyond the scope of this paper. We used the Bayes net and CRF toolkits from Murphy 1 in Matlab 7.0 to develop our algorithms. We have manually annotated all ground truth data of both high-level semantics (events) and midlevel semantic structures for algorithm evaluation. 1
http://www.cs.ubc.ca/∼murphyk/Software/index.html
Seven methods are involved for comparison, including a supervised GMM (order 3), the HMM with Gaussian emission (HMM(1) ) and the HMM with GMM emission (HMM(2) ), and the embedded GMM-HMM (HMM(3) ) in [8] and the SHMM in [7]. We also implemented two CHMM-based methods [1], among which CHMM(1) and CHMM(2) uses a GMM and a segmental observation model respectively. The first five explore two semantic structures (plays and views) independently and separately, while the last three estimate both jointly. Because the EM algorithm is the basic training algorithm in this case, the initialization is important, therefore, we adopted a coarse-to-fine learning strategy that uses the training result of a simpler model to initialize a more complex one. Specially, we first use K-mean (4-class) to obtain a coarse classification, and this result can be utilized to initialize HMM(1) whose training result can be used to initialize HMM(2) , and so on. The training result of SHMM was used to initialize MCSHMM. Moreover, we need to insert some prior knowledge into the learning process, e.g., the first shot is always in the central view, the initial probability of central view is set to be 1. We can initialize the transition matrix by estimating the frequencies of different state transition in a couple of real games. Table 1 shows the experimental results of mid-level semantic structure analysis. It is clearly shown that MCSHMM outperforms all other algorithms with significant improvements. It is mainly due to the two-layer dynamic model and the segmental observation model involved in MCSHMM.
7.3 High-level event analysis Given the mid-level semantic structures, we can use them to detect high-level semantic event. Specifically, we compare two event detection techniques as follows: • Rule-based approach: Rule-based approaches are built on top of some game rules, it is simple but effective in some cases where the game is highly organized according to the pre-defined baseline rule. For example, most touchdown plays occur in either left or right camera views with a short or long play that is followed by a field goal in the end zone. Then we use this rule to detect the touchdown play. • CRF-based approach: We take mid-level semantic structures as the the observation, and then infer high-level semantics via a CRF model (Fig. 6(b)). Especially, we employed the cross validation strategy in the training and testing process. Each method was tested under two settings. One is to use the ground-truth labels of mid-level semantic structures (Case 1), and the other is to use the results from MCSHMM (Case 2). Specifically, we consider the touchdown play as the desired high-level semantic event. In each game there are usually 5-12 touchdowns out of 150-170 plays. Moreover, we add possessions (which team is controlling the ball) that can be simply detected by camera motion in the beginning of a shot. We evaluate the performance of the two approaches in terms of three criteria, i.e., precision, recall, and F-measure [21], and the results are shown in Table.2. It is clearly shown that the CRF-based method has a much higher precision rate with a slight lower recall rate. The overall improvement on the F-measure is near 20%.
Test Videos
Semantic Structures Video-1 Play Type (156 shots) Camera View Video-2 Play Type (163 shots) Camera View Video-3 Play Type (167 shots) Camera View Video-4 Play Type (168 shots) Camera View Video-5 Play Type (168 shots) Camera View Video-6 Play Type (170 shots) Camera View Video-7 Play Type (170 shots) Camera View Video-8 Play Type (171 shots) Camera View Video-9 Play Type (173 shots) Camera View Average Accuracy
Supervised GMM 26.92% 35.90% 30.67% 41.10% 23.35% 37.12% 36.90% 47.61% 29.17% 40.47% 38.25% 41.17% 37.05% 52.35% 40.93% 50.88% 39.88% 43.93% 43.59%
HMM(1)
HMM(2)
HMM(3)
43.59% 55.77% 49.07% 61.35% 44.91% 53.29% 58.33% 64.88% 50.59% 56.55% 54.11% 52.35% 64.70% 68.24% 56.14% 62.57% 63.01% 66.07% 57.44%
60.90% 72.44% 61.34% 67.48% 49.10% 58.68% 64.67% 70.24% 61.90% 65.48% 60.59% 64.12% 71.17% 72.35% 65.50% 75.44% 67.05% 68.21% 62.62%
77.56% 76.28% 70.56% 79.14% 58.08% 61.08% 70.66% 73.21% 72.02% 73.81% 70.59% 70.59% 70.59% 75.29% 72.51% 77.78% 69.36% 71.68% 71.85%
CHMM(1) [1] 41.03% 58.33% 53.37% 65.03% 52.10% 58.08% 67.86% 69.05% 70.24% 60.71% 76.47% 74.70% 72.35% 67.06% 68.82% 76.02% 67.63% 69.94% 63.54%
SHMM [7] 80.13% 79.49% 77.91% 84.05% 68.86% 74.85% 77.98% 79.17% 74.40% 73.21% 80.59% 80.00% 75.88% 81.18% 78.95% 80.11% 75.14% 77.46% 77.42%
CHMM(2) [1] 78.85% 81.41% 76.69% 80.98% 70.06% 77.25% 69.62% 75.60% 70.83% 64.88% 70.59% 71.18% 76.47% 71.76% 82.94% 81.18% 69.36% 73.41% 75.08%
MCSHMM 82.05% 84.62% 81.59% 85.28% 75.45% 81.44% 82.74% 84.52% 80.95% 77.38% 84.71% 84.11% 83.53% 87.06% 84.80% 88.30% 82.08% 84.97% 82.92%
Table 1: Mid-level semantic structure analysis by eight algorithms for nine NFL videos.
7.4 Discussion In mid-level semantic structure analysis, the reason that SHMM is better than other HMMs is because of the segmental model used. The improvement of MCSHMM over SHMM lies in the new hierarchical-parallel dynamics. MCSHMM can effectively capture the mutual interaction between two Markov chains by introducing the two-layer dynamics and decision-level fusion that are able to balance the dependency both within each channel and across the two channels. On the other hand, CHMM does not produce reasonable results due to the facts mentioned in Section 5.3. The two-layer dynamic model in MCSHMM can learn the interaction from two individual Markov chains synergistically and feed it back to the two chains individually. Regarding event detection, we can see that the rule-based approach can reach a high recall rate but a very low precision rate, and the CRF-based approaches can improve the overall accuracy significantly. Moreover, we can see some interesting findings: (1) The CRF could detect the highlevel semantics with high precision even when the training data are limited; (2) However, the same high-level semantic event may have certain variability with respect to mid-level semantic structures, the CRF may not be generalized well to different incidents of the same kind of event; (3) Compared with the rule-based approach, the CRF-based one offers a better balance between precision and recall due to it statistical nature. Our research on the second inference problem is still very preliminary, but results are promising and inspiring, which motivate us to develop new discriminate models that has better generality and learning capability.
8.
CONCLUSION AND FUTURE WORK
We have proposed a general approach for semantic modeling and event detection in sports video data. A three-level semantic space is presented that supports explicit semantic modeling and direct semantic computing. Under this semantic space, event detection can be formulated as two inter-related statistical inference problems that have different nature and can be addressed by generative and discriminative models individually. Compared with other similar approaches, it is our attempt to provide a new perspective on the research of sport video mining and event detection so that they can be better formulated and addressed by ap-
propriate machine learning tools. The proposed framework can be extended to other sports video mining applications by incorporating new domain knowledge and relevant features, as well as novel modeling and learning techniques. Our future work will be two-fold. One is that we want to incorporate more relevant mid-level semantic structures (as well as low-level features) that could enrich the expressiveness for high-level semantics. The other is that we will enhance the capability and generality of CRF for more effective event detection. It is also interesting to investigate a hybrid generative-discriminative model that seamlessly combines two different learning mechanisms into one framework.
Acknowledgements This work is supported in part by the National Science Foundation (NSF) under Grant IIS-0347613 and the 2009 Oklahoma NASA EPSCoR Research Initiation Grant (RIG). The authors also thank the anonymous reviewers for their valuable comments and suggestions that improved this paper.
9.[1] M.REFERENCES Brand. Coupled hidden Markov models for modeling interactive processes. Tech. Report, MIT Media Lab, 1997. [2] M. Brand. Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation, 11(5):1155–1182, 1999. [3] G. Celeux and G. Govaert. Gaussian parsimonious models. Pattern Recognition, 28(5):781–783, 1995. [4] S.-F. Chang. The holy grail of content-based media analysis. IEEE Multimedia Magazine, 9(2):6–10, 2002. [5] M. Chen and A. Hauptmann. Discriminative fields for modeling semantic concepts in video. In Proc. Eighth Conference on Large-Scale Semantic Access to Content, 2007. [6] Y. Ding and G. Fan. Sports video mining via multi-channel segmental hidden Markov models. IEEE Trans. on Multimedia. in press. [7] Y. Ding and G. Fan. Segmental hidden Markov models for view-based sport video analysis. In Proc. International Workshop on Semantic Learning and Applications in Multimedia (SLAM), 2007. [8] Y. Ding and G. Fan. Two-layer generative models for sport video mining. In Proc. IEEE International Conference on Multimedia and Expo, 2007. [9] L. Duan, M. Xu, Q. Tian, C. Xu, S. Member, S. Member, and J. Jin. A unified framework for semantic shot classification in sports video. Transactions on Multimedia, 7:2005, 2005. [10] A. Ekin, A. Tekalp, and R. Mehrotra. Automatic soccer video analysis and summarization. IEEE Trans. on Image Processing, 12(7):796–807, 2003.
Test Videos Video-1 Video-2 Video-3 Video-4 Video-5 Video-6 Video-7 Video-8 Video-9 Average
Data Type Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case
1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Precision 20.00% 20.00% 55.56% 55.56% 62.50% 50.00% 60.00% 60.00% 66.67% 50.00% 50.00% 50.00% 83.33% 66.67% 40.00% 40.00% 45.45% 36.36% 53.72% 47.62%
Rule-based Recall 100.00% 80.00% 100.00% 100.00% 88.89% 66.67% 100.00% 80.00% 85.71% 85.71% 54.55% 54.55% 100.00% 89.89% 87.50% 75.00% 91.67% 91.67% 89.81% 80.39%
F-measure 33.33% 32.00% 71.43% 71.43% 73.39% 57.16% 75.00% 68.57% 75.00% 63.16% 52.18% 52.18% 90.91% 76.56% 54.90% 52.17% 60.77% 52.06% 65.21% 58.37%
Precision 100.00% 100.00% 100.00% 100.00% 87.50% 75.00% 100.00% 100.00% 100.00% 71.43% 100.00% 100.00% 100.00% 100.00% 100.00% 87.50% 88.89% 55.56% 97.38% 87.72%
CRF-based Recall 100.00% 100.00% 80.00% 60.00% 88.89% 66.67% 100.00% 80.00% 85.71% 72.73% 72.73% 54.55% 66.67% 66.67% 77.78% 66.67% 72.72% 54.55% 82.72% 69.09%
F-measure 100.00% 100.00% 88.89% 75.00% 88.19% 70.59% 100.00% 88.89% 93.34% 72.07% 84.21% 70.59% 80.00% 80.00% 87.50% 75.68% 80.00% 55.05% 89.13% 76.43%
Table 2: High-level event detection by rule-based and CRF-based approaches. [11] C. X. G. Zhu, Y. Zhang, Q. Huang, and H. Lu. Event tactic analysis based on player and ball trajectory in broadcast video. In International conference on Content-based image and video retrieval, 2008. [12] M. Gales and S. Young. The theory of segmental hidden markov models. Technical Report CUED/F-INFENG/TR 133, Cambridge University, 1993. [13] Y. Gong, L. T. Sin, C. H. Chuan, H. J. Zhang, and M. Sakauchi. Automatic parsing of tv soccer programs. In Proc. International Conference on Multimedia Computing and Systems, pages 167–174, 1995. [14] A. Kokaram, N. Rea, R. Dahyot, M. Tekalp, P. Bouthemy, P. Gros, and I. Sezan. Browsing sports video: trends in sports-related indexing and retrieval work. IEEE Signal Processing Magazine, 23(2):47–58, March 2006. [15] D. Koubaroulis, J. Matas, and J. Kittler. Colour-based object recognition for video annotation. In Proc. IEEE International Conference on Pattern Recognition, pages 1069–1072, Los Alamitos, CA, 2002. [16] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. the Eighteenth International Conference on Machine Learning (ICML), 2001. [17] M. Lazarescu and S. Venkatesh. Using camera motion to identify types of american football plays. In Proc. 2003 International Conf. on Multimedia and Expo, July 2003. [18] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998. [19] B. Li and I. Sezan. Event detection and summarization in American football brocast video. In Proc. SPIE Storage and Retrieval for Media Database, pages 202–213, 2002. [20] Y. Ma and H. Zhang. Motion pattern based video classification using support vector machines. In Proc. IEEE International Symposium on Circuits and Systems, 2002. [21] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performance measures for information extraction. In DARPA Broadcast News Workshop, 1999. [22] T. Mei, Y. Ma, H. Zhou, W. Ma, and H. Zhang. Sports video mining with mosaic. In Proc. the 11th IEEE International Multimedia Modeling Conference, pages 164–171, 2005. [23] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, 2002. [24] M. Naphade and T. Huang. Discovering recurrent events in video using unsupervised methods. In Proc. IEEE International Conference on Image Processing, 2002. [25] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of The IEEE, 77(2), February 1989. [26] R. Raina, Y. Shen, A. Y. Ng, and A. Mccallum. Classification with hybrid generative / discriminative models. In In Advances in Neural Information Processing Systems 16, 2003. [27] Y. Rui, T. Huang, and S. Mehrotra. Constructing
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
table-of-content for video. In Proc. ACM Multimedia conference, 1998. D. Sadlier and N. Connor. Event detection in field-sports video using audio-visual features and a support vector machine. IEEE Trans. Circuits Syst. Video Technology, 15(10):1225–1233, Oct. 2005. M. Srinivasan, S. Venkatesh, and R. Hosie. Qualitative estimation of camera motion parameters from video sequences. Partern Recognition, 30:593–606, 1997. V. Suresh, C. Mohan, R. Swamy, and B. Yegnanarayana. Content-based video classification using support vector machines. In Neural Information Processing, Lecture Notes in Computer Science. Springer Verlag, Oct. 2004. F. Wang, Y. Ma, H. Zhang, and J. Li. A generic framework for semantic sports video analysis using dynamic bayesian networks. In Proc. the 11th International Multimedia Modelling Conference, 2005. T. Wang, J. Li, Q. Diao, W. Hu, Y. Zhang, and C. Dulong. Semantic event detection using conditional random fields. In Proc. International Workshop on Semantic Learning and Applications in Multimedia (SLAM), 2006. Y. Wang, Z. Liu, and J. Huang. Multimedia content analysis using both audio and visual clues. IEEE Signal Processing Magazine, 17(6):12–36, November 2000. L. Xie, S. Chang, A. Divakaran, and H. Sun. Unsupervised mining of staistical temporal structures in video. In D. D. A. Rosenfeld, editor, Video Mining, chapter 10. Kluwer Academi Publishers, 2003. L. Xie, S.-F. Chang, A. Divakaran, and H. Sun. Structure analysis of soccer video with hidden Markov models. In Proc. Interational Conference on Acoustic, Speech and Signal Processing, Orlando FL, 2002. L. Xie, S.-F. Chang, A. Divakaran, and H. Sun. Unsupervised discovery of multilevel statistical video structures using hierarchical hidden Markov models. In Proc. IEEE International Conference on Multimedia and Expo, Baltimore, MD, July 2003. Z. Xiong, X. Zhou, Q. Tian, Y. Rui, and T. Huang. Semantic retrieval of video - review of research on video retrieval in meetings, movies and broadcast news, and sports. IEEE Signal Processing Magazine, 23(2):18–27, March 2006. C. Xu, J. Cheng, Y. Zhang, Y. Zhang, and H. Lu. Sports video analysis: Semantics extraction, editorial content creation and adaptation. Journal of Multimedia, 2:69–79, 2009. X. Yu, C.Xu, H. W. Leong, Q. Tian, Q. Tang, and K. W. Wan. Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video. In Proc. ACM Multimedia, pages 11–20, 2003. D. Zhong and S.-F. Chang. Structure analysis of sports video using domain models. In Proc. IEEE Conference on Multimedia and Expo, Tokyo, Japan, 2001. W. Zhou, A. Vellaikal, and C. Kuo. Rule-based video classification system for basketball video indexing. In ACM Multimedia, 2000.