Semantic Event Detection Using Conditional Random ... - IEEE Xplore

2 downloads 0 Views 225KB Size Report
Semantic Event Detection using Conditional Random Fields. Tao Wang. 1. Jianguo Li. 1. Qian Diao. 1. Wei Hu. 1. Yimin Zhang. 1. Carole Dulong. 2. 1.

Semantic Event Detection using Conditional Random Fields Tao Wang1 Jianguo Li1 Qian Diao1 Wei Hu 1 Yimin Zhang1 Carole Dulong2 1 Intel China Research Center, Beijing, P.R. China, 100080 2 Intel Corporation, Santa Clara, CA 95052, USA {,, qian.diao,, yimin.zhang, carole.dulong}

Abstract Semantic event detection is an active research field of video mining in recent years. One of the challenging problems is how to effectively model temporal and multi-modality characteristics of video. In this paper, we employ Conditional Random Fields (CRFs) to fuse temporal multi-modality cues for event detection. CRFs are undirected probabilistic models designed for segmenting and labeling sequence data. Compared with traditional SVM and Hidden Markov Models (HMMs), CRFs based event detection offers several particular advantages including the abilities to relax strong independence assumptions in the state transition and avoid a fundamental limitation of directed graphical models. To detect event, we use a three-level framework based on multi-modality fusion and mid-level keywords. The first level extracts audiovisual features, the mid-level detects semantic keywords, and the high-level infers semantic events from multiple keyword sequences. The experimental results from soccer highlights detection demonstrate that CRFs achieves better performance particularly in slice level measure.

1. Introduction With the advance of storage capabilities, computing power and multimedia technology, the research on semantic event detection become more and more active in recent years, such as video surveillance, sports highlight detection, TV/Movie abstraction and home video retrieval etc. Through event detection, consumers can retrieve specific video segments quickly from the long videos and save much time in browsing. There is much literature on semantic event detection [1][3][5][11][16]. However, semantic event detection is still a challenging problem due to the large semantic gap and the difficulty of modeling temporal and multimodality characteristics of video.

In general, two kinds of methods are adopted in previous works, i,e, segments classification and sequence learning. The Segments Classification Approach (SCA) treats event detection as a classification problem. The approach first selects possible event segments, e.g., a sliding data window, and then adopts classification algorithms to predict the semantic label of each segment. Duan et al [11] used game-specific rules to classify events. Although the rule system is intuitive to yield adequate results, it lacks in scalability and robustness. Wang et al used SVM to detect events[10]. SVM is a good classifier particularly for a small training set. However, it may not sufficiently characterize the relations and temporal layout of features. Some researchers utilized Naive Bayesian classifier to detect specific events[1]. Naive Bayesian assumes that features are independent of each other, and consequently neglects the important relationships among features. SCA are simple and effective but have two limitations. Firstly, they can not characterize long-term dependence within video streams, and thus may be myopic about the impact of their current decision on later decisions[9]. Secondly, it is difficult for them to determine accurate event boundaries, i.e., the starting and ending time of the detected events. Compared with Segments Classification, Sequence Learning Approach (SLA) uses probabilistic models to characterize the temporal video sequence. SLA deals with event detection as a labeling sequence problem, i.e., decoding the most probable hidden state (semantic event label) sequence from the observed sequence (video). The most popular model is Hidden Markov Models (HMMs) which provides well-understood training and decoding algorithms for labeling sequence[4][5]. While enjoying much historical successes, they suffer from one principal drawback: The structure of the HMM is often a poor model to characterize the true process producing the data. Part of the problem stems from the Markov property. Any relationship between two separated y values (e.g., y0

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

and y3) must be communicated via the intervening y's (e.g., y1 and y2). A first-order Markov model where the p(yt) only depends on yt-1 can not capture these kinds of relationships. This limitation is one of the main motivations to consider Conditional Random Fields (CRFs) as an alternative[9]. CRFs are undirected graph models which specify the joint probability of possible label sequences given an observation sequence. Although it encompasses HMM-like models, CRFs are more expressive due to allowing more dependencies on the observation sequence. In addition, the chosen features may represent attributes at different levels of granularity of the same observations or aggregate properties of the observation sequence [9][6]. CRFs have been successfully practiced in text processing, such as name entity extraction [2] and shallow text parsing [6], but there is little work applying it in video processing. In this paper, we employ Conditional Random Fields (CRFs) to semantic event detection. In our approach, we take advantage of mid-level keywords [11] to minimize the semantic gap between low-level features and high-level events. The method first detects mid-level semantic keywords from low-level audio/visual features, and then CRFs infers semantic event labels according to the multiple mid-level keyword sequences. The rest of this paper is organized as follows. In Section 2, we briefly describe CRFs conceptions and relevant algorithms. In Section 3, we propose the CRFs-based semantic event detection approach. To evaluate the effectiveness of this method, extensive experiments over 12.6 hours of soccer videos are reported in section 4. Finally, concluding remarks are given in section 5.

2. Conditional random fields In this section, we describe CRFs basic concepts and relevant algorithms on labeling sequence. More details of CRFs can be found in [6].

where w~t means that w are neighbors of the node st in the graph model, i.e. Markov blanket.

s t 1


ot 1

s t 1


ot 1

s t 1


s t 1

ot 1


ot 1

Fig 1. (a)HMM models and (b)CRFs models The conditional probability pT (s | o) of CRFs is defined as follows: T (1) 1

pT ( s | o )

Z (o )

exp( ¦ F (s, o, t )) t 1


¦ exp( ¦ F (s, o, t ) is

where Z (o )

the normalization

t 1


constant, F (s, o, t ) is the feature function at the position t. For the first-order CRFs, the feature function F (s, o, t ) is given by: (2) F (s, o, t ) O f ( s , s )  u g (o, s )


t 1

i i








In which f i (.) and g j (.) are the transition and state feature functions respectively. T {O1 , O2 ,...; u1 , u 2 ...} is the learned weights associated with f i (.) and g j (.) . Compared





p(ot | st ) of HMM, the state feature functions g j (o, st ) of CRFs depend not only on the current observation ot, but also on past and future observations o . Although SVM is able to make the decision based on the dependent local observation segments, its object function doesn’t consider the state relationship, e.g. f i ( st 1 , st ) in the labeling sequence. So CRFs are better at characterizing the sequence labeling problem in theory than SVM and HMMs.

2.1 CRFs concepts

2.2 Training and inference of CRFs

Contrary to directed HMMs, CRFs are undirected probabilistic graphical models shown in fig.1(b). Let o [o1 , o2 ,.., oT ] be the input observation sequence

The parameters T {O1 , O 2 ,...} of CRFs are trained by maximum likelihood estimation (MLE) approach. Given N training sequences, the log likelihood L is written as

and s [ s1 , s2 ,..., sT ] be the corresponding labeling sequence, e.g., the sequence of event labels (st=1 represents event and st=0 nonevent, t=1,2,...,T). Conditioned on the observation sequence o and the surrounding labels s \t , the random variable st obeys the Markov property p ( s t | s \t , o)

p ( s t | s w , o, w ~ t ) ,



¦ log( pT (s


| o j ))

j 1 N

(3) T

¦ (¦ F (s j 1




, o , t ))  log Z (o )

t 1

It has been proved that the L-BFGS quasi-Newton method converges much faster to learn the

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

parameters T than traditional iterative scaling learning algorithms, such as GIS and IIS [6]. L-BFGS avoids the explicit estimation of the Hessian matrix of the loglikelihood by building up an approximation of it using successive evaluations of the gradient. After training of CRFs, the most probable labeling sequence s * given an observation sequence o is inferred by:


arg max pT (s | o) s

infers events in the semantic space of these keyword sequences. In the framework, we take advantage of mid-level keywords to bridge the large semantic gap between low-level features and high-level events[11]. Since mid-level keywords convert video streams into the representation of text symbol sequences, the high-level event detection can be conveniently processed and analyzed like text mining.




arg max exp( ¦ F (s, o, t )) s

t 1


s can be efficiently calculated by the Viterbi algorithm, which calculates the marginal probability of states at each position of the sequence using a dynamicprogramming procedure[7].

3. CRFs based Event Detection In this section, we propose the CRFs based event detection approach. The method first detects mid-level semantic keywords from low-level audio/visual features, and then CRFs jointly infers semantic event labels from the multiple keyword sequences.

low-level feature extraction visual feature

audio feature


multimodal features

mid-level keyword detection x1


... xN

keyword stream

high level event detection prediction result

3.1 Framework of event detection In general, events appear with particular patterns. Similar to the idea from content (word) to context (sentence) in text mining, the events of video are characterized by certain multimedia content elements and their temporal layout. For example, music and camera motion are two kinds of important content elements that often appears in movies; video segments in movies with high tempo, extreme emotion and tense music usually indicate highlight scenes. By semantic keyword/concept detection, the content elements can be extracted as keywords. Further, their temporal layout consists of the keywords sequence. So, event detection can be intuitively looked as a sequence labeling problem, i.e., infer the semantic event labels s according to observed multiple keyword sequences o k where o k [ xk1 , xk 2 ,..., xkN ] with k=1,..,K keywords, and N time slices. Fig.2 illustrates our event detection framework. The framework consists of three level architectures, i.e. low-level feature extraction, mid-level semantic keywords detection and high-level event detection modules. In processing, the low-level module first extracts audio/visual features from the video stream. Then the mid-level module detects semantic keywords from low-level features. Finally, the high-level module

Fig 2. Overview framework.





3.2. Mid-level keyword detection The mid-level module detects relevant semantic keywords from low-level audio/visual features of videos. Keywords denote basic semantic concepts in a frame or a shot, such as subject (face, car, building, road, sky, water, grass), place (indoor and outdoor), sound type (silence, speech, music, applause, explosion), camera motion (pan, tilt, zoom), and view type (global view, medium view and close-up view) etc. Generally, keywords are related to different applications. For instances, car and speed are relevant keywords to the surveillance of vehicle transportation, face and speeches are important in casting indexing of movie. In the case of soccer highlights detection, we detect following multi-modal keywords for high-level semantics inference. In visual domain, there are three relevant keywords: semantic view types, play-position, and replay. In audio domain, we detect two significant keywords: commentator’s excited speech and referee’s whistle. Details of keywords generation are described as following: - x 1 View type: View type plays a critical role in video understanding. We predefine four kinds of view types:

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

global view, medium view, close-up view and out of view [1][11]. And then use playfield area and player size to determine view type of each key frame. The corresponding low-level processing includes playfield segmentation by HSV dominant color of playfield connect-component analysis. The dominant color of playfield is adaptively trained by accumulating HSV color histogram on a lot of frames[14]. Fig.3 shows examples of these view types. - x 2 Play-position: Play-position indicates potential highlights in field-ball sports video (near penalty area or transition from middle area to penalty area). We classify the play-position in global views into 5 regions as shown in fig. 4(b). “LL” means left region including two corners and penalty area in the left half-field. “ML” refers to middle-left region, and “MM” is middle region. In the implementation, we first execute Hough transform to detect playfield line-marks including boundary lines, middle line and penalty box lines. Then use a decision tree to determine the play-position according to the lines’ slope and positions [1][10]. - x 3 Replay: Replay is an important video editing way in broadcasting programs. It is usually used to play back important or interesting segments with a slowmotion pattern to let audiences enjoy the details. Generally, there is a logo flying in high speed at the beginning and ending of each replay (see fig.3 right). We detect logo patterns by color feature and optical flow motion features and then identify similar replay segments by dynamic programming [17]. - x 4,5 Audio keywords: We detect two types of audio keywords: commentator’s excited speech, and referee’s whistle. They have strong relations to some soccer highlights such as goal, shot, and foul, etc. Gauss mixture model (GMM) is used to detect the above two keywords from low-level audio features including Mel frequency Cepstral coefficients (MFCC), Energy and pitch [11][15]. One thing worth pointing out is that the proposed mid-level module is an open framework. More advanced features and mid-level keywords can readily be incorporated for special applications.

Fig. 3. From left to right, these are examples of global view, middle view, close-up view, out of view, and replay logo

(a) (b) Fig. 4. (a) Hough line detection on a segmented playfield (b) Five detected regions in the playfield

3.3. High-level event detection using CRFs The goal of semantic event detection is to find meaningful events (event type) and their starting and ending times (event boundary). For classifier based event detection, SCA depends on a candidate event region, e.g., a sliding window, to decide whether event happened or not in the corresponding video segment. Since different events randomly happened with variant time durations, SCA is trouble to decide the accurate candidate event regions in the whole video without prior knowledge. For CRFs based event detection, labeling mid-level keyword sequences can automatically decode the semantic event label and output both the occurred event type and detailed event region according to the whole observed keywords sequence. So, CRFs based event detection is more convenient in practice and more possible to achieve better performance by joint inference over entire video sequences. For the event detection of CRFs, the transition feature functions f i ( st 1 , st ) (i.e., edge in the undirected graph) are model defined. Users only need to control the state feature functions g j (o, st ) . At each time position t of a given sequence, we usually assume a Markov blanket, and extract combined state features in the Markov blanket domain. Given two keywords sequences, the context combination based state features at position t are defined and illustrated in Fig 5 and Table 1. Here, v and w are two kinds of mid-level keywords, and st is the event label. It is obvious that CRFs are more expressive to model temporal sequence than HMMs and SVM by allowing arbitrary dependencies on the observation sequence. In addition, the chosen features may be at different levels of granularity of the same observations or aggregate properties of the observation sequence. CRFs model is flexible and extensible. For different applications, people can incorporate different keyword sequences, state features g j (o, st ) and set appropriate size of Markov blanket according to prior

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

knowledge or feature selection methods. In the case of soccer highlight detection, we empirically set the size of Markov blanket to be 5, View type x1 to be one keyword sequence v, and encode all the other mid-level keywords [ x 2 x 3 x 4 x 5 ] as another sequence w. We define st=1 if current slice t belongs to highlights event, otherwise st=0 for non-highlight. st-3


Event labels s Video sequence


Keyword v sequence


Keywords w sequence


st-2 G

st-1 G


st C


Fig 5. Markov blanket of CRFs at position t with size=5. In the example of soccer highlights detection, G: global view; M: medium view; C: close-up view; LL: left region; ML: middle left region; R: replay; E: excited speech. Table1. CRFs position t





Template for transition features st-1st Template for state features st st st st st

wt-2, wt-1, wt, wt+1, wt+2, wt-1wt, wtwt+1 vt-2, vt-1, vt, vt+1, vt+2, vt-2vt-1, vt-1vt, vtvt+1, vt+1vt+2 vt-2vt-1vt, vt-1vtvt+1, vtvt+1vt+2 vt-1vtvt+1wt vt-1wt-1, vtwt, vt-1vtwt-1, vt-1vtwt, vt-1wt-1wt, vtwt-1wt

keyword sequences to find candidate event segments, i.e. play-break units, using the algorithm described in [1]. Then each keyword stream of the play-break unit is time sampled and represented by a N-dimensional feature vector. This vector is used by all approaches to detect events where x k [ xk 1 , xk 2 ,..xkN ] , and k = 1,2,..,K, with K=5 keywords, and N=40 time slices. The most widely used performance measures for information retrieval are precision (Pr) and recall (Re). Based on Pr and Re, the F-score = 2*Pr*Re/(Pr+Re) evaluates the comprehensive performance. Since events always happened in particular time regions, a good detector not only predicts correct event types, i.e. event label, but also detects the accurate event boundary with full event contents. Thus we further define the segment level measure and slice level measure to evaluate event detection performance. For a predicted video segment, the segment level measure is true if there is at least 80% overlap with the real event region. Similarly, for a predicted video slice, the slice level measure is true if the predicted slice is also in the real event region. Then the performance for a whole video is the average of all predicted segments/slices. Segment level measure is suitable to evaluate the recall because it is not dependent on accurate event boundaries. On the other hand, slice level measure can better evaluate how accurate the predicted event boundaries are. Table 2 and table 3 summarized the highlights event detection performance in segment level measure and slice level measure respectively. From the two tables, following observations can be made: z

For segment level measure, the CRFs achieve slightly better performance than the SVM approach and better performance than HMM since it relaxes the strong first-order Markov dependence assumptions and avoids a fundamental limitation of directed graphical models. The lower precision of HMMs demonstrates the deficiency which is unable to capture long term interactions in sequence labeling.


For slice level measure, CRFs greatly dominate all other approaches since sequence learning approaches can automatically predict the event boundary by sequence labeling. This can also explain why CRFs obtain the highest precision performance in the segment level measure of table 2. However, SCA approaches (SVM) can’t detect accurate event boundaries due to depending on prior knowledge to decide possible event regions. The lower precision of HMMs further demonstrates its deficiency in semantic event detection.

4. Experiments In our experiments, we used libSVM[13], Intel OpenPNL [8] and FlexCRF[7] toolkits for the training/inference of Linear SVM, first-order HMMs and first-order CRFs respectively. To demonstrate the effectiveness of the proposed approach, experiments of soccer highlights detection were conducted on eight soccer matches totaling up to 12.6 hours of videos. Highlights are video segments in which the user has special or elevated interest. We define semantic events “goal”, “shot”, “foul”, “free kick” and “corner kick” as highlight events, and all others as non highlights. Five matches are used as training data, and the others as testing data. The ground truth is labeled manually. To compare fairly the performance of SVM, HMM, and CRFs, we input same video segments for highlights detection. Since televised soccer generally uses a close-up view or a replay as a break to emphasize the highlights event, we first filter multiple

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

Table 2. Comparison on soccer highlight detection in segment level measure. Ground truth of highlights is 205. Method SVM HMM CRFs

Miss 26 30 28

False 27 45 20

Precision 86.9% 79.5% 89.8%

Recall 87.3% 85.4% 86.3%

F-Score 87.10% 82.34% 88.02%

Table 3. Comparison on soccer highlights detection in slice level measure. Method SVM HMM CRFs

Precision 73.4% 70.0% 78.5%

Recall 60.4% 60.7% 71.6%

F-Score 66.3% 65.3% 74.9%

Both SCA (SVM) and SLA (HMMs and CRFs) have their advantages and disadvantages, and can be applied for different applications. For example, if we just care about the rough position of each highlight, or prior knowledge can filter possible event segments accurately, then SCA is enough. However, if we require the precise event boundaries, but without prior knowledge to segment videos, then we have to use SLA.

5. Conclusion In this paper, we propose a CRFs based semantic event detection approach. The method first extracts mid-level keyword sequences from low-level multimodality features, and then employs CRFs to jointly infer semantic event labels from multiple keyword sequences. Compared with traditional approaches, e.g., HMMs and SVM, CRFs offers several particular advantages including the abilities to relax strong independence assumptions in the state transition and avoid a fundamental limitation of directed graphical models. The experiments over soccer highlights event detection demonstrated that CRFs achieves better performance than SVM and HMM particularly in slice level measure. It is worth pointing out that our proposed methods can be broadly applied to other kinds of event-based video applications, e.g. video surveillance, video summarization and content based retrieval.

6. Acknowledgements The authors are thankful to YangBo, WangFei, Sun Yi, Prof. Sun Lifeng, and Prof. Ou Zhijian of Dept. CS. and EE. of Tsinghua university for the research on the mid-level audiovisual keywords detection.

7. References [1]

A. Ekin, A. M. Tekalp, and R. Mehrotr. Automatic soccer video analysis and summarization. IEEE Trans. on Image processing, 12(7):796–807, 2003. [2] A. McCallum, Efficient inducing features of Conditional Random Fields, In Proc. of Conf. on Uncertainty in Artificial Intelligence(UAI), 2003. [3] C. G. Snoek and M. Worring. Multimedia event-based video indexing using time intervals. IEEE Trans on Multimedia, 7(4):638–647, 2005. [4] D. Q. Phung, T.V. Duong, S.Venkatesh, and H.H. Bui. Topic transition detection using hierarchical hidden Markov and semi-Markov Models, ACM multimedia, pp.11-20, 2005. [5] D. Zhang, D. G. Perez, S. Bengio, I., McCowan, Semisupervised Adapted HMMs for Unusual Event Detection, IEEE conf. on CVPR, vol.1, pp 611-618, 2005 [6] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. of HLT/NAACL, 2003. [7] FlexCRFs: Flexible Conditional Random Fields [8] Intel Open Source Probabilistic Network Library (OpenPNL) [9] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pp.282-289, 2001. [10] J. Wang, C. Xu, E.Chng, K. Wan, and Q. Tian. Automatic replay generation for soccer video broadcasting. In ACM Multimedia Conference, 2004. [11] L. Duan, M. Xu, T.-S. Chua, Q. Tian, and C. Xu. A mid-level representation framework for semantic sports video analysis. In ACM Multimedia Conference, 2003. [12] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun. Structure analysis of soccer video with hidden markov models. Proc. ICASSP, 4:4096–4099, 2002.

[13] LIBSVM: A Library for Support Vector Machines [14] M. Luo, Y. Ma, and H.J.Zhang. Pyramidwise structuring for soccer highlight extraction. In ICICS-PCM, pp. 1–5, 2003. [15] M. Xu, N. Maddage, C.Xu, M. Kankanhalli, and Q.Tian. Creating audio keywords for event detection in soccer video. In IEEE ICME 2003, volume 2, pages 281–284, 2003. [16] N., Haering, R.J. Qian, M.I.Sezan, A semantic eventdetection approach and its application to detecting hunts in wildlife video, IEEE Trans. on Circuits and Systems for Video Technology, Vol.10(6), pp.857 – 868, 2000. [17] X. Yang, P. Xue, and Q.Tian. Repeated video clip identification system. In ACM Multimedia 2005, pages 227–228, 2005.

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) 0-7695-2646-2/06 $20.00 © 2006 IEEE

Suggest Documents