2010 International Conference on Pattern Recognition
Action Recognition Using Spatial-Temporal Context Qiong Hu1, Lei Qin1, Qingming Huang1, 2,*, Shuqiang Jiang1, Qi Tian3 Key Lab of Intelli. Info. Process., Inst. of Comput. Tech., CAS, Beijing 100190, China
1
2
3
Graduate University of Chinese Academy of Sciences, Beijing 100190, China Department of Computer Science, University of Texas at San Antonio, TX 78249, U.S.A. {qhu, lqin, qmhuang, sqjiang}@jdl.ac.cn,
[email protected] filter [1]. The statistical distribution of video-words provides an abstract and compact representation for videos, but the spatial-temporal layout information has lost, resulting in a great degradation in performance for action recognition in unconstrained videos. To enhance the discrimination, the state-of-the-art approaches fall into two categories. The first category dedicates to include more visual cues. Liu et al. [4] employed Fiedler Embedding to embedding local motion features and quantized spin images into a common Euclidian space. Lin et al. [7] proposed an action prototype tree built in a joint shape-motion space. Another line of work tries to capture the spatial-temporal structural information of videowords. Modified spatial correlogram [6] and spatial-temporal pyramid matching method [2] are utilized for this task. More elaborately, Savarese el al. [10] introduced spatial-temporal correlatons to capture the underlying distribution of videowords. However, the membership of ST-correlatons is lost. Besides, how to find the optimal size/shape of a set of kernels is another issue in their method. Motivated by these problems and inspired by Zhang’s work in [12], we propose the descriptive video phrases and descriptive video cliques in spatial-temporal domain, or STDVPs and ST-DVCs in short. ST-DVPs are defined as the distinctive and frequently co-occurring video-word pairs in videos, while ST-DVCs are the representative video cliques composed of video-words and their surrounding contextual words. Intuitively, the context could help us capture and understand the semantic meaning of a video-word. The proposed action recognition framework is shown in Fig. 1. Traditional bag-of-video-words model is first constructed, and then the context information is modeled by building a local histogram within a confined cube. The local histograms are further clustered into contextual word. Based on these, ST-DVPs and ST-DVCs are selected according to the Term Frequency-Inverse Document Frequency (TF-IDF) criterion for each action category, respectively. Thus action recognition can be done based on all these representations and their combinations. The rest of the paper is organized as follows. Section 2 introduces the concept of spatial-temporal video-phrases and video-cliques (ST-VPs/ST-VCs). Section 3 illustrates the selection process of ST-DVPs and ST-DVCs. Then these representations based action recognition method is evaluated in Section 4. Finally, Section 5 concludes the paper.
Abstract—The spatial-temporal local features and the bag of words representation have been widely used in the action recognition field. However, this framework usually neglects the internal spatial-temporal relations between video-words, resulting in ambiguity in action recognition task, especially for videos “in the wild”. In this paper, we solve this problem by utilizing the volumetric context around a video-word. Here, a local histogram of video-words distribution is calculated, which is referred as the “context” and further clustered into contextual words. To effectively use the contextual information, the descriptive video-phrases (ST-DVPs) and the descriptive video-cliques (ST-DVCs) are proposed. A general framework for ST-DVP and ST-DVC generation is described, and then action recognition can be done based on all these representations and their combinations. The proposed method is evaluated on two challenging human action datasets: the KTH dataset and the YouTube dataset. Experiment results confirm the validity of our approach. Keywords-spatial-temporal local features; spatial-temporal context; descriptive video-phrases; descriptive video-cliques; action recognition
I.
INTRODUCTION
Action recognition is an active research field in computer vision, which has been extended from videos captured in constrained scenarios, i.e. KTH dataset [9], to videos “in the wild”. Recently the latter has caught increasing attention of researchers. Liu et al. [5] collected an action dataset from YouTube and Laptev et al. [2] learned realistic actions in Hollywood movies. Due to cluttered background, camera motion, occlusion and variances of objects in wild videos, successful extraction of good features [5] and robust feature representation [8] are crucial for action recognition. In [8], a bag-of-video-words model for action recognition is proposed, and it has been widely exploited in literatures [2][4][5][6][10]. This model treats a video sequence as a collection of video-words extracted by some spatial-temporal interest point detector, i.e. an extended 3D Harris detector [3] or a separable linear *
Corresponding author. This work was supported in part by National Natural Science Foundation of China: 60773136, in part by National Basic Research Program of China (973 Program): 2009CB320906, and in part by Beijing Natural Science Foundation: 4092042.
1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.376
1525 1521
STIP extraction
Codebook generation
Local histogram building (context)
DVPs/DVCs selection Recognition
video-word 1 video-word 2 video-word 3 video-word n local cuboid
height
200
150
100
T F I D F
50
0 300 400
200
time
300 200
100 100 0
0
width
ST-DVPs ST-DVCs type 2
ST-DVPs ST-DVCs video-word codebook
(a)
(b)
v1 v2 v3 v4
contextual word codebook
(c)
type n
(d)
Action model training
video-word codebook (e)
Output action categories
ST-DVPs ST-DVCs type 1
250
Figure 1. The flowchart of the proposed action recognition framework (better viewed in color)
II.
and more occurrence of noise. p and each video-word in the local histogram constitute a video-word pair, or an ST-VP. As ST-VPs only model the relationship between two video-words, more contextual information should be utilized. So another concept, spatial-temporal video-cliques (STVCs), is proposed in this paper. Referring to Fig. 1(c), the video-words enclosed by the black cube, encapsulated in the local histogram described above, constitute the context of video-word p. We try to capture more information of an action from them. So we further cluster the local histograms calculated at each interest point into contextual words and define a video-word together with its surrounding contextual word as an ST-VC. Video-words, ST-VPs and ST-VCs depict the characteristics of an action at different granularities. It is reasonable to believe that they are complementary to each other in action recognition task. So a general action recognition framework based on them is proposed and evaluated in Section 4.
SPATIAL-TEMPORAL VIDEO-PHRASES AND SPATIALTEMPORAL VIDEO-CLIQUES
In this paper, we extend the concept of visual phrases [12] in images to the spatial-temporal domain and apply it to the action recognition task. As shown in Fig. 1 (a) and (b), the Space-Time Interest Points (STIPs) are detected by the method in [3], and then clustered to form video-word vocabulary. Each interest point can obtain a label by associating it to the nearest video-word. So a video can be represented as a collection of labeled STIPs, as indicated in Fig. 1(c), where different color and shape represent different words. Thus, the local label distribution pattern can be used as the “context” for a videoword, and it is further quantized to form contextual words. Based on these primitive elements, namely, video-words and contextual words, representations at different levels of abstraction can be extracted from videos. A spatial-temporal video-phrase (ST-VP) refers to the cooccurring video-word pair within some spatial-temporal extents. In order to capture the co-occurrence of two videowords, a local histogram [10] is employed and modified here. As illustrated in Fig. 1(c), the inner pink cube centered at some video-word instance p is provided by scale estimation during the feature extraction phase in [3]. Then a surrounding black cube proportional to it in scale is chosen as the “context”, and a local histogram defined below is calculated: H ( p, s ) = [t1 , t2 ,..., tn ] (1) where p is the video-word for which the local histogram is built, s is the scale factor that defines the neighborhood extents, ti represents the frequency of video-word with label i occurring in the confined neighborhood, and n is the total number of video-words. The size of the neighborhood cube is computed as: size = s × size p (2) where sizep is the scale of STIP p obtained in [3]. It is easy to verify that larger value of s incurs higher computational cost
III.
ST-DVPS AND ST-DVCS SELECTION
Intuitively, every action category has its unique representation elements or element combinations which can differentiate it from all other action categories. Identifying the category based descriptive patterns has great significance in action classification and recognition. To this end, the selected patterns are desired to have the following properties: (1) They are supposed to appear more frequently in the action category they mean to describe. (2) To be unique for one category, they should appear less frequently in all other categories. The two properties conform to the TF-IDF weighting of information retrieval theory, so TF-IDF is employed here to select these patterns out. And we call the selected patterns as ST-DVPs and ST-DVCs, respectively, in light of the pools from which they are selected, that is, ST-VPs and ST-VCs. The selection process can be decomposed into two steps: cooccurrence calculation and TF-IDF weighting based selection.
1522 1526
Influence of video-word vocabulary size
85
90.3
90
75
average accuracy
average accuracy
84.3
80
85 80 75 70 65
59.1
60 55 50
Influence of scale factor s
84.1
95
200
400
600
800
1000
65
62.1
60
58.1
55 50
Results on KTH Results on YouTube 0
70
1200
(a) video-word vocabulary size
1400
1600
DVPs_based DVPs_based DVCs_based DVCs_based
45 40
1
1.5
2
AR on KTH (vw_200_DVP_400) AR on YouTube (vw_150_DVP_400) AR on KTH (vw_100_cw_50_DVC_400) AR on YouTube(vw_150_cw_50_DVC_400) 2.5
3
(b) s (scale factor)
3.5
4
(c)
Figure 2. Influence of parameters
S * (i , j ) = TC* (i , j ) / T * (i, j ) (7) Finally, for each action category, the scores of ST-VPs and ST-VCs are sorted, respectively, and the top N and M ones are selected as ST-DVPs and ST-DVCs.
A. Co-occurrence Calculation The co-occurrence information is captured in the local histogram mentioned in Sec. 2. The co-occurrence frequency of two video-words (or a video-phrase), labeled i and j, respectively, in video clip v is calculated as: Tvdvp (i, j ) = ∑ t j (3)
IV.
EXPERIMENTS AND DISCUSSION
In this section, we assess the performance of our action recognition scheme on two challenging datasets: KTH dataset [9] and YouTube dataset [5]. Results are reported for a set of experiments. STIPs are extracted by the method in [3] and described by histogram of gradient (Hog). Recognition is carried out through the leave-one-out-cross-validation (LOOCV) scheme. Feature points of YouTube videos are preprocessed by Liu’s motion feature pruning method [5] with a nearly 2% improvement, i.e. from 57.3% to 59.1%, in average accuracy when the vocabulary size is 1000. First, video-words based action recognition is done on both datasets with varying vocabulary sizes and the best outcomes are used as the baseline here. We can see that the performance saturates when the number of video-words reaches certain levels (Fig. 2(a)). Then we test the performance of ST-DVPs and ST-DVCs based action recognition, respectively. As ST-DVPs and STDVCs are selected from the VWnum x VWnum sized and VWnum x CWnum sized co-occurrence matrices calculated under some scale factor value s, VWnum, CWnum, DVPs/category, DVCs/category and s are closely related with the final outcomes. In our experiment, different values of these parameters are tested and the results are compared, leading to the following conclusions: (1) the average accuracy improves with the increasing of DVPs/category or DVCs/category and also saturates at some values. (2) Before saturation, with a fixed DVPs/category or DVCs/category value, the smaller the number of elements in the cooccurrence matrix, the better the results. As a small sized cooccurrence matrix captures more compact and stable relationship between video elements. (3) The scale factor s differs from dataset to dataset, and it is also closely related with the other parameters mentioned here. (4) For the cooccurrence matrix with more video elements, more STDVPs/ST-DVCs are needed to represent an action category, and the results become better with larger scale factor value s. A part of the experiment outcomes are shown in Fig. 2 (b) and (c) to explain these statements vividly.
p∈{ pi }
where {pi} is the set of all video-words labeled i in video clip v, and tj is the jth element of H(pi, s), which is the local histogram centered at video-word pi. Similarly, the co-occurrence frequency of a video-word i and a contextual word j (or a video-clique) in video clip v is: dvc Tv (i, j ) = f (vwi , cw j ) (4) In (4), vwi and cwj denote the video-word labeled i and the contextual word labeled j, respectively, and f(vwi, cwj) is the co-occurrence frequency of the two in video clip v. When the center video-word is vwi and its surrounding contextual word is cwj , then we add one to the co-occurrence frequency f(vwi, cwj). Tv*(i, j) (Tvdvp(i, j) or Tvdvc(i, j)) is divided by the total number of STIPs in video clip v in calculation to offset the unequal lengths of video clips, resulting in NTv*(i, j). Then the value is averaged over all the clips of action category C to obtain the intra-class frequency: 1 TC* (i, j ) = (5) ∑ NTv* (i, j ) | Tv | {v∈C } where |Tv| is the total number of video clips belonging to action category C. dvp
As a result, TC is a VWnum x VWnum sized matrix, and TCdvc is a VWnum x CWnum sized matrix.
B. TF-IDF Weighting Based Selection As the co-occurrence frequencies within one action category have been calculated by formula (5), the frequency over all categories is: 1 (6) T* = ∑ TC* | C | {C } Here, |C| is the total number of action types in an action dataset. T* is a uniform notation for Tdvp and Tdvc. A score can be obtained by:
1523 1527
BoWs&DVCs BoWs&DVPs BoWs
VWs
walking v_spiking t_jumping
ST-DVPs
t_swinging swinging s_juggling h_riding g_swinging
VWs
diving cycling b_shooting
ST-DVPs
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 4. Results on YouTube dataset. The average accuracy for BoWs, BoWs&DVPs and BoWs&DVCs based methods are 59.1%, 64.3% and 62.1%, respectively.
Figure 3. Video-words and ST-DVPs
Denoting the concerned parameter set for ST-DVPs based method as the triplet (VWnum, DVPs/category, s), then the KTH action dataset peaks at the value (200, 400, 3.5) with 84.3% in average accuracy while YouTube action dataset reaches the maximum average accuracy value 62.1% at (150, 400, 1.5). Similarly, for ST-DVCs based method, we use the quaternion (VWnum, CWnum, DVCs/category, s) to express the related parameter set, then apply (100, 50, 400, 1) and (150, 50, 400, 2.5) on KTH and YouTube dataset, respectively, we can get the peak average accuracy values of this method on both datasets, i.e. 84.1% and 58.1%. We notice that ST-DVPs based method outperforms baseline on YouTube dataset. It is reasonable because STDVPs, as the selected stable co-occurrence patterns, can remove some noise of STIPs, as indicated in Fig. 3. The extracted STIPs and the centers of the ST-DVPs are shown for two sample clips from YouTube dataset (better viewed in color). We can see that most STIPs located at the dynamic background are excluded out. Finally, action recognition based on representation combinations is evaluated. With different weights assigned to VWs, ST-DVPs and ST-DVCs features, the resulting classifying vector can be a weighted concatenation as: H = [α H vw β H ST − DVPs γ H ST − DVCs ] (8) where α , β and γ are the assigned weights for Hvw, HST-DVPs and HST-DVCs, representing the distribution histograms of video-words, ST-DVPs and ST-DVCs, respectively. And the length of the resulting histogram is the sum of their individual lengths. Two weight combinations (0.5 0.5 0) and (0.5 0 0.5), corresponding to ( α , β , γ ), are tested on the two datasets in our experiment. The results reach 91.2% for both combinations on KTH human action dataset. On YouTube dataset, the results reach 64.3% and 62.1%, respectively, compared to 59.1% using video-words alone. We can see a 1% to 13% increase in almost all action categories (Fig. 4). The results substantiate that ST-DVPs and ST-DVCs can enhance the action recognition performance, and the outcomes are comparable with the state-of-the-art results on these two datasets, that is, 91.2% [11] on KTH and 65.4% [5] on YouTube, respectively.
1524 1528
V.
CONCLUSION
In this paper, we introduce spatial-temporal context in action recognition and propose two compact and discriminative video representations, i.e. ST-DVPs and STDVCs. Their effectiveness has been evaluated on KTH and YouTube datasets and the parameter influence has been explored extensively. Experiment results verify that considerable performance improvements are achieved in almost all categories with spatial-temporal context added in. More sophisticated context modeling methods are worth exploring in our future work. REFERENCES [1]
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” In VS-PETS, pp. 6572, Oct. 2005. [2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” In Proc. CVPR, pp. 1-8, 2008. [3] I. Laptev, “On space-time interest points,” In international Journal of Computer Vision (IJCV), 2005. [4] J. Liu, S. Ali and M. Shah, “Recognizing human actions using multiple features,” In Proc. CVPR, 2008. [5] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos ‘in the wild’,” In Proc. CVPR, 2009. [6] J. Liu and M. Shah, “Learning human actions via information maximization,” In Proc. CVPR, 2008. [7] Z. Lin, Z. Jiang, and L.S. Davis, “Recognizing Actions by ShapeMotion Prototype Trees,” In Proc. ICCV, 2009. [8] J. C. Niebles, H.Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” In Proc. BMVC, 3: 1249–1258, Sept. 2006. [9] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” In Proc. ICPR, 2004. [10] S. Savarese, A. Del Pozo, J. C. Niebles, and L. Fei-Fei, “Spatialtemporal correlations for unsupervised action classification,” In Proc. BMVC, Jan. 2008. [11] Y Wang, G Mori, “Human Action Recognition by Semi-Latent Topic Models,” PAMI, 2009. [12] S. Zhang, Q. Tian, G. Hua, Q. Huang and S. Li, “Descriptive visual words and visual phrases for image applications,” In ACM Multimedia, pp. 229-238, Oct. 2009.