efficient search of top-k video subvolumes for multi-instance action

0 downloads 0 Views 854KB Size Report
Computer Vision and Pattern Recognition, 2008. [14] Edmond Boyer Daniel Weinland, Remi Ronfard, “Free view- point action recognition using motion history ...
EFFICIENT SEARCH OF TOP-K VIDEO SUBVOLUMES FOR MULTI-INSTANCE ACTION DETECTION Norberto A. Goussies

Zicheng Liu

Junsong Yuan

DC - FCEyN Univ. de Buenos Aires Argentina [email protected]

Microsoft Research Redmond, WA, USA [email protected]

School of EEE, Nanyang Technological University Singapore, 39798 [email protected]

ABSTRACT Action detection was formulated as a subvolume mutual information maximization problem in [8], where each subvolume identifies where and when the action occurs in the video. Despite the fact that the proposed branch-and-bound algorithm can find the best subvolume efficiently for low resolution videos, it is still not efficient enough to perform multiinstance detection in videos of high spatial resolution. In this paper we develop an algorithm that further speeds up the subvolume search and targets on real-time multi-instance action detection for high resolution videos (e.g. 320 × 240 or higher). Unlike the previous branch-and-bound search technique which restarts a new search for each action instance, we find the Top-K subvolumes simultaneously with a single round of search. To handle the larger spatial resolution, we downsample the volume of videos for a more efficient upperbound estimation. To validate our algorithm, we perform experiments on a challenging dataset of 54 video sequences where each video consists of several actions performed by different people in a crowded environment. The experiments show that our method is not only efficient, but also capable of handling action variations caused by performing speed and style changes, spatial scale changes, as well as cluttered and moving background. Keywords— branch-and-bound, action recognition, 1. INTRODUCTION Detecting human actions in video sequences is an interesting yet challenging problem. It has a wide range of applications including video surveillance, tele-monitoring of patients and senior people, medical diagnosis and training, video indexing, and intelligent human computer interaction, etc. Different from action recognition or categorization [3] [4] [5], where each action video is classified into one of the pre-defined action classes, the task of action detection [6] [7] [8] [9] needs to Junsong Yuan is supported in part by the Nanyang Assistant Professorship (SUG M58040015)

c 978-1-4244-7493-6/10/$26.00 2010 IEEE

identify not only which type of action occurs, but also where (spatial location in the image) and when (temporal location) it occurs in the video. It is in general a more challenging problem. Considering that human actions are complex patterns that are characterized by spatio-temporal volumes in videos, similar to the sliding window based search for object detection, detection of actions is to search for subvolumes in the volumetric video space [8]. Despite previous successes of subvolume-based action detection, the spatio-temporal branch-and-bound search method proposed in [8] has two limitations. First of all, it only searches one action instance each time. To find top K actions of the highest detection score, it requires K round branch-and-bound searches, by eliminating the detected action each time and then search the rest of the video space for another action, thus is inefficient for finding multiple actions. In general, for both action detection [8] and object detection [10], as the branch-and-bound search only return the best solution from the candidate set, it needs to perform branch-and-bound many times for multiple detections. Moreover, to achieve the close to real time detection speed, the videos are constrained to limited spatial resolution in [8], e.g. 160 × 120. Because the search speed is practically linear to the screen size m × n (with worse case quadratic to m×n), it requires more computing power to handle videos of high resolution. For example, the search on the 320 × 240 screen can be more than four times slower than that on a 160 × 120 screen. Thus a more efficient method is required. To address the above two challenges, we firstly improve the branch-and-bound search method to directly perform i.e. Top-K volume search from videos, which enables the detection of multiple action instances simultaneously. Moreover, to handle the larger screen size, we downsample the video frames for a more efficient upper-bound estimation. There are several advantages of our proposed action detection method: (1) it is efficient for multi-instance action detection which can achieve close to real time implementation with frame size of ICME 2010

328

320 × 240; (2) it is robust to scale changes, subject changes, clutter background, performing speed variations, and even partial occlusions. The Top-K volume search algorithm is general and can be applied to other types of pattern search problems in videos. The experimental results on a challenging dataset of 54 video sequences validate the efficiency and effectiveness of the multiple action detection methods. 2. RELATED WORK In spite of a large number of above works in action recognition, action detection is less addressed in the literature. Compared with action recognition, action detection is a more challenging as it needs to locate the actions in cluttered or dynamic background [6] [7] [8] [11] [9]. Even with good action recognition scheme, it can be time-consuming to search the video space and accurately locate the action. For effective action detection, some early methods apply template-based pattern matching [12] [6] [13]. Two types of temporal templates are proposed in [12] for characterizing actions: (1) the motion energy image (MEI), which is a binary image recording where motion has occurred in an image sequence; and (2) the motion history image (MHI), which is scalar valued image whose intensity is a function of the recency of motion. In [14], the motion history volume (MHV) is introduced as a free-viewpoint representation for human actions. To better handle the clutter and dynamic backgrounds, an input video is oversegmented into many spatio-temporal video volumes in [7]. An action template is matched by searching among these over-segmented video volumes. However, because only one template is utilized in matching, previous templates-based methods usually have difficulties in handling intra-class action variations. Besides template matching, discriminative methods are applied as well. Haar features are extended to 3-dimensional space in [15] [16], and boosting is applied to integrate these features for final classification. In [9], multiple instance learning is presented for learning human action detector. However, head detection and tracking are required to help locate the person. In [17], a prototypebased approach is introduced, where each action is treated as a sequence of prototypes. Then efficient action matching is performed in long video sequences. In [18], a successive convex matching scheme is proposed for action detection. 3. MULTI-INSTANCE ACTION DETECTION Given a video sequence and an action class, we are interested in finding the spatial and temporal locations of all the actions performed in the video sequence. Following the notation in [8], we represent an action as a collection of spatio-temporal interest points (STIPs) [1]. We use d ∈ RN to denote an N -dimensional feature vector describing a STIP. Denote C = 1, 2, ..., C as the class label set. For each class c ∈ C, a score sc (d) is associated with d which

329

is the mutual information between the class c and the feature vector d. The score of a subvolume V is defined P as the sum of the scores of all the STIPs in V , that is, f (V ) = d∈V sc (d). Note that the class c is omitted in the notation f (V ). In addition, we use f (d) to denote sc (d). A subvolume V is said to be maximal if there does not exist any cube V 0 such that f (V 0 ) > f (V ) and V 0 ∩V 6= ∅. The multi-instance action detection problem is to find all the maximal subvolumes whose scores are above a certain threshold. In [8], the multi-instance detection problem was converted to a series of single subvolume search problem. They first find the optimal subvolume V1 such that f (V1 ) = maxV f (V ). After that, it sets the scores of all the points in V1 to be 0, and find the optimal subvolume V2 , and so on. A spatio-temporal branch-and-bound algorithm was proposed in [8] to solve the single subvolume search problem. Instead of performing a branch-and-bound search directly in the 6-dimensional parameter space Λ, the method performs a branch-and-bound search in the 4 − dimensional spatial parameter space. In other words, it finds the spatial window W ∗ that maximizes the following function: F (W ) = max f (W × T ) T ⊆T

(1)

where T = [0, t]. To further speed up the search process during the branchand-bound iterations, an heuristic was used in [11]. If a candidate window W with a score larger than the detection threshold is found, the subsequent searches are limited to the subwindows contained in W. It guarantees that it will find a valid detection, but the detected subvolume is not guaranteed to be optimal. One advantage of separating the parameter space is that the worst case complexity is reduced from O(m2 n2 t2 ) to O(m2 n2 t). The complexity is linear in t, which is usually the largest of the three dimensions. For this reason, it is efficient in processing long videos, but when the spatial resolution of the video is increased, the complexity goes up quickly. The method of [8] was tested on videos with low resolution videos (160 × 120). In this paper we are interested in higher resolution videos (320 × 240 or higher). We found that for videos taken under challenging lighting conditions with crowded background such as those in the publicly available MSR Action dataset 1 , the action detection rate on 320 × 240 resolution are much better than on 160 × 120. Unfortunately, the subvolume search for 320 × 240 videos is much slower. For example, Yuan et al’s heuristic algorithm takes 20 hours to search the MSR Action dataset which consists of 54 320×240 video sequences of 1 minute long each. In the next two sections, we present two techniques to speed up the subvolume search algorithm. The combination of the two techniques allow us to perform subvolume search on 320×240 videos in real time. 1 The MSR action dataset is available at http://research.microsoft.com/enus/downloads/fbf24c35-a93e-4d22-a5fe-bc08f1c3315e/

4. SPATIAL DOWN-SAMPLING

Therefore

The technique is to spatially down-sample the video space by a factor s before the branch-and-bound search. Note that the interest point detection, descriptor extraction, and the scores are all done in the original video sequence. For a video volume V of size m × n × t, the size of the n down-sampled volume V s with scale factor s is m s × s × t. m s For any point (i, j, k) ∈ V where i ∈ [0, s − 1], j ∈ [ ns − 1], and k ∈ [0, t − 1], its score is defined as the sum of the scores of the s×s points in V, that is, f s (i, j, k) is defined as f s (i, j, k) =

s−1 X s−1 X

f (s ∗ i + x, s ∗ j + y, k).

(2)

x=0 y=0

Given any subvolume V s = [L, R] × [T, B] × [B, E] ⊂ V , denote ξ(V s ) to be its corresponding subvolume in V, that is, s

ξ(V s ) = [s∗L, s∗(R+1)−1]×[s∗T, s∗(B+1)−1]×[B, E]. (3) It is easy to see that f s (V s ) = f (ξ(V s )).

(4)

maxV s ⊂V s f s (V s ) ≤ maxV ⊂V f (V ).

(5)

Therefore

A subvolume V = [X1 , X2 ] × [Y1 , Y2 ] × [T1 , T2 ] ⊂ V is called an s-aligned subvolume if X1 and Y1 are multiples of s and the width X2 − X1 + 1 and height Y2 − Y1 + 1 are also multiples of s. Equation 3 provides a one-to-one mapping between the volumes in V s and the s-aligned subvolumes in V. Let V ∗ denote the optimal subvolume in V, that is, f (V ∗ ) = maxV ⊂V f (V ). Assume V ∗ = [x1 , x1 + w − 1] × [y1 , y1 + h − 1] × [t1 , t2 ] where w and h are the width and height of V ∗ , respectively. Let |V | denote the number of pixels in V . It can be shown that there exists an s-aligned ˜ − 1] × [t1 , t2 ] subvolume V˜ = [˜ x1 , x ˜1 + w ˜ − 1] × [˜ y1 , y˜1 + h such that |(V ∗ \ V˜ ) ∪ (V˜ \ V ∗ )| ≤ (s ∗ h + s ∗ w + s2 )(t2 − t1 ). (6) Therefore s ∗ h + s ∗ w + s2 |(V ∗ \ V˜ ) ∪ (V˜ \ V ∗ )| ≤ . |V ∗ | wh

(7)

If we assume the total score of a subvolume is in average proportional to its size, then f (|(V ∗ \ V˜ ) ∪ (V˜ \ V ∗ )|) s ∗ h + s ∗ w + s2 ≤ . f (V ∗ ) wh

(8)

330

f (V ∗ ) − f (V˜ ) s ∗ h + s ∗ w + s2 ≤ . f (V ∗ ) wh

(9)

Let V˜ ∗ = argmaxV ∈V s f s (V ) denote the optimal subvolume in V s . Equation 9 yields f (V ∗ ) − f s (V˜ ∗ ) s ∗ h + s ∗ w + s2 ≤ . f (V ∗ ) wh

(10)

Note that the left hand side of Equation 10 is the relative error of the optimal solution in the scaled video volume V s . As an example, suppose spatial dimension of V is 320 × 240, and the scale factor s = 8. The spatial dimension of the down-sampled volume is 40 × 30. If we assume the window size of the optimal subvolume V ∗ is 64 × 64, then the average relative error is 8 ∗ 64 + 8 ∗ 64 + 82 s ∗ h + s ∗ w + s2 = ≈ 25%. (11) wh 642 We also run numerical experiments to measure the relative error of the optimal solutions in the down-sampled volumes. We used 30 video sequences of resolution 320 × 240. There are three action types. For each video sequence and each action type, we obtain a 3D volume of scores as described in Section 3. We choose s = 8, and down-sample each 3D volume to spatial resolution of 40 × 30. There are 113 actions in total. For each action, we compute its corresponding downsampled subvolume, and evaluate the relative error which is the score difference divided by the original action score. The mean is 23% and the standard deviation is 26%. We can see that the numerical experiments are reasonably consistent with the theoretical analysis. Unfortunately, after downsampling the heuristic to speed up the branch-and-bound search (section 3) does not give good results. For example, we found that if we downsample 320 × 240 video volumes to 40 × 30, the PR-curve goes down 30% in average. The reason is because downsampling smoothes the scores thus resulting in a lot more subvolumes with scores above the threshold. On the other hand, if we do not use the heuristic, the exact search algorithm is quite slow even for the down-sampled 40×30 volumes. As shown in Table 1, the exact search algorithms takes 10 hours on the downsampled 40 × 30 video volumes. To address this problem, we present a new technique for multi-instance subvolume search in the next section. 5. TOP-K SEARCH ALGORITHM The multi-instance search algorithm in [8] repeatedly applies the single-instance algorithm many times until some stop criteria is met. In practice, there are typically two different stop conditions that can be used. The first is to stop after k iterations where k is a user-specified integer. The second is to

stop when the detection score is smaller than a user-specified detection threshold λ. In either case, suppose the number of detected instances is k, then the worst case complexity of the algorithm is O(kn2 m2 t). Note that in 1D case, Kanade [2] developed a algorithm that finds the Top-K subarrays in O(n+k) time. This is much more efficient than repeatedly applying the single-instance algorithm k times which has the complexity O(kn). In 3D case, we would also like to have an algorithm that is more efficient than simply applying the single-instance algorithm k times. We consider two different variants corresponding to the two stop criteria. The first, called λ search, can be applied when we are interested in finding all the subvolumes above a userspecified threshold λ. The second, called Top-K search, can be applied when we are interested in finding the Top-K subvolumes. 5.1. λ Search In this section we describe an algorithm that finds all the subvolumes with scores larger than a user-specified threshold λ. The pseudo-code of the algorithm is shown in Algorithm. 1. Following the notation in [8], we use W to denote a collection of spatial windows. W is defined by 4 intervals which specify the parameter ranges for the left, right, top, and bottom positions, respectively. Given any set of windows W, we use Fˆ (W) to denote its upper bound which is estimated in the same way as in [8]. We use Wmax to denote the largest window among all the windows in W. Initially, W is equal to the set of all the possible windows on the image. In terms of worst case complexity, the number of branches of this algorithm is no larger than O(n2 m2 ) since the algorithm does not re-start the priority queue P . Each time it branches, the algorithm has to compute upper bound which has complexity O(t). Therefore the worst complexity involved in branch and bound is the same as [8]: O(n2 m2 t). In addition, each time when it detects a subvolume, the algorithm has to update the scores of the video volume which has complexity O(nmt). If there are k detected subvolumes, the complexity for updating the scores is O(kmnt). Overall, the worst case complexity of this algorithm is O(n2 m2 t) + O(kmnt). When k is large, this is much better than O(kn2 m2 t).

Algorithm 1 λ search. 1: Initialize P as empty priority queue 2: set W = [T, B, L, R] = [0, m] × [0, m] × [0, n] × [0, n] ˆ (W)) into P 3: push(W, F 4: repeat 5: Initialize current best solution F ∗ , W ∗ 6: repeat 7: retrieve top state W from P based on Fˆ (W) 8: if Fˆ (W) > λ then 9: split W into W1 ∪ W2 10: if Fˆ (W1 ) > λ then 11: push (W1 , Fˆ (W1 )) into P 12: update current best solution {W ∗ , F ∗ } 13: end if 14: if Fˆ (W2 ) > λ then 15: push (W2 , Fˆ (W2 )) into P 16: update current best solution {W ∗ , F ∗ } 17: end if 18: end if 19: until Fˆ (W) ≤ F ∗ 20: T ∗ = argmaxT ∈[0,t] f (W ∗ , T ); 21: add V ∗ = [W ∗ , T ∗ ] to the list of detected subvolumes. 22: for each point (i, j, k) ∈ V ∗ , set f (i, j, k) = 0. ˆ (W) ≤ λ 23: until F

Third, it replaces the inner-loop stop criteria Fˆ (W) ≤ F ∗ with Fˆ (W) ≤ Fc∗ . Finally, the outer-loop stop criteria Fˆ (W) ≤ λ is replaced with c > k. In this algorithm, the number of outer loops is k. So the worst case complexity is also O(n2 m2 t) + O(kmnt). 6. EXPERIMENTS To validate our algorithm, we perform experiments on a challenging dataset of 54 video sequences where each video consists of several actions performed by different people in a crowded environment. Each video is approximately one minute long. The videos contains three different types of actions : (i) handwaving, (ii) handclapping, (iii) boxing. For all of our experiments we have fixed λ = 3.0 and K = 3. Moreover, unless explicitly mentioned we downsample the volume to 40 × 30 pixels.

5.2. Top-K Search In this section we describe how to modify Algorithm 1 for the case when we are interested in finding the Top-K actions, and we assume we do not know the threshold λ. The pseudo-code of the algorithm is shown in Alg. 2. The algorithm is similar to Algorithm 1. There are four main differences. First, instead of maintaining a single current best solution, it maintains k-best current solutions. Second, it replaces the criteria Fˆ (W) > λ with Fˆ (W) > Fk∗ to determine where we need to insert W1 or W2 into the queue P .

331

6.1. Recognition Rates Figure 1 compares the precision-recall for the the following methods. The original videos are of high resolution 320 × 240 (i) accelerated λ search of [11] in low resolution videos (frame size 40 × 30), (ii) accelerated λ search of [11] in high resolution videos, (iii) multi-round branch-and-bound search of [8] in low-resolution videos (frame size 40 × 30), (iv) Top-K search with branch-and-bound search at original size 320 × 240, (v) Top-K search, with branch-and-bound

Algorithm 2 Top-K Search. 1: Initialize P as empty priority queue 2: set W = [T, B, L, R] = [0, m] × [0, m] × [0, n] × [0, n] ˆ (W)) into P 3: push(W, F 4: c=1 5: repeat 6: Initialize ({Wi∗ , Fi∗ })i=c...k where Fk∗ ≤ ... ≤ Fc∗ 7: repeat 8: retrieve top state W from P based on Fˆ (W) 9: if Fˆ (W) > Fk∗ then 10: split W into W1 ∪ W2 11: if Fˆ (W1 ) > Fk∗ then 12: push (W1 , Fˆ (W1 )) into P 13: update ({Wi∗ , Fi∗ })i=c...k 14: end if 15: if Fˆ (W2 ) > Fk∗ then 16: push (W2 , Fˆ (W2 )) into P 17: update ({Wi∗ , Fi∗ })i=c...k 18: end if 19: end if 20: until Fˆ (W) ≤ F1∗ 21: T ∗ = argmaxT ∈[0,t] f (W ∗ , T ); 22: output Vc∗ = [W ∗ , T ∗ ] as the c-th detected subvolume 23: for each point (i, j, k) ∈ Vc∗ , set f (i, j, k) = 0. 24: c = c+1 25: until c > k

Method Low resolution [11] High resolution [11] Down-sampled B&B (40x30) λ search + Down-sampled B&B (40x30) Top-K + Down-sampled B&B (80x60) Top-K + Down-sampled B&B (40x30)

Running Time 40 mins 20 hours 10 hours 1 hour 20 mins 6 hours 26 mins

Table 1. Time consumed for each method for finding actions in the 54 videos. mance. However, the search speed is much faster. 6.2. Running Times

search at down-sampled size 40 × 30, (vi) λ search, with with branch-and-bound search at down-sampled size 40 × 30. For the purpose of generating precision-recall curves, we modified the outer-loop stop criteria (line 25, algorithm 2) to repeat until Fˆ (W) ≤ λ where λ is a small threshold. In this way, it outputs more than K subvolumes which is necessary for plotting the precision-recall curve. The measurement of precision and recall is the same as that described in [8]. For the computation of the precision ∗ ∩G) we consider a true detection if : Volume(V > 18 where Volume(G) G is the annotated ground truth subvolume, and V ∗ is the detected subvolume. On the other side, for the computation ∗ ∩G) 1 of the recall we consider a hit if: Volume(V Volume(V ∗ ) > 8 . From the precision-recall curves in Fig. 1, we can see that although the accelerated search of [11] provides excellent results in high resolution videos, its performance on downsampled low resolution videos is poor compared with other search schemes. Moreover, all the methods applied to the high resolution videos provide similar performance. In particular, the methods of Top-K search with branch-and-bound search at down sampled size (v) and λ search with branch-and-bound search at down-sampled size (vi) are among the best ones. These results justify our proposed λ-search and Top-K search algorithms. Although the branch-and-bound is performed in the down-sampled size videos, it still provides good perfor-

332

The Table 1 shows the time consumed for the different methods. All of the algorithms are implemented using C++, performed on a single PC of dual-core and 4G main memory: (a) accelerated λ search of [11] in low resolution videos (frame size 40 × 30), (b) accelerated λ search of [11] in high resolution videos, (c) multi-round branch-and-bound search of [8] in low-resolution videos (frame size 40 × 30), (d) λ search, with branch-and-bound search at down sampled size 40 × 30, (e) Top-K search, with branch-and-bound search at downsampled size 80 × 60, (f) Top-K search, with branch-andbound search at down-sampled size 40 × 30. Table 1 shows that although the method of [11] works well for low resolution videos, the search speed becomes much slower for high resolution videos. Moreover, when performing on the down-sampled videos, both the multi-round λ search and our proposed λ search provide the same results: video subvolumes of scores higher than λ. However, our proposed λ search is more than seven times faster, because it can find multi-instance in a single round branch-and-bound search. Among all the search schemes, the fastest method is the Top-K search with branch-and-bound at down-sampled size of 40 × 30. It takes only twenty six minutes to process the 54 sequences of around one hour in total. 7. CONCLUSIONS Although the video subvolume search method proposed [8] provides promising results for action detection, it has limitation in handling multi-instance action detection and videos of high spatial resolution. To address these two problems, we improve the spatio-temporal branch-and-bound search in [8] to efficiently detect multiple actions simultaneously. To accelerate the search in high resolution videos, we downsample the frame size then perform the branch-and-bound search in the low resolution videos. The quality of the upper bound estimation in the down-sampled videos is analyzed and two novel multi-instance search algorithms are presented: the λ search and Top-K search.

(a) Handclapping

(b) Handwaving

(c) Boxing

Fig. 1. Precision-recall curves for the three actions in the database and the different methods. Our proposed algorithm dramatically reduces search complexity without significantly degrading the quality of the detection results. With features and detection scores calculated, our proposed search algorithm can search one hour videos in less than half an hour. As a data-driven approach for action detection, our method does not rely on human tracking and detection, and can handle scale variations, clutter and moving background, and even partial occlusions. 8. REFERENCES [1] Ivan Laptev, “On space-time interest points,” International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107–123, 2005. [2] Jon Bentley, “Programming pearls: algorithm design techniques,” Commun. ACM, vol. 27, no. 9, pp. 865–873, 1984. [3] Christian Schuldt, Ivan Laptev, and Barbara Caputo, “Recognizing human actions: a local svm approach,” in Proc. IEEE Conf. on Pattern Recognition, 2004. [4] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008. [5] Kishore K Reddy, Jinggen Liu, and Mubarak Shah, “Incremental action recognition using feature-tree,” in Proc. IEEE Intl. Conf. on Computer Vision, 2009.

[9] Yuxiao Hu, Liangliang Cao, Fengjun Lv, Shuicheng Yan, Yihong Gong, and Thomas S. Huang, “Action detection in complex scenes with spatial and temporal ambiguities,” in Proc. IEEE Intl. Conf. on Computer Vision, 2009. [10] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann, “Beyond sliding windows: Object localization by efficient subwindow search,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008. [11] Junsong Yuan, Zicheng Liu, Ying Wu, and Zhengyou Zhang, “Speeding up spatio-temporal sliding-window search for efficient event detection in crowded videos,” in ACM Multimeida Workshop on Events in Multimedia, 2009. [12] Aaron F. Bobick and James W. Davis, “The recognition of human movement using temporal templates,” IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol. 23, no. 3, pp. 257–267, 2001. [13] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah, “Action mach a spatio-temporal maximum average correlation height filter for action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2008. [14] Edmond Boyer Daniel Weinland, Remi Ronfard, “Free viewpoint action recognition using motion history volumes,” Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 207–229, 2006. [15] Yan Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection using volumetric features,” in Proc. IEEE International Conf. on Computer Vision, 2005.

[6] Eli Shechtman and Michale Irani, “Space-time behavior based correlation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2005.

[16] Ming Yang, Fengjun Lv, Wei Xu, Kai Yu, and Yihong Gong, “Human action detection by boosting efficient motion features,” in IEEE Workshop on Video-oriented Object and Event Classification in Conjunction with ICCV, Kyoto, Japan, Sept.29 - Oct.2, 2009.

[7] Y. Ke, R. Sukthankar, and M. Hebert, “Event detection in crowded videos,” in Proc. IEEE International Conf. on Computer Vision, 2007.

[17] Zhe Lin, Zhuolin Jiang, and Larry S. Davis, “Recognizing actions by shape-motion prototype trees,” in Proc. IEEE Intl. Conf. on Computer Vision, 2009.

[8] Junsong Yuan, Zicheng Liu, and Ying Wu, “Discriminative subvolume search for efficient action detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2009.

333

[18] Hao Jiang, Mark S. Drew, and Ze-Nian Li, “Successive convex matching for action detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2006.

Suggest Documents