AN ACTIVE ACTION PROPOSAL METHOD BASED ON REINFORCEMENT LEARNING 1
1
Tao Zhang, 1 Nannan Li, 1 Jingjia Huang, 1 Jia-Xing Zhong, 1 Ge Li
School of Electronic and Computer Engineering, Peking University, Shenzhen, China t
[email protected],
[email protected] ABSTRACT
Detecting human activities in untrimmed video is a significant yet challenging task. Existing methods usually generate temporal action proposals via searching extensively at multiple preset scales or combining a bunch of short video snippets. However, we argue that the localization of action instances should be a process of observation, refinement and determination: observe the attended temporal window, refine its position and scale, then determine whether a true action region has been accurately found. To this end, we formulate temporal action localization task as a Markov Decision Process, and propose an active temporal action proposal model based on reinforcement learning. Our model learns to localize actions in videos by automatically adjusting the position and span of temporal window via a sequence of transformations. We train an action/non-action binary classifier to determine whether a temporal window contains an action instance. Validation results on THUMOS’14 dataset show that our proposed method achieves competitive performance both in accuracy and efficiency compared with some state-of-the-art methods, while using much less proposals. Index Terms— action proposal, temporal action detection, deep reinforcement learning, Q-learning 1. INTRODUCTION Temporal action detection aims to not only figure out what kind of actions occur in a video but also locate the time period of actions. Detecting human actions in long untrimmed videos is vital for many applications like security surveillance, human-machine interaction, etc. However, the massive dynamic information in videos, various durations of action instances and huge computational overhead keep this task difficult and unsolved. Recently, temporal action detection has drawn much interest in computer vision community [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. At present, the dominant action detection approaches usually follow a two-step strategy. First, lots Thanks to National Natural Science Foundation of China (61602014), Shenzhen fundamental research programe (JCYJ20170818141120312), Shenzhen Peacock Plan (20130408-183003656), National Natural Science Foundation of China and Guangdong Province Scientific Research on Big Data (U1611461) for funding.
of proposals probably containing actions of interest are generated; then proposals are refined and classified. Since the stateof-the-art action recognition methods have already achieved remarkable performance [11, 12, 3, 13, 14], developing an effective and efficient proposal generation model is crucial for breaking through the performance bottlenecks on the task of temporal action detection. Existing works have tried different ways to generate class independent proposals, such as sliding windows [5], DAPS [1], buchsst [15], etc. Although some of these works have achieved good performances on existing action detection datasets, they are struggling with huge computation overhead and lack of intelligence. Considering how humans find a target video segment quickly in a long untrimmed video: generally, we first go through the entire video briefly; and then select a temporal span likely to contain an action instance; finally adjust the temporal boundaries repeatedly to locate the action instance accurately. This is in fact a procedure of observation and refinement, which can be decomposed into a sequence of transformations on boundaries, such as ”move right”, ”move left”. Inspired by the observation, in this paper we propose a active action proposal model to locate actions in untrimmed videos rapidly and accurately. We build a Markov Decision Process (MDP) on this task and train an agent to act with reinforcement learning algorithm. The trained model learns a policy to adjust the position and scale of temporal window continuously. Given an initial window randomly placed, the trained agent adjusts its position and span automatically by taking a sequence of transformations until a true action instance is covered. The transformation actions are decided by the agent based on both the content covered by current window and the transformations taken in last several steps. A class-independent action/non-action binary classifier is trained to determine whether attended temporal windows cover an action instance during the searching process. Different from the dominant proposal methods that follow fixed searching paths, our method generates different searching paths intelligently for different action instances based on the learned policy and the initial location. We validate our method on the public challenging THUMOS’14 dataset. The proposals generated by the trained agent within about 15 steps achieve higher recall performance with much less number. The main contributions of this paper
4096 6*4 c3d “fc6” action feature history
Encoder (C3D) Feature Vector
fc 4096+6*4 1024 fc fc 6
Agent (DQN) 𝛼1
choosen action
4096 1024 1 sigmoid action/non-action
fc fc
4096 1024 2 position offset
fc fc
𝛼2 𝜓1
Temporal Windws
𝜓0
𝜓1
𝛼3
𝛼𝑡
𝜓2
𝜓𝑡
𝜓2
𝜓𝑡
𝛼𝑡+1 𝜓𝑡+1 𝜓𝑡+1
𝛼𝑡+2 𝜓𝑡+2 𝜓𝑡+2
Binary Classifier (BCN) Positive Windows
✘
✘
Regression (RGN) Proposals
Fig. 1. The framework of our action proposal model. α refers to action and ψ represents temporal windows (section 3.1). The yellow circle represents executing action α on window ψ. The green windows stand for positive temporal segments considered containing action instances, while the gray ones represent negative segments. Both BCN and RGN contain two f c layers. can be summarized in two folds: (1) we propose an active action proposal model based on reinforcement learning, which can automatically locate the action instance effectively and efficiently; (2) the proposed method achieves competitive detection results compared with the state-of-the-arts without refinement processing. 2. RELATED WORKS
trained to search for action instances by altering the position and scale of current attended window in a few steps; Binary Classification network (BCN), which determines whether current attended window contains action instance; the regression network (RGN) [5] adjusts the position offsets between predicted proposals and the groundtruths. 3.1. Problem Definition N
Action Recognition. A lot of methods have been proposed to solve this problem [11, 12, 3, 14, 13]. The two-stream architecture [11] utilizes optical flow and RGB images to extract both spatial and temporal motion information. Instead of processing spatial and temporal information separately, Tran et al [14], use a 3D CNN (C3D) to directly extract spatiotemporal features. In our method, we use TSN [12] as action classifier and take “fc6” output of C3D as video feature representation. Temporal Action Localization. Most existing approaches focus on feature representation and classifier construction [1, 3, 4, 5, 6, 7, 10]. Shou et al. [5] utilize a multi-stage CNN detection network for action localization, generating proposals via sliding windows and filtering background out by a binary action/backgorund classifier. Yeung et al. [4] propose an attention model based on reinforcement learning, which predicts the action position through a few of glimpses. Following the work [5], we also construct a multi-stage detection structure in our method. Unlike the work in [4], we generate proposals by gradually adjusting the attended window instead of directly predicting the position and scale of action region. 3. THE PROPOSED METHOD Fig.1 presents an overview of our proposed framework. It consists of three parts: Deep Q-Network (DQN), which is
A video containing N frames is denoted as X = {xi }i=1 . π The groundtruths within X are represented as G = {gi }i=1 , where π is the number of groundtruths andgi is the i-th one. A temporal window is represented as ψt : xlt , xrt , where xlt and xrt are the left and right boundaries respectively. An MDP can be represented by a quaternion: M = {S, A, Ps,a , R}. T S = {st }t=1 represents the state space, where T is the number of all the possible states that the agent may hold. A = C {ai }i=1 stands for all the C actions possibly adopted by the agent; Ps,a signifies the probability that the agent chooses acT tion a at state s; R = {rt }t=1 indicates the feedback the agent gets from the environment while executing action at at state st . The total reward the agent gains from time step t to T , denoted as Rt , can be described as: Rt = rt + γrt+1 + γ 2 rt+1 ... + γ T −t rT
(1)
where discount factor γ (γ ∈ [0, 1)) determines how much of future reward will be taken into consideration. Solving an MDP quantitatively equals to finding an action sequence that leads the agent to the terminal state while maximizing Rt . By designating proper state set S, action set A and reward R, we can turn the problem of temporal action localization into an MDP, in which the agent interacts with the environment and adopts a series of actions to achieve the settled goal. We solve the MDP with one-step Q-learning algorithm [16].
move left
move right jump operation on position
left expand right expand operation on scale
shorten
Fig. 2. Illustrations of the actions adopted by the agent for motion search. Orange windows with dash line represent next window after taking the corresponding action. 3.2. MDP Formulation In this section, the set S, A and R will be discussed in detail. States. The state set is the representation of the environment, i.e. the given video clip in our task. To make the network converge more quickly, we encode a single state, in other words the feature vector fed into DQN, as the concatenation of features extracted from “fc6” layer of C3D (the green part of feature vector in Fig. 1) and the latest four taken actions before time step t (the yellow part of feature vector in Fig. 1). One action taken in history is represented as a 6-dimension binary vector, in which the value corresponds to the taken action is filled with 1 and the others zeros. Actions. A proper action set is important for the agent to search a target as fast as possible. In our proposed method, the action set consists of six actions, which can be divided into two categories, as shown in Fig. 2. After the agent takes the selected action at , ψt is modified with a fixed ratio α(α ∈ [0, 1)). For example, when a r l is transformed , x “right expand” action is taken, ψ : x t t t to ψt+1 : xlt , xrt + α ∗ (xrt − xlt ) . Particularly, the action “jump” is adopt to avoid the agent trapped in a region containing no action instance, and to encourage the agent search unexplored regions to locate action instances more efficiently. On the trade-off of efficiency and accuracy, we set α=0.2 during both training and test process. In the work of [17], a “trigger” action is brought in to determine whether the target object is accurately located. However, considering that the untrimmed video presents a much more complex environment compared to a single image, with which the agent interacts, we replace the “trigger” action with the BCN. Rewards. During the training process, the agent gets an evaluation called reward rt when it takes action a at state s. A positive rt encourages the agent to choose action a at state s, while a negative rt punishes it for executing the wrong action. We evaluate the reward of action a via a simple yet indicative metrics, Intersection over Union (IoU) between the attended window and the groundtruth, which can be defined as: IoU (ψ, g) = area(ψ ∩ g)/area(ψ ∪ g)
(2)
The reward function is proportional to the difference between IoUs of two successive states s and s0 , where the agent moves to state s0 from s by executing the action a. Specially, it is formulated as following: rt = max sign(IoU(ψt+1 , gi ) − IoU(ψt , gi )) 1≤i≤π
(3)
Equation (3) indicates that the agent gets a reward +1 if ψt+1 has more overlap with any of the groundtruth than ψt , and a reward -1 otherwise. The greedy search strategy makes sure that the trained agent discover an action region as fast as possible with no unnecessary steps. 3.3. Binary Classification Network We train an action/non-action binary classifier called BCN to determine whether a video volume is with or without action. We use the following strategy to construct the training data M D = {dm , km }m=1 , where label km ∈ {0, 1}, dm is a single training segment and M is the count of training segments. We assign a label for each segment by computing its IoUs with all the groudtruth G. Specifically, we choose the following criterion: ( 0, M AX(IoU (dm , gi )gi ∈G ) < thtop km = (4) 1, M AX(IoU (dm , gi )gi ∈G ) > thbottom thtop and thbottom are the IoU thresholds for generating negative and positive samples respectively. We generate positive samples via operating some transformations on the groundtruth, such as shrink or translation. As the action instances spread sparsely over the entire given video clip, we produce negative samples simply by randomly executing the ”jump” action mentioned in section 3.2 for efficiency. We keep the ratio of the positive samples and negative samples to be 1:1 during the training phase. 4. EXPERIMENTAL RESULTS 4.1. Implementation Dataset. We validate our method on the widely-used untrimmed video dataset, THUMOS’14 [18]. It contains 20 action classes, with 200 clips for training and 213 clips for test. Training Details. We implement our model on Torch7 [19] and Caffe ([20]) platform, and carry out experiments on Tesla K80 GPUs. We augment data by horizontally flipping every frame. We also pre-process the data by down-sampling all clips to the same framerate (10 fps). γ is set as 0.99 and Ps,a ={1}. thtop is 0.6 and thbottom is 0.4. The DQN network is trained with the mechanism of memory replay. The replay buffer size is 2000. The initial learning rate for DQN and RGN is 5e-3 and 1e-4 respectively, both with a decay of 5e-4. When training BCN, the learning rate for pre-trained C3D
(a)
(b)
Fig. 3. Evaluation of recall performances on THUMOS’14 dataset compared with SCNN[5], DAPs[1] and Sliding Window. For figure (a), proposal recalls are all calculated with a fixed IoU at 0.5, for figure (b), the number of proposals is fixed at 100. Methods DAPS SCNN Ours
Recall@50 0.35 0.37 0.43
Recall@100 0.54 0.58 0.54
FPS 134 60 240
Table 1. Comparison of proposal generation performance with existing methods in terms of recall at IOU=0.5 using different number of proposals. Our method get higher recall.
Conv layers is 2e-4 and for FC layers is 5e-3. Testing Details. During the test process, the agent starts its search from the beginning of each clip, and take actions continuously according to the content of attended region. The agent restarts its search at the region far away from the right side of the current window when a groundtruth has been found. We post-process candidate proposals by applying non-maximum suppression. 4.2. Experiment Results Firstly, we evaluate our temporal proposal model by comparing localization results with state-of-the-art approaches, as shown in Fig.3. A better proposal method is expected to achieve a higher recall while using less proposals. Besides, it is significant for the model to have fast processing speed. We measure the ability of our model to retrieve proposals with high recall with the proposal average recall (AR). Our proposal method can reach a higher recall for the top 100 proposals. The growth of recall tends to slow down at the later trend of the curve, the reason of which is that the agent stops to locate the found action instance with a higher IoU when the binary classifier determines that the current attended region belongs to an action class. We can observe that our approach also achieves comparative recall values compared with stateof-the-art methods under various IoU settings. Our method can process 240 frames per second averagely, which is much faster than most existing methods. We evaluate the performance of our proposed method on
Model@Proposal Number SCNN @NA Yeung @NA DAP @NA CDC @NA ∗ SSN @NA ∗ Ours@50 Ours @NA
mAP(%) 19.0 17.1 13.9 23.3 29.8 22.8 24.4
Table 2. Action detection results compared with state-of-theart methods: mAP are calculated with a fixed number of proposals on THUMOS’14. @NA means the proposal number is not specified for corresponding methods. IoU is fixed at 0.5. ∗ means the methods focus on proposal refinement, which promises better detection result than unrefined proposal. temporal action localization task with the metrics of mean Average Precisions (mAPs) at IoU 0.5, and compare the results with state-of-the-art approaches, which are shown in Table 1. All the proposals are classified by TSN [12]. With the proposal number fixed at 50, the mAPs of our method achieves 22.8, which is higher than compared methods. When the proposal number is not specified, our approach advances the detection accuracy with the mAPs value at 24.4. 5. CONCLUSION This paper propose an active action proposal method based on reinforcement learning. We formulate temporal action localization task as an MDP, and learn an efficiency policy to generate action proposals in a few steps by adaptively adjusting position and scale of temporal window. Furthermore, we train an action/non-action binary classification network to determine proposals as true action or background. The extensive experiments on THUMOS’14 dataset validate that the proposed approach can attain a competitive detection result with fewer proposals compared with state-of-the-art methods.
6. REFERENCES [1] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem, “Daps: Deep action proposals for action understanding,” in European Conference on Computer Vision. Springer, 2016, pp. 768–784. [2] Georgia Gkioxari and Jitendra Malik, “Finding action tubes,” in Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2015, pp. 759–768. [3] Dan Oneata, Jakob Verbeek, and Cordelia Schmid, “Action and event recognition with fisher vectors on a compact feature set,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1817– 1824. [4] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 2678–2687. [5] Zheng Shou, Dongang Wang, and Shih-Fu Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1049–1058. [6] Huijuan Xu, Abir Das, and Kate Saenko, “R-c3d: Region convolutional 3d network for temporal activity detection,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 6, p. 8. [7] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin, “Temporal action detection with structured segment networks,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 8. [8] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem, “Fast temporal activity proposals for efficient detection of human actions in untrimmed videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1914–1923. [9] Yi Zhu and Shawn Newsam, “Efficient action detection in untrimmed videos via multi-task learning,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 197–206. [10] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih Fu Chang, “Cdc: Convolutionalde-convolutional networks for precise temporal action localization in untrimmed videos,” arXiv preprint arXiv:1703.01515, 2017.
[11] Karen Simonyan and Andrew Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in neural information processing systems. IEEE, 2014, pp. 568–576. [12] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 20–36. [13] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 4724–4733. [14] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2015, pp. 4489–4497. [15] S.; Buch, V.; Escorcia, C.; Shen, B.; Ghanem, and J. C. Niebles, “Sst: Single-stream temporal action proposals,” in In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017. [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, February 2015. [17] Juan C Caicedo and Svetlana Lazebnik, “Active object localization with deep reinforcement learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2488–2496. [18] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “THUMOS challenge: Action recognition with a large number of classes,” http://timmurphy.org/2009/07/ 22/line-spacing-in-latex-documents/, 2014. [19] Ronan Collobert, Koray Kavukcuoglu, and Cl´ement Farabet, “Torch7: A matlab-like environment for machine learning,” in BigLearn, NIPS Workshop, 2011, EPFL-CONF-192376. [20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.