in terms of detection rate and of computation time of testing and training. ... and computation time, and (b) we can decrease the compu- tation cost of the .... the player's ball hit instant time. Negative ... manually, and labelled the strokes to provide the ground truth. .... AdaBoost-based methods needed only a few seconds.
Improving human activity detection by combining multi-dimensional motion descriptors with boosting Takehito Ogata† , William Christmas‡ , Josef Kittler‡ and Seiji Ishikawa† † Department of Mechanical and Control Engineering, Kyushu Institute of Technology, Japan ‡ Centre for Vision, Speech and Signal Processing, University of Surrey, U.K. {ogata, ishikawa}@ss10.cntl.kyutech.ac.jp, {W.Christmas, J.Kittler}@surrey.ac.uk Abstract A new, combined human activity detection method is proposed. Our method is based on Efros et al.’s motion descriptors[2] and Ke et al.’s event detectors[3]. Since both methods use optical flow, it is easy to combine them. However, the computational cost of the training increases considerably because of the increased number of weak classifiers. We reduce this computational cost by extend Ke et al.’s weak classifiers to incorporate multi-dimensional features. The proposed method is applied to off-air tennis video data, and its performance is evaluated by comparison with the original two methods. Experimental results show that the performance of the proposed method is a good compromise in terms of detection rate and of computation time of testing and training.
1
Introduction
Detecting and understanding human activity is important for high-level analysis of videos which contain humans. Human activity recognition has been studied in many research domains, and sports video analysis is one of them. Many high-level applications of sports video analysis such as content-based retrieval or automatic annotation [1] require human activity recognition. One of the difficulties of human activity recognition in sports video is that image regions associated with players are usually small, and the quality of broadcast videos is not good. Efros et al.[2] have proposed a method for recognizing human actions in low resolution sports video. In order to eliminate effects caused by camera motion, a player is tracked and optical flow is calculated from player-centred images. Then the optical flow is separated into 4 channels to construct motion descriptors. In experiments, they attempt to recognize players’ activities in a football video. One of the bottlenecks of Efros et al.’s method is the computation cost of an activity recognition. Since they adopt a k-nearest
Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06) 0-7695-2521-0/06 $20.00 © 2006
neighbour method in the recognition stage, the recognition rate depends on the number of training data. However, if the number of training examples increases, the computation cost will increase proportionally. Recently, Ke et al.[3] have developed a new event detection framework. This is an extension of the Viola-Jones face detector [5]. The Viola-Jones face detector is based on the AdaBoost algorithm, and Haar-like features are used for weak classifiers of the AdaBoost. Ke et al. extend this 2-D face detector into the time dimension to detect a 3-D pattern. In order to detect the 3-D pattern, volumetric features are used for the weak classifiers. Although the computation cost of the AdaBoost training is expensive, the detection is very fast. Since Ke et al.’s method is based on the two components of the optical flow, it is easy to combine the detectors with Efros et al.’s four-component motion descriptors. However, in the training, we need to generate twice as many weak classifiers as the original method because each of their weak classifiers is looking at only one of the components. Since the computation cost of the AdaBoost training depends on the numbers of boosting stages, weak classifiers and training examples, this makes the training slower. In this paper, we propose a new human activity detection method based on a combination of the above two methods; Efros et al.’s motion descriptors and Ke et al.’s event detectors. In order to reduce the computation cost of the training, we extend Ke et al.’s weak classifiers to multiple dimensions. In our experiments, we apply this method to tennis video to detect 4 types of strokes. Experimental results show that (a) the combination of the two methods exploits the advantages of both methods in terms of detection rate and computation time, and (b) we can decrease the computation cost of the training by using the multi-dimensional weak classifiers. The structure of this paper is as follows. The idea of multi-dimensional weak classifiers is described in Section 2. An overview of our activity detection system is presented in Section 3. Section 4 describes the application of the pro-
k
Input images Target tracker
Fb+x
Target centred images
Fb x-
Fb+y
Fb yT
Motion descriptor generator
a=
4-channel motion descriptors
Activity1 detector
Figure 1. A construction of the activity detection system. posed activity detectors to tennis stroke detection. The experimental results are shown in Section 5. Finally conclusions are drawn in Section 6.
2
Multi-dimensional weak classifiers
Ke et al. [3] use optical flow for the event detection, however each of their boosted weak classifiers uses only one of the optical flow components (vx or vy ) to calculate the volumetric feature, so that a scalar feature is used. We believe that both components can be used as a vector, because the vector contains information of the magnitude and the direction. For this reason, we construct the weak classifier in a multi-dimensional feature space.
3
System overview
Our activity detection system consists of three parts, shown in Fig.1. First, a target is tracked and target-centred images are extracted. The motion descriptors are then generated. Finally, activities are detected by activity detectors.
3.1
Target tracker
Our tracker is based on the colour-based particle filter algorithm [4]. Currently, a colour model of the target is initialized manually in the first frame. Once the target is tracked, target-centred images can be extracted.
3.2
Figure 2. The calculation dimensional weak classifier.
Activity2 detector
Motion descriptor generator
The motion descriptors are generated from the optical flow of the target-centred images as described in [2]. Horizontal and vertical components of the optical flow field, Fx and Fy , are separated into the positive components (Fx+ , Fy+ ) and the negative components (Fx− , Fy− ). Note that if one of the positive components is non-zero, the corresponding negative component will be set to zero and vice versa. Each component is then smoothed using a 2-D spatial Gaus− sian filter to construct the motion descriptors F b+ x , F bx , + − F by and F by . Finally, each motion descriptor is rescaled to a fixed size.
Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06) 0-7695-2521-0/06 $20.00 © 2006
of
multi-
.
3.3
Activity detector
Each of the four motion descriptors serves as a basis for the extraction of many volumetric features which are defined by randomly placed Haar-like volumetric operators of different scales. Thus each operator generates a local motion action representation in terms of a feature vector. Hence, aw (k), a vector generated from a weak classifier w at frame k, is calculated from following equation: ⎤ ⎡ + fw F b+ x (k), F bx (k − 1), · · · − − ⎢ fw F bx (k), F bx (k − 1), · · · ⎥ ⎥ aw (k) = ⎢ + ⎣ fw F b+ ⎦ y (k), F by (k − 1), · · · − fw F by (k), F b− (k − 1), · · · y Here, function fw () is the calculation of the volumetric feature [3] of the weak classifier w. Fig.2 shows this calculation schematically. These feature vectors are then used to construct weak classifiers during the AdaBoost classifier training. The weak classifier w simply classifies a query data by comparing Euclidean distances between (a)the query vector and an average vector of positive training examples apw , and (b) the query and an average vector of negative training examples apw . Thus, the output of the weak classifier w is expressed in the following formula:
1 if apw − aw (k) < anw − aw (k) hw (k) = 0 otherwise where hw (k) is the output of the weak classifier w at frame k. Finally, a subset of these weak classifiers is selected to constitute the AdaBoost strong classifier.
4
Implementation to tennis stroke detection
We have implemented the developed activity detectors for tennis strokes. We use four types of the volumetric features which are same as [3]. The size of the activity detector is set to 32 × 32 pixels by 20 frames. In training, all positive training examples are synchronized manually using the player’s ball hit instant time. Negative examples are all video data which do not contain the stroke that the detector is designed to detect.
5 5.1
Experiment Experimental environments
We tested our proposed method on off-air tennis video data. We used the men’s final match of the 2003 Australian tennis tournament. In order to evaluate the performance of the proposed method, we compared it with both Efros et al.’s method and Ke et al.’s method. We implemented the stroke detectors for both players in the video. Here, we call ’far player’ as the player who appears towards the top of the image, and ’near player’ as the other player who appears the bottom of the image. We selected 18 playing sequences, each containing several strokes. For the far player, we constructed ‘serve’, ‘forehand stroke’, ‘backhand stroke’ and ‘backhand volley’ detectors, and the sequences contain far player’s 9 serves, 19 forehand strokes and 24 backhand strokes and 10 backhand volleys. In the same time, for the near player, we constructed ‘serve’, ‘forehand stroke’ and ‘backhand stroke’ detectors, and the sequences contain near player’s 9 serves, 27 forehand strokes and 24 backhand strokes. Before the experiment, we detected all of the player’s hitting points manually, and labelled the strokes to provide the ground truth. Since the far player in all of the sequences is the same player and same for the near player, the leave-one-out method is used in this experiment. The implementation details of Efros et al.’s method were as follows. We used 1-, 3- and 5-nearest neighbours with the k-nearest neighbour (k-NN) classification method. First, all selected most similar k frames were compared to the ground truth. If the frame was close to the hitting point of any of the strokes (within ±5 frames), the stroke label was assigned to the data. Then, voting of the stroke labels was performed. If the ‘no label’ received a majority vote, no stroke was detected. During training of the AdaBoost-based detectors, we computed 50, 000 volumetric features which were randomly generated within the 32 × 32 pixels by 20 frame window. We compared three different schemes. (1) Ke et al.’s detectors were constituted from 100, 000 weak classifiers (50, 000 each in Fx and Fy ). (2) Another version, using Ke et al.’s detectors but with Efros et al.’s descriptors, were constituted from 200, 000 weak classifiers (50, 000 each in − + − F b+ x , F bx , F by and F by ). We call this the unmodified combined method. (3) Multi-dimensional detectors were constituted from 50, 000 weak classifiers: each of our weak classifiers has four volumetric features corresponding to the four motion descriptors. We call this the modified combined method. All three detectors were constructed from single boosted classifiers, and were trained with the same training examples. In Ke et al.’s method, the unmodified combined method and the modified combined method, the numbers of
Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06) 0-7695-2521-0/06 $20.00 © 2006
weak classifiers selected for testing were set to 100, 200 and 50, respectively. All four kinds of detection results were compared to the ground truth. Since the detection results of the AdaBoost detectors were in the form of a binarized time-dimensional signal, all connected positive results were merged as a single detection. In order to smooth the results, we applied a 1-D median filter whose size was 3. Then, if one of these groups overlapped with the true hitting point or was near it(i.e. ±5 frames), it was counted as a true detection. Otherwise it was counted as a false detection.
5.2
Experimental results
ROC curves of the stroke detectors for the far player are shown in Fig.3: (a) shows the ROC curves of the ‘forehand stroke’ detector, (b) shows the ‘backhand stroke’ detector and (c) shows the ‘backhand volley’ detector. The ROC curves of the AdaBoost detectors were calculated by changing the threshold of the strong classifiers. We did not need to calculate the ROC curves of ‘serve’ detectors, because all serves were perfectly detected in all four methods. The detection result of the k-NN is shown in the ROC curves as black points. In (a), 1-NN, 3-NN and 5-NN performed 84.2%, 73.7% and 63.2% detection, respectively. In (b), all NN showed 83.3% with different number of false positives, and only 5-NN achieved 70% while 1-NN and 3NN achieved 60% in (c). According to the calculated ROC curves, although the unmodified and modified combined detectors didn’t always perform as well as Efros et al.’s method, the unmodified and modified combined detectors outperformed Ke et al.’s detectors. In terms of detection rate, although the unmodified combined methods showed better performance than the modified combined methods, we cannot see a significant difference between them. However, the modified combined detectors have an advantage in terms of the AdaBoost training. Although both detectors used the same number of volumetric features, the numbers of selected weak classifiers and candidate weak classifiers in the training of modified combined detectors were one fourth of those in the training of the unmodified combined detectors. Therefore, the modified combined detectors can be constructed faster than the unmodified combined detectors. Table 1 shows the computation time of the ’backhand stroke’ detector’s training Table 1. Computation time of the ‘backhand stroke’ detector training Method Computation time[sec] Unmodified combined 74098 Modified combined 23621
100
80 60 40 Ke et.al.’s Unmodified Modified Efros et.al.’s
20 0 0
0.05
0.1
0.15
0.2
80 60 40 Ke et.al.’s Unmodified Modified Efros et.al.’s
20 0
0.25
0
0.05
false detections/second
(a) Forehand stroke detection
correct detection rate
100 correct detection rate
correct detection rate
100
0.1
0.15
0.2
80 60 40 Ke et.al.’s Unmodified Modified Efros et.al.’s
20 0
0.25
0
false detections/second
(b) Backhand stroke detection
0.05 0.1 0.15 0.2 false detections/second
0.25
(c) Backhand volley detection
Figure 3. The ROC curves of stroke detection results
Figure 4. An example of the stroke detection results. running on our computer. The computation time depends on the specification of the computer, however the ratio of the computation time between two processes running on the same computer doesn’t change radically. Therefore, the training of ’backhand stroke’ detector for modified combined method was approximately 3 times faster than for unmodified combined method. Since the complexity of our weak classifiers is greater than that of Ke et al.’s weak classifiers, their detectors and the unmodified combined detectors are faster than modified combined detectors. However, note that the computation time of Efros’s method was a few minutes, while all three AdaBoost-based methods needed only a few seconds. One of our stroke detection results is shown Fig. 4. The white lines show the trajectory of the players, and the red, the green and the blue boxes show the serve, the forehand stroke and the backhand stroke, respectively, correctly detected.
6
Conclusion
In this paper, we proposed a human motion detection method based on a combination of Efros et al.’s method and Ke et al’s method. Multi-dimensional classifiers were used to reduce the computation cost of the training. We applied the proposed method to off-air tennis video data, and its performance was evaluated by comparison to both of the original methods. In terms of the ROC curves, the
Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06) 0-7695-2521-0/06 $20.00 © 2006
AdaBoost-based detectors did not always perform as well as Efros et al.’s method; however the computational cost of the latter when testing was prohibitive. Of the remaining AdaBoost methods, the ROC curves show that the unmodified and modified combined methods were more accurate than Ke et al.’s methods, indicating that the used of the 4dimensional motion feature was worthwhile. Although the unmodified and the modified combined method showed similar performance, in terms of computational cost when training the modified combined method has an advantage. From the experimental results, we can conclude that (a) the combination of the two methods has exploited the advantages of the original methods and eliminated their disadvantages, and (b) we have decreased the computation of the training by introducing the multidimensional weak classifiers.
References [1] W.J. Christmas, A. Kostin, F. Yan, I. Kolonias, and J. Kittler. A system for the automatic annotation of tennis matches. In Fourth international workshop on content-based multimedia indexing, 2005. [2] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In Proc. of ICCV, pages 726–733, 2003. [3] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In Proc. of ICCV, pages 166–173, 2005. [4] K. Nummiaro, E. B. Koller-Meier, and L. Van Gool. A color-based particle filter. In Proc. of 1st International Workshop on Generative-Model-Based Vision(GMBV’02), pages 53–60, 2002. [5] P. Viola and M. Jones. Robust real-time object detection. In Proc. of IEEE Workshop on Statistical and Computational Theories of Vision, pages 1–25, 2000.