View-Invariant Human-Body Detection with Extension to Human Action Recognition using Component-Wise HMM of Body Parts Bhaskar Chakraborty, Ognjen Rudovic Computer Vision Center & Dept. Ci`encies de la Computaci´o, Edifici O, Campus UAB, 08193 Bellaterra, Spain
Jordi Gonz`alez Institut de Rob`otica i Inform`atica Industrial (UPC – CSIC)
[email protected]
Abstract This paper presents a technique for view invariant human detection and extending this idea to recognize basic human actions like walking, jogging, hand waving and boxing etc. To achieve this goal we detect the human in its body parts and then learn the changes of those body parts for action recognition. Human-body part detection in different views is an extremely challenging problem due to drastic change of 3D-pose of human body, self occlusions etc while performing actions. In order to cope with these problems we have designed three example-based detectors that are trained to find separately three components of the human body, namely the head, legs and arms. We incorporate 10 sub-classifiers for the head, arms and the leg detection. Each sub-classifier detects body parts under a specific range of viewpoints. Then, view-invariance is fulfilled by combining the results of these sub classifiers. Subsequently, we extend this approach to recognize actions based on component-wise Hidden Markov Models (HMM). This is achieved by designing a HMM for each action, which is trained based on the detected body parts. Consequently, we are able to distinguish between similar actions by only considering the body parts which has major contributions to those actions e.g. legs for walking, running etc; hands for boxing, waving etc.
1. Introduction Human detection based action recognition is a constantly expanding research area due to number of applications for surveillance (behaviour analysis), security (pedestrian detection), control (human-computer interfaces), contentbased video retrieval, etc. It is, however, a complex and difficult-to-resolve problem because of the enormous differences that exist between individuals, both in the way they c 978-1-4244-2154-1/08/$25.00 2008 IE
Figure 1. Images from the KTH database showing some of the challenges involved with people detection in still images where the 3D-pose of their body parts changes with great variety while performing some actions like clapping, walking, boxing, and jogging.
move and their physical appearance, view-point and the environment where the action is carried out. Fig. 1 shows some images from the KTH’s database [12], demonstrating the variation of the human poses w.r.t. different camera views and for different actions. Toward this end, several approaches can be found in the literature [7]. Some approaches are based on holistic body information where no attempt is made to identify individual body parts. Authors like [3] use HMM and AdaBoost for recognition of 3-D human action considering joint position or pose angles. However, there are actions which can be better recognized by only considering body parts, such as the dynamics of the legs for walking, running and jogging [2]. Consequently, action recognition can be based on a prior detection of the human body parts [9]. In this context, human body parts should be first detected in the image: authors like [6, 11] describe human detection algorithms by probabilistic body part assembly. However, these approaches use motion information, explicit models, a static camera, assume a single person in the image, or implement tracking rather than pure detection. Mohan et al. [8] used Haar-Like features and SVM for component-
wise object detection. However, Haar-Like features cannot obtain certain special structural feature that can be useful to design a view invariant human detection. Moreover, there is no method to select the best features for the SVM so that the performance can be improved by minimizing the feature vector size. Lastly, these works do not cope well for the detection of profile poses. Due to these difficulties in the view-invariant body parts detection, action recognition has been restricted by analyzing the human body as a whole from multiple views. For example, Mendoza and P´erez de la Blanca [5] detect human actions using Hidden Markov Models (HMM) based on contour-based histogram of full body. Also, authors in [1] combine shape information and optical flow based on the silhouette to achieve this goal. Likewise, [14] uses the sum of silhouette pixels. However, when knowledge of body part detection is applied in action recognition higher detection rate can be achieved as demonstrated in this paper. Our approach introduces a technique for view-invariant human detection and subsequently learning the stochastic changes of the body part components to recognize different actions. We use a hierarchy of multiple example-based classifiers for each of different body part-components and view invariant human detection is achieved by combining the result of those detectors in a meaningful way. Since human action is viewed as a combination of the motion of different body parts, action detection can be analyzed as a stochastic process by learning the changes of such components using HMM. Moreover, in this way we can only consider features from those body part-components which actually taking part into the action e.g. the legs for walking, jogging and running. We have compared our result with current state of the art [12] and [5]. Lastly, our method for action recognition is also able to detect the direction of motion from the likelihood map obtained from HMM. This paper is organised as follows. In Section 2 we have presented view-invariant human detection method in detail. Section 3 describes the component wise HMM method towards the detection of human actions. Section 4 reports on the performance of our system. Finally conclusions are drawn in Section 5.
2. View-Invariant Human Detection Towards human action dection our system first detects full human by detecting body parts. To ensure the view invariance in human detection, for each body part more than one detector has been designed and the knowledge of each of those body part detectors are combined finally to increase the robustness of the whole system. The component-based human detection has some inherent advantages over existing techniques. After detection of human body parts geometric constraints are used which supplements the visual information present in an image. In
contrast, a full-body person detector relies solely on visual information and does not take full advantage of the known geometric properties of the human body. Other problems in full human detection e.g. partial occlusion can be avoided using this approach.
2.1. Detection of Human Body Parts The human detection system starts detecting people in images by selecting specific size rectangular windows for each body parts from top left corner of the image as an input to head, leg and arm detectors. These inputs are then independently classified as either a respective body parts or a non-body part and finally those are combined into a proper geometrical configuration in bigger window as a person. All of these candidate regions are processed by the respective component detectors to find the strongest candidate components. The image itself is processed at several sizes. This allows the system to detect various sizes of people at any location in an image.
2.2. Feature Extraction and Selection In our approach training images of each body part detector are divided into several 8x8 blocks after applying Sobel mask on them. HOG features are extracted from those blocks dividing the gradients into 6 bins from -90o to +90o . In this way for each of this 8x8 pixel window 6 feature vector can be obtained. Next step is to select the best 6 feature packs among them. This feature selection method is based on standard deviation (σ). Considering we have N number of training image samples of size WxH. We have divided each training image into n number of 8x8 blocks. For each of these 8x8 block we have 6 gradient based features, gi where i = 1,2,...,6. We define σ of ith block and jth gradient feature i.e. σij as follows: N 2 1 X (t) (gij − µij ) (1) σij = ( ) N t=1 (t)
where i = 1,2,...,n; j = 1,2,...,6; t = 1,2,...,N; gij is defined as jth gradient feature of i th block of tth training image and µij is defined as the mean of jth gradient of ith block over all the training images. Now the σ value of (1) has been sorted and those 6 feature packets are taken where the σ is smaller than some predefined threshold. In our case we run the experiment several times to obtain that threshold. In this way the feature size is minimized.
2.3. Training Body-part Detectors To identify human into one particular scale of image each of the head and leg detectors are applied simultaneously. In the present system there are four head detec-
Figure 2. Training database for each body part detectors e.g. head, arm and leg. Each body part detectors have more than one subclassifiers e.g. head detector has four sub-classifiers frontal, backward, left and right profile.
tors, two leg detectors and four arm detectors. The four head detectors are for the view angle 45o to 135o , 135o to 225o , 225o to 315o and 315o to 45o . We choose this division so that consecutive ranges have smooth transition of head poses. For arm, there are four classifiers corresponding different position of arms. Detecting arms is a difficult task since arms have more degree of freedom than the other components. For walking, jogging we have used major four pose of the arm with the pose symmetry the detection of other pose possibilities can be achieved. For actions like hand waving, boxing etc more than four sub-classifiers have been designed since in those cases arms pose changes rapidly. To detect the legs two sub-classifiers have been added, one for open legs and other for profile legs. We use SVM to classify the data vectors resulting from the feature extraction and feature selection method of the components. The SVM algorithm uses structural risk minimization to find the hyper plane that optimally separates two classes of objects. This is equivalent to minimizing a bound on generalization error. The optimal hyper plane is computed as a decision surface of the form: f = sgn(g(x)) where,
(2)
∗
l X g(x) = ( K(x, xi ) + b)
(3)
i=1
In (3) K is many possible kernel functions, yi ∈ {-1, 1 } is the class label of the data point {x1 , x2 ,..., xl∗ } is the set of training images and x is data point to be classified by SVM. In our method we have used polynomial kernel. Fig. 2 shows few training samples of our body part component
training database. HumanEva Database1 [13] has been used to create those training images. In our method for head, leg and arm detector window sizes have been fixed to 72 x 48, 184 x 108 and 124 x 64 pixels respectively. 264 x 124 pixels window has been applied for full human. For training each component detectors 10,000 true positives and 20,000 false positives have been used. The result of those component detectors has combined based on geometrical configuration. Since the head is the most stable part of the body, the geometric combination has been done by first considering the head. Subsequently, the leg component is taken into account and, after that, both arms are combined. We include the Leg Bounding Box (LBB) (184 x 108) after the Head Bounding Box (HBB) is computed (72 x 48) from the head detector. This is done by checking that the x component of the center of the LBB must be within the width of HBB and the y component must be greater than the y component of the centre of the Full Human Bounding Box (FHBB) (264 x 124). After found FHBB, Arm Bounding Box (ABB) (124 x 64) has been incorporated within that area. It gives computational efficiency since here we only use one specific window size instead of whole image. We have chosen the result from that sub classifier which gives the best score result. When a person is moving in a circular path in some cases we can have best score for two sub classifier for arms and other cases provide us only one best score since the arms can be occluded behind the body. Fig. 3 shows the results of the component detectors in HumanEva Database [13]. Full detection has been in Fig.4. These detections are strong example of view invariance detection since in HumanEva database the agents performing actions in a circular path and our component detectors have shown robust performance in detecting those agents.
3. Human Action Recognition using HMM The aforementioned component wise view invariant human detection is next extended to human action recognition. Toward this end, we learn the stochastic changes of detected body parts using HMM to recognize human actions like walking and jogging. In our system we use only those body part components which have major contribution on the action. From the HMM likelihood map we can detect the direction of motion by which we can infer the view point of the person with respect to camera for the actions where agents moves from one place to another e.g. walking.
3.1. HMM Learning The feature set used to learn component HMM is almost same as that used for SVM of each body part detector. Instead of selecting features here we take the mean of each 1 http://vision.cs.brown.edu/humaneva/
Figure 6. Performance of full human detection in HERMES indoor sequence. Detection performance is shown when body parts are tilted. In the first image arm detector fails to detect one visible arm.
µn } this set of features where µj can be defined as 6
Figure 3. Results of head, arm and leg detector as a validation process. Here detection of profile head and leg is shown. Arm detections are shown where both arms are visible and one is occluded.
Figure 4. Validation process of full human detection. Experiments are performed in Human Eva database.
Figure 5. Performance of full human detection in KTH’s database. Detection of profile image with difference leg, arm and leg position has been shown.
6 bin angle histogram vector. So if there are n number of blocks (8 x 8) in a training image then we have {µ1 , µ2 ,...,
1 X µj = ( ) gi 6 i=1
(4)
where gi is the gradient histogram of jth bin. The significance of taking mean is to get general orientation of body part which intern signify one pose or, series of similar poses in a action. We fit a Gaussian Mixture Model (GMM) into those feature values to obtain key pose alphabet for a particular action. Those key poses are the center of each cluster obtained from GMM. To detect actions like walking, jogging and running only legs has been used and for boxing, hand waving and hand clapping we take only hands to learn our component HMM. We have used same training set from HumanEva database as of the body part detectors to train component wise HMM. For actions like hand waving, hand clapping and running we have used sequences from KTH database for training and rest for testing, since HumanEva database do not contain those actions. We have chosen a sequence of frames to define one cycle for every actions. After that we map those frame sequences into pose alphabet to obtain one sequence of states for HMM learning. We have computed P (Ot |st = s) i.e. the probability of an observation sequence Ot given the state s at time t using both Forward and Baum-Welch algorithm of HMM [10]. We use HumanEval Database as our validation database. While testing the performance of HMM, likelihood map has been computed using the probability values obtained from the HMM for each frame of the test sequence. Maximum values of likelihood map indicate the end of each action cycle.
4. Experimental Results We have used KTH’s [12] Database2 and HERMES indoor sequence dataset [4] to test the performance of our system architecture. Fig 5 shows some detection example of KTH database. In KTH database there are several image 2 http://www.nada.kth.se/cvap/actions/
due to scaling and Sobel mask hardly found important edges from the particular body parts. Images shown are of different sizes. Detected box has been drawn after rescaling it into 1:1 scale.
(a)
(b)
(c) Figure 7. ROC curves of different component detectors are demonstrated. The positive detection rate is plotted as a percentage against the false alarm rate. The false alarm rate is the number of false positive detections per window inspected. The obtained results are from (a) HumanEva database, (b) KTH database and (c) HERMES indoor sequence respectively.
sequences where it is impossible to detect arms due to different cloths. But most of the cases we are able to detect head and legs and when they are in proper geometrical order we detect full human. There are some sequences where only legs can be detected due to low resolution. For low resolution image sequences much information has been lost
Fig. 6 is the examples from HERMES indoor sequences. Here we have shown the detection of human where they are tilted and are of different scale. The ROC’s are shown in the Fig.7 for three component detector e.g. head, legs and arms of different databases. These ROC’s are generated considering the overall performance of all the sub-classifiers of each group of the three classifiers. ROC analysis can reveal that head detection and leg detection are very robust. But for arm detection the detection rate is not very high since arm pose varies drastically while performing actions. In KTH database ROC shows more than 90% detection rate of leg and for head is it less. This is due to low resolution of head in several sequences. In Hermes ROC we have little less detection rate than previous two. It is because this sequence has real background and humans are sometimes tilted, different clothing etc. But in that sequence arm detection rate has been more than other two. In the KTH’s database there are different types of actions: walking, jogging, boxing, hand waving etc. These actions were performed by 25 different people of both sexes in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. We have applied our method for all those sequences for action deteciton. Table 1 shows the comparison of action detection rate with two other method e.g. local feature with SVM [5] and contour histogram [12] based action deteciton. Our method performs well in walking, running, boxing and clapping. A better detection rate is achieved than full body contour histogram based action recognition. Table 2 shows the confusion matrix of our approach. In this matrix we can observe missclassification with other actions. Since our method consider the stochastic change of body parts which has the major contribution in action so if there is some similarity in movement of those components in different actions then recognition using HMM become difficult. We have found some of jogging sequences are misclassified as walking etc. Jogging and running legs are very similar and in some cases there are problems of resolution and contrast, so it is difficult to be distinguished. Actions involving hand motion have similar problems, like some agent performing waving similar to clapping, causing ambiguity. The advantage of the component-wise HMM is that the deteciton of body parts is very robust and in a test sequence if detection fails in certain frames then also HMM can detect the action since in a long chain of sequence overall action topology is maintained.
Table 1. Comparison of action detection rates with other two approaches. Column (a) is contour histogram of fullbody based detection [5], (b) is Local feature and SVM based detection [12] and (c) is our approach.
Approach (a) Approach (b) Approach (c) Walking Jogging Running Boxing Waving Clapping
75.0 96.3 66.7 48.0
100.0 60.0 76.9 100.0 73.4 66.7
83.8 60.4 54.6 97.9 59.7 73.6
Table 2. Confusion matrix of action recognition using our component-wise HMM.
Walking Jogging Running Boxing Waving Clapping Walking 100.0 Jogging 40.0 Running 0.0 Boxing 0.0 Waving 0.0 Clapping 0.0
0.0 60.0 23.1 0.0 0.0 0.0
0.0 0.0 76.9 0.0 0.0 0.0
0.0 0.0 0.0 100.0 20.0 13.5
0.0 0.0 0.0 0.0 73.4 19.8
0.0 0.0 0.0 0.0 6.6 66.7
5. Conclusions This paper presents one approach to detect view point invariant human detection using example-based body-part detectors. As an extension we have learned stochastic behavior of body-part components for human action recognition. Here we have focused to distinguish very similar actions like walking, jogging (considering features from legs); boxing, hand waving (considering features from hands) and learning the pose changes using HMM. The performance can be improved in human detection by adding more training samples and introducing more angular views. One difficultly is that there is no good database for different body part components so building component database is an important task. For action detection higher order HMM can be applied. Information of other body parts like arms for walking can be included in order to minimize the misclassification rate.
Acknowledgements This work is supported by EC grants IST-027110 for the HERMES project and IST-045547 for the VIDI-video project, and by the Spanish MEC under projects TIN200614606 and CONSOLIDER-INGENIO 2010 MIPRCV CSD2007-00018. Jordi Gonz`alez also acknowledges the
support of a Juan de la Cierva Postdoctoral fellowship from the Spanish MEC.
References [1] M. Ahmad and S. Lee. Human action recognition using multi-view image sequence features. In FGR, pages 10–12, 2006. [2] J. Davis and S. Taylor. Analysis and recognition of walking movements. In Analysis and recognition of walking movements, Quebec, Canada, pages 11–15, 2002. [3] L. Fengjun and N. Ramkant. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Computer Vision-ECCV, pages 359–372, 2006. [4] J. Gonz`alez, F. Roca, and J. J. Villanueva. Hermes: A research project on human sequence evaluation. In Computational Vision and Medical Image Processing (VipIMAGE), pages 359–372, 2007. [5] M. Mendoza and N. P´erez de la Blanca. Hmm-based action recognition using contour histograms. In IbPRIA, Part I, LNCS 4477, pages 394–401, 2007. [6] A. Micilotta, E. Ong, and R. Bowden. Detection and tracking of humans by probabilistic body part assembly. In British Machine Vision Conference, pages 429–438, 2005. [7] T. Moeslund, A. Hilton, and V. Kr¨uger. A survey of advances in vision-based human motion capture and analysis. In CVIU 104, pages 90–126, 2006. [8] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transaction on Pattern Analysis and Machine Intelligence, 23(4):349– 361, 2001. [9] S. Park and J. Aggarwal. Semantic-level understanding of human actions and interactions using event hierarchy, 2004. CVPR Workshop on Articulated and Non-Rigid Motion, Washington DC, USA,. [10] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. IEEE, 2:257–286, 1989. [11] D. Ramanan, D. Forsyth, and A. Zisserman. Tracking people by learning their appearance. IEEE Transaction on PAMI, 29(1):65–81, 2007. [12] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In ICPR III, pages 32–36, 2004. [13] L. Sigal and M. Black. Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. In Techniacl Report CS-06-08, Brown University, 2006. [14] A. Sundaresan, A. RoyChowdhury, and R. Chellappa. A hidden markov model based framework for recognition of humans from gait sequences. In ICIP, pages 93–96, 2003.