Unusual Activity Recognition in Noisy Environments Matti Matilainen, Mark Barnard, and Olli Silv´en University of Oulu, Department of Electrical and Information Engineering, Machine Vision Group {matti.matilainen,olli.silven}@ee.oulu.fi,
[email protected] http://www.ee.oulu.fi/mvg/
Abstract. In this paper we present a method for unusual activity recognition that is used in home environment monitoring. Monitoring systems are needed in elderly persons homes to generate automatic alarms in case of emergency. The unusual activity recognition method presented here is based on a body part segmentation algorithm that gives an estimation of how similar the current pose is compared to the poses in the training data. As there are arbitrary number of possible unusual activities it is impossible to train a system to recognize every unusual activity. We train our system to recognize a set of normal poses and consider everything else unusual. Normal activities in our case are walking and sitting down. Keywords: Computer vision, body part segmentation, unusual activity recognition.
1
Introduction
Unusual activity recognition has many possible applications in automatic surveillance. We are concentrating on monitoring elderly people in home environments to generate automatic alarms when unusual activity is detected. There are other solutions for making alarms such as passive infrared sensors, accelerometers, and pressure pads. These kinds of sensors must be worn all the time so the automatic video monitoring approach is much more comfortable to the person using it. The problem with computer vision based systems is that they usually have to be trained for each installation location separately. This increases the cost of the system significantly. In domestic environments the furniture causes difficult occlusions. We have developed a solution that requires neither training nor adjustment of parameters in a new location. Our system uses only one uncalibrated camera. The features and statistical methods we used to model the actions work very well in very noisy environments and under occlusions. Unusual activity recognition is addressed in several publications. Chowdhury and Chellappa [1] tracked the persons and then classified the actions to normal/unusual based on the trajectories. They obtained a set of basis shapes from the training trajectories. The unknown activities were recognized by projecting J. Blanc-Talon et al. (Eds.): ACIVS 2009, LNCS 5807, pp. 389–399, 2009. c Springer-Verlag Berlin Heidelberg 2009
390
M. Matilainen, M. Barnard, and O. Silv´en
onto those basis shapes. Salas et al. [2] also used only the trajectories of objects. They presented a method that detects forbidden turns and red light infringements in vehicular intersections. Nait-Charif et al. [3] trained Gaussian Mixture Models (GMMs) for recognizing inactivity zones from overhead camera. When the person monitored becomes inactive outside the inactivity zone it is considered unusual activity. Mahajan et al. [4] used several layers of finite state machines (FSM) to model the activities. They label an unknown activity unusual if the logical layer FSM fails to recognize it. T¨ oreyin et al. [5] calculated the height and widht ratios of bounding boxes. This sequence was wavelet transformed and used as features in HMM based classification to discriminate between walking and falling over. The method has problems telling the difference between sitting down and falling over so they also analyzed the audio tracks of the videos to find high amplitude sounds produced by falling over. In statistical pattern recognition the patterns are D dimensional feature vectors that can be considered as points in D dimensional feature space. Each class forms a cluster of points into the feature space. The unknown sample vectors can be classified to some of these classes depending on how close it is to the corresponding cluster. If the distributions in the training data are the same as in real world the classifier performs optimally. The classifier needs enough training data to be able to generalize enough to recognize patterns it has never seen before. If there is not enough training data the classifier cannot recognize anything but the patterns that are included in the training data. This is called overfitting. If the models used are complex they need lots of training data. In some problems there is need for a very large training set that requires labelling. Labelling video data can be very time consuming. Sometimes it is impossible to label the required data by hand. Synthetic training data can be used instead of hand labelled data. We propose the use of artificial training data to generate a large number of training examples. The synthetic data was created through motion capture. We captured sequences with 5 test subjects. From these sequences we rendered a training database of over 50000 frames. In each frame the body parts are automatically labelled by color. It would be impossible to label this much data by hand. Synthetic training data has been used many applications. Varga and Bunke [6] created a method for generating synthetic training data from hand written text. They applied geometrical transforms and thinning/thickening operations to the text lines. Heisele and Blanz [7] used morphable models for creating more training data from a small set of face images. They built a 3D model from the example image and then rendered it under varying lighting conditions and from different viewing angles. In Section 2 we present our unusual activity recognition method. We are using Hidden Markov Models and Gaussian Mixture Models to segment the body parts from background subtracted silhouette images. The body part segmentation algorithm gives an estimation of how well the given silhouette corresponds to the ones in the training data. This information is used at higher level to detect the unusual activities. The experiments conducted are described in Section 3 and discussed in Section 4.
Unusual Activity Recognition in Noisy Environments
2
391
Our Approach
Our unusual activity recognition method builds on the body part segmentation algorithm proposed by Barnard et al. [8]. Before body part segmentation a background subtraction algorithm is applied to each frame. Figure 1 illustrates some example frames where the body parts are segmented. The frames are from five different sources. Each is shot with different equipment and the person performing the actions is different in each frame. The resolution and lightning conditions varies. The first column is the original input frame. The second column shows the result of background subtraction. The third column is the final result. In some of the frames the arms are very close to the body and they are correctly recognized. There is also some noise that causes the limbs to be cut out from the silhouette. The silhouette edges are often distorted by varying lightning conditions and shadows that cause the background subtraction to fail. The body part segmentation algorithm uses Hidden Markov Models (HMMs) [9] and GMMs. These statistical models need a lot of labelled training data to avoid overfitting. We trained the models only with synthetic training data. The synthetic training data was created through motion capture. In the motion capture process 5 subjects (3 male and 2 female) were used. 16 optical markers
Fig. 1. Frames where the body part segmentation is applied
392
M. Matilainen, M. Barnard, and O. Silv´en
Fig. 2. Examples from the training data. The labelled 3D model rendered from 12 different viewing angles with different offsets.
were attached to each person and the markers were tracked with three cameras. The marker trajectories were used to animate a 3D model. The 3D model had each body part labelled with different colors. This way we had to label the model only once to create a training database of over 50000 frames. Some example frames are shown in Figure 2. The model is rendered from different viewing angles. Labelling that much data by hand is very time-consuming. We can re-render the synthetic training data from new viewing angles, lighting conditions and with added occlusions automatically. We also used some data from the CMU motion capture database. The motion capture process is discussed in detail in [8]. We used shape context features proposed by Belongie et al. [10] to represent the silhouette edges. The shape context features have been used in shape matching and defining the aligning transformation between two objects. The silhouette edge is sampled at regular intervals. For each sampled edge point the distance and direction to each other edge point that fall under the maximum distance is calculated. These are stored in a histogram. Belognie et al. used a log-polar histogram. In log-polar histogram the spacing is equal in logarithmic space. The log-polar histogram is illustrated on the left of Figure 3. We conducted experiments using modified features [11]. Instead of log-polar histogram we used weighted radial bins as illustrated on the right of Figure 3. The distance of the radial bin from the center is given by, ⎧ R ⎨ 2N if R3 < r < 2R 3 (1) d(r) = ⎩R else N
Unusual Activity Recognition in Noisy Environments
393
Fig. 3. Illustration of the original and weighted bins
where N is the number of radial bins with equal spacing and R is the overall radius of the shape context descriptor. This way the middle radial bins are emphasized and the locality of the descriptor is maintained. A GMM was trained for each body part from the shape context features. The GMMs form the states of a HMM. We consider each silhouette outline as a sequence of shapes corresponding to body parts (Head, Arms, Legs and Body). Using an HMM we can constrain the shape recognition by taking into account the transitions between different body parts using Viterbi decoding [9]. The state transition matrix is estimated from the labeled training data. The state transitions are illustrated in Figure 4.
Fig. 4. Transitions between body parts
394
M. Matilainen, M. Barnard, and O. Silv´en
The body part segmentation algorithm outputs the label (head, body, arms or legs) for each silhouette edge pixel and the likelihood. We calculated the average likelihood over all the edge pixels. When the person is in a pose that is not present in the training data the average likelihood drops. The pose is then considered abnormal. The average likelihood of the pose is given by, m
L=
1 li m i=1
(2)
where L is the average likelihood, m is the number of pixels in the silhouette and li is the likelihood of the i th pixel. We used three sequences containing walking (normal activity) and falling over (unusual activity) as the training set to find the statistically optimal threshold for unusual poses. These sequences were not included in the testing set. The average likelihoods for walking and falling over were 55.0 (standard deviation 7.5) and 21.0 (standard deviation 9.3) respectively. The threshold was set to 39.5. This sequence of thresholded likelihoods can be used for higher level recognition. We propose the use of majority voting over large number of consecutive decisions because the actions occur over a period of time. This way the influence of single frames with incorrect decisions is not significant. When the majority voting window moves through the signal it works like a low pass filter. If the decision is made from only one fame at the time then a large number of false alarms are going to be raised. The false alarms are not as harmful as a missed unusual activity but to keep the system running robustly they must be removed. This method can be used with several cameras if the whole room can not be covered with a single camera. The frames from each camera can be concatenated to a single buffer and the method can be applied to all the frames in the buffer.
3
Experiments
In our experiments we used 16 test sequences that were shot from only one viewing angle. 14 of the sequences introduce a person walking, falling over and getting up. In one of the sequences the person walks then sits down and then gets up. In the last sequence the person walks around the room. These sequences without any unusual activities are used to test if the method gives false alarms. Test sequences were recorded in different locations using different hardware to get more variations in the data. We used both male and female subjects. The same models and parameters were used during each test. The models were trained using only synthetic training data that was created through motion capture. We also tested the method with majority voting over 50 previous frames. If the average likelihood in over 50% of the frames is below the threshold then the activity is unusual. The frame rate of the sequences varies from 10 fps to 20fps resulting in a buffer from 5.0s to 2.5s respectively. If there is at least 1 frame that is classified as unusual in the walking sequence then a false alarm
Unusual Activity Recognition in Noisy Environments
395
Fig. 5. Average likelihood plot of a walking sequence and the thresholded sequence
has been made. Without any higher level decisions the system is prone to false alarms. There were 5 false alarms in the walk sequences (16 sequences total) without the voting method. There were no false alarms when using the voting method. All the fall over activities were detected correctly with and without the voting. The Figure 5 shows one average likelihood plot of a walking sequence. The second plot is the thresholded sequence where 1 and 0 means normal and unusual activity respectively. There are several frames where the likelihood drops below the threshold so a false alarm would have been generated. If the voting method is applied to this sequence no false alarms are generated. The majority voting can be applied over any number of frames from different cameras. In addition to the 16 single camera tests we ran tests with International Conference on Distributed Smart Cameras (ICDSC, http://wsnl2.stanford.edu/ icdsc09challenge/ ) videos (four viewing angles) and our own motion capture videos (three viewing angles) that were not used in the training. We chose to use
Fig. 6. ICDSC frames with corresponding background subtracted frames
396
M. Matilainen, M. Barnard, and O. Silv´en
a buffer of 10 frames from each video stream resulting in a buffer of 40 and 30 frames from ICDSC and motion capture videos respectively. Figure 6 illustrates frames from each of the four cameras of the ICDSC videos. In these sequences the
Fig. 7. Frames from our motion capture database
Fig. 8. Frames from a fall over sequence and the corresponding average likelihood plot
Unusual Activity Recognition in Noisy Environments
397
background subtraction failed. The person in the sequence moves static objects in the scene and is badly occluded by furniture. The voting method raised one false alarm in this sequence. There were no false alarms in the motion capture video and correct alarm was raised when the person fell over. Figure 7 shows frames from the motion capture sequence where the person has fallen over. Our experiments show that the voting method reduces the overall frame classification rate but it is acceptable because it is not necessary that every single frame is recognized correctly if the alarms are raised correctly.
4
Discussion
In the testing data 93.41% of the frames were classified correctly when using the statistically optimal threshold. Most of the misclassified frames were the ones where the person was falling over or getting up. Figure 8 shows few frames of a falling over sequence from the test data and the corresponding average likelihood ratio plot. An example sit down sequence is illustrated in Figure 9. In the sit down sequence the likelihood drops when the activity is changing from walking to sitting but it still over the threshold. The average likelihood drops sharply below the threshold when the person falls over.
Fig. 9. Frames from a sit down sequence and the corresponding average likelihood plot
398
M. Matilainen, M. Barnard, and O. Silv´en
Fig. 10. The receiver operator characteristic
We ran the tests using different thresholds. The receiver operator characteristic (ROC) curve is plotted in Figure 10. In the ROC curve the true negatives (correctly recognized walking) are plotted against false negatives (missed unusual activity). It is important to have a low false negative rate because missing an unusual activity in a monitoring application such as this could be disastrous. Whereas the false positives are not as critical as false negatives it is crucial to keep them as low as possible to have the system running robustly. The system we presented here has a lot of applications. It can be used in elderly persons homes to monitor for medical concerns. It does not require any sensors to be worn by the monitored person. It is cheap to install in a new location because the training is done only once and the trained models can be used in each installation location.
5
Conclusions
We have presented a solution for unusual activity recognition that uses a body part segmentation algorithm to determine how similar an unknown pose is compared to the ones in the training data. If the pose is close enough compared to the trained models it is considered normal. The unknown activity sequences were processed with majority voting window to ignore the single frames where the recognition was not correct. The body part segmentation algorithm was trained using only synthetic training data we created through motion capture. We have showed that the solution works in very noisy environments. The models were trained only once and the same models were used in all the test cases. The test sequences were shot in different locations and with different equipment.
References 1. Chowdhury, A., Chellappa, R.: Factorization Approach for Activity Recognition. In: Computer Vision and Pattern Recognition Workshop, vol. 4(4), p. 41 (2003) 2. Salas, J., Jim´enez, H., Gonz´ alez, J., Hurtado, J.: Detecting Unusual Activities at Vehicular Intersections. In: IEEE International Conference on Robotics and Automation, pp. 864–869 (2007)
Unusual Activity Recognition in Noisy Environments
399
3. Nait-Charif, H., McKenna, S.J.: Activity Summarisation and Fall Detection in a Supportive Home Environment. In: International Conference on Pattern Recognition, pp. 323–326 (2004) 4. Mahajan, D., Kwatra, N., Jain, S., Kalra, P., Banerjee, S.: A framework for activity recognition and detection of unusual activities. In: Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing, pp. 15–21 (2004) 5. T¨ oreyin, U.B., Dedeoglu, Y., Cetin, A.E.: HMM Based Falling Person Detection Using Both Audio and Video. In: Signal Processing and Communications Applications, pp. 1–4 (2006) 6. Varga, T., Bunke, H.: Generation of synthetic training data for an hmm-based handwriting recognition system. In: Proceedings International Conference on Document Analysis and Recognition, p. 618 (2003) 7. Heisele, B., Blanz, V.: Morphable models for training a component-based face recognition system. In: Computer Vision and Patern Recognition, p. 1055 (2004) 8. Barnard, M., Matilainen, M., Heikkil¨ a, J.: Body Part Segmentation of noisy human silhouette. In: International Conference on Multimedia and Expo., pp. 1189–1192 (2008) 9. Rabiner, L.R.: A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 10. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 11. Barnard, M., Heikkil¨ a, J.: On bin configuration of shape context descriptors in human silhouette classification. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 850–859 (2008)