On Bin Configuration of Shape Context. Descriptors in Human Silhouette Classification. Mark Barnard and Janne Heikkilä. Machine Vision Group. Department of ...
On Bin Configuration of Shape Context Descriptors in Human Silhouette Classification Mark Barnard and Janne Heikkil¨ a Machine Vision Group Department of Electrical and Information Engineering University of Oulu, Finland
Abstract. Shape context descriptors have been a valuable tool in shape description since their introduction. In this paper we examine the performance of shape context descriptors in the presence of noisy human silhouette data. Shape context descriptors have been shown to be robust to Gaussian noise in the task of shape matching. We implement four different configurations of shape context by altering the spacing of the histogram bins and then test the performance of these configurations in the presence of noise. The task used for these tests is recognition of body part shapes in human silhouettes. The noise in human silhouettes is principally from three sources: the noise from errors in silhouette segmentation, noise from loose clothing and noise from occlusions. We show that in the presence of this noise a newly proposed spacing for the shape context histogram bins has the best performance.
1
Introduction
We have selected the task of identifying body parts in human silhouettes as this is an important first step for many tasks, such as action recognition, tracking and pose estimation. Methods for silhouette classification fall into two broad categories: model based and learning methods. Model based approaches are topdown methods where prior knowledge of the silhouette shape, in the case of human silhouettes this is knowledge of the human body in various poses and the location of the limbs in these poses, is used for part identification. These models are often used for tracking the location of the limbs in video sequences [4] [13]. Mittal et al [9] present an algorithm for body part segmentation from silhouette images. Their approach divides the body based on negative minimum curvature. The authors report that their algorithm performed poorly when the body parts, particularly the arms, are close to the body. Methods such as this, that rely on the overall shape of the human body, while not affected by small amounts of noise would be significantly affected by the type of large deformations caused by the segmentation errors and occlusions we examine in this paper. Most learning based methods rely on the recognition or matching of individual pixels in a silhouette. Li et al[7] use a general geometric transform to match body part pixels in silhouette images. Shape context descriptors are a generic set of descriptors used to describe a shape, most commonly the shape is a line J. Blanc-Talon et al. (Eds.): ACIVS 2008, LNCS 5259, pp. 850–859, 2008. c Springer-Verlag Berlin Heidelberg 2008
On Bin Configuration of Shape Context Descriptors
851
delineating the boundary of the shape. The original application involved shape matching by defining the aligning transformation between two objects by Belognie et al [2]. Mori and Malik [10] used shape context descriptors to match silhouettes in order to reconstruct the 3D pose of a person from single uncalibrated 2-D image. Agarwal and Triggs [1] also used shape context descriptors for recovering the 3-D pose of a person from a single 2-D image. Their algorithm recovers the pose by direct nonlinear regression against shape descriptor vectors. The 2-D shape context descriptors have also been extended for use in 3-D shape matching by K¨ ortgen et al.[6]. Poppe and Poel [11] compared shape context descriptors to Fourier descriptors and Hu Moments. They recovered human pose from a large database of human silhouettes under different viewing angles, body dimensions and noise. The Hu moments were considerably inferior to the compared methods under all deformations. While these approaches are less susceptible to errors from occlusions, they can still be affected by noise close to the pixel in question as we demonstrate in our experiments. In this paper we specifically examine the affect of noise on the performance of shape context features in a classification problem. While we are looking at human body part segmentation we believe the results are pertinent to any silhouette classification task, as silhouettes are generally produced using background subtraction. In Section 2 we discuss the nature of the noise that can occur during silhouette segmentation. We give a more detail of shape context descritptors in Section 3 as well as a proposed modification to the bin configuration to improve the performance of the descriptors in noisy environments. In Section 4 we present experiments on both artificial and real data using shape context descriptors with various bin configurations.
2
Noise in Human Silhouette Data
Silhouettes are commonly used in human pose and action recognition tasks. They are chosen for a number of reasons, firstly a binary silhouette image contains no texture or colour information and this gives a degree of robustness to different subjects and environments. Additionally silhouettes are relatively easy to produce, given a simple method of background subtraction. Silhouettes are, however, prone to noise such as shadow attachment caused by poor background segmentation performance. In human silhouette data we can broadly classify noise as coming from three sources: poor silhouette segmentation, clothing and occlusions. These three classes of noise are not necessarily disjoint, for example in extreme cases poor background segmentation can also cause occlusion, such as a limb being detached from the main silhouette. Noise can be caused by poor silhouette segmentation in the form of shadows, changes in lighting conditions or the background colour matching the foreground colour we wish to segment. This noise can cause dramatic changes in the shape of the silhouette and so seriously degrade the performance of any shape recognition method. In Figure 1(a) it can be seen how the black background is causing very poor segmentation almost making the legs detach from the main silhouette. In
852
M. Barnard and J. Heikkil¨ a
(a) Segmentation noise
(b) Clothes noise
(c) Occlusion noise
Fig. 1. Examples of real test data used in our experiments. The top row show the original image and the bottom row shows the labelled silhouette image used to evaluate models.
real applications we also encounter a variety of clothing being worn by subjects. The effects of loose clothing on the silhouette are shown in Figure 1(b). In this instance it can be seen that the sleeves of the shirt change the outline of the subjects arms. Another source of error in silhouette segmentation is occlusions. This is particularly true in the case of surveillance and monitoring applications where a subject may be entering or leaving a room. Figure 1(c) shows a subject leaving the frame, while the legs are still clearly visible the head and parts of the upper body have disappeared. In our experiments we used simple background subtraction to produce silhouettes. While it can produce poorly segmented, and therefore noisy, data we feel it gives us a good, easily reproducible base-line for testing the shape context feature extraction. Background subtraction is still very much an open problem and clearly any improvement in the segmentation of the human silhouettes will improve the overall recognition performance.
3
Shape Context Descriptors
We will now look in more detail at the implementation of shape context descriptors. Shape context descriptors measure the distribution of points in a shape relative to each point in that shape. So for every point in the shape, in our case a human silhouette outline, a feature vector is generated that captures the shape relative to that point. We define our silhouette as a sequence of points P = (p1 , p2 , . . . pT ), with length T . The space around each point, pt , is divided into Nang angular sectors originating from the point and Nrad radial shells each placed a certain radius r from the point, as shown in Figure 2. These sectors and shells form four sided bins the number of which is given by Nbins = Nang × Nrad . A feature vector is given by xt = {x1 , x2 , . . . xi , . . . xNbins } where xi is the number of silhouette outline points located in the ith bin of the shape context. A sequence of shape context descriptors is thus generated for each point on the silhouette giving, X = (x1 , x2 . . . , xT ).
On Bin Configuration of Shape Context Descriptors
853
In the initial definition of shape context descriptors the spacing between the radial shells is equal in log-polar space, as shown in Figure 2(a). The authors state the reason for this is to be more sensitive to differences in nearby pixels rather than more distant pixels [3] [2]. This spacing gives shape context descriptors the property of being localised and thus less affected by noise or artifacts elsewhere in the shape. It should be noted that shape context descriptors were first proposed in order to perform a point to point matching between two shapes. In this application distinguishing two nearby points is important, so emphasis is given to close pixels in order to achieve this. However, in other cases we wish to classify a point as belonging to a particular class. In this application the position of closer pixels is less important than the shape of the entire body part and its context in the overall shape of the body. We examine the performance of four bin configurations for shape context descriptors. The first is the log spacing as specified in the original shape context proposal and shown in Figure 2(a). The second configuration is an reverse log spacing, in which the shell radius decreases logarithmically away from the pixel, as shown in Figure 2(b). Thirdly we implement equal spacing of the shells, seen in Figure 2(c). As the fourth configuration we introduce a new modification to the spacing of the histogram bins in shape context descriptors. We propose weighting the radial bin size as shown in Figure 2(d). We define the overall radius of the shape context descriptor as R with the width of each radial bin being d and the distance from the centre of the descriptor as r. Instead of the original logarithmic spacing we use a weighted spacing of the radial bins according to ⎧ R ⎨ 2N if R3 < r < 2R 3 (1) d= ⎩R else, N where N is the number of radial shells with equal spacing. This places more emphasis on the pixels in the middle distance, where the angular bins are wider, whilst still maintaining a degree of locality in the descriptor.
4
Experiments
In these experiments we test the performance of the different configurations of shape context descriptors, described in Section 3 in the task of body part segmentation, initially with artificial data, then with real images. The aim is to classify each pixel on the silhouette outline into one of four body part classes, Head, Body, Arms, Legs. The entire body silhouette outline is modelled as a sequence using Hidden Markov Models (HMMs) [12] with each body part being a state of the HMM and modelled with a Gaussian Mixture Model (GMMs). During recognition Viterbi decoding is used to give the most likely sequence of body parts given the trained model. Specifics of this approach are described in more detail in previous work [8].
854
M. Barnard and J. Heikkil¨ a
(a) Log spacing
(b) Reverse log spacing
(c) Equal spacing
(d) Weighted spacing
Fig. 2. Shape context descriptors showing the four bin spacing configurations used in our experiments
In this case the model is trained using artificial labelled training data. This data was generated using the Carnegie-Mellon University Graphics Lab Motion Capture Database available at http://mocap.cs.cmu.edu/. The motion capture data used to animated a 3D figure rendered using the Maya rendering package
On Bin Configuration of Shape Context Descriptors
855
The training and validation sets consisted of three different subjects performing a series of actions in five different sequences totaling more than 11000 images. The test sets consisted of both artificially generated data and real data, these are described in the following sections. Descriptors for training and testing were extracted from the data using the different shape context configurations described in Section 3. This gives us four models, one for each of the shape descriptor configurations. Each model is a fully connected, or ergodic, HMM with four states, one for each body part, and each state consists of a GMM with 15 Gaussians. 4.1
Artificial Data
In the first set of experiments we measured the performance of each of the shape context configurations using artificially generated data. We, again, used data from the Carnegie-Mellon University Graphics Lab Motion Capture Database, a video sequence of a subject not seen in the training data. The body parts in each image are automatically labelled during the rendering process, therefore we can easily generate a relatively large amount of labelled test data. In this case we have 1000 labelled images of a person in different poses. In order to test the performance of the different descriptor configurations in noisy conditions we add Gaussian distributed noise to each image. We offset the position of each pixel in the silhouette by a Gaussian distributed amount and increase the variance of the Gaussian distribution to increase the amount of noise. This is the procedure used by Belongie et al. [3] to test the log spaced shape context descriptors and taken from the procedure described by Chui and Rangarajan [5]. The effects of different levels of Gaussian noise can be seen in Figure 3. The performance of the different shape descriptor configurations in the presence of different levels of Gaussian noise can be seen in Figure 4. This plot shows the pixel recognition rate for each configuration. We can see that most of the shape context configurations are reasonably robust to the presence of Gaussian noise. Indeed, there is little difference in the performance of log spacing, equal spacing and weighted spacing with all having an approximately 6% drop in recognition rate at the highest level of noise. In the case of the reverse log spacing, however, the performance even on clean silhouettes is relatively poor
(a) σ = 0
(b) σ = 0.004
(c) σ = 0.02
(d) σ = 0.05
Fig. 3. Synthetically generated figure with different levels of Gaussian distributed noise
856
M. Barnard and J. Heikkil¨ a
Results of Gaussian noise on artificial shapes 0.96 Weighted spacing Equal spacing Log spacing Reverse log spacing
Edge pixel recognition rate
0.94
0.92
0.9
0.88
0.86
0.84
0
0.005
0.01
0.015 0.02 0.025 0.03 0.035 Variance of Gaussian noise function
0.04
0.045
0.05
Fig. 4. Results of testing on artificially generated figures with different levels of Gaussian noise
in comparison to the other configurations. This is probably due to the lack of locality in the descriptor, with each bin having equal area, the pixels further away are given the same importance as those close to the pixel. Interestingly although the reverse log configuration starts from a lower base it seems to be less affected by the Gaussian noise, with a 4% drop in the recognition rate. 4.2
Real Data
The next set of experiments involved testing the different shape descriptor configurations on real data. This data was divided into four categories based on the sources of noise in the image. As stated in Section 2 an image can be affected by more than one source of noise, however we have attempted to classify the images based on the main source of noise. The data is divided into: Tight clothes. This is video data taken from the CMU Graphics Lab Motion Capture Database of a subject dressed in tight fitting clothing moving slowly around an empty room. An image from this sequence can be seen in Figure 5(a) Normal clothes. This is video data taken from various sources, some recorded with good lighting and relatively clean background (see Figure 5(d)), while others are noisier surveillance sequences (see Figure 5(b)). In these sequences all subjects are wearing various types of normal clothing. Occlusions. In this data the subject is partially occluded. Some of the occlusions are natural due to the subject leaving the frame (see Figure 5(c)) and some have been artificially occluded by removing a section of the silhouette
On Bin Configuration of Shape Context Descriptors
(a) Tight clothes
857
(b) Normal clothes
(c) Tight clothes with occlu- (d) Normal clothes with occlusions sions Fig. 5. Examples of real test data used in our experiments. The top row shows the original image and the bottom row shows the labelled silhouette image used to evaluate models.
(see Figure 5(d). This class is sub-divided into tight clothes with occlusions and normal clothes with occlusions. To test the performance of the models on this real data it was necessary to hand label the body parts in a number of extracted silhouettes. Examples of this labelling can be seen in Figures 5(a), 5(b), 5(c) and 5(d). Obviously this hand labelling is time consuming and expensive, so the number of test images is smaller than in the case of artificial data. In all, 80 images of each of the four image classes were labelled giving a total of 320 images in the real test set. The results of testing each of the shape context configurations on each of the four noise classes is shown in Table 1, the recognition rate for artificial data is also shown as a baseline. While all the configurations were relatively robust to
858
M. Barnard and J. Heikkil¨ a
Table 1. Recognition results for the four noise classes tested with each of the shape context configurations. Results are given as per pixel percentage recognition rates.
Artificial (σ = 0) Tight clothes Normal clothes Occluded tight Occluded normal
Log space Rev Log space Equal space Weighted space 93.6 88.0 94.5 94.4 82.7 81.9 83.9 87.0 78.5 79.0 82.7 85.8 77.2 76.1 77.3 81.8 71.2 71.2 74.9 77.7
Gaussian noise in artificial data, the results in Table 1 show very different results on real data. Even the noise from a tightly clothed figure with no occlusions produces a significant drop in performance from all the configurations. This fall in recognition rate is not equal across all configurations, however, with log and equal spacing dropping by 11.1% and 10.6% respectively. While the reverse log spacing and weighted spacing fell by 6.1% and 7.4% respectively. This difference may be explained by segmentation errors causing dramatic changes to the local shape of a body part, so the configurations with more emphasis on close pixels are more adversely affected by this noise. The addition of noise produced by subjects wearing normal clothes has less impact on the performance. The log spacing configuration again has the largest fall in recognition rate 4.2%. This would seem again to be due to the emphasis being placed on close pixels, so small changes in shape produced by clothing will have a larger effect than on the other configurations. If we look at the effect of occlusions it can be seen that the log spaced configuration is most robust by a small margin in the case of both tight and normal clothing. This is to be expected as the locality of the descriptor configuration will minimise the effect of occlusions in one part of the shape from affecting shape contexts in other parts of the shape. On balance, taking into account all sources of noise the proposed weighted configuration has the best performance.
5
Conclusions
We have seen in these experiments the performance of four different configurations of shape context descriptors on both artificial and real data. The first point to make is the difference in performance between Gaussian noise and real noise. Clearly real noise caused by poor silhouette segmentation does not follow a Gaussian distribution and conclusions drawn from tests using Gaussian distributed noise should not be transfered to data containing real noise. It is also clear that in this application the log spaced configuration is not the best performing. It’s locality causing errors when segmentation artifacts or clothing cause large deformations in local pixels. The proposed weighted configuration performed best overall being more robust to both segmentation and clothing noise. This configuration can be seen as a trade-off between having some degree of locality whilst also having some context from pixels further away. While
On Bin Configuration of Shape Context Descriptors
859
we have demostrated the performance of the proposed shape context descriptor configruation in the task of human silhouette segmentation, it can equally apply to any application with noisy silhouette data.
References 1. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1052–1062 (2006) 2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 3. Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape matching and object recognition. In: NIPS, pp. 831–837 (2000) 4. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: CVPR 1998: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, p. 8. IEEE Computer Society, Los Alamitos (1998) 5. Chui, H., Rangarajan, A.: A new algorithm for non-rigid point matching. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2000) 6. K¨ ortgen, M., Park, G.-J., Novotni, M., Klein, R.: 3d shape matching with 3d shape contexts. In: Proc. 7th Central European Seminar on Computer Graphics, Budmerice, Slovakia (April 2003) 7. Li, J., Zhou, S.K., Chellappa, R.: Appearance modeling under geometric context. In: ICCV 2005: Proceedings of the Tenth IEEE International Conference on Computer Vision, Washington, DC, USA, pp. 1252–1259. IEEE Computer Society, Los Alamitos (2005) 8. Barnard, M., Matilainen, M., Heikkil¨ a, J.: Body part segmentation of noisy human silhouette images. In: Proc. IEEE International Conference on Multimedia and Expo. (ICME 2008), Hannover, Germany (2008) 9. Mittal, A., Zhao, L., Davis, L.S.: Human body pose estimation using silhouette shape analysis. In: Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 263–270 (July 2003) 10. Mori, G., Malik, J.: Recovering 3d human body configurations using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1052–1062 (2006) 11. Poppe, R., Poel, M.: Comparison of silhouette shape descriptors for example-based human pose recovery. In: Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (2006) 12. Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 13. Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 784–800. Springer, Heidelberg (2002)