Feature Extraction for Pedestrian Classification

Feature Extraction for Pedestrian Classification Under the Presence of Occlusions Laurens van der Maaten∗

Guido de Croon∗

MICC, Maastricht University, P.O. Box 616, 6200 MD Maastricht, The Netherlands {l.vandermaaten,g.decroon}@micc.unimaas.nl

Abstract

The identification of pedestrians is an important problem in a wide range of computer vision applications. Previous work on pedestrian identification mainly focuses on feature extraction approaches. In this study, we investigate four new features for pedestrian identification that aim to overcome the limitations of features that are employed in previous studies. In particular, we focus on features that are to some extent robust to the presence of occlusions in the pedestrian images. The four new feature extraction approaches that are investigated are based on: (1) nonlinear dimensionality reduction, (2) texton frequency histograms, (3) Latent Dirichlet Allocation, and (4) local receptive fields networks. We test the performance of our feature extraction approaches on the NiSIS competition dataset. From the obtained results, we conclude that Support Vector Machines trained on PCA features perform best, most likely due to the exploitation of biases in the NiSIS dataset.

1

Introduction

The identification of pedestrians (or in general, persons) is an important problem in computer vision applications such as surveillance, robotics, and control systems. The overall appearance of an observer pedestrian is subject to a variety of transformations, such as viewing angle, lightning conditions, physical properties of the pedestrian, and occlusions. Humans are known to perform very well on the identification of pedestrians under these transformations and distortions. In order to foster the development of robust techniques for pedestrian identification, NiSIS organized a competition that aims at the development of such techniques. In particular, the development of biologically plausible techniques is encouraged. The number of previous studies on pedestrian classification are limited [9, 16, 18]. An approach to pedestrian classification using two high-definition cameras is presented by Scotti et al. [18]. The approach by Scotti et al. identifies pedestrians by means of their specific geometric properties. In [9], Gavrila and Munder present an approach to pedestrian identification that detects pedestrians by combining multiple cues such 1

as shape and texture features. In later work [16], Munder and Gavrila compare three different feature extraction approaches and two different classifiers on a dataset of very small (36 × 18 pixels) pedestrian and non-pedestrian images. The three feature extraction approaches that are studied are PCA, Haar wavelets, and local receptive fields, whereas the classification is performed by means of nearest neighbor classifiers and Support Vector Machines (SVM). The results in [16] reveal that a combination of local receptive features and SVMs performs best. The three features investigated in [16] suffer from various weaknesses. For instance, PCA features assume the pedestrian images lie on or near a linear manifold in the image space, which is an assumption that is likely to be violated by the data. Haar wavelets are very sensitive to translations in images. The local receptive features are obtained using a training procedure that is likely to get stuck in local minima of the objective function, and is therefore likely to be suboptimal. In addition, all three features are sensitive to the presence of occlusions in the data. In this paper, we present an experimental study that aims to address the weaknesses of the features that are studied in [16]. The quality of the features is evaluated on the NiSIS competition dataset, which is similar to the dataset that is employed in [16] except for the presence of partial occlusions in the data. As a result, our selection of features focuses at addressing the susceptibility to occlusions of the features investigated in [16]. In particular, we focus on four alternative feature extraction approaches: (1) nonlinear dimensionality reduction, (2) texton frequency histograms, (3) Latent Dirichlet Allocation, and (4) a local receptive fields network. The outline of the remainder of this paper is as follows. In section 2, the setup of the NiSIS competition and the characteristics of the data is discussed. Section 3 presents the four feature extraction approaches. In section 4, we present the results of experiments with classifiers trained on the four feature types. The results of these experiments are discussed in more detail in section 5. In section 6, we conclude that (despite their weaknesses) PCA features perform best, most likely due a strong bias in the trainingset.

2

NiSIS Competition

The NiSIS competition aims at promoting the development of robust techniques for pedestrian classification and, in particular, the development of techniques that are robust against the presence of partial occlusions of the pedestrians. The NiSIS competition datasets consists of three main parts: (1) a labeled training set of 1,225 images, (2) a labeled test set of 2,450 images, and (3) an unlabeled test set of 6,125 images that is used in the assessment of the submissions to the competition. The dataset contains images of 36 × 18 pixels that depict either a pedestrian or not. In addition, a subset of the images contains artificial occlusions. Some examples of images from the dataset are shown in Figure 1 one of which is partially occluded. The aim of the competition is to develop a system that recognizes whether or not a pedestrian is shown in an unlabeled image (i.e., the classification problem is binary). In general, such a problem is addressed by extracting certain features from the images, and training a classifier on these feature representations. In the remainder of the paper, we mainly focus on the feature extraction from the images.

2

Figure 1: Four images from the NiSIS dataset.

3

Feature extraction

The aim of feature extraction from images is twofold: (1) obtaining a meaningful representation of the images by exploiting spatial structure apparent in such images and (2) reducing the overall dimensionality of the image data. By achieving these two aims, it becomes possible to successfully train a classifier on the feature representations. In the feature extraction from images in the NiSIS competition dataset, there are two complicating factors. First, the small size of the images does not allow for, e.g., feature extraction approaches that employ filter banks. Feature extraction from small images has not yet received much attention, except for an interesting study in [20]. In particular, a disadvantage of such small images that they do not contain contextual information that can be employed (as is done in [1]). Second, the features that are extracted should be robust to the presence of the artificial occlusions in the data. In our study, we focus on four feature extraction approaches: (1) nonlinear dimensionality reduction, (2) texton frequency histograms, (3) Latent Dirichlet Allocation, and (4) a local receptive fields network. Our selection of features is motivated by the two complicating factors discussed above, and by the limitations (such as linearity of the dimensionality reduction techniques) of the features that were employed in [16]. We discuss the four feature extraction approaches separately in subsection 3.1 to 3.4.

3.1

Nonlinear dimensionality reduction

Although PCA has been successfully applied in problems such as face recognition [21], coin classification [13], and pedestrian classification [16], its performance is limited by the assumption that the data lies on or near a linear manifold in the image space [19]. It is very likely that this assumption is violated by the data, and the data in fact lies on or near a nonlinear manifold in the image space1 . In recent years, a large number of techniques has been proposed that are claimed to able to identify nonlinear manifolds [12, 17, 19], most of which are reviewed in [23]. In our study, we focus on three techniques for nonlinear dimensionality reduction, viz., Isomap [19], LLE [17], and autoencoders [12]. Isomap is a technique that identifies the underlying manifold by fitting a neighbor1 For instance, a dataset of images depicting a face under a large number of orientations between 0 and 360 degrees comprises a nonlinear manifold in the image space that is not isometric to Euclidean space.

3

hood through the data, and computes a geodesic distance matrix by approximating the geodesic distance (i.e., the distance over the manifold) by the distance of the shortest path between two points in the graph (e.g., using Dijkstra’s algorithm). Subsequently, a low-dimensional feature representation is computed by performing multidimensional scaling on the geodesic distance matrix. Similar to Isomap, LLE identifies the low-dimensional manifold by fitting a neighborhood graph through the data. LLE differs from Isomap in that it solely aims to preserve local properties of the manifold, whereas Isomap retains global geodesic distances. In particular, LLE retains the weights in the linear combination that reconstructs a datapoint from its nearest neighbors. The error function that is minimized in LLE is a convex function that can easily be optimized using spectral techniques. Autoencoders are neural networks with an odd number of layers whose structure is illustrated in Figure 2. Autoencoders are trained to minimize the mean squared error between the input and the output of the network. We use a training procedure that learns the network weights layer-by-layer using a Restricted Boltzmann Machine (RBM) training procedure [11]. In the greedy training procedure, deeper layers are trained on the activations obtained from their trained higher layers. The training of autoencoders using a RBM training procedure is biologically plausible, in contrast to traditional neural network training procedures such as backpropagation. RBM training learns a generative model instead of a discriminative one [10]. The brain is generally believed to have generative capabilities, which are exhibited in, e.g., vivid imaginations and dreams, but also play an important role in visual processes [25]. Furthermore, the weight updates in the training procedure have a very local nature (similar to Hebbian learning), and do not require propagation of errors through the network. In addition, the RBM training does not require labeled training instances. In order to further improve the performance of the autoencoders, we finetune the weights in a supervised way using backpropagation. The final features of an image are formed by the activations in the middle layer when the image is used as input to the trained autoencoder. The reader should note that dimensionality reduction techniques are global techniques, and therefore, it is not possible to explicitly model the occlusions present in the data.

3.2

Texton frequency histograms

In recent years, approaches to texture classification that view upon texture as a probabilistic generator of small texture patches called textons have gained in popularity [24, 22]. We can view upon an image as a superposition of so-called textons, which may be considered the fundamental building blocks of texture [15]. We may extract features from images based on the intuition that an image is a composition of overlapping textons. In order to be able to extract such features, we first need to construct a set of fundamental textons that underly the images in our data. We do so by performing vector quantization (using k-means clustering) on a large set of small image patches that were randomly drawn from the dataset in order to construct a texton codebook consisting of V textons. Subsequently, an image can be characterized by means of a texton frequency histogram, that measures the relative frequencies of the textons from the codebook in the image. The texton frequency histogram is computed by identify4

Figure 2: Schematic structure of an autoencoder. ing the most similar texton from the codebook for every image patch from the image and incrementing an accumulator for that texton. Afterwards, the texton frequency histogram is normalized and is used as a feature describing the image. The reader should note that the texton frequency histogram does not employ any spatial information, and is thereby invariant to, e.g., translations of an object. Furthermore, for partially occluded images, the artificial occlusions can simply be ignored in the computation of the texton frequency histograms.

3.3

Latent Dirichlet Allocation

In the extraction of texton frequency histograms, an image is viewed upon as a sample from a simple generative model. The graphical model underlying texton frequency histograms is shown schematically in Figure 3(a). In the model, x indicates the distribution of D textons over an image, c indicates the texton distribution of one of the N images in the data, α indicates the prior distribution over the textons, and β represents the prior distribution for the C classes. The main drawback of such a simple generative model for images is that it neglects the presence of various themes in the image. For instance, a beach photograph is composed of a beach theme, a water theme, and a sky theme, all of which have their own specific texton distribution. In texton frequency histograms, we simply average over the underlying texton distributions, and thereby, we loose important information. This information might be modelled by a generative model called Latent Dirichlet Allocation (LDA) [2]. In [6], LDA was already successfully applied on a scene recognition task. The generative model underlying LDA is shown schematically in Figure 3(b). In the diagram, α represents a multinomial prior distribution over the K themes for each class c of the N images in the dataset. The variable θ indicates the belief distribution over

5

(a) Texton model.

(b) LDA model.

Figure 3: Generative models. the collection of multinomial theme mixture distributions and therefore, θ is Dirichlet distributed. An image consist of D textons x, each of which is associated with a theme that is drawn from z, which is in turn a theme mixture that is drawn from θ once per image. Furthermore, the distribution over x depends on the prior β over the textons given the class c ∈ {1, . . . , C}. The key inferential problem that needs to be solved in both the training of the model and the inference from the model is that of computing the posterior distribution over the hidden variables given an image p(θ, z|x, α, β) =

p(θ, z, x|α, β) p(x|α, β)

(1)

which can only be computed by marginalizing over the hidden variables ! Z D Y p(x|α, β) = p(θ|α) p(zd |θ)p(xd |zd , β) dθ

(2)

d=1

PK Z Γ( k=1 αk ) = QK k=1 Γ(αk )

K Y

! θkαk −1

D X K Y V Y

! xin

(θk βki )

dθ

(3)

d=1 k=1 i=1

k=1

where i is the texton index and V is the number of textons in the codebook. This equation is not tractable due to the coupling between θ and β in the summation [4], but can be approximated using variational inference. Variational inference approximates the posterior over the hidden variables by minimizing the Kullback-Leibler divergence between some variational distribution q(θ, z|γ, φ) and the posterior over the hidden variables p(θ, z|x, α, β) [14]. Herein, the variational distribution q(θ, z|γ, φ) is the factorized distribution q(θ, z|γ, φ) = q(θ|γ)

D Y d=1

6

q(zd |φd )

(4)

The details of the minimization of KL(q(θ, z|γ, φ)||p(θ, z|x, α, β)) fall outside the scope of this paper, but can be found in [2, 14]. The parameters (γ ∗ (x), φ∗ (x)) that minimize the Kullback-Leibler divergence for a certain image are specific for this image, and can thus be used as a feature representation for this image. In particular, we employ the Dirichlet parameters γ ∗ (x) as features describing the original image [2], since they correspond to the estimate of the theme mixture in the image. Similar to our procedure in the computation of texton frequency histograms, we simply ignore the artificial occlusions in the images in the computation of the features.

3.4

Local receptive field network

Instead of the manual construction of features (such as in the approaches in the previous subsections), feedforward neural networks with local receptive fields can extract features that are tuned to the data, such as in the neocognitron [8]. The main problem of such approaches is the number of weights in the network that has to be learned, which very likely causes the learning algorithm to get stuck in local optima. In our approach, we use a network architecture that is losely based on the structure of the neocognitron, but we train it using a new learning algorithm that is less likely to get stuck in local optima. The local receptive field (LRF) network we employ consist of three layers, and is

Figure 4: Schematic structure of a neocognitron (not all connections are shown). illustrated in Figure 4. The first layer consists of local receptive fields of size 4 × 4 (hence 16 neurons) that have an overlap of 2 pixels. The second layer processes the outputs of the local receptive fields, and reduces their dimensionality from 16 to 10. In the third layer, the outputs of the second layer are combined into the 30 final outputs of the network. 7

In contrast to the learning approach proposed in [8], we do not couple any weights during the training of the network. Instead, we train our local receptive field networks in an unsupervised way by training Restricted Boltzmann Machines (RBM) layer-bylayer [11]. This learning algorithm is similar to the one we employed in the training of autoencoders. The features describing an image are given by the outputs of the trained network when the image is used as input into the network. Because of the locality of the feature detectors in LRF networks, the network should be capable to model the presence of partial occlusions in its weights.

4

Experiments

In order to evaluate the performance of the proposed features for pedestrian classification, we performed experiment on the NiSIS competition dataset described in Section 2. In subsection 4.1, we describe the setup of these experiments. The results of our experiments are presented in subsection 4.2.

4.1

Experimental setup

The NiSIS competition dataset is already divided into a training set and a test set, making our experimental setup rather straightforward. We extract features from both datasets and subsequently train a classifier on the training set. The generalization performance of the trained classifier is evaluated on the test set. We selected two classifiers for our experiments: (1) the 1-nearest neighbor classifier and (2) a Support Vector Machine (SVM) using a Gaussian kernel. The SVM was trained iteratively using the technique described in [5]. The main parameters of the employed techniques (such as the target dimensionality in the dimensionality reduction, the patch size in the texton and LDA models, and the variance in the Gaussian kernel of the SVM) were optimized by means of an exhaustive grid search.

4.2

Results

In Table 1, we present the generalization errors of classifiers that were trained using the features we proposed in Section 3. The best performing feature for each classifier is printed in boldface.

5

Discussion

There are two aspects in the experiments presented that need to be discussed in more detail. First, we discuss the characteristics of the pedestrian classification task in more detail (subsection 5.1). Second, we discuss the results of our experiments in more detail (subsection 5.2).

8

Features Intensity values PCA (30D) Isomap (30D) LLE (30D) Autoencoders (30D) Texton freq. histograms Latent Dirichlet features LRF features

1-NN 0.1069 0.1049 0.1645 0.2527 0.1669 0.1188 0.1408 0.1032

SVM 0.0776 0.0641 0.1396 0.2208 0.1212 0.0820 0.0804 0.0845

Table 1: Generalization errors of classifiers trained using various features.

5.1

Pedestrian classification task

The NiSIS-competition has two main goals: (1) to ameliorate state-of-the-art techniques for pedestrian classification and (2) to stimulate the application of biologically inspired computer vision methods. In light of the first goal, we think that there are three aspects of the pedestrian classification task that deserve attention for future competitions. First, the dataset for the competition does not completely represent the final task a pedestrian detector has to perform. This final task consists of determining the presence and location of pedestrians in (low-resolution) images. The intuition behind the current dataset is that a traditional object detector is applied to this final task. A traditional object detector slides a window over the image and determines for each window whether an object is present or not. Such a detector will encounter image patches that contain only a part of a pedestrian. Such patches do not occur in the NiSIS dataset, since all pedestrians are located in the center of the images. Our suggestion is to include the entire images to which an object detector will be applied. Performing pedestrian classification on such a dataset also facilitates achieving the second goal, since many biologically plausible computer vision methods forwarded in the last decade are based on the use of contextual information [3]. Second, the occlusions that play a major role in the pedestrian classification task may be prone to improvement. In the dataset, occlusions always occur in the bottom part of the image and have a regular intensity value (i = 205, i ∈ [0, 255]). We presume that the occlusions are based on the use of a camera that scans the image line by line, and is not always fast enough to render the entire image. The type of occlusion in the dataset is not the only type of occlusion that a real-life pedestrian detector will encounter. In the real world, persons may be partially occluded by objects in the world, such as cars parked on the side of the road. More importantly, in the dataset, only pedestrian images contain occlusions. We surmise that this bias is an important reason for the strong performance of PCA, since it conflicts with the results in other studies. In particular, Munder and Gavrila [16] present results on a pedestrian classification task (without occlusions) in which PCA performs relatively poorly. Third, the pedestrian classification task exhibits a ceiling effect. A straightforward method such as classifying images by means of a 1-nearest neighbor classifier that was 9

trained on the intensity values, already achieves an error of 10.69% (see Table 1). The best method in our study has an error rate that is only 4.28% lower, indicating the presence of a ceiling effect that makes it difficult to compare various methods. The changes in the dataset that we suggested above may help to alleviate this problem.

5.2

Results of the experiments

In Table 1, we have shown that the combination of PCA and an incremental SVM leads to the best results on the pedestrian classification task. This result is surprising for two main reasons. First, the use of PCA leads to better results than the use of nonlinear dimensionality reduction techniques such as Isomap and LLE. This is surprising, since PCA (in contrast to the other techniques) makes the strong assumption that the data lies on or near a linear manifold. In a recent study [23], we argue that the inferior results of the nonlinear dimensionality reduction techniques are due to fundamental weaknesses in the locality of learning. The results of autoencoders may be improved by further finetuning of the training procedure. Second, our results show that holistic methods (such as PCA, Isomap, LLE, and autoencoders) are not outperformed by patch-based methods (such as texton frequency histograms, LDA features, and LRF features), despite the presence of occlusions. One reason for this may be that the occlusions occur only in pedestrian images, making the classification of these image easy for holistic methods. Another reason may be that the patch-based methods are prone to improvement. For example, the texton frequency histograms and the LDA features do not employ spatial information in the image. There are constellation-based methods that take the spatial relations between image patches into account [7], that may be more suitable for the task at hand. On the other hand, PCA also ignores most spatial context, a problem that might be addressed by the application of 2D PCA [26].

6

Conclusions

In the paper, we presented a study in which we proposed a number of features for the identification of pedestrians in images, motivated by weaknesses in the features discussed in earlier studies [16]. In particular, we performed experiments on the NiSIS competition datasets with four types of features. Our results show that an SVM trained on PCA features yields the best results, despite the weaknesses of the PCA technique. Most likely, this result is due to a bias in the artificial occlusions in the NiSIS dataset. In future work, we plan to perform experiments on a dataset in which pedestrians can be observed in their context in the world. This would allow for (biologically plausible) methods that employ context [1, 3]. Furthermore, we aim at incorporating more spatial information into our patch-based approaches (i.e., texton frequency histograms and LDA features). The results with PCA features could be further improved by variants of PCA that retain spatial structure in their embeddings [26].

10

References [1] N.H. Bergboer, E.O. Postma, and H.J. van den Herik. Context-based object detection in still images. Image and Vision Computing, 24(9):987–1000, 2006. [2] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003. [3] G. de Croon and E.O. Postma. Active object detection. In Proceedings of the 2nd International Conference on Computer Vision Theory and Applications (VISAPP), pages 97–103, 2007. [4] J. Dickey. Multiple hypergeometric functions: Probabilistic interpretations and statistical uses. Journal of the American Statistical Association, 78:628–637, 1983. [5] C.P. Diehl and G. Cauwenberghs. SVM incremental learning, adaptation and optimization. In Proceedings of the International Joint Conference on Neural Networks, volume 4, 2003. [6] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 524–531, 2005. [7] R. Fergus, P. Perona, and A. Zisserman. Weakly supervised scale-invariant learning of models for visual recognition. International Journal of Computer Vision, 71:273–303, 2007. [8] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13, 1983. [9] D.M. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking from a moving vehicle. International Journal of Computer Vision, 73(1):41–59, 2005. [10] G.E. Hinton. Learning multiple layers of representation. Trends in Cognitive Science, 11(10):428–434, 2007. [11] G.E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006. [12] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [13] R. Huber, H. Ramoser, K. Mayer, H. Penz, and M. Rubik. Classification of coins using an eigenspace approach. Pattern Recognition Letters, 26(1):61–75, 2005. [14] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introductio to variational methods for graphical models. Machine Learning, 37:183–233, 1999.

11

[15] B. Julesz. Textons, the elements of texture perception and their interactions. Nature, 290:91–97, 1981. [16] S. Munder and D.M. Gavrila. An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):??–??, 2006. [17] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000. [18] G. Scotti, A. Cuocolo, C. Coelho, and L. Marchesotti. A novel pedestrian classification algorithm for a high definition dual camera 360 degrees surveillance system. In Proceedings of the IEEE International Conference on Image Processing, volume 3, page 880, 2005. [19] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [20] A. Torralba, R. Fergus, and W.T. Freeman. Tiny images. Technical Report TR2007-024, MIT-CSAIL, 2007. [21] M.A. Turk and A.P. Pentland. Face recognition using eigenfaces. In Proceedings of the Computer Vision and Pattern Recognition 1991, pages 586–591, 1991. [22] L.J.P. van der Maaten and E.O. Postma. Texton-based texture classification. In Proceedings of the Belgium-Netherlands Artifical Intelligence Conference (in press), 2007. [23] L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality reduction: A comparative review. Preprint, 2007. [24] M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 691–698, 2003. [25] H.L.F. von Helmholtz. Handbuch der Physiologischen Optik. Leopold Voss, 1867. [26] H. Yu and M. Bennamoun. 1D-PCA, 2D-PCA to nD-PCA. In Proceedings of the 18th International Conference on Pattern Recognition, volume 4, pages 181–184, 2006.

12