We study the problem of robust pedestrian detection. A new descriptor, Pyramidal Statistics of Oriented Filter- ing (PSOF), is proposed for shape representation.
Pyramidal Statistics of Oriented Filtering for Robust Pedestrian Detection Min Li, Zhaoxiang Zhang, Kaiqi Huang, Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {mli, zxzhang, kqhuang, tnt}@nlpr.ia.ac.cn Abstract We study the problem of robust pedestrian detection. A new descriptor, Pyramidal Statistics of Oriented Filtering (PSOF), is proposed for shape representation. Unlike one-scale gradient-based methods, the PSOF descriptor constructs an image pyramid and uses a Gabor filter bank to obtain multi-scale pixel-level orientation information. Then, locally normalized pyramidal statistics of these Gabor responses are used to represent object shape. After feature extraction, the AdaBoost training algorithm is adopted to train a classifier for the final pedestrian detector. We show experimentally that the PSOF descriptor is much more robust to image blur and noise than the HOG (Histograms of Oriented Gradients) descriptor, as well as possesses excellent detection performance in normal imaging condition as HOG does. We also study the influence of various parameter settings, concluding that multi-scale information and statistic combination are two important factors for the robustness of the PSOF descriptor.
Image blur is a much overlooked yet important property, since it makes privacy-protecting surveillance possible. As more and more surveillance cameras are set everywhere in daily life, people increasingly concern about their privacy security. As shown in the blurred image Figure 1 (b), all privacy-related things, like people’s appearance, what they are wearing, and whom they are with can not be identified easily. Moreover, image blur can also be used to smooth noise in bad imaging conditions.
1. Introduction
Pedestrian detection is a popular topic in computer vision community during the last ten years [15, 17, 7, 11, 24, 4, 9, 25, 18], because humans are one of the most important types of targets in many applications, such as visual surveillance, intelligent transportation system and robots. However, pedestrian detection is still an open problem due to many challenges in real applications, like wide range of human poses, partial occlusion, image blur and low contrast with much noise. This paper aims to propose a full-body detection method that ont only has excellent detection performance in good imaging condition, shown in Figure 1 (a), but also can work well under bad imaging conditions, such as blur and low contrast with much noise, shown in Figure 1 (b) and (c). Challenge of partial occlusion is beyond the scope of this paper, since it is often solved by part-based detection methods [11, 25].
Figure 1. Our method’s detections in images under different conditions. (a) normal image (b) blurred image (c) low contrast image with much noise. Challenge of low contrast with strong noise often happens in night visual surveillance because of poor illumination condition. As shown in Figure 1 (c), human shapes in night surveillance images are often not very clear and there is also much noise around them. This causes much trouble for pedestrian detection algorithms, because shape is often the most important cue for pedestrian detection in surveillance scenes. As we know, features based on gradients and edges are often used for shape representation. However, they are often sensitive to image noise and greatly affected by im-
age blur. To avoid the disadvantages of gradient/edge features, we propose a new kind of shape descriptor, Pyramidal Statistics of Oriented filtering (PSOF), to describe human’s shape. Similar to the S1 layer in the BIM (Biologically Inspired Model) [20, 14], we construct a multi-layer image pyramid and convolve it with a Gabor filter bank to obtain multi-scale pixel-level orientation information. Then, each two adjacent layers of the resulting pyramid is evenly divided into a number of sub-pyramids, in which lp -norm-like statistics for every orientation are calculated. These scalefused pyramidal statistics possess certain robustness to image blur or noise, as well as have strong ability of shape description. To make the final descriptor robust to illumination variability, these pyramidal statistics are all locally normalized. After feature extraction, AdaBoost is used to train a classifier for pedestrian detection. We show experimentally that the PSOF descriptor is much more robust to image blur and noise than one of the state-of-the-art shape descriptors, the HOG (Histograms of Oriented Gradients) descriptor [4], as well as possesses excellent detection performance in normal imaging conditions as HOG does. We also study the influence of various parameter settings on the PSOF descriptor, concluding that multi-scale information and statistic combination are two important factors for the robustness of the PSOF descriptor. The remainder of this paper is organized as follows. After reviewing some previous work in Section 2, we give a detailed description of the computation of the PSOF descriptor in Section 3. The details of how to train a pedestrian classifier and postprocessing of detections are given in Section 4. Experimental results are presented in Section 5 and we conclude in Section 6.
2. Previous Work Due to its importance and challenges, much work has been done in pedestrian detection. Papageorgiou et. al [17] propose a Haar-wavelet based detector for pedestrian detection, with a parts-based variant in [12]. One of the most important advantages of Haar-like features is that they can be computed very fast by the integral image technique [23]. However, their performance is poor in real surveillance scenes [24]. To improve the performance of Haar feature based detector, Viola et. al [24] integrate intensity information with motion information (both are calculated in a Haar-wavelet way) in their detection system, which decreases false alarms largely in real surveillance videos. Relative to these wavelet-based feature sets [12, 24], the HOG (Histograms of Oriented Gradients) descriptor proposed by Dalal et. al [4] provides much better performance in pedestrian detection in complex scenes with a wide range of poses. Zhu et. al [26] extend the HOG descriptor and utilize a cascade classifier structure to increase detection speed. Leibe et. al [9] address the problem of human de-
tection in crowded scenes. They use the learned ISM (Implicit Shape Model) [8] to generate object hypotheses and refine them by top-down segmentation to detect humans. An extension of the ISM-approach for pedestrian detection in crowded image sequences can be seen in [19]. In [25], a set of edgelet (a short segment of line or curve) features is proposed to train part detectors and pedestrians are detected in a joint likelihood way. Sabzmeydani and Mori [18] propose a set of shapelet features (mid-level features) for pedestrian detection, which is generated from low-level gradient information by AdaBoost. In [22], a pedestrian detection method based on the covariance matrix descriptor [21] is proposed and shows better performance on the INRIA dataset [4] than the HOG descriptor, but an experimental study conducted by Paisitkriangkrai et. al [16] shows that the covariance matrix descriptor is slightly inferior to the HOG descriptor on the DaimlerChrysler pedestrian benchmark dataset created in [13].
3. Pyramidal Statistics of Oriented Filtering The PSOF (Pyramidal Statistics of Oriented Filtering) feature set is computed hierarchically in four layers: L1 (image layer), L2 (Gabor layer), L3 (statistical layer) and L4 (local normalization layer). The L1 and L2 layers are similar to the S1 layer in the BIM (Biologically Inspired Model) [20, 14]. Both our method and the BIM-approach use a Gabor filter bank to obtain basic pixel-level orientation information. The difference is that in the successive steps, we use locally normalized orientation statistics as local descriptors to represent object’s shape, while the BIMapproach computes more complex high-level vision information based on the ”bag of features” method. Figure 2 shows the flowchart of computation of the PSOF descriptor. Details are given as follows. L1 (Image layer): The input image is resized to 64×128 (this is the size of pedestrian samples in the INRIA dataset [4], on which our main experiments are conducted) and an image pyramid of N √ scales is created, in which the higher scale is a factor of 2 smaller than the adjacent lower one. For each two adjacent scales (totally N − 1 combinations), go to the following layers. L2 (Gabor layer): A Gabor filter bank is used to convolve the two-scale image pyramid from the L1 layer at each position and every scale (for RGB images, the maximum response of the three channels at every pixel is kept). Gabor filters are widely used in object recognition [10] because of their excellent performance on orientation and spatial frequency selectivity. A Gabor filter is the product of an elliptical Gaussian envelop and a harmonic function: X 2 + γ2Y 2 2π ) cos( X) (1) 2σ 2 λ where X = x cos θ + y sin θ, Y = −x sin θ + y cos θ, and θ G(x, y) = exp(−
/
R = |G ∗ I|
/
where n is the total number of units in a sub-pyramid, xi is the value of a specified orientation of a unit in the subpyramid and p ∈ {1, 2, .., ∞}. Note that for every orientation there is such a statistic, so it describes some kind of local statistical information at a specific orientation. The lp -norm-like statistics have at least two advantages. First, they can be computed very fast by the integral image technique. Moreover, some common statistics, like mean, standard deviation (called sigma here), and max can be easily computed from the the lp -norm-like statistics: • mean: Sm = S1 ; • standard deviation: Ss =
/RFDO QRUPDOL]DWLRQ
S22 − S12 ;
• max: Sx = S∞ ; Here, the lp -norm-like statistics also generally include the variants of themselves, like the three common statistics mentioned above. These lp -norm-like statistics have strong ability to describe orientation information, because there are many possible combinations of these statistics to describe the magnitude and variance of orientated responses from many different aspects. Furthermore, these statistics
6P 6V
3\UDPLGDOVWDWLVWLFDO ILOWHULQJ
(2)
L3 (Pyramidal statistics layer): To obtain the L3 layer, evenly divide the L2 layer into a set of sub-pyramids with the bottom √ size of 8 × 8 and the top size of 5 × 5 (a factor of about 2 smaller than the bottom) and in each sub-pyramid for each orientation some lp -norm-like statistics are computed. The reason why we choose 8 × 8 (note that the limb width is about 6 − 8 pixels) as the bottom size of subpyramids is that this is the optimal ”cell” size in the HOG descriptor [4]. Both our sub-pyramids and the ”cells” in the HOG descriptor are basic local regions in which orientation statistics are calculated. The only difference is that the used statistics are totally different. So some size parameters related to local regions in the POSF descriptor, including the size of region used for local normalization in the following L4 layer, can also be set to the corresponding optimal parameter in the HOG descriptor. The basic lp -norm-like statistics are defined as follows: p xi 1/p (3) ) Sp = ( n
«
/
controls the orientation of the filter. Our filter bank consists of four Gabor filters with 4 orientations: 0, π/4, π/2, and 3π/4. Each filter is 5 × 5 in size and x and y vary between 2 and 2. The parameters γ (aspect ratio), σ (effective width) and λ (wavelength) are set to 1.0, 3.0, 5.0 respectively. Note that the components of each filter are normalized so that their mean is 0 and variance is 1. The response R of an image I to a Gabor filter G is given by:
2
2
2
2
6PPHDQ6VVLJPD«
2 2 2 2
2ULHQWHG ILOWHULQJ
2 2 2 2
N N 1
/ N
Figure 2. The flowchart of computation of Pyramidal Statistics of Oriented Filtering are computed in a scale-fused way, and thus possess certain robustness to image blur and noise. L4 (Local normalization layer): The L4 layer is the local normalization version of the L3 layer. Divide the L3 layer into a number of overlapping blocks which are 3 × 3 units in size and have an overlap of one unit between two adjacent blocks, and then use the L2-norm to normalize the feature set f in each block:f = √ m f 2 , where is i=1
f (i) +
a small real number (we set it to 1.0) and m is total number of features in a block. Overlapping local normalization makes the resulting feature set robust to local illumination variability. The final PSOF feature set is the concatenation of the total N -1 L4 layers containing statistical orientation information in different scales.
4
AdaBoost Training and Postprocessing of Detections
AdaBoost Training: Considering that the dimension of the PSOF feature set may be up to several thousands and the number of training samples is very large (more than 10,000), we use AdaBoost [5] to learn the classification function. This is because AdaBoost is an effective and efficient learning algorithm for training on high-dimensional large dataset. Compared with other statistical learning approaches (e.g., SVM), which try to learn a single powerful discriminant function from all the specified features extracted from training samples, the AdaBoost algorithm combines a collection of simple weak classifiers on a small set of critical features to form a strong classifier using the weighted majority vote. This means the AdaBoost classi-
fier can work very fast in the testing stage. Furthermore, AdaBoost is not prone to overfitting and provides strong bounds on generalization which guarantees the comparable performance with SVM. In the AdaBoost learning procedure, the classification and regression trees (CARTs) [2] are adopted as the weak learners. As we know, gathering a representative set of negative samples is very difficult. To overcome the problem of defining this extremely large negative class, a bootstrapping training is adopted. A preliminary classifier is trained on an initial training set, then used to predict the class categories of a large set of patches randomly sampled from many pedestrian-free images. False alarms are collected and added to the negative training set for the next iteration of training.
Figure 3. Our method’s detections in a image. (a) before postprocessing. (b) after postprocessing. Postprocessing of Detections: For pedestrian detection, we scan the trained classifier over all sliding windows in all possible scales in the input image. As a results, as shown in Figure 3 (a), there will be many detections around each target. To obtain the number of targets and the exact location of each target from these detection windows, postprocessing is necessary. First, the quadratic algorithm [3] (see chapter: Data structures for disjoint sets) is used to split all the detection windows into several subsets. Then, in each subset, calculate the average window position and size. By postprocessing, as shown in Figure 3 (b), we can get the exact number of pedestrians and obtain more accurate target locations than the raw detection windows.
5 Experimental Results In this section, we compare the performance of our PSOF-Boosting detector with that of the HOG-SVM detector [4], one state-of-the-art pedestrian detector, on the INRIA dataset [4] and two poor-quality variants of this dataset. Then, we make a detailed study on the effects of different parameter settings on the detection performance of the PSOF descriptor.
5.1
The Datasets and Methodology
Datasets: The INRIA dataset is a representative and challenging dataset for pedestrian detection and used
Figure 4. Typical pedestrian samples from three test sets: the normal INRIA test set (Row 1), the blurred one (Row 2), and the noised one (Row 3). widely [26, 18, 22]. It consists of 1239 pedestrian images (2478 with their left-right reflections) and 1218 person-free images for training. In the test set there are 566 pedestrian examples (566 × 2=1132) and 453 person-free images. The pedestrian image size is 64 × 128, while the sizes of personfree images vary largely for 320 × 240 to 640 × 480. To test the ability of our descriptor to resist image blur and noise, we make another two test sets based on the INRIA test set. One is the blurred test set obtained by convolving the INRIA test set with a rotationally symmetric Gaussian lowpass filter of size 10 × 10 with standard deviation 10; the other is the noised test set obtained by adding Gaussian white noise of zero-mean and standard deviation 0.07 to the INRIA test set. Figure 4 shows some typical samples from the normal INRIA test set and the other two newly constructed test sets. Methodology: The aim of this study is to evaluate both the effectiveness and the robustness of the proposed PSOF descriptor in pedestrian detection, so each detector is tested on both the normal INRIA test set and another two poorquality variants: the blurred one and the noised one, but trained only on the normal INRIA training set. In the Bootstrapping training process, a fixed set of 9,000 patches sampled randomly from 1218 person-free training images provides the initial negative set. For each detector, a preliminary classifier is trained on the initial training set and the 1218 negative training photos are searched for false positives (hard examples). Then, a new classifier is re-trained using this augmented negative set (initial 9,000 + hard ex-
amples) to produce the final detector. To quantify detector performance, as did in [4], we plot Detection Error Tradeoff (DET) curves, i.e. miss rate ( 1Recall) versus false positive per window (FPPW). Lower values are better. DET plots present the same information as Receiver Operating Characteristics (ROCs), but allow small probabilities to be distinguished more easily. For convenience, we will often use miss rate at 10−4 FPPW as a reference point for result comparisons, unless otherwise specified.
5.2
Comparison with HOG
The HOG descriptor [4] uses locally normalized histograms of gradient orientation to represent object’s local structure information, which shows excellent performance in pedestrian detection. The details of the computation of this descriptor can be see in [4]. Parameter setting for the PSOF descriptor is as follows. The number of scales N in the L1 layer is set to 3 and two statistics: mean and standard deviation (sigma) are chosen during pyramidal statistical filtering in the L3 layer. As studied in the next subsection, this is the optimal parameter setting for our PSOF descriptor. Other parameters are set as described in Section 3. The dimension of the final descriptor is 3112. 1
PSOF(Normal data) PSOF(Noised data) PSOF(Blurred data) HOG(Normal data) HOG(Noised data) HOG(Blurred data)
Miss Rate
0.8
0.6
0.4
0.2
−6
10
−5
−4
−3
−2
10 10 10 10 False Positives Per window (FPPW)
−1
10
Figure 5. Performance comparison of the HOG-SVM classifier and the PSOF-Boosting classifer on three test sets: the normal INRIA test set, the blurred one and the noised one. Figure 5 shows the DET curves of our POSF-Boosting classifier and the HOG-SVM classifier on the prepared three test sets. As we can see, on the normal test set, our classifier is just slightly better than the HOG-SVM classifier with a miss rate of about 3.5% lower. However, on the poor-quality test sets, our classifier greatly outperforms the HOG-SVM classifier. From normal data to blurred, the miss rate of the HOG-SVM classifier increases from about 11% to 56%, while ours increases from about 7.5% only to about 15%;
and on the noised test set, the miss rate of the HOG-SVM classifier increases drastically to 87%, while ours increases only to 30%. These results suggest that the HOG descriptor is very sensitive to image blur and noise. This is mainly because gradients are very sensitive to noise and they often shift in blurred images. The success of the PSOF descriptor lies in its avoidance of using gradients to estimate pixellevel orientation information and the usage of multi-scale information. Relative to using simple 1D point derivative method to estimate gradients in the HOG descriptor (see [4] for details), using a Gabor filter bank to extract orientation information in the PSOF descriptor is more robust. Moreover, as studied in the following experiments, the negative effect brought by noise or blur can be decreased by using higher image scales and pyramidal statistical filtering. Figure 7 shows some detection results of the PSOFBoosting detector in challenging images from real surveillance scenes or daily-life photos. As we can see, there are extremely few false alarms and miss detections in both the normal images and the poor-quality images (blurred images or images in low contrast with much noise).
5.3
Effect of Parameter Setting
There are two important factors that may influence the PSOF descriptor greatly. One is the number of scales N in the L1 layer, and the other is what statistics used in the L3 layer. Here, we will make a detailed study on how these parameter settings influence detector performance. To study the effect of the number of image scales on classification performance, we make N vary from 1 to 4 and choose three fixed statistics: mean, sigma and max. Other parameters are set as described in Section 3. Note that when N =1, we do not have to construct an image pyramid and the pyramidal statistical filtering in the L3 layer degenerates to rectangular statistical filtering. Figure 6 (a)-(c) show the detector performance in different number of scales on the three prepared test tests. As shown in Figure 6 (a), on the normal test set, the classification performance almost does not change as N increases. This means that in normal imaging condition, additional higher image scales bring little benefit for detector performance, and that three statistics of four-orientation Gabor responses in one image scale capture enough orientation information for pedestrian detection. Interesting things happen on the blurred test set. As shown in Figure 6 (b), N =1 performs the worst (miss rate is 20%) and N =2 performs the best (miss rate is 15%). Thus, for blurred images, additional image scales can improve detector performance, but not the more scales the better. However, on the noised test set, more image scales are definitely important. As shown in Figure 6 (c), miss rate monotonously decreases as N increases. From N =1 to N =4, miss rate decreases from 62% to 33% (nearly one half). The second best N =3 has a miss rate of 37%. Therefore, additional
-/ -/ -/ -/
"#
-/ -/ -/ -/
"#
"#
-/ -/ -/ -/
!
$& '$ +
+
!
,
$%$& '$%$& $%'$ $%'$%$&
"#
"#
"#
!
$%$& '$%$& $%'$ $%'$%$&
$%$& '$%$& $%'$ $%'$%$&
+
$& '$ +
+
$+
$& '$ +
!
$+
!
!
"#
"#
$+
"#
!
'
!
*
!
Figure 6. Performance comparison of PSOF descriptors under different parameter settings and on different test sets. The first, second, and third column correspond to the normal INRIA test set, the blurred one and the noised one respectively; and the first, second, and third row correspond to the effects of number of scales, each single statistic and statistic combinations respectively.
Figure 7. Our method’s detections in three kinds of images: normal (Row 1), blurred (Row 2), and low-contrast with much noise (Row 3).
higher image scales play the most important role in improving PSOF’s ability to resist image noise. There are several reasons for this result. First, noise is decreased in higher scale space; second, as scale increases, local orientation statistics is calculated and normalized on larger and larger corresponding original image regions, which significantly decreases negative effects brought by local operations; the last but not the least, pyramidal filtering becomes a denoising method because of fusing orientation information in adjacent scales. Since N =3 has comparable performance to that of N =4 and brings less computation, we choose N =3 in the following experiments. To study the effects of different choices of statistics on detector performance, we divide statistics into two classes. One is the ”single statistic” class and the other is the ”statistic combination” class. Each element in the first class is one single statistic from the statistic set {mean, S2 , S3 , sigma, max}; each element in the second class is a two- or three-element combination from the statistic set {mean, sigma, max}. Figure 6 (d)-(f) show the classification performance of each single statistic on the three prepared test sets. As shown in Figure 6 (d), on the normal test set, sigma performs the best (miss rate is about 6.3%); max performs the worst (miss rate is about 10%); mean, S2 and S3 have nearly equal performance (miss rate is about 8%). But on the blurred test set, as shown in Figure 6 (e), S3 performs the best (miss rate is 17%); mean performs the worst (miss rate is 24%); others have similar performance. On the noised test set, as shown in Figure 6 (f), the five statistics have nearly equal performance, but relatively speaking, max performs the worst (miss rate is about 48%) and sigma and S2 perform better (miss rate is about 44%). Therefore, taking all results into consideration, max is the worst statistic in the ”single statistic” class, and sigma is the best one. Figure 6 (g)-(i) show the classification performance of statistic combinations on the three prepared test sets. On the normal test set, as shown in 6 (g), the ”mean+sigma” combination performs the best (miss rate is 7.5%), while the ”mean+max” combination gives the worst result (miss rate is 9%). On the blurred test set, shown in Figure 6 (h), the ”mean+max” combination also performs the worst (miss rate is about 21%), while others have similar performance (miss rate is about 15%). On the noised test set, shown in Figure 6 (i), the ”mean+max” combination still performs the worst (miss rate is 44%), while the ”mean+sigma” combination performs obviously the best (miss rate is about 30%). From all the results, we can see that the ”mean+sigma” combination has the best classification performance among all the combinations. This is because the ”mean+sigma” combination captures the majority of local statistical information and meanwhile it is more robust than those containing the max statistic.
From Figure 6, we can also see that the statistic combinations often outperform one single statistic in bad imaging conditions. By using the statistic combination method, the lowest miss rate on the blurred test set decreases from 17% (given by the S3 statistic) to 15% (given by the ”mean+sigma” combination), and on the noised test it decreases significantly from 44% (given by the sigma statistic) to 30% (given by the ”mean+sigma” combination).
5.4
Analysis of Feature Selection
We use the OpenCV [6] implementation of the AdaBoost training algorithm and decision trees are adopted as the weak learners. Each non-leaf node in a decision tree is a stump classifier associated with a selected feature and a weight (positive is beneficial for classifying positive samples and negative is for negative samples). To analyze the feature selection process of our optimal detector, we accumulate the weights of selected features from the same local area and plot the corresponding weight maps in Figure 8. As expected, shown in Figure 8 (b) and (c), features around the head-shoulder, the torso and the legs have high weights. Features that contribute to negative samples are many located in the inner part of the torso and outside the human silhouette, shown in Figure 8 (d) and (e).
D
E
F
G
H
Figure 8. Weight maps of local features. (a) average edge image of the positive training set. (b) and (c) are positive weight maps of features from ”mean” and ”sigma” respectively. (d) and (e) are likewise the negative weight maps. We find that 60.1% of the selected features (with positive weights) are from statistic mean and others are from statistic sigma. Therefore, mean and sigma are all important for human detection. So are the low scale features and high scale features. This because we find 60.2% of the selected features are from the first L4 layer, and others are from higher L4 layers.
5.5
Discussions
From the experimental results we can obtain some useful findings: 1) The PSOF descriptor outperforms the gradientbased descriptors, like HOG, in pedestrian detection. 2) Higher image scales in the PSOF descriptor make it more
robust to image blur and noise. 3) standard deviation is the best single statistic for PSOF descriptor, and max is the worst one. 4) ”mean+sigma” is the best statistic combination for PSOF descriptor, and ”sigma+max” is the worst one. 5) Statistic combinations often outperform one single statistic in bad imaging conditions. Implemented in C++, our detector consumes about 440 us/window on a computer with a P4 2.5GHz CPU (a little slower than the HOG-SVM detector which consumes about 250 us/window). Perhaps this can be improved by the cascade classifier structure described in [23]. Demonstration videos can been see in [1].
6 Conclusions In this work, the PSOF (Pyramidal Statistics of Oriented Filtering) descriptor is proposed for pedestrian detection. Unlike gradient-based methods, which uses the blur- and noise-sensitive gradients to obtain orientation information, the PSOF descriptor constructs an image pyramid and uses a Gabor filter bank to obtain multi-scale pixel-level orientation information. Then, locally normalized pyramidal statistics of these Gabor responses are used to describe object’s shape. We showed experimentally that the PSOF descriptor is much more robust to image blur and noise than the HOG (Histograms of Oriented Gradients) descriptor, as well as possesses excellent detection performance in normal imaging condition as HOG does. We also studied the influence of various parameter settings and concluded that multi-scale information and statistic combination are two important factors for the robustness of the PSOF descriptor.
Acknowledgement This work is funded by research grants from the National Basic Research Program of China (2004CB318110), the National Science Foundation (60605014, 60875021), the National Natural Science Foundation of China (60736018, 60723005) and National Laboratory of Pattern Recognition (2008NLPRZY-2). The authors also thank the anonymous reviewers for their valuable comments.
References [1] Authors. Demos. http://limin81.cn/HumanDetDemo.zip. [2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, California, USA, 1984. [3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, Cambridge, MA, second edition, 2001. [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. of CVPR’05, 2005.
[5] Y. Freund and R. E. Schapire. A decision-theorectic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 1997. [6] Intel. http://www.sourceforge.net/projects/opencvlibrary. [7] S. Ioffe and D. A. Forsyth. Probabilistic methods for finding people. IJCV, 43(1), 2001. [8] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In ECCV’04 Workshop on Stat. Learn. in Comp. Vis., 2004. [9] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In Proc. of CVPR’05, 2005. [10] C. Liu and H. Wechsler. Gabor feature based classification using the enhanced fisher linear discriminant model. IEEE Trans. PAMI, 11(4):467–476, 2002. [11] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a probabilistic assembly of robust part detectors. In Proc. of ECCV’04, 2004. [12] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. PAMI, 23(4), 2001. [13] S. Munder and D. Gavrila. An experimental study on pedestrian classification. PAMI, 28(11), 2006. [14] J. Mutch and D. G. Lowe. Object class recognition and localization using sparse features with limited receptive fields. IJCV, 80(1):45–57, 2008. [15] M. Oren, C. Papageorgion, and et al. Pedestrian detection using wavelet templates. In Proc. of CVPR’97, 1997. [16] S. Paisitkriangkrai1, C. Shen, and J. Zhang. An experimental study on pedestrian classification using local features. In IEEE Inter. Symp. on Circuit and System (ISCAS’08), 2008. [17] C. Papageorgiou and T. Poggio. A trainable system for object detection. IJCV, 38(1), 2000. [18] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features. In Proc. of CVPR’07, 2007. [19] E. Seemann, M. Fritz, and B. Schiele. Towards robust pedestrian detection in crowded image sequences. In Proc. of CVPR’07, 2007. [20] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by visual cortex. Proc. CVPR’05, 2005. [21] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In Proc. of ECCV’06, 2006. [22] O. Tuzel, F. Porikli, and P. Meer. Human detection via classification on riemannian manifolds. In Proc. of CVPR’07, 2007. [23] P. Viola and M. J. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR’01, 2001. [24] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. IJCV, 63(2), 2005. [25] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In Proc. of ICCV’05, 2005. [26] Q. Zhu, S. Avidan, M. Yeh, and K. Cheng. Fast human detection using a cascade of histograms of oriented gradients. In Proc. of CVPR’06, 2006.