A Crowd Modelling Framework using Fast Head Detection and Shape-Aware Matching Tao Zhou,a Jie Yang,a Artur Loza,b,c Harish Bhaskarb,c , Mohammed Al-Mualla b
a Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, China. b Department of Electrical and Computer Engineering, Khalifa University of Science Technology and Research, Abu Dhabi, U.A.E c School of Engineering, University of Bristol, U.K.
Abstract. In this paper, a novel framework for crowd modelling using a combination of multiple kernel learning (MKL)-based fast head detection and shape-aware matching is proposed. First, the MKL technique is used to train a classifier for head detection using a combination of Histogram of Oriented Gradient (HOG) and Local Binary Patterns (LBP) feature sets. Further, the head detection process is accelerated by implementing the classification procedure only at those spatial locations in the image where the gradient points overlap with moving objects. Such moving objects are determined using an adaptive background subtraction technique. Finally, the crowd is modelled as a deformable shape through connected boundary points (head detection) and matched, with the subsequent detection from the next frame, in a shape-aware manner. Experimental results obtained from crowded videos show that the proposed framework, while being characterized by a low computation load, performs better than other state-of-art techniques and results in a reliable crowd modelling in the tested video sequences. Keywords: crowd analysis, macro-modelling, shape-context, shape-aware matching, head-detection. Address all correspondence to: Harish Bhaskar, Department of Electrical and Computer Engineering, Khalifa University of Science, Technology and Research, Abu Dhabi, U.A.E. ; E-mail:
[email protected]
1 Introduction Crowd detection and modelling in videos are emerging fields of research that have gained significant attention in recent years, primarily motivated by the security issues surrounding surveillance in crowded environments. Such crowd analysis framework is known to have critical impact in applications including intelligent surveillance, anomaly detection, behavioural understanding, situation awareness and flow analysis, among many others.3 Although much research efforts has been spent in developing techniques for crowd analysis, several unsolved problems remain in vision-based crowd analysis, particularly dealing with detection, tracking, occlusion handling, crowd modelling and inference.
1
Crowd modelling approaches can be categorized into two major classes: a) macro-level, and b) micro-level.3 At the macro-level, crowd flow is often used to represent dynamics whereas the micro-level models are driven by scene semantics and are fairly target specific. Although these categories may seem well separated, most literature indicates the use of a combination of the techniques (dubbed as hybrid) from both of the categories for crowd analysis. Crowd modelling has conventionally remained density-dependent.10 Such techniques for density-dependent crowd analysis have been performed at 3 distinct levels. At the low-level, detection of individual targets is performed by extracting primitive attributes such as motion cues either using optical flow, or background subtraction and tracking using filtering or regression models.8, 25 A number of pedestrian detection methods have been proposed in the literature that fall under this category. Some examples of methods include the work of Garcia-Bunster et al18 and Chen and Huang,11 who exploited the Stauffer and Grimson approach32 for foreground extraction and regions detection where people exists. At the mid-level: pattern recognition-based methods, including classification and clustering schemes are popular. For example, methods such as Support Vector Machines (SVM)28 and AdaBoost35 have been applied in various classification problems, where Haar features24 and Histogram of Oriented Gradient (HOG)14, 19 are widely used as feature descriptors. In the present context, Wu et al.38 uses body part detectors to detect multiple people in the presence of partial occlusion in crowded and cluttered scenes. Their method can overcome some partial occlusion, but it requires high resolution images. Dalal et al.14 have used the HOG descriptor and a linear SVM classifier for pedestrians detection. Compared with the part based detector of,38 their method achieves a better detection performance on the lower resolution images. In order to handle heavy occlusion, Senior et al.30 established an appearance based model with observation probabilities for all pixels in the foreground regions to detect interacting pedestrians. Although this appearance 2
model for the pedestrian detection is updated in every frame, this method is not effective for detecting humans in long video sequences under illumination changes since their background model is not adaptive.22 Examples of other similar detection includes the work of:.1, 16, 26, 36 However, these methods cannot be applied to very crowded scenes with significant occlusion because all pedestrians need to be detected and segmented5 reliably for accurate modelling. Finally, at the high-level, techniques such as the dynamic texture models9 and Lagrangian-based approach of crowd flow segmentation by.2 In2 Lagrangian particle dynamics is used for high density crowd flows segmentation and crowd flow where moving crowd is treated as aperiodic dynamical system. Irrespective of the level at which modelling is performed, results have demonstrated that crowd modelling approaches have been fairly density dependent.
Whilst person or head detection is complicated, tracking is a much more challenging problem as targets must first be detected, segmented and then be tracked jointly across multiple video frames. Furthermore, tracking in crowd needs to handle a large number of targets simultaneously which randomly enter or leave the field of view. This problem can be addressed using a joint state space of variable dimension, where a number of targets can be inferred in parallel with specific configurations.15, 17, 21, 39, 41 On one hand, most techniques in the literature perform single object tracking, and much of these are sensitive to appearance variations and heavy occlusions. On the other hand, the computational burden becomes very high when tracking multiple targets. The Lucas-Kanade and Kanade-Lucas-Tomasi (KLT) feature tracker are very common optical flow algorithms in crowded scene analysis applications and has been popularly used in the work of.12, 20, 31 Cheriyadat and Radke12 used the KLT algorithm to track low-level features extracted using Shi-Tomasi-Kanade detector and Rosten-Drummond detector. He and Liu20 applied the KLT algorithm to obtain mo3
tion that facilitated the extraction of key feature points. Similarly, in31 the Lucas-Kanade optical flow algorithm together with particles advection were exploited to define a dynamic system for the crowd. Despite sustained efforts in tracking crowds, it still remains very difficult to achieve accurate person detection and tracking in crowd effectively.
2 Contributions & Structure The main aim of this paper is to propose a novel framework for modelling crowds in videos using fast head detection and shape-aware matching. The framework shall provide the necessary means for analysing crowds in videos such that behavioural analysis for situation awareness, becomes possible. For the deployment of such crowd analysis systems in real time, efficient yet accurate detection of crowd is often mandatory. One important contribution of this paper is the implementation of a MKL-based classification method using fused appearance features using HOG, with texture features based on LBP descriptors, deployed only at those spatial locations where the gradient points in the frame overlap with moving objects, extracted using a background modelling technique. Such a deployment is anticipated to improve the overall accuracy of detection by reducing false positives and simultaneously providing compelling computational advantages. Furthermore, crowd modelling on a frame-to-frame basis of a video is achieved by matching the detected crowd, represented as a connected graph in a shape-aware manner. The advantage of such a shape-aware matching of crowds is that it facilitates inferring holistic flow changes of crowds that is indicative of behavioural changes in real-life for decision making during situation awareness.
The remaining part of this paper is organized as follows. The proposed MKL-based fast head detection algorithm 3.1 and shape-aware matching of crowds 3.2 are described under Section 3. 4
Head detection
Neighbor points
Previous frame
Shape matching
Head detection
Head tracking
Boundary points As previous frame
Current frame
Input next frame
As current frame
Extract shape
Shape matching
Repeat the processing
Fig 1 Proposed framework for head detection and tracking in crowded scenes
Experimental results and comparative analysis is discussed in Section 4. Finally, conclusive remarks and directions of future work is presented in Section 5.
3 Proposed Methodology Our proposed method for crowd detection and modelling is illustrated in Fig. 1. The process illustrated in Fig. 1, consists of two distinct phases: a) MKL-based Crowd Detection and b) ShapeAware matching for Crowd modelling. The former, is a kernel-based feature fusion supervised technique that integrates the complementaries of HOG and LBP. However, the latter is a method for simultaneously estimating the correspondences and a smooth non-rigid transformation between graphically modelled shapes of crowds factorised in a shape-aware manner.
3.1 Crowd Detection The proposed crowd detection implementation through head detection is illustrated in Fig. 2. The process is accomplished in two stages: training and detection. During training, a combination of 5
HOG and LBP features are extracted from positive and negative samples indicating the presence and absence of a head region respectively. These feature sets are then learnt by a MKL-based classifier. A coarse to fine localization of interest points is used to speed up the implementation of head detection.
Positive samples Object representation
HOG + LBP
Multiple kernels learning
Classifier
Negative samples Method 1
Sliding window Patches
Input image Method 2
Fast detection method
Representation by HOG + LBP Detection results
Fig 2 Process flow of the proposed head detection algorithm
The MKL-based classification technique is a parametric kernel-based learning method6, 33 popular for inference using feature fusion. The main idea here is to combine multiple kernels through the construction of kernel matrices with relevant parametrization for the chosen feature representations individually, instead of using a single one. The primal of the MKL problem can be formulated as the learning of a function of the form:
f (x) = wt φd (x) + b
6
(1)
with the kernel
kd (xi , xj ) = φtd (xi )φd (xj )
(2)
While the goal in SVM-based classification is to learn the parameters w and b from the training set {(xi , yi )}; for MKL, it also needs the estimation of the kernel parameter d. Therefore, the MKL formulation is represented as follows:
X 1 t w w+ l(yi , f (xi )) + r(d) 2 i
min w,b,d
(3)
subject to d > 0
where l is a loss function. Varma et.al.,33 followed a simple two step optimization procedure wherein:a) in the outer loop, the kernel will be learnt over optimizing d, and b) further, while the kernel is fixed, the parameters w and b are learnt in the inner loop. Therefore, this optimization process can be implemented by rewriting the previous MKL formulation as follows:
min d
T (d) subject to d > 0
(4)
where,
X 1 T (d) = min wt w + l(yi , f (xi )) + r(d) w,b 2 i
7
(5)
In the proposed method, the traditional MKL formulations such as in,33 can be easily extended to learn general kernel combinations subject to regularization (either L1 norm or L2 norm) of the kernel parameters. Generalized kernel learning can be implemented very efficiently using gradient descent optimization method.
The success of supervised detection strategies not alone depends on the learning strategy itself, i.e. the fusion process, but more importantly on the features chosen from the observation set that is capable of reliably distinguishing one form of data from another. For our considered problem of crowd detection, occlusion is inevitable, hence the people detection as a whole (typically using body parts model) is fairly unrealistic. However, head detection can be an optimistic alternative to detecting people in crowded environments. Our formulation of head detection, in turn crowd detection is based on two complementary features as listed below.
Appearance & Texture Features: HOG features have been widely used in object detection and have been successful in encapsulating edge and local shape information of targets.37 In this paper, the HOG is implemented in the manner as suggested in.37 The HOG features are extracted in the three R,G & B channels of the video frames and concatenated into a single feature vector. The computation of the HOG features is made very efficient by the use of VLFeat library.34 In the context of the crowd detection problem, the HOG features demonstrate increased sensitivity to the background with cluttered edges and hence, at times, can potentially perform poorly. On the other hand, the LBP is a highly discriminative texture descriptor that is usually well-known for its computational efficiency and invariance to monotonic grey level changes.27 The LBP feature is complimentary to the HOG in this respect. The LBP texture features are implemented using the al8
gorithm proposed in27 and are also extracted in the R,G & B channels channels. In crowded scenes, the combination of edge-type shape information with texture features can provide improved recognition of heads on a wide range of data samples.
Without loss of generality, it is possible that the feature space could be extended using other low level features including color features, motion, etc. Since, our crowd detection and modelling problem is aimed for solving situation awareness applications, typically hosted within a surveillance environment; real-time processing expectations is generally high. In this paper, in order to deal with the computational complexity constraints of conventional detection technologies, a novel optimization mechanism through a coarse-to-fine implementation of the interest point localization is proposed.
3.1.1 Speeded-Up Crowd Detection
The detection process means the application of the trained MKL classifier on a test image, which often involves the use of a sliding window approach. Although the sliding window is a popular tool for detection, it is computationally exhaustive as it scans all possible locations at pre-defined step sizes. In this paper, a speeded up head detection method is proposed, consisting of a coarse-to-fine searching process. The principle behind this fast technique is to reduce the number of detections only to those spatial locations that exhibit maximum probability of hosting a potential target. Such reduced computational effort is realised using the algorithm summarized in Algorithm 1.
In the coarse step, interest points are extracted in the image using gradients. This is under the assumption that points corresponding to a potential head region normally have a higher gradient
9
Algorithm 1. Fast detection method Coarse Steps: 1. Select gradient points using a soft threshold; 2. Find moving objects through background subtraction; 3. Obtain candidate positions by combining gradient points with moving objects; Fine Steps: 4. Apply head detector on the all candidate positions; 5. Apply clustering algorithm on detection results. value than other points in the neighbourhood region. Note that the gradient points can be computed once, while the HOG features are calculated. Further, moving objects are segmented through an adaptive background subtraction technique. Finally, the candidate positions are obtained by combining interest points from the gradient method with the localized moving targets from the background subtraction technique. In the fine step, the head detection classifier is applied around the candidate positions, and then clustered to generate the final detection results. In order to validate our gradient assumption, Fig. 3 is considered. In the chosen dataset, it is clear that the relationships between the current point (head) and its neighbouring points lead to higher gradient values. Therefore, by selecting the points that have a high gradient value, it is possible to localize head regions. Given an image I, with its gradient in the x and y directions denoted by gx and gy , respectively, the gradient value is computed as:
q g(i, j) = gx (i, j)2 + gy (i, j)2
(6)
where g(i, j) corresponds to the gradient value at the point (i, j) in the image I. A soft threshold is used to extract the gradient points that are likely to correspond to the head locations: g(i, j) > T0 , where T0 is the gradient threshold. In our experiments, T0 is set to 80.
10
Points from this region exist gradient relationship between their neighborhood points.
Compute gradient
Soft threshold
Find gradient points Fig 3 Points from the head region showing the gradient relationship between the current point and its neighbour points
Fig 4 Detection results on gradient points. Left: gradient points obtained by gradient value; Right: detection results obtained by implementing head detection algorithm on these points of interest
The use of gradient as the only criterion for generating interest points can result in a large number of false positives, leading to poor detection accuracy and increased computational complexity as illustrated in Fig. 4. Therefore, moving object detection using an adaptive background subtraction technique is used to further narrow the search space where head detection shall be applied. During adaptive background subtraction, first the background model is initialized in the tth frame by bt . 11
Secondly, the difference between the background model bt and the next frame It+1 is computed, and then the final result is obtained using a threshold function. Finally, the background model is updated as follows:
bt+1 = λIt+1 + (1 − λbt )
(7)
where It+1 denotes the current image, bt and bt+1 represent the previous background template and the new background template respectively and λ is the learning rate, set to 0.15. In this fast detection method, gradient points are combined with moving objects to determine the candidate positions. Thus, the head detection algorithm is implemented only on these candidate positions as described in Fig. 5
Fig 5 Fast detection method. Left: moving object detection by background subtraction; Middle: combining the points of interest with moving objects to find candidate positions; Right: detection results obtained by implementing head detection algorithm on the all candidate positions
3.2 Crowd Modelling In order to adequately model crowds in video sequences, the head detection from the previous frame must be associated with the detection in the current frame. Such associations between con12
secutive detection can be materialized through the implementation of a robust matching or tracking algorithm.
This paper is focused on hybrid modelling of crowds, i.e. incorporating the micro-detection of targets through from individual head detection into a holitic model for crowd tracking. Conventionally, to accomplish this, two approaches are possible. One is to implement a tracking algorithm for individual head detection and cluster the trajectories to model crowd. This method requires tracking of individual heads, that under heavily occluded can fail significantly and hence it is inefficient. However, on the other hand, a more reasonable approach is to perform matching between selected head detection in subsequent frames. The selection criteria should impose constraints on the representation of the general shape of the crowd itself and provide flexibility to transform to desired levels of granularity when required. Our proposed framework is based on this underlying principle for crowd modelling. Our matching method works in 2 distinct steps: a) representation of crowd shape (inaccordance with the hybrid principles outlined above), and b) shape matching.
3.3 Crowd Shape Representation A robust representation of the crowd is often desired in order to facilitate accurate matching and association. Given the head detection from the previous step, the simplest shape representation of a crowd on the given image can be constructed using a graphical model of connected boundary points (head). The advantage of such a graphical representation is that it permits both coarse and fine analysis during matching and presents flexibility to the representation, particularly during appearance and disappearance of targets. In addition, the graphical representation of the shape of the crowd can permit systematic graph matching and inference methods to be applied. Given that the 13
chosen problem in this paper is rather simplistic, it is beyond the scope of this paper to address such graphical model matching methods. We choose to propose a rather simplistic representation and modeling strategy primarily based on spatial constraints of target positions. Therefore, the selection of boundary points (heads) from the set of all detection using simple spatial constraints, is critical. The selection of appropriate boundary points is sensitive to the presence of outliers generated by false detection, detection of a new target or the mis-detection of an existing target.
An example of such scenarios are shown in Fig. 6(a). In order to cope with such changes, each boundary point from the current frame is chosen and their neighbouring points from the previous frame is searched as shown in Fig. 6(b). The final shape representation is shown in Fig. 6(c).
Fig. 7 shows the flow of the basic shape representation method by selecting boundary points on the current frame and searching related neighbour points around boundary points on the previous frame. In this method, a boundary point will be deleted if its neighbour point is not found within a small neighbourhood region on the previous frame. However, the real boundary point that is a correct head detection point cannot be added into boundary point set, if it introduces inaccuracies in the process of building the shape and applying the shape matching. To handle this issue, an improved shape representation method is proposed, as shown in Fig. 8. The main distinguishing aspect of this method against the previous is that, a point from head detection point set in the current frame is deleted if it does not have a neighbour point in the previous frame. However, the search for boundary points is repeated on the current frame until all boundary points have its neighbour points resulting in the two shapes being computed.
14
previous frame
current frame
260
300
240 250
220 200
200 180 160 150 140 120
100
100 80
0
100
200
300
400
500
600
50
700
0
100
200
300
400
500
600
700
800
500
600
700
800
500
600
700
800
(a) Select boundary points directly Search neighbor point
previous frame 260
current frame
300
240 250
220 200
200 180 160 150 140 120
100
100 80
0
100
200
300
400
500
600
50
700
0
100
200
300
400
(b) Search neighbor points by nearest neighbor method previous frame
current frame
260
300
240 250
220 200
200 180 160 150 140 120
100
100 80
0
100
200
Neighbor points
300
400
500
600
50
700
0
100
200
300
400
(c) Shapes are represented by a set boundary and neighbor points
Fig 6 Shape representation by selecting boundary points.
15
Boundary points
Search by NN method
Not to find out neighbor point
Delete it from boundary point set
Select neighbor points
Whether to select it as boundary point?
Fig 7 Basic shape representation method
There are some specific differences between the two methods shown in Fig. 7 and Fig. 8 outlined below: • A point is deleted from the boundary point set if it does not have neighbour point in the previous frame, while it is deleted from detection point set in improved method. • It is necessary to find boundary points and search neighbour points only once in the basic shape representation method, while in the improved method the search is performed multiple times.
16
Search by NN method
Not to find out neighbor point
Delete it from point set
Repeat to find boundary points
Select neighbor points
We can delete outlier point, and select some new points as boundary points
Fig 8 Improved shape representation method.
3.4 Shape-aware Matching In a crowded sequence, it is often difficult to track each head detection due to such factors such as occlusion, illumination, pose and motion blur. Therefore, in this work crowd is modelled and tracked using shape matching. For shape matching, it is mandatory to define an objective cost function in order to establish correspondences between the features extracted using shape descriptors in consecutive frames. In this paper, the shape context is used as a way of describing the shape of the crowd thereby facilitative the measurement of shape similarity and recovering of point correspondences. Shape context is a feature descriptor used in object detection and recognition.7 For a point pi of the shape, consider the neighbouring relations between the point pi and the remaining
17
points on the shape. Therefore, for a point pi , a coarse histogram hi of the relative coordinates of the remaining n − 1 points can be computed using:
hi (k) = #{q 6= pi : (q − pi ) ∈ bin(k)}
(8)
where hi is the shape context feature of pi , and k represents the kth bin. The bins are normally taken to be uniform in the log-polar space. This histogram describes the neighbourbood relationships between the current point and all the remaining points, and constitutes a rich and robust feature descriptor for shape matching.
During matching, consider a point pi in the shape from the current frame and another point qj in the shape from the previous frame, and their normalized K-bin histograms (i.e. shape contexts) g(k) and h(k). Since shape contexts are distributions represented as histograms, it is possible to use the χ2 test statistic to compute the cost of matching the two points:
K
1 X [hi (k) − hj (k)]2 Cij = C(pi , qj ) = 2 i hi (k) + hj (k)
(9)
where hi (k) and hj (k) represent the K-bin histograms for pi and qj , respectively. After obtaining the cost Cij for all pairs of points, the total cost of shape matching is minimized:
18
H(π) =
X
C(pi , qπ(i) )
(10)
i
where H(π) can be constrained to ensure one-to-one matching. As this formulation is an intense computation of the square assignment problem, an efficient solution method as in,23 is utilized.
Since this process only finds a correspondence between any two points, the next step is to estimate the shape change by using a transformation matrix T . Given two shapes P and Q, this distance between the shape P and Q is calculated as a symmetric sum of shape context matching costs over best matching points:
Dsc (P, Q) =
1 X 1X argmax C(p, T (q)) + argmin C(p, T (q)) n p∈P q∈Q m q∈Q p∈P
(11)
where T (.) is the thin plate spline (TPS) transform that maps the points in the shape shapes P and the shape Q. Therefore, the two points are matched by minimizing the shape distance. Although global shape matching can produce considerable accuracy for crowd modelling in sequences, the dynamics of crowd can have a major impact on the shape changes over time. Such changes in shape of the crowd can usually lead to crowd splitting, merging or simply deforming. Therefore there is a compelling need to perform shape matching in a shape-aware manner so that the restrictions imposed by the framework in capturing macro-dynamics of crowd are broken.
19
previous frame
current frame
260
300
240 250
220 200
Matching
180
200
160 150 140 120
100
100 80
0
100
200
300
400
500
600
50
700
0
100
200
300
400
500
600
700
800
600
700
800
Segment 260
300
240 250
220 200
200 180 160
Base triangles
140 120
150
100
100 80
0
100
200
300
400
sub-shape matching
Compute matching score
500
600
700
yes
sub-shape matching
Stably track
50
0
100
200
300
400
500
Micro movement
Further analysis
Bigger movement
Further analysis
Score > T ?
no
Non-stably track
Fig 9 Shape-Aware matching framework
Here, a shape-aware matching extension is proposed using sub-shape matching thereby encapsulating local (dis)order in crowd leading to improved matching and accurate crowd modelling. The method is based on factorizing the overall shape of the crowd by triangulation, where joint relationships between interconnected nodes dominate the matching criteria. A block diagram description of the proposed shape-aware matching technique is shown in Fig. 10. The method, first, determines boundary points from head detection in the current frame. Further, it selects neighbouring head detection and segments them into different triangles. The current triangle (or grouped boundary 20
points) is used for matching against a similar neighbourhood of grouped points from the previous frame. Finally, sub-shapes are matched and the matching score is calculated.
4 Experiments & Comparisons In this section, the proposed method is validated and compared against state-of-the-art baseline methods through systematic experiments using publicly available datasets. These experiments, in addition to highlighting the capabilities of the proposed algorithm also test the effect of key components to prove the importance of that component in the proposed framework. The approach proposed in this paper has been evaluated on PETS 2009 database, containing images of 576 × 768 × 3 size. The image patch sizes during training and testing steps have been fixed to 33 × 33 pixels. The HOG descriptor is extracted by fixing the cell size to 8 pixels, and the normalized LBP histogram has been constructed for each pixel in the cell by comparing it against its 8-neighbours. During background modelling for moving object detection, a constant learning rate λ of 0.15 has been assumed. All experiments have been implemented in MATLAB on a Dual-core and 2.10 GHz PC machine with 2 GB RAM.
4.1 Detection Results In order to appropriately test and compare the crowd detection module of the framework, several experimental evaluations has been designed. To enable proper benchmarking of results against other competing baselines, both qualitative and quantitative evaluation have been undertaken. Qualitative evaluation has been performed by analysing results through visual inspection; whereas for quantitative evaluation metrics such as: a) the Mean Absolute Error (MAE) and b) the Mean
21
Relative Error (MRE) are calculated. The evaluation metrics of MAE and MRE are defined according to13 as:
M 1 X |D(i) − T (i)| M AE = M i=1
(12)
M 1 X |D(i) − T (i)| M RE = M i=1 T (i)
(13)
where M is the number of frames in the test scene, and D(i) and T (i) are the detected number and the true number of people in the i-th frame, respectively. The MAE criterion is very useful to exactly quantify the error in the estimation of the number of people which is in the focus of the camera, while the MRE criterion is used to take into account the estimation error related to the true number of people.
First, generic results of crowd detection through the proposed speeded-up head detection process on various samples from the PETS 2009 dataset is presented in Fig. 10.
From a detailed visual inspection of the results of the proposed detection methodology, it is clear from Fig. 10 that the qualitative accuracy of head detection is very high.
In order to further validate the performance of the proposed detection method, comparative anal-
22
Fig 10 Results of the head detection module on different frames from the PETS 2009 dataset.
ysis against other state-of-the-art detection methods is performed. For this purpose, two selected subsets of baseline detection strategies is chosen: a) a group of pedestrian detection algorithms based on learned appearance models and b) a generic set of crowd detection methods (presented in comparison to the shape-aware matching based modelling results). For first set of comparative experiments, the following baselines are chosen: Boosting-decision-trees+HOG, Boostingdecision-trees+LBP, Binary features+Ferns and RCS-LBP+SVM. First, a boosted-decision tree method proposed in,4 where a principled approach for speeding up training of boosted decision trees (Boosting-decision-trees) is chosen. The method is built on a novel bound on classification or regression error, guaranteeing that gains in speed do not come at a loss in classification perfor23
mance. In another interesting study by Mustafa et al. in,29 proposed a general method for image patch recognition that is effective for object pose estimation, using Ferns classifier with binary features to obtain the better recognition performance (Binary features+Ferns). Alternative combination of features for fusion based detection (RCS-LBP+SVM) has been proposed in,40 where LBP and relational color similarity (RCS) information features are jointly used to obtain better detection performance using linear support vector machine (SVM). Fig. 11 illustrates the comparison of head detection results using different detection methods as aforementioned on the S2 L2 14-55 scene. It is clearly evident that the proposed method obtains better detection performance against other baselines. The reduction in accuracy of the baseline methods in frames #43, #76 and #158 can be attributed to the detection of non-heads (higher number of false positives).
For a simple quantitative analysis of performance, Fig. 12 shows the count of detected heads using different detection methods on the S2 L2 14-55 and S1 L1 13-57 scenes. As shown in Fig. 12, it can be noticed that the overall detection count is lower than the ground-truth in all detection methods. However, the plot of the number of detections using the proposed method closely matches the ground-truth in comparison to others. Crowded scenes are typically complicated by large occlusions and severe changes in appearance. Hence, it is crucial to choose feature sets that can adequately cope with such changes. The higher performance of the proposed detection strategy can be attributed to the robustness of the combined HOG and LBP feature sets wherein, some baselines have noticeably failed in this regard.
Finally, the key components of the detection framework are evaluated. An important contribution of this paper is the speeded-up detection process initiated by the computation of interest points 24
#43
#76
#158
(a) Boosting-decision-trees + HOG #43
#76
#158
(b) Boosting-decision-trees + LBP #43
#76
#43
#76
#43
#76
#158
(c) Binary-features + Ferns #158
(d) RSC-LBP + SVM #158
(e) Ours
Fig 11 Comparison of head detection results using 25different detection methods on S2 L2 14-55 scene.
35
30
30
25
25
Number of heads
Number of heads
35
20
15
20
15
10
10
5
5
0 20
30
40
50
60 # Frames
70
80
90
0 20
100
Ground truth Boosting-decision-trees+HOG Boosting-decision-trees+LBP Binary-features+Ferns RCS-LBP+SVM Ours
30
(a) S2_L2_1455
40
50
60 # Frames
70
80
90
100
(b) S1_L1_1357
Fig 12 Plot of detection count results using different detection methods on the S2 L2 14-55 and S1 L1 13-57 scenes.
Table 1 Performance of using different detection methods in terms of the MAE and the MRE (as a percentage) indices.
Scene
Boosting-decision-
Boosting-decision-
Binary-
RSC-LBP+SVM
Our method
tress+HOG
tress+HOG
features+Ferns
S1 L1 13-57
14.03(64.81%)
27.5(55.92%)
15.25(70.05%)
14.46(65.32%)
4.98(22.67%)
S1 L1 13-59
26.40(79.39%)
19.74(61.08%)
13.69(45.18%)
24.5(55.92%)
10.80(25.69%)
S1 L2 14-06
25.70(69.31%)
26.17(54.43%)
24.24(60.87%)
28.5(65.32%)
11.18(27.30%)
S1 L2 14-31
21.04(52.32%)
28.5(75.11%)
18.30(24.14%)
27.15(65.02%)
12.21(25.15%)
S1 L3 14-17
17.24(54.18%)
19.06(60.75%)
26.36(70.16%)
28.35(65.42%)
13.22(24.10%)
S1 L3 14-33
19.67(55.78%)
26.43(58.13%)
19.19(35.87%)
23.52(47.44%)
12.01(26.43%)
S2 L1 12-34
23.96(55.98%)
26.14(60.23%)
18.22(33.87%)
29.61(59.02%)
14.18(29.17%)
S2 L2 14-55
22.13(68.51%)
23.35(72.42%)
18.73(58.12%)
15.74(48.74%)
5.56(17.23%)
26
through the intersection of detection using gradient computations and moving objects through a background model. It was earlier hypothesized that by narrowing the potential locations for implementing the head detector, a significant computational gain can be obtained while simultaneously rejecting a large number of false detection. To substantiate this claim, the detection performance of our method is verified against the conventional sliding window method with different search step sizes. In these experiments, the search step sizes are set to 4, 8 and 16. Table 2 reports the comparison of execution time between the proposed speeded-up detection method against the sliding window method. As shown in Table 2, the sliding window method requires significant amount of time to scan all positions in each tested image while our method only scans the candidate positions obtained by moving objects. Fig. 13 shows the compared head detection results of the two techniques. The sliding window method is the most common scanning technique in object detection. Although some specific adaptations are possible, in general, the method advocates scanning all positions on a test image in a sequential manner. The two main limitations of this method include: (1) the high computation demand in implementing the detection at all possible spatial locations of the test image; and (2) the increased chance for falsely interpreting outliers while implementing the detection algorithm on the all possible positions in the test image. In comparison to these, the proposed method, by incorporating the moving object detection at high gradient locations excludes outliers (or potential non-object positions in the test image) and further by implementing the head detector on at those interest locations, reduce the computational demand of the detection procedure in general. Next, variations of classification methods using individual and combined feature sets is studied. Fig. 14 shows the detection results using an SVM classifier with the individual feature sets com27
Step=4
Step=8
Step=16
Ours
Fig 13 Comparison of head detection results using the proposed speeded-up detection method against sliding window method with different search step sizes.
Table 2 Execution time comparison (seconds)
Methods Time Detection on gradient points 472.40 Our fast detection method 213.13 step=4 3600+ Sliding window method step=8 1382.19 step=16 325.27
28
SVM
(a) SVM+Hog
(b) SVM+LBP
(c) MKL+Hog+LBP
Fig 14 Comparison of detection results between MKL and SVM classifier with using different features.
Table 3 Performance of using different detection methods in terms of the MAE and the MRE (as a percentage) indices. Scene
SVM+Hog
SVM+LBP
Our method
S1.L1 view 001
17.0(40.22%)
27.5(55.92%)
5.88(25.10%)
S1.L1 view 002
16.40(39.39%)
18.34(51.08%)
3.29(25.18%)
S1.L1 view 003
15.70(39.31%)
21.67(54.43%)
10.29(30.87%)
S2.L2 view 001
21.0(50.02%)
30.5(55.92%)
8.80(25.94%)
S2.L2 view 002
7.24(54.18%)
10.06(69.58%)
6.36(69.06%)
S2.L2 view 003
9.66(35.78%)
16.14(48.13%)
8.29(30.87%)
pared against the MKL-classifier with the combined HOG and LBP features. From Fig. 14, it is obvious to calibrate the advantages of the MKL algorithm combining HOG and LBP features, as it distinctly outperforms the SVM classifier with any one feature. Table 3 reports the performance of head detection for the above-mentioned combination of methods. From Table 3, it can be observed that the SVM classifier with HOG features reports high MAE, so as the detection method using LBP feature with SVM classifier. However, our detection method obtains the best detection performance in all considered sequences. 29
4.2 Matching Results In this subsection of the paper, the shape-aware matching of crowd models is presented. To evaluate the matching performance, we compute the matching score metric. The matching score is defined as:
Score =
Nsuccess Ntotal
(14)
where Nsuccess denotes the number of successfully matched points, and Ntotal denote the total number of matched points. A successful match between a point pi in the first shape and anther point qj in the second shape is decided by the followed criterion:
dij =
< T,
success (15)
otherwise, f ailure with
dij = ||f (pi ) − qj ||2
(16)
where dij denotes the Euclidean distance between the two points that are matched by correspondence matrix, and T is a threshold. In the experiments, T has been set to 5.
Results of the head detection and shape-aware matching algorithm on the S2.L2 view001 se-
30
quences from PETS 2009 dataset is presented in Fig. 15. It can seen that this method is capable of achieving near-perfect shape matching results. This can further be quantitatively verified through further analysis using the matching score.
Fig.16 shows the processing of boundary points selection and sub-shape segmentation, and Fig.17 shows the sub-shape matching results under sub-shape segmentation from the Fig.16. Fig. 18 illustrates the shape matching score at each frame on the current database. We can deduce from Fig. 18 that, there is about 85% of matching scores that exceed 0.9 indicating good shape matching performance.
Further shape matching results are presented in Fig.19 and Fig.20. Also, results of sub-shape matching on other sequences are presented in Fig.21 and Fig.22. Fig.19 and Fig.20, also illustrates some failure modes during the shape matching process indicating the need for further improvements, as mentioned in our future work.
Modeling of crowds does not alone function from a micro-analysis perspective. Other, more generic approaches to crowd detection and modelling have been reported in the literature. Here, in Fig. 23, a comparison of the proposed head detection and baseline crowd detection strategies is presented. The chosen baseline include the following algorithms: a) a motion-based crowd modeling algorithm combining a background model as in18 or11 and Kalman fitlering, b) an appearancebased model that combines HOG detector with pedestrian tracking,19 c) the dynamic texture crowd modelling method of9 and d) the Lagrangian crowd segmentation strategy of.2
31
#20
#21
MatchingScore = 1
#21
#22
MatchingScore = 1
#22
#23
MatchingScore=1
Fig 15 Shape matching based head tracking results.
32
#20
#21
(a) Detection results
#20
#21
(b) Boundary points
#20
#21
(c) sub-shape segmentation
Fig 16 Boundary points selection and sub-shape segmentation.
33
#20
#21
#20
#21
MatchingScore = 1 #20
MatchingScore = 1
#21
#20
#21
MatchingScore = 1 #20
MatchingScore = 1
#21
#20
#21
MatchingScore = 1
MatchingScore = 1
Fig 17 Sub-shape matching results
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Score>=0.9
0.55 0.5
Score