ICCTA 2017, 28-30 October 2017, Alexandria, Egypt
People Counting with Extended Convolutional Neural Network Htet Htet Lin
Kay Thi Win
University of Computer Studies, Mandalay Myanmar
[email protected]
University of Computer Studies, Mandalay Myanmar
[email protected]
Abstract— Crowd people counting in crowded scenes is a very challenging task due to various appearance, perspective distortions and severe occlusions. Most of the previous works have a trade-off between accuracy and miss detection. Although they get high performance, they may decrease from complex occlusions. Occlusion handling is one of the important issues and also still exists as an active research area. To tackle this fact, this paper establishes an effective people counting framework that automatically estimates the accurate number of people. Unlike the other, this paper proposes extended Convolutional Neural Network (CNN) by contributing the region localization and segmentation. Firstly, we make the foreground segmentation in the noisy scene by foreground detection and then localize these detected areas. Finally, each localized region put into the CNN framework to handle the correlated size, occlusion and density level. We evaluate the experiments on challenging crowd counting dataset that gets the significant performance than state-of-art results. This present the effectiveness of the proposed framework. Keywords— Crowd Counting; Foreground Segmentation and detection; CNN
I.
INTRODUCTION
The accurate number of person in the heavy cluster scenes is an essential task in many applications such as visual surveillance, operational, traffic and safety monitoring processes. Counting many people, especially in large dense crowd area is very essential and an active area of research. According to the literature, they are clustered into three groups: counting by detection based model, counting by regression based (map based) model and counting by density estimation based model. Counting by detection approaches involve training a visual object detector to find and count each person in the scenes. Counting by regression techniques on the other hand attempt to study the automatically related the mapping between the low-level features and the overall number of person in frame or within a frame region. Counting by feature regression techniques attempt to study the centroid distances among the objects. Individual people are not explicitly detected or tracked in the rest two approaches, meaning visual occlusions have less impact on counting accuracy. While generally more computationally efficient than counting by detection methods, regression-based techniques have suffered greatly from overfitting in the past due to a lack of varied training data.
978-1-5386-3794-4/17/$31.00 ©2017 IEEE
Moreover, existing methods based on patch-based or the whole image. They also extract local or global features. These features included feature fusion, frequency domain analysis, interest points or edge orientations were used in people counting. But the performance of detectors was inability in complex and noisy scenes due to appearances variations, heavy occlusion and various perspective distortions. Occlusion handling is one of the most important issues. This fact motivates to learn a framework that can handle the occlusion to take the tasks of counting in crowd scenes especially under heavy occlusion condition. Given a video, the two main objectives of the proposed system are: 1) to find out more significant and distinctive features ensure effectiveness and efficiency and 2)
to obtain the accurate person numbers.
The proposed system first calculates mean shape background to extract accurate foreground detection. The calculation of mean shape can avoid the illumination problems. After that segment the people foreground area by k-mean clustering algorithm, which will be redundant. Unlike other works, this paper uses the making mean background instead of the normal organization. For accurate counting, the proposed system focus the extended CNN (multi-task CNN framework) for attaining the decreasing estimation errors. That takes a localized segmented region of whole image as the input image and then automatically output the counting result. In this paper the effective terminology of people counting system has been addressed. This paper also shows the distinct progress in people counting that has been achieved by proposed methods. This paper organizes as follows. Section II presents the earlier methods and technologies and their applicable work of the crowd counting system has been discussed. Section III demonstrates the proposed approach to contribute in crowd sensing field. Section IV investigates and discusses the experimental performance results compared with state-of-art results. The rest section, Section V has been attempted the conclusion and concerns II.
RELATED WORK
A. Counting by detection based model Most of the state of art emphasized on detection framework that performed features (such as Haar wavelets, histogram oriented gradients, edgelet and shapelet) extracted from a fully whole body or part of the body [16, 5, 18, 15]. And then these
99
ICCTA 2017, 28-30 October 2017, Alexandria, Egypt
features are trained into various classifier (such as adaboost boosting, random forest and Support Vector Machines) [17]. These methods only got high performance in low density scenes, i.e. four or five people in the scenes. Although they used part based or shape based detector, they were not mitigate the problem of occlusion and not suitable for crowded scene. B. Counting by regression based(map-based) model To overcome the occlusion issue, most of the researchers emphasized on regression or mapping. They proposed to count by regression based method by learning a mapping between the actual counts with features extracted from local image patches [14, 4]. They can independent on learning detectors that is a relatively complex task. These frameworks used various classifier (such as linear regression, piecewise linear regression, Gaussian process regression, ridge regression and neural network) [8] to study low-level feature with crowd count.
A. Overview The system overview of the proposed approach is depicted in Fig. 3. In this figure, the original image is set as the input image of the proposed system. Then, the foreground is detected on each input frames and first order features are extracted. After the foreground is clearly detected, the proposed system segmented the localization regions that containing only the people by gradient k-mean segmentation. Then the segmented region as well as the region proposal are putted into the CNN classifier. CNN is used as not only the feature extraction but also the feature learning to get the best result. The image labeled by the estimated count is resulted out.
Their framework focused on the notion of discriminative attributes used for addressing sparse training data. Their methods are inherently capable of handling imbalanced data. They have less estimate of people region in real time video.
Fig. 1. Example images of PET 2009 dataset. Challenges include severe occlusion, clutter and similar appearance of people
C. Counting by density estimation based model Most of the state-of-arts observed that most of the regression based methods ignored the global spatial information. Lempitsky et al [9] presented a linear framework by learning a mapping between object density maps with corresponding local patch features. Their issue is to optimize a convex cutting plane. Pham et al. [12] presented a non-linear framework by using random forest to map local patch features with density maps. Generating high quality density maps along with low count estimation errors would be another important issue to be addressed in the future. This model is the most sophisticated and time consuming among three models. To tackle the issue occlusion and to develop low count estimation error, this paper proposes a best accurate people counting framework by segmenting foreground region with gradient k-mean and by training the extended CNN.
III.
PRPPOSED APPROACH
This paper describes the proposed framework of the people counting with occlusion problem, where input is a video and then the output is the image, each labeled by the number of people counts. The quality measure of the output is based on count error rate techniques that depend on the counts assigned by detecting of the foreground localization region. In the experiments, the proposed system introduces a lower mean absolute error rate and mean relative error rate to capture our concerns.
Fig. 2. Pseudo code form representation of the proposed work
B. Foreground Detection and Segmentation Given a video, the proposed system first resizes and enhances the original input image. To detect foreground as serve as candidate detections more accurately, the clarity background is very important. In contrast to previous work, they either run an object detector or perform background subtraction with various approaches. They took the first frame of the video. In this paper, the proposed system calculates mean shape on the whole frames to apply foreground detection. Like the other is that the system need to set threshold in the foreground detecting process. The proposed system can get the foreground detections with reasonable recall by using the following equation: | ( ) −
|>
(1)
where Io is the image object of every frame I, M is the shape mean background, T is the threshold and i is the number 1 to length of the frame.
100
ICCTA 2017, 28-30 October 2017, Alexandria, Egypt
Fig. 3. System overview of the proposed system
The notice fact is that there will often be significant overlap between the foreground and detections, which is complicated for the optimization process must account for. The result of image contains more noisy background. This is hard to achieve convergence in practice. In reality, foreground detection is difficult because of the occlusion, illumination variations, shadows, etc. So, the proposed system use the noise removal algorithm and morphological operation that need to estimate foreground segments. Most of the people located regions will be cover up with the white pixels after finishing the dilation process. As a discussion, we set the various values of the radius of the structure element. The insight is that the value 2 is the best for dilation process. The more value is the fewer regions that contain the people. This paper extract the foreground area and segment these foreground localization by k-mean clustering. The proposed system adaptively sets the threshold to segment the foreground people regions. The system also estimates the number of person in each of the foreground segmented areas. The proposed system aggregates the gradient to the corresponding cells, makes a histogram on each cell, and normalizes the histogram along four directions. Unlike the other histogram of gradient approaches, this paper normalizes the cell feature along four directions sum together, instead of concatenation which reduces the dimensional of feature vector to one-forth. These detected people gradients are clustered to get the whole area of people located. K-means is an essential technique in vector quantization, formerly used in signal processing. It is calculated as the following equation: arg
∑
∑
∈
|| − || = arg
∑
| |
(2)
where x is set of observation, μ is the mean of point in observation set S . This clustering use intends to find clusters of comparable white pixel, while these represent the clutter of people that have different shapes and size. We use the value 2 for the clutter centroid number. This is the best to clutter all the people location. Then, the region proposal of the original image is segmented. This region is to use as the input of CNN framework. This region extraction is the first contribution of proposed system.
C. Extended CNN Framework CNN is one of the most popular types of deep neural networks as well as deep learning. They eliminate the need for manual feature extraction. The previous works of CNN are directly extracted features from input images. This paper highlights the two main categories of the previous CNN works: Network based model and Training process. The first category is focused on the networks model such as hydra, VGG-16, GoogLeNet, AlexNet, and ResNet. The basic model is used the initial CNN layer. Scale-aware model is more complex than the basic model. They used many column or many resolution architectures. Context-aware model fused local and global information. So, they got lower estimation errors. Multi-task learned various vision tasks and different approach to estimate the crowded counting and density. But the weakness of the first category is that they strictly depended on hardware and network model. And they need high GPU. They need many large dataset. The second category is based on the methodology of patchbased or the whole image. Patch-based cropped from the input image by using different approaches. The whole-based is used the whole image as the image. They avoid computationally expensive sliding window. But the issue is the highest mean absolute error rate due to training the whole image (both the people regions and non-people regions). A framework of the extended CNN model with switchable objectives is also the patch-based of the second category. This is also shown in Fig. 4. Unlike the other works is that the framework pre-segmented the region as region proposal. The input of the extended CNN is the image patches segmented localization region. Unlike the other neural network based works, the proposed system firstly extracts the localization region as a region proposal to reduce not only time but also inefficiency. This is the highlighted contribution of the proposed system contract with other works. In order to obtain pedestrians, the size of each patch at different locations is chosen according to the perspective value of its center pixel. Then the patches are warped to 32 pixels by 32 pixels as the input of the extended CNN framework. In the proposed system, two layers are used as the convolution layers and three layers are used as fully-connected. Region proposal of the original image is set as the input layer. This paper uses four hidden layers. The first convolution layer consists of 20 filters with size of 5×5×3 and layer2 convolution has 50 counterparts with size of 5×5×26. The corresponding feature map is 6×6×20 and 12×12×30, respectively. After localization the receptive fields in the framework, the important weight and biases is calculated. The weight of the perceptron for the four layers can be arranged into the matrix as the following: = (∑
)
(3)
where w is the weight of the proposed system and n is the number of neurons of the hidden layer. Rectify is also calculated as the following equation:
101
ICCTA 2017, 28-30 October 2017, Alexandria, Egypt
= max(0, )
(4)
where x is the pixel value. Then 2×2 max pooling is used. To avoid the over-fitting issue and to sufficiently training the network, this paper surveys the different values of seek, epochs and batch sizes of Layer1. As one of the contributions, the proposed system uses the best potion and seed number by analyzing various number range. An epoch of options is very important for calculating the number of batches. For time decreasing, the proposed system reduces the number of pixel patches. The number of epochs (passes over the training set) is very important. Unlike other works, the proposed system analyses the various values with the result of mean absolute rate. Each neuron of the input layer is strongly associated to a neuron of the hidden layer in the traditional neural network. Unlike this, a tiny region proposal neuron of input layer is only associated to the hidden layer. So, extended CNN can give efficiency and tolerance. In the training process, the proposed system took 39 images of the dataset sequence. Since there is no official training set and validation set, the system arranges the necessary information. In the testing process, the whole images of the video sequence are tested. IV.
EXPERIMENTAL RESULTS
To calculate the framework, we evaluate the experiments on the challenging crowd counting dataset, PET 2009[13]. It used as the crowded counting domain dataset. A. Dataset This paper evaluates the accuracy performance of the proposed work on the challenging PET 2009 dataset. This dataset consisted of several sections to test the different surveillance scenes. In this experiment, the proposed system apply the “S1” section and “S2” section that are used for the people counting benchmark. This video sequences have good crowded properties and challenging variations such as both walking and running people, illumination changes and diverse crowd densities and sizes. This sequence also has the ground truth annotations [11]. The example images of PET2009 dataset are shown in Fig. 1.
where N is the total number of testing frames, ℎ and ℎ is the ground truth and the estimated count of person in frame t. C. Results The performances of the proposed system with comparing the different previous methods are shown in Table. I. The proposed system got best result (lower MAE and MRE result) than the state-of- art results. The reported result of the methods in [6, 13, 10, 1, 3, 19, and 2] are directly noted from the related papers. In contract, the MAE rate of proposed system is 1.50 lower than the MAE rate [6] of 0.34%. Moreover, 0.34% MRE rate is also lower. Compare with different previous work that presented in Table. I, 0.53 MAE rate lower than [13], 0.55 lower than [13], 1.24 than [1], 1.64 than [3], 1.43 than [19] and 1.82 than [2]. The plot results by comparing of ground truth with the estimated values are shown in Fig. 5, 6, and 7. In the proposed system, the estimated count number is almost the same with the ground truth number. But, some of the frames have need to correct. This can be seen in Fig. 8. In this figure, the first image frame shows the accurately predicted result is the same as with the ground and the second image frame shows the miss estimate count with ground truth. The more experiments are tested on the other two datasets as shown in Table II and III. Table IV also shows the average results of the three datasets. This paper will evaluate the proposed system with other different dataset such as Mall, UCSD or WorldExpo datasets. We will evaluate this fact as a future work to assessment and clarity of our concept correctness.
B. Expermeriment In this experiment, the proposed system uses MAE (Mean Absolute Error) and MRE (Mean Relative Error) to evaluate the performance. MAE can give clear insight that the predicted value is the same with the ground truth data. Otherwise, there exists the high miss rate. MRE is like as the relative square error. This is the rate of the comparative the predictor value with the actual value. MAR and MRE can be calculated easily with the following equations: = . ∑
|ℎ − ℎ |
= . ∑
|
|
(5)
(6)
Fig. 4. The extended CNN framework
V. CONCLUSION A robust system for crowded person counting by segmentation region proposal with extended CNN is intended to solve the occlusion, illumination and various orientation. Experimental result are significantly outperformed than the state-of-art methods. First input image is make preprocessing to avoid the noise and illumination. Then, clear background is take out to get the discriminate foreground detection. The localized region containing people is extracted by foreground segmentation methods. Finally, these region are set to extended CNN framework to obtain the accurate person numbers.
102
ICCTA 2017, 28-30 October 2017, Alexandria, Egypt
Experimental results tested on the challenging crowd methods. Finally, these region are set to extended CNN framework to obtain the accurate person numbers. Experimental results tested on the challenging crowd dataset are significantly outperformed than state-of-art methods. As a discussion, we tested the people counting system only on CNN framework. The whole original image is putted into the CNN framework, extracted CNN features and also used as the classifier. The MAE result is 3.15. So, we found the error and though what way can improve the performance. Next, we extracted first order feature from the whole original input image. Then, extracted first-order features is putted into CNN framework. CNN is also used as the classifier. But, MAE rate got 2.42. So, we processed first foreground detection and then segmented these localized region. We found new intuition that foreground detection and segmentation with CNN is significantly outperformed than the state of art methods. The training time took a very few time so the proposed system gets a balance tradeoff between processing time and results. As a future work, we will evaluate the proposed system on more dense complex crowded datasets.
Fig. 8. Example output images of the proposed crowd counting system
TABLE I.
COMPARISON OF DIFFERENT APPROACHES ON “PET 2009” DATASET (S1-L1-13-57)
Methods
S1-L1-13-57 MAE
Proposed Method
0.48
2
Hashemzadeh [6] Liang et al. [13]
0.89 1.01
2.34 4.97
Hashemzadeh et al. [10]
1.03
9.31
Albiol et al. [1]
1.72
N/A
Rao et al. [3]
2.17
37.61
Li et al. [19]
1.91
N/A
Chan et al. [2]
2.30
N/A
Fig. 5. Crowd counting result obtained by the proposed method for S1-L113-57
TABLE II.
COMPARISON OF DIFFERENT APPROACHES ON “PET 2009” DATASET (S1-L1-13-59)
Methods
Fig. 6. Crowd counting result obtained by the proposed method for S1-L1-1359
MRE%
S2-L1-13-59 MAE
MRE%
Proposed Method
0.56
3.62
Hashemzadeh [6]
0.84
4.47
Liang et al. [13]
1.17
9.30
Hashemzadeh et al. [10]
1.14
12.84
Albiol et al. [1]
1.89
N/A
Rao et al. [3]
1.62
35.41
Li et al. [19]
2.02
N/A
Chan et al. [2]
1.64
N/A
Fig. 7. Crowd counting result obtained by the proposed methods for S1-L214-36
103
ICCTA 2017, 28-30 October 2017, Alexandria, Egypt TABLE III.
COMPARISON OF DIFFERENT APPROACHES ON “PET 2009” DATASET (S1-L2-14-06)
Methods
S1-L2-14-06 MAE
[2]
MRE%
Proposed Method
0.82
8.06
Hashemzadeh [6]
2.32
8.12
Liang et al. [13]
4.33
18.76
[4]
Hashemzadeh et al. [10]
1.68
14.23
[5]
Albiol et al. [1]
1.95
N/A
[6]
Rao et al. [3]
2.47
41.11
Li et al. [19]
3.28
N/A
[7] [8]
Chan et al. [2]
4.32
N/A
[3]
[9] TABLE IV.
COMPARISON OF DIFFERENT APPROACHES ON “PET 2009” DATASET (AVERAGE OF THEREE SETS)
Methods
Average MAE
Proposed Method
MRE%
[10]
[11] [12]
0.62
4.56
Hashemzadeh [6]
1.35
4.97
Liang et al. [13]
2.17
6.2
[13]
Hashemzadeh et al. [10]
1.28
12.12
[14]
Albiol et al. [1]
5.56
N/A
[15]
Rao et al. [3]
2.08
38.04
[16]
Li et al. [19]
2.40
N/A
[17]
Chan et al. [2]
2.75
N/A [18]
REFERENCES [1]
A. Albiol, M.J. Silla, A. Albiol, J.e. M. Mossi, Video analysis using corner motion statistics, in: Proceedings of the IEEE International Workshop on
[19]
446 Performance Evaluation of Tracking and Surveillance, 2009, pp. 3138. A. B. Chan, M. Morrow, N. Vasconcelos, Analysis of crowded scenes using holistic properties, in: Proceedings of the Eleventh IEEE International Work456 shop on Performance Evaluation of Tracking and Surveillance, 2009, pp. 101–108. A. Rao, J. Gubbi, S. Marusic, M. Palaniswami, Estimation of crowd density by clustering 514 motion cues, Vis. Computer. 31 (2015) 1533– 1552. Chen, K., Loy, C.C, Gong, S., Xiang, T., 2012. Feature mining for localised crowd counting., in: European Conference on Computer Vision. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE. pp. 886–893. Hashemzadeh, M., & Farajzadeh, N. (2016). Combining keypoint-based and segment-based features for counting people in crowded scenes. Information Sciences, 345, 199-216. http://www.cvg.rdg.ac.uk/PETS2009. Kowcika A. And Sridhar S., 2016. A Literature Study on Crowd (People) Counting With the Help of Surveillance Videos, International Journal of Innovative Technology and Research, pp. 2353-2361. Lempitsky, V., Zisserman, A., 2010. Learning to count objects in images, in: Advances in Neural Information Processing Systems, pp. 1324–1332. M. Hashemzadeh, G. Pan, M. Yao, Counting moving people in crowds using motion statistics of feature-points, Multimed. Tools Appl. 72 (2014) 453–475 487. Milan, A., in: Milan, (Ed.), Data, 2012. http://research.milanton.de/data.html. P ham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R., 2015. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3253–3261. R. Liang, Y. Zhu, H. Wang, Counting crowd flow based on feature points, Neurocomputing 133 (2014) 377–384. Ryan, D., Denman, S., Fookes, C., Sridharan, S., 2009. Crowd counting using multiple local features, in: Digital Image Processing : Techniques and Applications, 2009. Sime, J.D., 1995. Crowd psychology and engerring. Safety science 21, 114. Viola, P., Jones, M.J., 2004. Robust real-time face detection. International journal of computer vision 57, 137–154. Viola, P., Jones, M.J., Snow, D., 2005. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision 63, 153–161. Wu, B., Nevatia, R., 2005. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors, in: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, IEEE. pp. 90–97. Y. Li, E. Zhu, X. Zhu, J. Yin, J. Zhao, Counting pedestrian with mixed features and extreme learning machine, Cognit. Comput. 6 (2014) 462476.
104