Face Tracking with Convolutional Neural Network Heat-Map Nhu-Tai Do, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, In-Seop Na School of Electronics and Computer Engineering, Chonnam National University 77 Yongbong-ro, Buk-gu, Gwangju 500 – 757, Korea
[email protected],
[email protected] ABSTRACT In this paper, we apply a heat-map approach for human face tracking. We utilize the heat-map extracted from the convolutional neural networks (CNN) for face / non-face classification problem. The CNN architecture we build is a shallow network to extract information that is meaningful in locating an object. In addition, we made many CNNs with changes in pool-size of the last layer to obtain a well-defined heatmap. Experiments in the Visual Tracking Object dataset show that the results of the method are very encouraging. This shows the effectiveness of our proposed method.
CCS Concepts • Computing methodologies~Face Tracking • Computing methodologies~Convolutional Neural Networks
Keywords Convolution neural network; heat-map; face tracking
1. INTRODUCTION Face tracking is a fundamental problem for many applications in image processing and computer vision. It provides facial positioning information for wide applications to handle face recognition, facial expression, and gaze tracking. Up to now, it is a challenging task with advance requirements in unconstrained environments such as light changes, motion blur, pose change as well as clutter, occlusion [1] [2]. Most visual object tracking methods (a general problem for face tracking) have two-way approach consisting of motion-based and appearance-based tracking methods. Motion-based tracking methods predict the probability of target states based on their previous states, typically as Particle Filter [3]. Besides, appearance-based tracking methods are divided into two groups including generative-based and discriminative-based tracking methods. In particular, generative-based tracking tries to represent visual observations of the target object such as sparse representation [4] and then finds the optimal position of the object in the image region of interest. Simultaneously, discriminativebased tracking methods select robust features for binary classification and constructing a classifier that distinguishes between background and object tracking. Typically, Ensemble tracking [5] uses the AdaBoost classifier in conjunction with the color feature and the local histogram.
Figure 1. Row 1, 2, 3 in column 1 show images containing faces in multiple cases such as many faces, occlusion, and motion blur. Columns 2, 3 show the heat-map obtained from feedback of the shallow CNNs and mask image identifying human face based on the specified threshold. With the successful application of the Convolution Neural Network in image classification [6] as well as object detection [7], robust features extracting from the layers in CNN are considered and applied in the problem tracking [8][9] which solve difficulties encountered in tracking-by-detection approaches for deep learning [10] due to the fundamental heterogeneity between the tracking and detection problem. This problem comes from the robustness of CNN in semantic classification leading to the inefficient representation of spatial details such as object location. Moreover, CNN needs large amounts of data to learn for object classification and detection, as opposed to visual object tracking problems that require only a small amount of samples during processing. This approach changes using the CNN for object detection, such as the black box to apply to the visual object tracking. Heat-maps extracted at each layer in CNN based on pre-train data with famous object detection networks such as VGG network [11] that are analysis and discover significant layers to extract robust features for object tracking problem.
Training Model (a) Heatmap Model (b)
Padding Same Size
Dropout 0.25
Padding Valid
Max Pooling s xs
Conv (ReLU) 64/s x 64/s x 128
mse
Dropout 0.5
Input Image 64 x 64 x 3
Norm. -1..1
Conv (ReLU) 3 x 3 x 10
Conv (ReLU) 3 x 3 x 10
Conv (tanh) 1x1x1
Flatten 1
Figure 2. Shallow CNN for binary classification face/non-faces applying in (a) Training process with Flatten layer using meansquare error loss (b) Heat-map building process without Flatten, loss function as well as size of input image. CNN features in the high-level layers are often very powerful for classifying object semantics before data deformation and corruption. While the low-level layers often contain more local spatial details to accurately locate the object [8]. Further deep learning methods such as MDNet [12], FCNT [13] use CNN features derived from pre-train CNNs and apply various tracking algorithms for the online update and target localization stage to take advantage of the robust features. In this paper, we propose a method to build a heat-map that is more suitable for face tracking instead of visual object tracking. Heat-map and face image mask are illustrated in Figure 1. We build a shallow face CNN for binary classification with detecting face/non-faces. After that, we remove the fully connected layer and apply the shallow face CNN for extracting the robust feature from the last layer. Because this network is shallow, so these features have both properties about semantic classification and spatial details for object location. To improve the accuracy of features from shallow face CNN, we experimented with the pool-size changes in the Max Pooling layer near last layers. We realized that increasing pool-size helps better determination in spatial details, but decreasing accuracy when determining face/non-face regions in heat-map. From there, we proposed a method to build heat-map from many shallow networks with different pool-size changes. The final heat-map is based on a feedback rating from the shallow networks according to the respective weights along with the definite threshold. From there, we built a mask to identify the object quickly in the target tracking window. The contribution of this paper is three parts. Firstly, we construct a face / non-face shallow CNN and analysis the pool-size change effect in Max Pooling Layer for face / non-face classification and face localization. Secondly, we build heat-map based on voting from many shallow networks in same structure but different poolsize to increase the accuracy of the heat-map. Finally, we experimented our method in Online Object Tracking dataset [14] with encouraging results. In this paper, it is organized as follows in the rest. In Section II, we will describe the shallow face heat-map approach to face tracking problem. It is the experiments and results in Section III together with the conclusion in Section IV.
2. PROPOSED METHOD 2.1 Binary classification for face/non-faces Many CNNs architectures for facial detection are proposed in past decade. Li et al. [15] built three cascade CNNs from low to high complexity to quickly remove non-background objects in window sliding and improve positions through calibration-nets. Zhang et.al [16] reduced complexity when using kernel size 3 and proposed multitasking with the integration of face classification, bounding box regression, and five facial landmarks. The goal of the paper is to construct CNN for binary classification of face/non-face without too complexity VGG network in object detection [11], but strong enough to predict semantics as well as face location. The method to build heat-map from CNN should also avoid sliding window to calculate face-appearance probability density for reducing time-consuming. Therefore, we build CNN with input image size 64 x 64 to be able to recognize large faces as in Figure 2. The top two convolution layers are 3 x 3 because of the face/non-faces classification simpler than object classification. Besides, the number of features in the layers is 10 [16]. Succeeding is the max pooling layer with size s x s used to reduce the size of the image, as well as highlight important features. We also use the two Dropout layers [17] in order to further strengthen the ability to prevent overfitting. The last two layers of CNN play the role as Full Connected Layer. The first layer has size 64/s with no padding, which leads to output image width and height with value 1. Besides, feature amounts of this layer are 128 to be connected with last convolution layer with size 1 x 1 and depth 1. Depth has value 1 because of purpose about binary classification with the face (value 1) and non-face (value -1). The introduction of the last two convolution layers replacing the Dense and Flatten layers is commonly found in CNN to reuse the face recognition model used in the heat-map generation process. In the training and detection face model, the input image size is 64 x 64, so when throughout max pooling layer, it drops to 64/s. With kernel size 64/s and no padding, the image size is reduced to 1 with 128 features in depth. In conjunction with convolution, classifier algorithm uses minimum squared loss function and adaptive learning rate method with class -1/1 for non-face and face.
2.2 Building a heat-map
Predict
During the construction of the heat-map, the model assigns the input image to an unspecified size. Therefore, when throughout max pooling layer, image size decrease to (w/s, h/s) where (w, h) is the width and height of the image. Moreover, images will be reduced in size according to the kernel size 64/s due to lack of padding. Along with the final convolution layer with 1x1 dimension gives probability density image or heat-map with every pixel value in [1, 1]. Each pixel in heat-map corresponds to the input image in 64 x 64 square. The position is calculated based on the upper lefthand side with multiplying by pool-size value s in max pooling layer as shown in Figure 3.
Threshold
Remapping
32
16
8 y 2y x
4
2x 64
Predicted HeatMap (Pool-Size 2)
2
Input Image
Figure 3. Mapping between density probability heat-map and original image by a shallow face CNN. We build 5 different heat-map models with different pool-size values s = 2, 4, 8, 16, 32, respectively. Every pixel value greater than 0.9 in these heat-maps will respond to common joint heatmap according to the formula below with the results shown in Figure 5:
x y H x, y s ws H s , s s value0.9
(1)
Figure 5. Heat-map building process by shallow face CNNs in size s = 32, 16, 8, 4, 2. The rows correspond to the pool-size, and the columns correspond to the heat-maps receiving from prediction, threshold in specific value lower than 0.9, mapping heat-map size into same size with original image. Results and original image are illustrated in Figure 4. Figure 1 shows positive results in determination face mask
Where H is joint heat-map, HS is heat-map corresponding with CNN containing max pooling size s [2, 4, 8, 16, 32], ws is HS heat-map weight respectively values 1, 2, 3, 4, 5. After building, joint heat-map H is normalized into [0, 1].
image in cases such as multiple faces, facial occlusion, and facial blur due to the motion. However, there are some situations that heat-map determines wrong human face regions as shown in Figure 4 and should be addressed by the face detection model.
2.3 Target Localization
2.4 Single Face Tracking
Based on the joint heat-map, we obtain a mask image of regions capable of appearing human faces based on a defined threshold t determined by the following formula:
Joint heat-map has tested for efficiency in terms of performance as well as accuracy through single face tracking commonly used in the mobile platform.
1 when H x, y t Maskx, y otherwise 0
(2)
In the paper, we define the threshold t of 0.3 to delineate the human face.
Figure 4. A situation that determines the area of the face is redundant.
At the beginning of the program, the algorithm uses face detection to find the position of the face to initiate tracking. At specified time t, the algorithm maintains a viewport size that is greater than half the face area to be observed. Viewport size is a minimum size (128, 128). The center of the subject at the next frame will be the highest value in the heat-map corresponding to the viewport. Size of the region will be maxed region containing facemask image. If heatmap cannot obtain face region, the program will call face detection on the small region to find face region for update heat-map again, and expand on all frame if not successful in the small region.
3. EXPERIMENTS AND RESULTS
Jumping
0.006
0.0362
0.009
0.0219
0.0135
3.1 Training shallow face CNN model
Man
0.0019
0.0016
0.0021
0.0073
0.0041
Mhyang
0.0009
0.0014
0.0013
0.0039
0.0038
Trellis
0.001
0.0281
0.0288
0.026
0.0307
In this paper, we used framework Keras with backed end Tensorflow on environment Python 3.5 to build Shallow Face CNNs. We download WIDER FACE [18] and pre-processing faces in training and validation dataset. We eliminate faces in the dataset with attribute too small than 128, blur, heavy occlusion. We collect total 29965 faces for training and 7682 faces for validation and resize with width and height 64. To obtain non-face image, we generate background from images in WIDER FACE dataset with the non-intersect region in face ground truth. There are 30,197 non-face images for train and 9,419 for the test. We create five models different from pool-size to train. We use batch size 128 with epochs 1000 as well as mean-square-error loss function and adaptive learning rate algorithm. Results of training process as in Table 1. Table 1. Training result of Shallow CNNs
CNN with Pool-size 2 4 8 16 32
Loss 0.089 0.071 0.069 0.070 0.114
The paper has suggested some ideas for adopting CNN to create and using heat-map. The proposed heat-map is capable of satisfying object localization as well as providing semantic information for tracking algorithms. The method achieves initial effects, but also requires improvements in the accuracy of heatmap in determining effectively object in the unconstrained environment.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education(NRF-2017R1A4A1015559).
3.2 Single Face Tracking For a quality comparison, we use a measure of the Euclidean distance to the ground truth as formula following: (3)
Where O is the tracking object, G is the ground truth, DE (CO, CG) is the Euclidean distance between the center of target and ground truth, SG is the size of the ground truth window. We use four algorithms in the OpenCV library to compare with the mathematical algorithm in the paper. Four algorithms consist of Boosting (BT), Median Flow (ML), MIL, and TLD. The results are presented in Table 2. Table 2. Comparison between proposed method and other methods in OpenCV library for face tracking problem
BT
ML
MIL
TLD
BlurFace
Heat Map 0.0015
0.0068
0.0035
0.008
0.0025
Boy
0.0029
0.0384
0.0179
0.0184
0.0067
David2 Dragon Baby Dudek
0.0025
0.005
0.0288
0.063
0.0151
0.0072
0.0173
0.0259
0.0192
0.032
0.0007
0.0008
0.0008
0.0011
0.0015
FaceOcc1
0.0006
0.0047
0.0025
0.0015
0.0015
FaceOcc2
0.0009
0.0014
0.006
0.0021
0.0087
FleetFace
0.0012
0.0031
0.0016
0.0015
0.002
Freeman1
0.0045
0.0928
0.0396
0.086
0.0657
Girl
0.0066
0.0179
0.0117
0.0099
0.0112
Video
4. CONCLUSION
ACKNOWLEDGMENTS
Accuracy 0.967 0.969 0.961 0.957 0.929
D C , C Error O, G E O G SG
The results show that our method is more efficient than the methods being compared. The average processing speed of the algorithm is 12 frames/s for the above videos.
Besides, this work was also partly supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2017-0-00383, Smart Meeting: Development of Intelligent Meeting Solution based on Big Screen Device).
REFERENCES [1] M. H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 1, pp. 34–58, 2002. [2] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1442–1468, 2014. [3] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process, vol. 50, no. 2, pp. 174–188, 2002. [4] W. Hu, W. Li, X. Zhang, and S. Maybank. Single and Multiple Object Tracking Using a Multi-Feature Joint Sparse Representation. Tpami, vol. 37, no. 4, pp. 816–833, 2015. [5] S. Avidan. Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 261–271, 2007. [6] A. Krizhevsky, I. Sutskever, and H. Geoffrey E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 25, pp. 1–9, 2012. [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Nips, pp. 91–99, 2015. [8] C. Ma, J. Bin Huang, X. Yang, and M. H. Yang. Hierarchical convolutional features for visual tracking. Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 3074–3082, 2015. [9] H. Li, Y. Li, and F. Porikli. Deep Track: Learning Discriminative Feature Representations Online for Robust Visual Tracking. IEEE Trans. Image Process, vol. 25, no. 4, pp. 1834–1848, 2016.
[10] Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. Video tracking using learned hierarchical features. IEEE Trans. Image Process, vol. 24, no. 4, pp. 1424–35, 2015. [11] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv Prepr. arXiv1409.1556, 2014. [12] H. Nam and B. Han. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. Cvpr, pp. 4293–4302, 2016. [13] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 3119–3127, 2015. [14] Y. Wu, J. Lim, and M.-H. Yang. Online Object Tracking: A Benchmark. 2013 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2411–2418, 2013.
[15] G. Li, Haoxiang and Lin, Zhe and Shen, Xiaohui and Brandt, Jonathan and Hua. A Convolutional Neural Network Approach for Face Detection. Cvpr, pp. 5325–5334, 2015. [16] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett, vol. 23, no. 10, pp. 1499–1503, 2016. [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014. [18] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A face detection benchmark. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016–Decem, pp. 5525– 5533, 2016.
Authors’ background Your Name
Title*
Research Field
Nhu-Tai Do
Phd Candidate
Pattern Recognition, Deep Learning
Soo-Hyung Kim
Full Professor
Pattern Recognition, Deep Learning
Hyung-Jeong Yang
Full Professor
Pattern Recognition, Deep Learning
Guee-Sang Lee
Full Professor
Pattern Recognition, Deep Learning
In-Seop Na
Phd
Pattern Recognition, Deep Learning
Personal website
http://pr.jnu.ac.kr/shkim/