Face Recognition Using the SR-CNN Model - MDPI

0 downloads 0 Views 6MB Size Report
Dec 3, 2018 - (RITF); convolution neural network (CNN); graphic processing unit ... scale, rotation, illumination, and angle of view, and still have good .... We introduce a deep face recognition model, SR-CNN, which combines multiple ...
sensors Article

Face Recognition Using the SR-CNN Model Yu-Xin Yang 1,2 , Chang Wen 1, *, Kai Xie 2,3 , Fang-Qing Wen 2,3 , Guan-Qun Sheng 2,3 and Xin-Gong Tang 4 1 2

3 4

*

School of Computer Science, Yangtze University, Jingzhou 434023, China; [email protected] National Demonstration Center for Experimental Electrical and Electronic Education, Yangtze University, Jingzhou 434023, China; [email protected] (K.X.); [email protected] (F.-Q.W.); [email protected] (G.-Q.S.) School of Electronic and Information, Yangtze University, Jingzhou 434023, China Key Laboratory of Exploration Technologies for Oil and Gas Resources, Yangtze University, Jingzhou 434023, China; [email protected] Correspondence: [email protected]; Tel.: +86-136-9731-5482

Received: 27 October 2018; Accepted: 28 November 2018; Published: 3 December 2018

 

Abstract: In order to solve the problem of face recognition in complex environments being vulnerable to illumination change, object rotation, occlusion, and so on, which leads to the imprecision of target position, a face recognition algorithm with multi-feature fusion is proposed. This study presents a new robust face-matching method named SR-CNN, combining the rotation-invariant texture feature (RITF) vector, the scale-invariant feature transform (SIFT) vector, and the convolution neural network (CNN). Furthermore, a graphics processing unit (GPU) is used to parallelize the model for an optimal computational performance. The Labeled Faces in the Wild (LFW) database and self-collection face database were selected for experiments. It turns out that the true positive rate is improved by 10.97–13.24% and the acceleration ratio (the ratio between central processing unit (CPU) operation time and GPU time) is 5–6 times for the LFW face database. For the self-collection, the true positive rate increased by 12.65–15.31%, and the acceleration ratio improved by a factor of 6–7. Keywords: face matching; scale-invariant feature transform (SIFT); rotation-invariant texture feature (RITF); convolution neural network (CNN); graphic processing unit (GPU); parallel computing

1. Introduction Face matching is a biometric technology that is widely used in a variety of areas [1], such as public security control, intelligent video monitoring, verification of identity, robot vision, etc. [2]. In the geometrical domain, it is a novel way of converting what the human eyes normally do in recognizing one person from another by implicitly extracting some morphological features [3,4]. It was suggested that multimodal emotion recognition can be combined with speech and facial expressions for emotion analysis [5]. Face matching is the most crucial stage in face recognition. Among these stages, local features are invariant to image scale, rotation, illumination, and angle of view, and still have good matching performance under the influence of noise, occlusion, and other factors. The extraction method of local features mainly includes two steps: detection of feature point and generation of feature description. Compared with the selection of the detection algorithm, the algorithm of local feature description has a more significant impact on the performance of the local feature. Furthermore, how to construct a feature descriptor with robustness and uniqueness is of paramount importance in the research of local features. The purpose of constructing the feature descriptor is to establish the description of the neighborhood information of the feature points, thereby constructing the descriptive vectors of the feature points. These descriptors are based either on the regional distribution, the spatial frequency, Sensors 2018, 18, 4237; doi:10.3390/s18124237

www.mdpi.com/journal/sensors

Sensors 2018, 18, 4237

2 of 24

differences, or other features. Mikolajczyk et al. [6] made a systematic analysis and comparison of various descriptors. The experimental results showed that the scale-invariant feature transform (SIFT) algorithm based on gradient distribution and the gradient location and orientation histogram (GLOH) are obviously superior to other descriptors. Therefore, most feature descriptors are built on the bases of distribution of local area information (including gradient, gray, and so on) [7,8]. The convolutional neural network (CNN) demonstrated its high efficiency in image classification [9], object detection [10], and voiceprint recognition [11]. Through deep architecture, it can learn abstract expressions close to human cognition. Recent studies showed that CNNs trained on large and diverse datasets like ImageNet [12] can be used to solve tasks for which they are yet to receive training. Many works adopted the solution of ready-made features based on pre-trained CNN extraction, which achieved good results in popular benchmark tests [13–16]. Farfade et al. proposed a fully convolutional neural network to detect faces in a wide range of orientations [17]. Reference [18] applied the faster region-based CNN (R-CNN), a state-of-the-art generic object detector, and achieved promising results. Reference [19] used fully convolutional networks (FCN) to generate heat maps of facial parts, and then used the heat map to generate face proposals. Reference [20] used a unified end-to-end FCN framework to directly predict bounding boxes and object class confidences. These methods, however, are relatively slow even on a high-end graphics processing unit (GPU). In References [21,22], faster R-CNN was used for face detection. Reference [23] improve the faster R-CNN framework by combining a number of strategies, including feature concatenation, hard negative mining, multi-scale training, model pre-training, and proper calibration of key parameters. Proposed by Lowe, the SIFT feature [24,25] is one of the most typical feature descriptors, where it generates a 128-dimensional eigenvector by constructing a three-dimensional gradient histogram of the neighborhood of feature points. The SIFT feature has strong adaptability to scale variation, change of illumination, image transformation, and noise, thus having a better fault tolerance in face matching. However, there are still many areas that need to be improved, and relevant problems attracted the attention of the scholars at home and abroad. Susan et al. [26] applied the fuzzy matching factor to the SIFT model, to form an efficient descriptor of fuzzy filters. Park et al. [27] used a combination of Gabor local binary pattern (LBP) histogram and SIFT-based feature points. Narang et al. [28] proposed facial recognition methods based on SIFT and a classification of a Levenberg–Marquardt Back Propagation (LMBP) neural network. Zhang et al. [29] extracted SIFT features from each image. The feature matrix consisted of extracted SIFT feature vectors sent to CNN for learning. The above scholars effectively optimized the application of the SIFT model in face recognition. Nonetheless, the traditional SIFT model can easily lead to mismatches when the image has multiple similar regions. When the target image is a grayscale image, CNN may not be as effective as SIFT, because SIFT calculates on the basis of the grayscale image without resorting to color information. The same is true when the color of an object changes dramatically [30]. Vijay et al. [31] showed that neither approach was consistently better than the other, and retrieval gains could be obtained by combining both features. The CNN network is widely used in face recognition technology due to its good recognition performance. To date, few people retrained the entire CNN network, and the widely used alternative is to fine-tune the CNN network, which already uses a lot of pre-training of tag images. Network fine-tuning is a simple learning method for migration, and data in the target data set are usually selected for training [32]. To this end, we propose a robust method of facial feature matching using an SR-CNN model, which merges the rotation-invariant texture feature (RITF) eigenvector and SIFT eigenvector with CNN features, which effectively improves the effectiveness of feature matching. Moreover, the introduction of a GPU powerfully shortens the computing time of this method, making it more efficient and accurate.

Sensors 2018, 18, 4237

3 of 24

In this paper, we combined the local features of the original image with the global image representation learnt by CNN. A high precision of face matching can be obtained by integrating low-level and high-level features [33,34]. The contributions of this paper are summarized below. We introduce a deep face recognition model, SR-CNN, which combines multiple features. It can effectively identify a wide range of tasks, while being more robust and discriminative. We propose a SIFT/RITF model. This model improves the rotation invariance of the traditional SIFT model while maintaining the advantages of the original SIFT and RITF models. We investigate a new CNN model, where the batch normalization layer was introduced, and the parametric rectified linear unit (PReLU) was chosen as the activation function; the network parameters were adjusted using the cross-entropy cost function. The validity of the proposed method was verified by testing on two datasets. The remainder of this paper is organized as follows: Section 2 proposes the SR-CNN model and illustrates it in five steps. In Section 3, the proposed parallel optimization architecture of the SR-CNN model is introduced. In Section 4, the experimental results are presented, and an analytical discussion is conducted. The conclusions are given in Section 5. 2. SR-CNN Model 2.1. Framework of the Proposed Method The traditional SIFT model is a method of extracting the local features of an image. Because only the local feature descriptor of the image is used in the feature matching process, it is inevitable that mismatches will result from the lack of texture and the small size of the target. Due to the similarity of local features, the ability to eliminate these misplaced point pairs is limited [35,36]. The SIFT simply describes the local gradient distribution, which leads to potential mismatches. Some studies use spatial constrains to test and verify SIFT matches [37,38]. These methods reduce the impact of mismatching while introducing more complexity. When there are multiple similar regions of texture information in the face image, only the rotation-invariant texture features of the image region around the key points can achieve good results. In most instances, CNN needs a great deal of training data. The large datasets and cheap computing power provided by GPUs increased CNN’s popularity. Although the results provided by SIFT and other handcrafted methods are not as accurate as those provided by CNN [39,40], they do not need a large number of datasets. However, the modeling ability of the manual approaches is limited by fixed filters, which are still the same for different data sources. Therefore, a novel SR-CNN model with two parts is proposed. The first part involves the extraction of SIFT features and RITF features, which consists of four steps [41–44]. Firstly, the Gaussian pyramid and the difference of the Gaussian pyramid (DOG) were established, and the space extremum was detected. Secondly, the Taylor formula was adopted to determine the location of the extreme point through five iterations. Secondly, for the key points detected in the DOG, the gradient and distribution of direction of the pixels in the 3σ domain of the pyramid were collected. Finally, the SIFT/RITF descriptors were constructed. The second part involved the CNN firstly using a large sample for pre-training and then using the pre-trained CNN model to extract features. The input of the CNN included the face images after scale normalization, where the images were processed by convolution to extract image features. Then, the results of the convolution layer were processed with the normalization layer to prevent network overfitting. After several convolutional layers and pooling layers, the feature was processed using the fully connected layer, and the result of the CNN feature representation was obtained. The two normalized features were connected together to get the fusion eigenvectors, and the random forest method was used to construct a feature classifier for feature classification. The schematic diagram of the SR-CNN algorithm is shown in Figure 1.

Sensors 2018, 18, x FOR PEER REVIEW

Dataset Sensors 2018, 18, 4237 Sensors 2018, 18, x FOR PEER REVIEW

input image

Dataset

4 of 25 conv1 pool1

conv2

pool2

conv3 conv4

conv5

pool3

FC

FC

55 27 27 13 13 13 6 13 11 3 3 3 3 3 3 3 conv5 pool3 input image conv1 pool1 conv2 pool2 conv3 conv4 3 11 3 3 3 3 3 3 6 13 13 13 27 27 13 FC FC 55 27 27 13 13 13 55 13 384 384 256 256 6 256 Feature 11 3 96 3 256 3 3 3 3 96 3 Fusion 3 11 3 3 3 3 3 3 6 Calculating the 13 13 13 27 27 13Determining Extreme points Construction Construction RITF eigenvector the scale 384 and 55 detection and of Difference 384 256 of SIFT-RITF Feature 256 256 direction of the 96 location 256 of Gaussians96 precisely feature vector Fusion Calculating the feature points SIFT eigenvector Calculating the Determining Extreme points Construction Construction RITF eigenvector the scale and detection and of Difference of SIFT-RITF precisely location direction of the of Gaussians feature vector Calculating the feature points schematic diagram of SR-CNN model, combiningSIFT theeigenvector rotation-invariant texture

4 of 24 4 of 25

Feature Classification

Feature Classification

Figure 1. The feature (RITF) vector, the scale-invariant feature transform (SIFT) vector, and the convolution neural Figure 1. The schematic diagram of SR-CNN model, combining the rotation-invariant texture feature network (CNN). (RITF) scale-invariant transform (SIFT) vector, the androtation-invariant the convolution neural Figurevector, 1. The the schematic diagramfeature of SR-CNN model, combining texturenetwork feature (CNN). (RITF) vector, the scale-invariant feature transform (SIFT) vector, and the convolution neural 2.2. Feature Extraction with SIFT/RITF Model network (CNN). with SIFT/RITF Model 2.2. Feature Extraction

2.2.1. Detection of Scale-Space Extrema 2.2. Feature Extraction with SIFT/RITF Model 2.2.1. Detection of Scale-Space Extrema Firstly, the Gaussian pyramid and the difference of Gaussian pyramid were established, and theextremum Gaussian pyramid and the of Gaussian pyramid were established, then the space was detected. In difference the construction process of the Gaussian pyramid,and thethen size 2.2.1.Firstly, Detection of Scale-Space Extrema the space extremum was detected. the construction process of the Gaussian the size the of the image was doubled; on this In basis, Gaussian blur was carried out on thepyramid, image under thisofsize, Firstly, Gaussian pyramid and the blur difference of Gaussian were established, and image was on this basis, Gaussian carried outdown-sampling onpyramid the image under this size, and the set doubled; ofthe a few blurred images constituted anwas octave. Then, was carried outand on thenset the space extremum was detected. In the construction process of the Gaussian pyramid, the size the of a few blurred images constituted an octave. Then, down-sampling was carried out on the the most blurred image under this octave, the length and width of which were halved once each, of theblurred image was doubled; on this Gaussian blur was of carried on halved the image under this size, most image this octave, length width which were once each, making making the area of under the image intobasis, a the quarter ofand the original. The out processed image was the initial and the set of a few blurred images constituted an octave. Then, down-sampling was carried out on the areaofofthe thenext image into abased quarter original. The processedblur image was the on initial of this the image octave, on of thethe completion of Gaussian processing thisimage octave; the most blurred image under octave, the length ofonwhich were halved once each, next octave, based on the completion of Gaussian blurand processing this octave; thisthe process was process was repeated until thethis complete construction of width all octaves, thus forming Gaussian making the area of the image into a quarter of the original. The processed image was the initial repeated thethe complete construction of all thusoctave forming the subtracted Gaussian pyramid. Then, pyramid.until Then, two adjacent images ofoctaves, the same were to obtain the image of the next octave, based on the completion of Gaussian blur processing on this octave; this the two adjacent images of the same octave were subtracted to obtain the interpolation image. Finally, interpolation image. Finally, the DOG pyramid was built using the collection of these images of all process repeated until using the complete construction of all octaves, thus forming the Gaussian the DOGwas pyramid was built the collection of these images of all octaves. octaves. pyramid. Then, the two adjacent images of the same octave were subtracted to obtain the D(y,xDOG ,σy),  )=pyramid =[[GG((x,x,y,y,kσ k))−−built G((x, x,using y,  )] the  II ((collection x, yy)) interpolation image. Finally, the was of these images of all D ( x, G y, σ )] ⊗ x, (1) , (1) octaves. k) )−−LL((x,x,y,y,σ)) ==LL((x,x,y,y,kσ

( x, y,  ) =kernel; [G( x, y,the k )initial − G( x, value y,  )] ofI (kx,isy) set to 1, and the scale is where k is the scale of the D Gaussian where k is the scale of the Gaussian kernel; the initial value of k is set to 1, and (1) , the scale is incremented incremented by k times. The establishment = L(ofx, the y, kDOG ) − LPyramid ( x, y,  ) is shown in Figure 2. by k times. The establishment of the DOG Pyramid is shown in Figure 2. where k is the scale of the Gaussian kernel; the initial value of k is set to 1, and the scale is … incremented by k times. The establishment of the DOG Pyramid is shown in Figure 2. Scale (next octave)



Scale (next octave)

Scale (first octave) Scale (first octave)

Gaussian

Difference of Gaussian(DOG)

Figure 2. Establishment of the difference of Gaussian (DOG) pyramid. Gaussian

Difference of Gaussian(DOG)

Sensors 2018, 18, 4237

5 of 24

For each octave of scale space, the initial image was repeatedly convolved with Gauss to obtain the image set shown on the left. By subtracting the Gaussian filter, the difference graph of the Gaussian function on the right could be obtained. After each octave, the Gaussian image was sampled and the process was repeated. The next step was to traverse and detect the extremum points in three-dimensional space (two-dimensional image, one-dimensional scale). The specific process was to compare the gray value of the current feature point with the other 26 points (eight points adjacent to this point at the current scale, and the nearest 9 × 2 points before and after the scale). 2.2.2. Extrema Localization of Key Points Since the image is a discrete space, the coordinates of the feature points are all integers, but the coordinates of the key point detected in the scale space are not always integers. To precisely locate the coordinates of key points, the Taylor formula was adopted, and the location of the extreme points was determined through five iterations. D(x) = D +

∂D T 1 ∂2 D x + x T 2 x. ∂x 2 ∂x

(2)

The known point A was used to estimate the value of the nearby point B in the above formula, where x = (x, y, σ)T is the offset of point B relative to point A. The extreme value of Equation (2) was determined by taking the derivative of this formula, and setting it to 0, as follows: x0 = −

∂2 D −1 ∂D . ∂x2 ∂x

(3)

If the offset is larger than 0.5 in any dimension, indicating that the point is closer to another sample point C, then it is updated as C; otherwise, the iteration ends and x0 is appended to the current integer coordinates of the feature point. This can be obtained by substituting Equation (3) into Equation (2), as follows: 1 ∂D D ( x0 ) = D + x0 . (4) 2 ∂x When D(x0 ) ≤ 0.03 [25], we can determine that the point is with low contrast, which is susceptible to noise and becomes unstable; thus, it needs to be removed. After precise positioning, there are some edge points in the feature points which are unstable due to some noise; thus, they need to be removed. The characteristic of edge points is that the principle curvature in the direction of the vertical edge is larger, and is smaller in the direction of the tangent to the edge. According to these characteristics, the edge points can be removed. The principle curvature can be obtained using the second-order Hessian matrix, as follows: " H=

Dxy Dxy

Dxy Dyy

# ,

(5)

where Dxy and Dyy can be obtained by subtracting the current pixel points from the surrounding pixel points. The principal curvature of the point is proportional to the eigenvalue of the Hessian matrices of a certain point; thus, it can be determined by calculating the eigenvalues of the Hessian matrix. Let α and β be the matrix eigenvalues (assuming α > β), giving the following equations: Tr ( H ) = Dxx + Dxy = α + β,

(6)

Det( H ) = Dxx Dyy − ( Dxy )2 = αβ,

(7)

Sensors 2018, 18, 4237

6 of 24

where Det(H) is the determinant of the matrix, and Tr(H) is the trace of the matrix. Upon calculating α/β to compare the principal curvature of the feature point, if α/β >10, the point is removed. 2.2.3. Orientation Assignment of the Feature Point For the key points detected in the DOG, the gradient and distribution of direction of the pixels in the 3σ domain of the pyramid are collected. l ( x, y) =

n

[ F ( x + 1, y) − F ( x − 1, y)]2 + o1/2 [ F ( x, y + 1) − F ( x, y − 1)]2

(8)

F ( x, y + 1) − F ( x, y − 1) , F ( x + 1, y) − F ( x − 1, y)

(9)

ρ( x, y) = tan−1

where l(x, y) and ρ(x, y) are the modulus and direction of the gradient at (x, y) in the DOG, respectively. 2.2.4. Construction of SIFT/RITF Descriptor For the traditional SIFT feature descriptor, firstly, the image was selected from the pyramid according to the feature point scale; then, we calculated the gradient direction and the amplitude of each point in the domain scope. Each feature point was composed of 16 (4 × 4) sub-points, of which each sub-point had vector data of the eight directions, generating 128 (4 × 4 × 8) dimensions of the eigenvector. Finally, to avoid the influence of illumination, the length of the eigenvector was normalized, and the eigenvector was denoted as Fsi f t . Then, for the construction of the feature vector of the RITF, with the feature point as the centre and the longest diagonals D of the image as the diameter, a circular region was constructed as the field of the feature points. The field was divided into a number of concentric circular regions, and, upon eliminating the Gaussian blur and zeroing operation, it still maintained a good rotation invariance, while also avoiding the computational cost from the rotation of the image. On the circumference where the key point is the center and r (desirable D/5, D/4, D/3, etc.) is the radius, m-points are sampled by the same interval. We compared the size of m pixels (denoted as I0 (x, y), I1 (x, y), . . . , Im −1 (x, y)) with the center pixel (denoted as Ic (x, y)), and the image binary was implemented. That is, if Ic (x, y) < Ii (x, y), (i = 0, 1, . . . , m − 1), the sampling pixel was set to 0; otherwise, it was set to 1. Therefore, the calculation formula of the central pixel point of the RITF ri ) is as follows: feature (denoted as RITFm,r ri RITFm,r = min{ ROR( RITFm,r , k)|k = 0, 1, · · · m − 1},

where

(

m −1

RITFm,r =



i =0

F ( Ii − Ic )2

i −1

, F(x) =

1, x ≥ 0 , 0, x < 0

(10)

(11)

where ROR(x, k) indicates the right cycle shift k times of m-bit binary x. Upon rotating the image area to the reference direction according to the main direction of the key points, and the image area of 8 × 8 was taken as the area to be described. In this area, for each pixel ri performed calculations, which was centered on point Ii (x, y), (i = 1, 2, . . . , 64), the RITF feature RITFm,r it, and denoted as ritfi (i = 1,2, . . . ,64). The further the Ii (x, y) distance is from the center point Ic (x, y), the less description data are provided. Therefore, it is necessary for ritfi to weight, and the weight coefficient is as follows: n o exp −[( xi − xc )2 + (yi − yc )2 ]/(2β)2 λi = , (12) (2πβ2 )

Sensors 2018, 18, x FOR PEER REVIEW

7 of 25

7 of 24 exp{−[( xi − xc ) 2 + ( y − y ) 2 ] / (2  ) 2 } i c (12) i = , (2 2 ) where (xi , yi ) and (xc , yc ) are the coordinates of the pixel point Ii (x, y) and the center point Ic (x, y), where (xi, yi) and (xc, yc) are the coordinates of the pixel point Ii(x, y) and the center point Ic(x, y), respectively, and β is a constant. Then, all the weighted RITF eigenvalues are composed of a respectively, and β is a constant. Then, all the weighted RITF eigenvalues are composed of a one-dimensional eigenvector, which givingthe thefollowing: following: one-dimensional eigenvector, whichisisdenoted denoted as as R Rii,, giving Sensors 2018, 18, 4237

ritff 22 · · · λ6464 ×  ritf Ri R =i [=λ[1× ritritf f 1 1λ rit f64 2 2×rit 64]].. 1

(13) (13)

Finally, in order to eliminate the effect of the changes due to illumination, Ri was normalized as

Finally, in order to eliminate the effect of the changes due to illumination, Ri was normalized R i i → Ri0 as k R → R i . Thereby, a 64-dimensional eigenvector Frit f was obtained, which is the RITF feature Ri R ki . Thereby, a 64-dimensional eigenvector 𝐹𝑟𝑖𝑡𝑓 was obtained, which is the RITF feature descriptor for the surrounding area of the key point Ic (x, y). descriptor for the surrounding area of the key point Ic(x, y). The The SIFT/RITF feature descriptor is defined as follows: SIFT/RITF feature descriptor is defined as follows:

si fsift t  αFαF , FsrF= = (1(1−−αα) F)rit , sr f Fritf  

"

#

(14) (14)

where Fsi f t Fissift a 128-dimensional SIFT feature descriptor, FF rit f is a 64-dimensional RITF feature where is a 128-dimensional SIFT feature descriptor, ritf is a 64-dimensional RITF feature descriptor, Fsr is a 192-dimensional SIFT/RITF feature descriptor, and α (α < 1) is a relative F is a 192-dimensional SIFT/RITF feature descriptor, and  (  1) is a relative descriptor, weighting factor srdepending on the pixel size and the type of the image. In this paper, according factor optimization, depending on the and the type of the image. In this paper, according to to theweighting experimental we pixel set α size to be 0.6. the experimental optimization, we set  to be 0.6.

2.3. Feature Extraction with CNN Model

2.3. Feature Extraction with CNN Model

The CNN model proposed in this paper is from Alexnet. The CNN model consists of five The CNN model proposed in this paper is from Alexnet. The CNN model consists of five convolution layers, three max-pooling twofully fullyconnected connected layers. adopts a convolution layers, three max-poolinglayer, layer, and and two layers. TheThe last last layerlayer adopts a Softmax classifier [9]. After all kinds of images reach the optimal classification effect, the parameters of Softmax classifier [9]. After all kinds of images reach the optimal classification effect, the parameters each of layer oflayer the convolutional neural network are determined. GlobalGlobal features are extracted according each of the convolutional neural network are determined. features are extracted to these parameters, the features of the thirdoflayer are extracted. The CNN firstly a large according to theseand parameters, and the features the third layer are extracted. The CNNuses firstly uses a large sample for pre-training, and then uses the pre-trained CNN model to extract features. sample for pre-training, and then uses the pre-trained CNN model to extract features. Figure 3 shows Figure 3 shows structure of the CNN network. the structure of thethe CNN network. input image

conv1 55

11

3

27

3

96

conv5

conv4

3 27

256

13

3

pool3

256

13 384

13 384

3

FC

6 3

3

3

3 13

13

13

3

3

55

96

conv3

13

3

27

pool2

FC

27

3

11

conv2

pool1

3 13 256

6 256

Figure 3. The CNN network structure consists of five convolution layers, three max-pooling layers, Figure 3. The CNN network consists five convolution layers, and two fully connected layers.structure The input is theof scale-normalized image.three max-pooling layers, and two fully connected layers. The input is the scale-normalized image.

2.3.1. Batch Normalization Layer 2.3.1. Batch Normalization Layer

The Alexnet consists of five convolution layers, three max-pooling layers, and three fully The Alexnet consists of five convolution layers, three max-pooling layers, and three fully connected layers. The input is the scale-normalized image. The training of the deep network is connected layers. The input is the scale-normalized image. The training of the deep network is a a complex process; as long as the first few layers of the network change slightly, then the changes in the complex process; as long as the first few layers of the network change slightly, then the changes in next the fewnext layers be accumulated and magnified. Once Once the distribution of the datadata changes, fewwill layers will be accumulated and magnified. the distribution of input the input the layer requires constant adaptation to the new distribution, thereby affecting the training speed of the network. The stochastic gradient descent (SGD) is a simple and efficient method for the training depth network; however, it has the disadvantage of us having to carefully adjust model parameters such as learning rate, initial values, weight attenuation coefficient, dropout ratio, and so on. The choice of these parameters is crucial to the training result; thus, we spent a lot of time on this work. To this

Sensors 2018, 18, 4237

8 of 24

end, we deleted the local response normalization (LRN) layer and introduced the batch normalization (BN) layer [45]. Due to BN having the ability to improve network generalization, we could remove the dropout parameter. In each layer of the network, a normalization layer was inserted, which is a normalization process (normalized to a mean of 0, a variance of 1). Consider a mini-batch α of size n. Since the normalization is applied to each activation independently, we can take any dimension as an example. We have n values of this activation in the mini-batch, α = {x1 , . . . , xn }. The elements of each mini-batch are sampled from the same distribution. Let the normalized values be {ˆx1 , . . . , xˆ n }, and their linear transformations be β = {y1 , . . . , yn }. The forward conduction process formula of the batch normalization network layer is shown in Algorithm 1. Algorithm 1. Batch normalization (BN) Input: Values of n over a mini-batch: α = { x1 , . . . , xn }; Parameters to be learned: η, λ. Output: β = {y1o , . . . , y n }; n yi = BNη,λ ( xi ) .

1. Mini-batch mean: µ =

1 n

n

∑ xi ;

i =1

2. Mini-batch variance: σ2 =

1 n

n

2 ∑ ( xi − µ ) ;

i =1 x −µ

3. Normalized value: xˆi = i σ ; 4. Scale and shift yi = η xˆi + λ ≡ BNη,λ ( xi ).

In this algorithm, BN can increase the speed of training, whereby it is possible to choose a larger initial learning rate, which can accelerate the attenuation such that the “optimal learning rate” is obtained, increasing the speed of adjustment of the learning rate. Moreover, it also alleviates the problem of internal covariate shift. 2.3.2. Activation Function Layer The main function of the activation function in neural networks is to add some nonlinear factors to the neural network, such that the neural network can better solve more complex problems. Without the activation function, the weights and deviations are only linearly transformed. The activation function makes backpropagation possible because the error gradient of the activation function can be used to adjust weights and deviations. This is not possible without a differentiable nonlinear function. When a number is applied to a deep neural network, it mainly affects the training of the network in the two processes of forward propagation and backward propagation. Figure 4 shows the PReLU Sensors 2018, 18, x FOR PEER REVIEW 9 of 25 activation function. PReLU

f(x)

1 0 -1 -2 -3 -5

-4

-3

-2

-1

0

1

2

x

Figure 4.4.Activation parametric rectified linear (PReLU). Figure Activation function: function: parametric rectified linear unitunit (PReLU). The PReLU is a variant of ReLU, where the slope of the negative region is determined by data, and is not predefined. It was found that PReLU converges faster than ReLU [46]. PReLU was chosen over ordinary ReLU to solve the dying ReLU problem; its mathematical representation is as follows:

y = max(0, x ) + a  min(0, x )

(15)

Sensors 2018, 18, 4237

9 of 24

The PReLU is a variant of ReLU, where the slope of the negative region is determined by data, and is not predefined. It was found that PReLU converges faster than ReLU [46]. PReLU was chosen over ordinary ReLU to solve the dying ReLU problem; its mathematical representation is as follows: yi = max(0, xi ) + ai × min(0, xi ),

(15)

where ai is a learning parameter; when ai is a fixed nonzero decimal, it is equivalent to LeakyReLU; when it is 0, PReLU is equivalent to ReLU. Its functional expressions are as follows: ( f ( yi ) =

yi ,

if yi > 0

ai yi ,

if yi ≤ 0

.

(16)

2.3.3. Backpropagation (BP) of CNN Networks BP neural networks have strong nonlinear mapping capabilities, as well as a high ability of self-learning and self-adaptation, the ability to apply learning results to new knowledge, and a certain fault tolerance. Based on the above advantages, we used BP to reverse-adjust the parameters. Since the cross-entropy loss function can be assigned a linear gradient, effectively preventing the gradient from vanishing, it can overcome the problem of slow updating of variance cost function. In this paper, we used the cross-entropy loss function instead of the mean variance loss function, and the last activation function was replaced by the sigmoid function, because the cross-entropy loss function is more compatible with it. The cross-entropy cost function is defined as follows: C=−

1 n [y ln a + (1 − y) ln(1 − a)], n i∑ =1

(17)

where y is the expected output, and a is the actual output. The following equations are its derivatives: 1 n ∂C = ∑ x i ( σ ( z ) − y ); ∂wi n i =1

(18)

∂C 1 n = ∑ ( σ ( z ) − y ). ∂b n i =1

(19)

The weight update is affected by σ (z) − y, which is the impact of the error. Therefore, when the error is large, the weight updates quickly, and, when the error is small, the weight updates slowly. 2.3.4. Feature Vector Extraction The input face image passes through the convolution layer, pooling layer, and fully connected layer. The output of the second fully connected layer is taken as the feature vector extracted by the CNN network, and is denoted as Fcnn . The fully connected layer works by looking at the output of the previous layer (the activation diagram representing the high-order characteristics) and determining which functions are the most relevant to the particular class. The structure of the fully connected layer is shown in Figure 5.

layer. The output of the second fully connected layer is taken as the feature vector extracted by the CNN network, and is denoted as

Fcnn . The fully connected layer works by looking at the output of

the previous layer (the activation diagram representing the high-order characteristics) and determining which functions are the most relevant to the particular class. The structure of the fully Sensors 2018, 18, 4237is shown in Figure 5. 10 of 24 connected layer Fully-connected Layer

Output Layer





Figure 5. Structure of the fully connected layer.

Figure 5. Structure of the fully connected layer. 2.4. Feature Fusion and Classification

The SR-CNN were constructed from 128-dimensional SIFT descriptors, 64-dimensional 2.4. Feature Fusiondescriptors and Classification RITF descriptors, and 4096-dimensional CNN features, which contain a variety of image information The SR-CNN descriptors constructed from 128-dimensional SIFT descriptors, and should adopt different matchingwere strategies. For the SIFT local feature vector S, Euclidean distance 64-dimensional RITF descriptors, and 4096-dimensional CNN features, which contain a variety of was used as its matching strategy. image information and should adopt different matching strategies. For the SIFT local feature vector r strategy. S, Euclidean distance was used as its matching 2 dS = Si − S j = ∑ (Si,m − S j,m ) , (20) m

 (S − S ) , and the m-th eigenvector(20) represent the m-th eigenvector of the i-th feature point of d S = Si − S j =

2

i ,m

j ,m

where Si ,m and Sj ,m m the j-th feature point (m ≤ 128) of the SIFT local feature, respectively. where and Sj,mlocal represent thevector m-th eigenvector of thewas i-th used feature and the strategy. m-th eigenvector of ForSi,m the RITF feature R, the χ2 statistic as point the matching the j-th feature point (m ≤ 128) of the SIFT local feature, respectively. 2 ( Ru,mwas − Rused For the RITF local feature vector R, the χ21 statistic v,m ) as the matching strategy. d R = χ2 = ∑ , (21) 2 m Ru,m + Rv,m

2 1 ( Ru , m − Rv , m )  = of the u-th feature point (21) where Ru,m and Rv,m represent the d m-th and the m-th eigenvector R =eigenvector , 2 R + R m u , m v , m of the v-th feature point (m ≤ 64) of the RITF local feature, respectively. 2

For a given face image, in this paper, we extracted two groups of features, denoted as Fsr and where Ru,m and Rv,m represent the m-th eigenvector of the u-th feature point and the m-th eigenvector Fcnn . Considering the large differences in the features of different categories, this paper used the mean of the v-th feature point (m ≤ 64) of the RITF local feature, respectively. and variance of two groups of features to normalize the corresponding feature vectors. The mean and For a given face image, in this paper, we extracted two groups of features, denoted as 𝐹𝑠𝑟 and variance of each group’s features were obtained in the training dataset. The two normalized features Fcnn connected . Considering the large differences in the features of different categories, this paper used the were together to get the fusion feature vector, which is expressed as mean and variance of two groups of features to normalize the corresponding feature vectors. The mean and variance of each group’s features in the training dataset. The (22) two F = { Fwere }. sr , Fcnnobtained normalized features were connected together to get the fusion feature vector, which is expressed as For the fusion feature vector F extracted from each face image, this paper used the random forest (RF) learning method [47] to construct a classifier for feature classification. The number of decision trees in the random forest method was set to 100; the classification results had high accuracy with a feature screening mechanism. 3. Parallel Optimization of SR-CNN Model 3.1. Architecture of Parallel Computing In 2007, NVIDIA released the official development platform Compute Unified Device Architecture (CUDA) for GPU programming [48]. The core idea of CUDA is based on three important abstract concepts: shared memory, shielding synchronization, and thread group hierarchy.

3.1. Architecture of Parallel Computing In 2007, NVIDIA released the official development platform Compute Unified Device Architecture (CUDA) for GPU programming [48]. The core idea of CUDA is based on three Sensors 2018, 18, 4237 11 of 24 important abstract concepts: shared memory, shielding synchronization, and thread group hierarchy. The GPU GPU can can be be used used for for the the computing computing of of parallel parallel data, data, such such that that many many data data elements elements can can be be The executed in parallel with a higher computational density. The model of CUDA programming executed in parallel with a higher computational density. The model of CUDA programming releases releases theparallel GPU’scapabilities. parallel capabilities. to the unique hardware architectural design of GPU, it the GPU’s Due to theDue unique hardware architectural design of GPU, it has strong has strong parallel data computing capabilities (a large number of transistors on GPU are used for parallel data computing capabilities (a large number of transistors on GPU are used for arithmetic logic arithmetic logic several elements run onhigh a thread and have operations). Sinceoperations). several dataSince elements rundata independently on aindependently thread and have computational high computational density, there is no need for larger data caches. density, there is no need for larger data caches. 3.2. Design of GPU Implementation This paper paperintroduces introducesthe the optimization acceleration the SR-CNN using the optimization andand acceleration of theofSR-CNN model model using the aspects aspects delineated In this framework, theofleft of the linethat indicates that the delineated in Figurein6.Figure In this 6. framework, the left side theside dotted linedotted indicates the program is programon is running theand CPU and theindicates right side indicates that iton is the running the GPU side. running the CPU on side, theside, right side that it is running GPU on side. Host

Device

Input positive face images

Convolution Layer: Extract features

Construction of Difference of Gaussian

Precise feature points location, removing the low contrast and edge points

Local Extrema Detectction

Calculating the gradient direction histogram

Calculating the RITF eigenvalue

Updating the gradient direction

Calculating the weighting factor

Construction of the SIFT descriptor

Constrction of the RITF descriptor

Pooling Layer: Simplifying the complexity of network computing

Construction of SIFTRITF feature descriptor

Fully Connected Layer: Output feature vector

Feature Fusion

Feature match of SR-CNN

Figure 6. Framework of parallel computing in the SR-CNN model.

3.2.1. Establishment of DOG and Detection of Extrema In the process of the establishment of the Gaussian pyramid, images of each layer need to perform Gaussian filtering, which is time-consuming; thus, two-dimensional convolution was used for the calculations. According to their separability, two dimensions (rows and columns) were used to calculate the parallel convolution kernel function. Icolumn ( x, y) = I ( x, y) × gx (t);

(23)

Irow ( x, y) = Icolumn ( x, y) × gy (t) = I ( x, y) × gx,y (t).

(24)

Since the column convolution is similar to the row convolution, the parallel construction of the Gaussian pyramid is described below in detail using the column convolution operation as an example.

Sensors 2018, 18, 4237

12 of 24

The block was set to two dimensions and the thread was set to one dimension. Part of the image data read in blocks after each calculation are stored in shared memory, such that the data communication between threads can speed up the parallel computing of the program. After the establishment of the Gaussian pyramid, the DOG was obtained by directly subtracting the pixels on the image of the adjacent scale. During the subtraction, non-adjacent images were not affected by the calculation, and they could be processed in parallel on GPU. In the same way, it was possible to detect the extreme points in parallel to the pixels of each layer of images in DOG and find the potential key points. Then, the location and scale of the key points were marked using the binary bitmap, which was inserted into the sequence and uploaded to the host terminal. 3.2.2. Histogram Statistics In this paper, a circular area with a radius of 15 pixels was used as domain information for the key point descriptors. To compensate for the feature point instability caused by no affine invariance, the gradient amplitude was firstly weighted by the Gaussian smoothing in the whole circular domain; then, division of the region was carried on to this field, dividing 2π into 36 equal points, with each region corresponding to a block. The histogram of the feature points could be statistically acquired using gradient information of the pixel gradient in every block, and the value of the maximum group distance (bin) in the histogram was the main direction of the feature point. The data were imported into red/green/blue/alpha (RGBA) texture storage, and the multi-objective rendering algorithm on the advanced graphics card was used to accelerate the processing, to further optimize the calculation. Finally, the calculated results were compressed to a one-dimensional array and passed back to the host terminal from the device terminal, followed by updating the gradient direction on the CPU and establishing the SIFT descriptor. 3.2.3. Construction of RITF Descriptor Since an image typically has hundreds of feature points, and the calculation of RITF features of each feature point is time-consuming, we processed the RITF on the GPU. Firstly, the images and the data of the feature points were copied from the host to the device, and then saved to global memory; then, each block was assigned to calculate the RITF eigenvalue of a feature point. As a grid can allow up to 65,535 × 65,535 blocks, the number of blocks is much larger than the number of feature points; thus, there is no need to worry about the problem of data overflow. From Equations (10), (12), and (13), we know that the calculations of the RITF eigenvalue and of the weighting coefficient do not affect each other. Therefore, different kernel functions were designed to be called in parallel in two streams, and the RITF eigenvalues and weighting coefficients were stored in different arrays. To ensure that their values corresponded to each other, the calculation order of the RITF eigenvalue was made to be consistent with that of the weighting coefficient. Finally, before the construction of the RITF descriptor, it was ensured that all calculations proceeded. To do this, in these kernel functions, in addition to returning the calculation results, the state information of calculations was also returned. 3.2.4. Feature Matching of SR-CNN In the process of feature matching, the face image data were uploaded from the host to the device, and we used the GPU to process in parallel. SIFT/RITF feature vectors of each feature point were put in a block of shared memory, and each block was assigned 192 threads; the first 128 threads executed the calculation of Euclidean distance, and the following 64 threads executed the calculation of the χ2 statistic. For CNN features, each feature map was mapped to a thread block, and each neuron on the feature map was mapped to each thread on the thread block. The three dimensions of the thread grid (x, y, z) correspond to the width, height, and number of each layer, respectively. The settings in the kernel function were as follows: kernel . The kernel function started z thread blocks, each containing x × y threads, and started z × x × y threads altogether. Because the number of threads in a thread block was 512, multiple thread blocks were allocated here. Finally, by calling the

neuron on the feature map was mapped to each thread on the thread block. The three dimensions The settings in the kernel function were as follows: kernel . The kernel function started z of the thread grid (x, y, z) correspond to the width, height, and number of each layer, respectively. thread blocks, each containing x  y threads, and started z  x  y threads altogether. Because The settings in the kernel function were as follows: kernel . The kernel function started z

thethread number of each threads in a thread block was multiple blocks were allocated blocks, containing and 512, started altogether. Because here. x  y threads, z  x thread y threads Finally, by calling the _syncthreads() function to achieve each here. thread was the number of threads in a thread block was 512, multiplethread threadsynchronization, blocks were allocated Sensors 2018, 18, 4237 13 of 24 evaluated before executing the subsequent process. Finally, by calling the _syncthreads() function to achieve thread synchronization, each thread was evaluated before executing the subsequent process.

4. _syncthreads() Experimentalfunction Resultstoand Analysis achieve thread synchronization, each thread was evaluated before executing 4. Results and Analysis theExperimental subsequent process.

This section evaluates the performance of the SR-CNN model using a public dataset and a This section evaluates the performance of the the Labeled SR-CNNFaces modelinusing a public dataset and which a self-collecting dataset. One dataset was the Wild (LFW) dataset, is 4. Experimental Results andopen Analysis self-collecting dataset. One open dataset was the Labeled Faces in the Wild (LFW) dataset, which is explained in detailevaluates later. To the verify the validity of the model, we compared the experimental results This section performance model usingthe a public datasetresults and a explained in detail later. To verify the validityofofthe the SR-CNN model, we compared experimental with other methods. The specific flow is shown in Figure 7. self-collecting dataset. One open dataset was the Labeled Faces in the Wild (LFW) dataset, which is with other methods. The specific flow is shown in Figure 7. explained in detail later. To verify the validity of the model, we compared the experimental results 4.1. Experimental Settings with other methods. The specific flow is shown in Figure 7. 4.1. Experimental Settings

Experimental Settings Experimental Settings

Experimental Database Database Experimental

Experimental Platform Experimental Platform

Experimental Experimental Evaluation Index Index Evaluation

Figure 7. Flow diagram of experimental settings.

Figure 7. Flow diagram of experimental settings.

4.1. Experimental Settings 4.1.1. Experimental Database

4.1.1. Experimental Database

4.1.1.The Experimental Database LFW dataset contains 13,000 images of human faces collected from the network, each The LFW dataset contains 13,000 imagesOfofthese human faces1680 collected from the network, named person being13,000 photographed. had two or more each Theafter LFWthe dataset contains images of human facesimages, collected frompeople the network, each named named after the being person being photographed. Of these images, people had two or more different photographs. We used these images forimages, training and testing,1680 andtwo theyorwere generated after the person photographed. Of these 1680 people had more different randomly andWe independently of the splits for 10-fold cross-validation. Allgenerated images were 250 × and 250 different photographs. We images used these images fortesting, training testing, and they were generated photographs. used these for training and and and they were randomly pixels. Figure 8 shows some sample images from the LFW database [49]. randomly and independently of the cross-validation. splits for 10-foldAllcross-validation. images independently of the splits for 10-fold images were 250 All × 250 pixels.were Figure250 8 × 250 showsFigure some sample images the LFW database pixels. 8 shows somefrom sample images from [49]. the LFW database [49].

Figure 8. Sample images from the Labeled Faces in the Wild (LFW) dataset.

The self-collecting face database includes orthotopic face, lateral face, and tilted face images, Figure 8. Sample thethe Labeled Faces in the (LFW)(LFW) dataset. Figure Sampleimages imagesfrom from Labeled Faces inWild the Wild dataset. among others. It contains 1002 images of 50 people, with each image tagged. Images were taken

The self-collecting face database includes orthotopic face, lateral face, and tilted face images,

The self-collecting face database includes orthotopic face, lateral face, and tilted face images, among others. It contains 1002 images of 50 people, with each image tagged. Images were taken with among others. It contains 1002 images of 50 people, with each image tagged. Images were taken different levels of light intensity, posture, and expression. All images were in color and manually cut to 300 × 300 pixels.

4.1.2. Experimental Platform The CPU was an Intel Core i5-6300U processor, clocked at 2.40 GHz, with 8 GB of memory. The GPU was a GeForce GTX 1060 M, with 640 CUDA cores, a memory bandwidth of 80.16 GB/s, and a Sensors 2018, 18, 4237 14 of 24 core frequency of 1097 MHz. The operating system was Microsoft Windows 10, and the software used was the compiler platform for Visual Platform Studio 2010, CUDA Toolkit 8.0, and OpenCV 3.4. 4.1.2. Experimental

The CPU was an Intel Core i5-6300U processor, clocked at 2.40 GHz, with 8 GB of memory. 4.1.3. Experimental Evaluation Index The GPU was a GeForce GTX 1060 M, with 640 CUDA cores, a memory bandwidth of 80.16 GB/s, and example a coreThe frequency of was 1097 divided MHz. into positive and negative classes. If an instance was positive, samples predicted as positive samples, denoted 10, as and If negative were TP .the The were operating system was Microsoft Windows softwareexamples used was the predicted compiler as positive this wasCUDA denoted as 8.0, . Correspondingly, if negative samples were FPand platform for examples, Visual Studio 2010, Toolkit OpenCV 3.4. predicted as negative samples, this was denoted as TN , and if positive classes were predicted as 4.1.3. Experimental Evaluation Index as FN . T negative samples, this was denoted represents the serial execution time for the CPU

example was divided into positive andtime negative classes. instance was positive, samples TGPU CPU,The represents the parallel execution for the GPU, IfSan r is the speed ratio, TPR is the were predicted as positive samples, denoted as TP. If negative examples were predicted as positive true positive rate, FPR is the false positive rate, and ACC is the recognition accuracy rate; the examples, this was denoted as FP. Correspondingly, if negative samples were predicted as negative formulas are given below. samples, this was denoted as TN, and if positive classes were predicted as negative samples, this was denoted as FN. TCPU represents the serial execution TP time for the CPU, TGPU represents the parallel TPR = (25) ; execution time for the GPU, Sr is the speed ratio, TPR is TP +the FNtrue positive rate, FPR is the false positive rate, and ACC is the recognition accuracy rate; the formulas are given below.

FP TP FPR TPR = = FP + TN; ; TP + FN

(26) (25)

FP+TN TP FPR ACC = = FP + TN ; ; TP + FP +TN + FN ACC =

(26) (27)

TP + TN ; TP + FPT+ TN + FN Sr =T CPU . TGPU. Sr = CPU TGPU

(27) (28) (28)

4.2. Experimental Analysis Analysis 4.2. Experimental In we analyze analyze the the performance performance of of the the CNN CNN network network with with different different settings, settings, In this this subsection, subsection, we including training epochs, activation function, and whether the CNN network was pre-trained including training epochs, activation function, and whether the CNN network was pre-trained or or fine-tuned. The specific specific flow fine-tuned. The flow is is shown shown in in Figure Figure 9. 9.

Experimental Analysis

Analysis of the Accuracy in Different Training Epochs

Analysis of the Accuracy of Different Activation Functions

Analysis of Pre-trained and Fine-tuned CNN Network

Figure 9. Flow diagram of the experimental analysis.

4.2.1. Analysis of of the the Accuracy Accuracy in in Different Different Training 4.2.1. Analysis Training Epochs Epochs The training epochs of the CNN network inevitably affect the accuracy. Figure 10 shows that the accuracy rate increased with the increase in epoch numbers.

Sensors 2018, 18, x FOR PEER REVIEW

16 of 25

epochs of the CNN network inevitably affect the accuracy. Figure 10 shows that 15 of 24 the accuracy rate increased with the increase in epoch numbers.

Rate(%) Accuracy Rate(%) Accuracy

SensorsThe 2018,training 18, 4237

100 99 98 97 96 95 94 93 92 91 90

CNN(LFW) CNN(Self-Collecting) SR-CNN(LFW) SR-CNN(Self-Collecting)

5

10

15 20 No.of Epochs

25

30

Figure 10. 10. Comparison of accuracy for different different training training epochs. epochs. Figure Comparison of accuracy for

When the thethe convergence state,state, the accuracy tendedtended to be stable increase thenetwork networkreached reached convergence the accuracy to beupon stable upon in epoch in number. The larger the valuethe of value the epoch thewas, easier it easier woulditbe to converge, but the increase epoch number. The larger of thewas, epoch the would be to converge, but theitlonger would As see we from can see thewhen graph, the of number epochs was 20, the longer wouldittake. Astake. we can thefrom graph, thewhen number epochsofwas 20, the network network wasconverged, already converged, and had SR-CNN hadperformance the best performance on the self-collection state wasstate already and SR-CNN the best on the self-collection database. database. Therefore, in this the training epoch the CNNused network used was set to 20. Therefore, in this paper, the paper, training epoch of the CNNofnetwork was set to 20. 4.2.2. 4.2.2. Analysis Analysis of of the the Accuracy Accuracy of of Different Different Activation Activation Functions Functions Residuals backpropagation (BP) algorithm diminish with the depth Residuals ininthethe backpropagation (BP) algorithm diminish with of thenetwork depth transmission, of network which makes the underlying unable tonetwork be trained effectively. closely transmission, which makes network the underlying unable to be Residual trained attenuation effectively. is Residual related to the selection of activation function in the network model. A better activation function can attenuation is closely related to the selection of activation function in the network model. A better inhibit the function residualscan in the process networkintransmission and improveattenuation, the convergence activation inhibit the of residuals the process attenuation, of network transmission and speed of the models. We used ELU, Tanh, ReLU, and PReLU instead of the sigmoid function in improve the convergence speed of the models. We used ELU, Tanh, ReLU, and PReLU instead of the the original to the perform thenetwork experiments, and we better results. Then, we compared sigmoid network function in original to perform theachieved experiments, and we achieved better results. the network performance with different activation functionsactivation using a table of comparison. The Then, we compared the network performance with different functions using a table of development of activation functions plays a crucial role in the further application of deep neural comparison. The development of activation functions plays a crucial role in the further application networks. We conducted experiments two benchmark datasets; we chose thewe LFW dataset and of deep neural networks. We conductedon experiments on two benchmark datasets; chose the LFW self-collection dataset as testdataset sets. In as thistest paper, function of the CNN network usedCNN was dataset and self-collection sets.the Inactivation this paper, the activation function of the the PReLU activation The experimental are shown inresults Figureare 11.shown in Figure 11. network used was thefunction. PReLU activation function.results The experimental

accuracy% Recogniton accuracy% Recogniton

LFW

Self-collecting

100

97.98 98.45

98

96.70 96.95

Tanh

ReLU

94.45 95.12

96 94

96.23 96.65 93.15 93.54

92 90 88 Sigmoid

ELU

PReLU

Activation function

function. Figure 11. The matching accuracy of the activation function.

Sensors 2018, 18, 4237

16 of 24

As can be seen from Figure 11, the matching accuracy with the PReLU activation function was higher than that with the other functions. Meanwhile, with the same activation function, the accuracy Sensors 2018, 18, xusing FOR PEER 17 of 25 was higher when theREVIEW self-collecting database. As canofbePre-Trained seen from Figure 11, the matching accuracy with the PReLU activation function was 4.2.3. Analysis and Fine-Tuned CNN Networks

higher than that with the other functions. Meanwhile, with the same activation function, the

Generally, neural networks trained from scratch are prone to problems, and accuracy wasconvolutional higher when using the self-collecting database. fine-tuning can quickly converge the network to an ideal state, but there is a large difference between 4.2.3. dataset Analysisand of Pre-Trained and Fine-Tuned Networks task of the target dataset, it is difficult the target the pre-trained dataset. InCNN the recognition to extractGenerally, the visualconvolutional features of the image using trained the pre-trained CNN In problems, order to make neural networks from scratch aremodel. prone to and the fine-tuning can quickly converge the networktotothe antarget ideal dataset, state, butwe there is a large pre-trained CNN parameters more appropriate fine-tuned thedifference CNN models target datasetWith and the In the recognition task of theof target dataset, basedbetween on the the target dataset. thepre-trained exceptiondataset. of the last layer, the parameters the pre-training it is difficult to extract the visual features of the image using the pre-trained CNN model. In model were used to initialize other layers, and the output layer was initialized by a zero-meanorder Gaussian to makewith the pre-trained parameters more appropriate to the target we that fine-tuned the distribution a standardCNN deviation of 0.01. Unlike the pre-training, wedataset, assumed the weights of CNN models based on the target dataset. With the exception of the last layer, the parameters of the the pre-trained CNN were relatively good; thus, we set a small learning rate for the CNN weights to pre-training model were used to initialize other layers, and the output layer was initialized by a be fine-tuned. We did not want to change them too quickly or too much; thus, we kept our learning zero-mean Gaussian distribution with a standard deviation of 0.01. Unlike the pre-training, we rate and learning very small. CNN were relatively good; thus, we set a small assumed that rate the attenuation weights of the pre-trained As can be from 1, the network accuracy, while learning rateseen for the CNNTable weights to befine-tuned fine-tuned. We did nothelped want toboost changethe them too quickly or the performance be kept further improved. Interestingly, fully connected too much;could thus, we our learning rate and learning rate attenuation verylayer small.7 (FC7) worked best As can be seen the frompre-trained Table 1, the fine-tuned network helped boost while theas the after fine-tuning, unlike CNN. Therefore, the output of thethe FC7accuracy, layer was taken performance could be further improved. Interestingly, fully connected layer 7 (FC7) worked best feature vector. after fine-tuning, unlike the pre-trained CNN. Therefore, the output of the FC7 layer was taken as between pre-trained convolution neural network (CNN) and fine-tuned CNN Table 1. Comparison the feature vector. for the proposed method. LFW—Labeled Faces in the Wild; Conv—convolution layer; FC—fully Table layer. 1. Comparison between pre-trained convolution neural network (CNN) and fine-tuned CNN connected for the proposed method. LFW—Labeled Faces in the Wild; Conv—convolution layer; FC—fully connected layer. Target Datasets Layer LFW Self-Collection

Target Datasets Pre-trained Pre-trained

Fine-tuned Fine-tuned

Conv5 Layer FC6 Conv5 FC7 FC6 Conv5 FC7 FC6 Conv5 FC7 FC6 FC7

85.3LFW 86.8 85.3 86.0 86.8 86.8 86.0 87.7 86.8 88.7 87.7 88.7

83.4 Self-Collection 85.8 83.4 85.4 85.8 85.8 85.4 87.5 85.8 88.6 87.5 88.6

4.3. Comparative Experiment on Performance 4.3. Comparative Experiment on Performance

In this subsection, we evaluate the proposed model in this paper in different situations. In this subsection, we evaluate the proposed model in this paper in different situations. The The comparisons included face matching methods before and after improvement, our model with comparisons included face matching methods before and after improvement, our model with other other matching models, and the GPU before and after optimization. The details are shown in Figure 12. matching models, and the GPU before and after optimization. The details are shown in Figure 12. Comparative Experiment on Performance

Comparison of Face Matching Methods Before and After Improvement

Comparison of Our Model with Other Face Matching Models

Comparison of GPU Before and After Optimization

Figure Comparative experiments experiments on Figure 12.12.Comparative onperformance. performance.

Comparison of Face MatchingMethods Methods Before Before and 4.3.1.4.3.1. Comparison of Face Matching andAfter AfterImprovement Improvement database usedin in this this experiment was provided by LFW size of each was 250 × was The The faceface database used experiment was provided by (the LFW (the size face of each face 250 pixels) and the self-collection face database (the size of each face was 300 × 300 pixels), including 250 × 250 pixels) and the self-collection face database (the size of each face was 300 × 300 pixels),

Sensors 2018, 18, 4237

17 of 24

Sensors 2018, 18, x FOR 18 of 25 including front face, sidePEER face,REVIEW and rotated face images, among others. The method of face matching Sensors 2018, 18, x FOR PEER REVIEW 18 of 25 under the traditional SIFT model is only suitable for face matching with the rotation of the positive front face, side face, and rotated face images, among others. The method of face matching under the face and images with rotation (tilt) angles, butothers. the matching effect is poor for faces with large front face, side face,small and rotated images, The of face under theand traditional SIFT model is onlyface suitable foramong face matching withmethod the rotation ofmatching the positive face rotation angles. The improved method combined RITF and SIFT, which improved matching for faces traditional SIFT model is only suitable for face matching with the rotation of the positive face and images with small rotation (tilt) angles, but the matching effect is poor for faces with large rotation with small rotation (tilt) angles, but the matching effect is poor for faces with large rotation with images large rotation (tilt) angle, on the basis of maintaining the effect of the original matching. In this angles. The improved method combined RITF and SIFT, which improved matching for faces with angles. The improved method combined RITF and SIFT, which improved matching for faces with experiment, improved face matching method was compared with traditional SIFT model large the rotation (tilt) angle, on the basis of maintaining the effect of the the original matching. In thisfor large rotation (tilt) angle, on face the matching basis of maintaining the effect over ofwith the original matching. In this for the same face database. The experimental results were averaged 10the experiments. experiment, the improved method was compared traditional SIFT model experiment, the improved face matching method was compared with the10traditional SIFT model for the same face database. The experimental were averaged over experiments. For the LFW face database, as shown inresults Figure 13, the SIFT model could match three pairs of the same face database. The experimental results were averaged over 10 experiments. For but the only LFW two face database, as shown in Figure 13, the SIFT model could match three pairs of feature points, pairs of matching points were correct. The improved SIFT/RITF model For the LFW but faceonly database, as shown in Figure 13, the SIFT model could matchSIFT/RITF three pairsmodel of feature points, two pairs of matching points were correct. The improved couldfeature matchpoints, 10 pairs ofonly feature points, and eightpoints pairswere of matching points were correct. two pairs points, of matching The improved SIFT/RITF model could matchbut 10 pairs of feature and eight pairs of correct. matching points were correct. The effects of two methods for pairs the LFW face database are shown in Figure 14 couldmatching match 10 pairs of feature points, and eight of matching points were correct. The matching effects of two methods for the LFW face database are shown in Figure 14 (acc (acc represents the recognition accuracy rate, and tpr isface thedatabase true positive rate).in For the14improved The matching effects of two methods thetpr LFW arerate). shown Figure (accface represents the recognition accuracy rate,for and is the true positive For the improved face matching method, the recognition accuracy rate improved by 10.97–13.24% and the true positive represents the recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 10.97–13.24% and the true positive matching method, recognition rate improved by 10.97–13.24% and theface true positivethethe rate improved by 8.50–10.11% (due(due toaccuracy the lowlow number ofof feature points face database, rate improved bythe 8.50–10.11% to the number feature points in in LFW’s LFW’s database, ratethreshold improved by 8.50–10.11% (due to the low number of feature points in LFW’s face database, the threshold of matching point pairs m was set to 3). of matching point pairs m was set to 3).

threshold of matching point pairs m was set to 3).

(a)

(a)

(b)

(b)

Figure 13. Comparison of matching effect using the LFW face database: (a) SIFT Model; (b) Figure Comparison of matching matching effect thethe LFWLFW face face database: (a) SIFT(a)Model; Figure 13. 13.Comparison of effectusing using database: SIFT (b) Model; SIFT/RITF Model. SIFT/RITF Model. (b) SIFT/RITF Model.

100 100 95 95 90 90 85 85 80 80 75 number of 200 500 800 1100 1400 75 number of face images 200 500 800 1100 1400 SIFT-acc 86.35 85.41 84.72 83.92 83.34 SIFT-acc 86.35 85.41 84.72 83.92 83.34 face images SIFT-tpr 90.62 90.47 90.3 89.28 89.03 SIFT-tpr 90.62 90.47 90.3 89.28 89.03 SIFT-RITF-acc 97.32 96.73 96.52 96.42 96.58 SIFT-RITF-acc 97.32 96.73 96.52 96.42 96.58 SIFT-RITF-tpr 99.12 99.15 99.23 99.1 99.14 SIFT-RITF-tpr 99.12 99.15 99.23 99.1 99.14

Figure 14. Comparison between SIFT and SIFT/RITF models (LFW face database).

Sensors 2018, 18, x FOR PEER REVIEW

19 of 25

Sensors 2018, 18, x FOR PEER REVIEW

19 of 25

Figure 14. Comparison between SIFT and SIFT/RITF models (LFW face database). Sensors 2018, 18, 4237 Figure 14. Comparison between SIFT and SIFT/RITF models (LFW face database).

18 of 24

For the self-collection face database, as shown in Figure 15, the SIFT model had 19 pairs of matching points, for which only nine pairs ofinmatching were correct. improved For the self-collection face database, as shown Figure 15,points the SIFT model had The 19 pairs of For the self-collection database, as shown in Figure 15, of the SIFT model had 19 correct. pairs of SIFT/RITF model 31face pairs matching and 23 pairs matching points were matching points, forhad which onlyof nine pairspoints, of matching points were correct. The improved matchingThe points, for which only nine pairs of matching were correct. The improved SIFT/RITF matching effects methods forand the points self-collection face database are shown SIFT/RITF model had 31 pairs of of both matching points, 23 pairs of matching points were correct.in Figure model had represents 31 pairs of recognition matching points, and 23 pairs of matching points wererate). correct. 16 (acc and tpr is the true positive the improved The matching effects of both accuracy methods rate, for the self-collection face database areFor shown in Figureface The matching effects of both methods for the self-collection face database are shown in Figure 16 matching method, the recognition accuracy rate improved by 12.65–15.31% and the true positive 16 (acc represents recognition accuracy rate, and tpr is the true positive rate). For the improved face (accrate represents recognition accuracy rate, and tpr is the true positive rate). For the improved face improved (the threshold of improved matching point pairs m was and set tothe 16).true positive matching method, by the7.49–9.09% recognition accuracy rate by 12.65–15.31% matching method, the recognition accuracy rate improved by 12.65–15.31% and the true positive rate rate improved by 7.49–9.09% (the threshold of matching point pairs m was set to 16). improved by 7.49–9.09% (the threshold of matching point pairs m was set to 16).

(a)

(b)

(a) effect using the self-collection (b) Figure 15. Comparison of matching face database: (a) SIFT Model; (b) SIFT/RITF Model. ofofmatching Figure 15. Comparison self-collection face database: (a) (a) SIFT Model; (b) Comparison matchingeffect effectusing usingthe the self-collection face database: SIFT Model; (b) SIFT/RITF Model. SIFT/RITF Model.

100 100 95 95 90 90 85 85 80 80 75 75 SIFT-acc SIFT-acc SIFT-tpr

number of 200 500 800 1100 1400 face number images of 20084.9650084.0980083.13 110082.23 140082.84 face images 84.9691.92 84.0991.23 83.1390.72 82.2390.47 82.8490.23

SIFT-tpr 91.9297.61 91.2397.52 90.7297.59 90.4797.54 90.2397.53 SIFT-RITF-acc SIFT-RITF-acc 97.6199.41 97.5299.46 97.5999.34 97.5499.36 97.5399.32 SIFT-RITF-tpr SIFT-RITF-tpr 99.41 99.46 99.34 99.36 99.32 Figure 16. Comparison between SIFT and SIFT/RITF models (self-collection face database). Figure 16. Comparison between SIFT and SIFT/RITF models (self-collection face database). Figure 16. Comparison between SIFT and SIFT/RITF models (self-collection face database).

Sensors 2018, 18, 4237

19 of 24

4.3.2. Comparison of Our Model with Other Face Matching Models The LFW face database and self-collection database were used to experiment with different improved SIFT models. In this experiment, we compared eleven different models: principal component analysis with SIFT (PCA/SIFT) [50], Gabor/SIFT [51], speeded up robust features with SIFT (SURF/SIFT) [52], partial-descriptor SIFT (PDSIFT) [53], person-specific SIFT [54], SIFT/Kepenekci [55], wavelet transform of the SIFT feature [56], hexagonal SIFT (H-SIFT) [57], and SIFT/RITF. For PCA/SIFT, gradient patches were used around the key points instead of the original patches to make the representation robust to changes in lighting and to reduce the changes that PCA needs to model [50]. Reference [51] proposes to combine Gabor and SIFT for face recognition. The SURF/SIFT method uses a SURF detector with a SIFT descriptor to augment the proficiency of contemporary face recognition systems [52]. The proposed PDSIFT method retains key points at large or near-face boundaries to achieve better performance than the original SIFT [53]. The person-specific SIFT model uses the SIFT features of a particular person and a non-statistical matching strategy to solve the face recognition problem in combination with the local and global similarities on the key point group [54]. SIFT/Kepenekci is a SIFT model-based Kepenekci approach [55]. The method proposed in Reference [56] is based on wavelet transform of the SIFT features. H-SIFT takes advantage of hexagonal transformed image pixels and applies processing on a hexagonal coordinate system, rather than using SIFT on the square image coordinates, displaying better performance [57]. The experimental results are shown in Table 2. Table 2. Recognition performance of different improved the scale-invariant feature transform (SIFT) models using the LFW dataset. ACC—recognition accuracy rate; TPR—true positive rate; FPR—false positive rate; PCA—principal component analysis; SURF—speeded up robust features; PD—partial descriptor; H—hexagonal; RITF—rotation-invariant texture feature; R-CNN—region-based CNN; SR-CNN—CNN with SIFT and RITF. Method

ACC

TPR@FPR = 1%

TPR@FPR = 0.1%

PCA/SIFT Gabor/SIFT SURF/SIFT PDSIF Person-specific SIFT SIFT/Kepenekci Wavelet transform of the SIFT feature H-SIFT SIFT/RITF CNN Faster R-CNN SR-CNN

92.34 94.37 95.12 95.68 95.72 96.35 96.37 96.42 98.56 97.98 98.45 98.98

89.53 91.76 92.35 92.65 92.94 93.37 93.36 93.44 95.51 95.18 95.67 95.96

80.46 82.72 82.75 83.07 83.39 83.79 83.51 83.86 85.94 85.59 85.97 86.47

According to the experimental results, the SR-CNN model presented in this paper obtained the optimal face recognition performance compared with the other ten methods. As we can see from the Table 3, the SIFT/RITF model performed well with the LFW database, and the CNN model performed well with the self-collection database. Correspondingly, SR-CNN performed well with both databases, which proves that the proposed SR-CNN model is very effective for face recognition. We fully considered the complementary information of different characteristics, which can solve the problems that CNN cannot avoid. We fused complementary features, and the results of the experiment show that the fusion was a success.

Sensors 2018, 18, 4237

20 of 24

Sensors 2018, 18, x FOR PEER REVIEW

21 of 25

Table 3. Recognition performance of different improved models using the self-collection dataset.

PCA/’SIFT

92.51

Method Gabor/SIFT

ACC 93.86

89.45

80.52

TPR@FPR 90.81= 1%

TPR@FPR 82.84= 0.1%

91.45 89.45 90.81 92.74 91.45 93.45 92.74 93.34 93.45 93.34 93.38 93.38 93.56 93.56 95.67 95.67 95.25 95.25 95.86 95.86 96.04 96.04

82.83 80.52 82.84 83.17 82.83 83.48

SURF/SIFT 94.96 PCA/’SIFT 92.51 Gabor/SIFT 93.86 PDSIF 95.37 SURF/SIFT 94.96 Person-specific SIFT 94.84 PDSIF 95.37 SIFT/Kepenekci 96.12 Person-specific SIFT 94.84 WaveletSIFT/Kepenekci transform of the SIFT feature96.12 96.35 Wavelet transform of the SIFT feature 96.35 H-SIFT 96.58 H-SIFT 96.58 SIFT/RITF 98.23 SIFT/RITF 98.23 CNN 98.87 CNN 98.87 Faster R-CNN 99.10 Faster R-CNN 99.10 SR-CNN 99.28 SR-CNN 99.28

83.17

83.95 83.48

83.95 83.76 83.76 83.97 83.97 86.08 86.08 85.89 85.89 86.53 86.53 87.01 87.01

above experiments, experiments, compared with the traditional method, the facial feature Based on the above matching method using the SR-CNN model is greatly improved. For the LFW face database, the size smaller. Although there are are fewer fewer matching matching points, the optimization optimization effect was of the face image is smaller. true positive positive rate rate could could still still be be increased increased to to 97.32%. 97.32%. For the self-collection database, better, and the true the size of the face image was larger, and and all all of of them them were were color color images. images. Therefore, there were more true positive rate as as high as pairs of matching matching points, points,and andthe theoptimization optimizationeffect effectwas wasbetter, better,with withthe the true positive rate high 97.61%. In In addition, forfor thethe improved as 97.61%. addition, improvedmethod, method,there therewas wasalso alsoan animprovement improvement in in the the number of matching points pairs, such that the true true positive positive rate rate with with both both databases databases could could reach reachabout about99%. 99%.

4.3.3. Comparison of GPU before and after Optimization To To carry carry out out the the comparison comparison experiment experiment of multi-group multi-group data, the LFW face database and the self-collection face face database database were werearranged arrangedinto intofive fivegroups groupsaccording according number of images. to to thethe number of images. In In each experiment, serial computing of the CPU compared with parallel computing of each experiment, thethe serial computing of the CPU waswas compared with the the parallel computing of the the GPU, of each group average ofexperiments. 10 experiments. GPU, andand the the test test timetime of each group waswas the the average of 10 According to Figure 17 and Table 4, it can can clearly be be seen that, with the increase in the number of face images, the calculating time exhibited an upward trend, the acceleration ratio of which was a factor of about 6–7 for the LFW database. CPU

Calculating Time/second

250

GPU 237.15

200 180.09 150 131.11 100 82.95 50 30.16 4.98

0 200

500

26.92

20.14

13.02 800

1100

34.27 1400

Number of Face Images 17.Comparison Comparison central processing unit (CPU) and graphics processing unit (GPU) Figure 17. of of central processing unit (CPU) and graphics processing unit (GPU) operation time (usingtime the (using LFW database). operation the LFW database). Table 4. Acceleration ratios of central processing unit (CPU) and graphics processing unit (GPU) using the LFW face database.

Number of Face Images 200

Acceleration Ratio 6.06

Sensors 2018, 18, 4237

21 of 24

Table 4. Acceleration ratios of central processing unit (CPU) and graphics processing unit (GPU) using the LFW face database. Sensors 2018, 18, x FOR PEER REVIEW Number of Face Images

22 of 25

Acceleration Ratio

200 500 500 800 800 1100 1100 1400 1400

6.06 6.37 6.37 6.51 6.51 6.69 6.69 6.92 6.92

Similarly,asascan can Figure and5,Table for the self-collection database, the Similarly, be be seenseen fromfrom Figure 18 and18 Table for the5,self-collection database, the acceleration acceleration ratio of was a factor ratio was a factor about 5–6.of about 5–6. CPU

Calculating Time/second

350

GPU 327.14

300 250

250.04

200 175.88

150 108.34

100 50

41.31 7.88

0 200

20.1 500

44.12

31.92 800

1100

56.21

1400

Number of Face Images Figure 18. Comparison of CPU and GPU operation time (using self-collection database). database). Table 5. Acceleration ratios of CPU and GPU using the self-collection face database. Table 5. Acceleration ratios of CPU and GPU using the self-collection face database. Number of Face Images

Number of Face Images 200 200 500 500 800 800 1100 1100 1400 1400

Acceleration Ratio Acceleration Ratio 5.24 5.24 5.39 5.39 5.51 5.51 5.66 5.66 5.82 5.82

Based on the above experiments, the use of GPU parallel computing could effectively improve Based on the above use the of GPU parallel computing could effectively improve the running speed of the experiments, program and the reduce execution time. Compared with the self-collection the running of the program and reduce execution time. Compared self-collection database, thespeed face size in the LFW database wasthe small; thus, calculation time ofwith the the feature extraction database, the face size in the LFW database was small; thus, calculation time of the feature extraction was lower for each image, the acceleration ratio was relatively large, and the optimization effect was lower for each image, the acceleration ratio was relatively large, and the optimization effect was was more distinct. When the number of pictures was small, and there was no comprehensive use of more distinct. number of pictures was wassmall small,and and was comprehensive use of resources on theWhen GPU the thread, the acceleration thethere speed wasnonot very high. However, resources the GPU thread, gradually the acceleration was small the speed was not were very high. However, when the on number of images increased, all theand threads on the GPU called, and the when the number of images gradually increased, all the threads on the GPU were called, the acceleration ratio of both databases gradually increased. Therefore, GPU parallel computing is and optimal acceleration ratio of both databases gradually increased. Therefore, GPU parallel computing is for situations where there are many pictures. optimal for situations where there are many pictures. 5. Conclusions and Future Work 5. Conclusions and Future Work This paper introduced a robust face matching method based on an SR-CNN model. It improved This paper introduced robust faceSIFT matching based on anthe SR-CNN model. improved the rotation invariance of thea traditional modelmethod while maintaining advantages of Itthe original the rotation invariance thethe traditional SIFTwhich model whilethe maintaining the advantages of the SIFT model, CNN model,ofand RITF model, tackles problem caused by the traditional original SIFT model, CNN model, and the RITF model, which tackles the problem caused by the SIFT model when there are multiple similar regions in an image. Furthermore, the parallel acceleration traditional SIFT model when there are multiple similar regions in an image. Furthermore, the parallel acceleration based on the GPU could not only ensure the matching accuracy, but also effectively improve the running efficiency and shorten the running time. However, in the case of complex backgrounds, the accuracy of face recognition methods will be affected, which will be further studied in the future.

Sensors 2018, 18, 4237

22 of 24

based on the GPU could not only ensure the matching accuracy, but also effectively improve the running efficiency and shorten the running time. However, in the case of complex backgrounds, the accuracy of face recognition methods will be affected, which will be further studied in the future. Author Contributions: Y.-X.Y. conceived and initialized the research, conceived the algorithms, and designed the experiments; K.X. performed and evaluated the experiments; C.W. and F.-Q.W. analyzed the data; G.-Q.S. and X.-G.T. reviewed the paper; Y.-X.Y. wrote the paper. Funding: This research was supported by the National Natural Science Foundation of China under Grants 61701046, 41674107, 41874119, and41574064, the Youth Fund of Yangtze University (2016CQN10), the Oil and Gas Resources and Exploration Technology, Ministry of Education Key Laboratory Open Fund in 2018 (K2018-15), and the Undergraduate Training Programs for Innovation and Entrepreneurship of Yangtze University under Grant 2017009. Conflicts of Interest: The authors declare no conflicts of interest.

References 1. 2. 3. 4. 5.

6. 7. 8. 9.

10. 11. 12.

13.

14. 15.

16.

Martino, L.D.; Preciozzi, J.; Lecumberry, F. Face matching with an a-contrario false detection control. Neurocomputing 2016, 173, 64–71. [CrossRef] Napoléon, T.; Alfalou, A. Pose invariant face recognition: 3D model from single photo. Opt. Lasers Eng. 2016, 89, 150–161. [CrossRef] Vezzetti, E.; Marcolin, F. Geometrical descriptors for human face morphological analysis and recognition. Robot. Auton. Syst. 2012, 60, 928–939. [CrossRef] Moos, S.; Marcolin, F.; Tornincasa, S. Cleft lip pathology diagnosis and foetal landmark extraction via 3D geometrical analysis. Int. J. Interact. Des. Manuf. 2017, 11, 1–18. [CrossRef] Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. In Proceedings of the IEEE International Conference on Data Mining, New Orleans, LA, USA, 18–21 November 2017. Mikolajczyk, K.; Schmid, C. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1615–1630. [CrossRef] [PubMed] Lu, J.; Liong, V.E.; Zhou, J. Cost-Sensitive Local Binary Feature Learning for Facial Age Estimation. IEEE Trans. Image Process 2015, 24, 5356–5368. [CrossRef] Perez, C.A.; Cament, L.A.; Castillo, L.E. Methodological improvement on local Gabor face recognition based on feature selection and enhanced Borda count. Pattern Recognit. 2011, 44, 951–963. [CrossRef] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [CrossRef] Sun, C.; Yang, Y.; Wen, C. Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning. Sensors 2018, 18, 2399. [CrossRef] Deng, J.; Dong, W.; Socher, R. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. Razavian, A.S.; Azizpour, H.; Sullivan, J. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 806–813. Li, J.; Qiu, T.; Wen, C. Robust Face Recognition Using the Deep C2D-CNN Model Based on Decision-Level Fusion. Sensors 2018, 18, 2080. [CrossRef] Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. Multi-scale Orderless Pooling of Deep Convolutional Activation Features. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Volume 8695, pp. 392–407. Xie, L.; Hong, R.; Zhang, B. Image Classification and Retrieval are ONE. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Shanghai, China, 23–26 June 2015; pp. 3–10.

Sensors 2018, 18, 4237

17.

18. 19.

20. 21.

22.

23. 24. 25. 26. 27.

28.

29. 30. 31. 32. 33. 34. 35.

36. 37.

38.

23 of 24

Farfade, S.S.; Saberian, M.J.; Li, L.J. Multi-view face detection using deep convolutional neural networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Shanghai, China, 23–26 June 2015; pp. 643–650. Jiang, H.; Learnedmiller, E. Face Detection with the Faster R-CNN. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017. Yang, S.; Luo, P.; Loy, C.C.; Tang, X. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3676–3684. Huang, L.; Yang, Y.; Deng, Y.; Yu, Y. Densebox: Unifying landmark localization with end to end object detection. arXiv, 2015, arXiv:1509.04874. Guo, J.; Xu, J.; Liu, S.; Huang, D.; Wang, Y. Occlusion-Robust Face Detection Using Shallow and Deep Proposal Based Faster R-CNN. In Proceedings of the Chinese Conference on Biometric Recognition, Chengdu, China, 14–16 October 2016; pp. 3–12. Qin, H.; Yan, J.; Li, X.; Hu, X. Joint training of cascaded CNN for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 456–3465. Sun, X.; Wu, P.; Hoi, S.C.H. Face detection using deep learning: An improved faster RCNN approach. Neurocomputing 2018, 299, 42–50. [CrossRef] Lowe, D.G. Object recognition from local scale invariant features. In Proceedings of the International Conference on Computer Vision (ICCV), Kerkyra, Greece, 21–22 September 1999; pp. 1150–1157. Lowe, D.G. Distinctive image features from scale-invariant key-points. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef] Susan, S.; Jain, A.; Sharma, A. Fuzzy match index for scale-invariant feature transform (SIFT) features with application to face recognition with weak supervision. IET Image Process. 2015, 9, 951–958. [CrossRef] Park, S.; Yoo, J.H. Real-Time Face Recognition with SIFT-Based Local Feature Points for Mobile Devices. In Proceedings of the International Conference on Artificial Intelligence, Modelling and Simulation IEEE, Kota, Kinabalu, Malaysia, 3–5 December 2013; pp. 304–308. Narang, G.; Singh, S.; Narang, A. Robust face recognition method based on SIFT features using Levenberg-Marquardt Backpropagation neural networks. In Proceedings of the International Conference on Image and Signal Processing, Hangzhou, China, 16–18 December 2013; pp. 1000–1005. Zhang, T.; Zheng, W.; Cui, Z. A Deep Neural Network-Driven Feature Learning Method for Multi-view Facial Expression Recognition. IEEE Trans. Multimed. 2016, 18, 2528–2536. [CrossRef] Zheng, L.; Yang, Y.; Tian, Q. SIFT Meets CNN: A Decade Survey of Instance Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 1224–1244. [CrossRef] [PubMed] Chandrasekhar, V.; Lin, J.; Goh, H. A practical guide to CNNs and Fisher Vectors for image instance retrieval. Signal Process. 2016, 128, 426–439. [CrossRef] Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R. Convolutional Neural Networks for Medical Image Analysis: Fine Tuning or Full Training? IEEE Trans. Med. Imaging 2017, 35, 1299–1312. [CrossRef] Judd, T.; Ehinger, K.; Durand, F. Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 2106–2113. Zhao, Q.; Koch, C. Learning a saliency map using fixated locations in natural scenes. J. Vis. 2011, 11, 74–76. [CrossRef] Jeon, B.H.; Sang, U.L.; Lee, K.M. Rotation invariant face detection using a model-based clustering algorithm. In Proceedings of the International Conference on Multimedia and Expo, New York, NY, USA, 30 July–2 August 2000; pp. 1149–1152. Amiri, M.; Rabiee, H.R. A novel rotation/scale invariant template matching algorithm using weighted adaptive lifting scheme transform. Pattern Recognit. 2010, 43, 2485–2496. [CrossRef] Philbin, J.; Chum, O.; Isard, M. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. Shen, X.; Lin, Z.; Brandt, J. Spatially-Constrained Similarity Measurefor Large-Scale Object Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1229–1241. [CrossRef]

Sensors 2018, 18, 4237

39.

40.

41.

42. 43. 44. 45.

46.

47. 48.

49.

50.

51.

52. 53. 54. 55. 56.

57.

24 of 24

Dosovitskiy, A.; Fischer, P.; Springenberg, J. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1734–1747. [CrossRef] [PubMed] Li, J.; Lam, E.Y. Facial expression recognition using deep neural networks. In Proceedings of the 2015 IEEE International Conference on Imaging Systems and Techniques (IST), Macau, China, 16–18 September 2015; pp. 1–6. Hossein, N.; Nasri, M. Image registration based on SIFT features and adaptive RANSAC transform. In Proceedings of the Conference on Communication and Signal Processing, Beijing, China, 6–8 April 2016; pp. 1087–1091. Elleuch, Z.; Marzouki, K. Multi-index structure based on SIFT and color features for large scale image retrieval. Multimed. Tools Appl. 2017, 76, 13929–13951. [CrossRef] Vinay, A.; Kathiresan, G.; Mundroy, D.A. Face Recognition using Filtered Eoh-sift. Procedia Comput. Sci. 2016, 79, 543–552. [CrossRef] Yu, D.; Yang, F.; Yang, C. Fast Rotation-Free Feature-Based Image Registration Using Improved N-SIFT and GMM-Based Parallel Optimization. IEEE Trans. Biomed. Eng. 2016, 63, 1653–1664. [CrossRef] [PubMed] Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. He, K.; Zhang, X.; Ren, S. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 1026–1034. Vens, C. Random Forest. Encycl. Syst. Biol. 2013, 45, 157–175. Devani, U.; Nikam, V.B.; Meshram, B. Super-fast parallel eigenface implementation on GPU for face recognition. In Proceedings of the Conference on Parallel, Distributed and Grid Computing, Solan, India, 11–13 December 2014; pp. 130–136. Huang, G.B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognitionin Unconstrained Environments; No. 2. Technical Report 07-49; University of Massachusetts: Amherst, MA, USA, 2007; Volume 1. Ke, Y.; Sukthankar, R. PCA-SIFT: A more distinctive representation for local image descriptors. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 27 June–2 July 2004; pp. 506–513. Parua, S.; Das, A.; Mazumdar, D. Determination of Feature Hierarchy from Gabor and SIFT Features for Face Recognition. In Proceedings of the Conference on Emerging Applications of Information Technology, Kolkata, India, 19–20 February 2011; pp. 257–260. Vinay, A.; Hebbar, D.; Shekhar, V.S. Two Novel Detector-Descriptor Based Approaches for Face Recognition Using SIFT and SURF. Procedia Comput. Sci. 2015, 70, 185–197. Geng, C.; Jiang, X. Face recognition using SIFT features. In Proceedings of the Conference on Image Processing, Cairo, Egypt, 7–10 November 2009; pp. 3313–3316. Luo, J.; Ma, Y.; Takikawa, E. Person-Specific SIFT Features for Face Recognition. In Proceedings of the Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, USA, 15–20 April 2007; pp. 593–596. Lenc, L. Automatic face recognition system based on the SIFT features. Comput. Electr. Eng. 2015, 46, 256–272. [CrossRef] Omidyeganeh, M.; Shirmohammadi, S.; Laganiere, R. Face identification using wavelet transform of SIFT features. In Proceedings of the Conference on Multimedia and Expo, Seattle, WA, USA, 15–19 July 2013; pp. 1–6. Azeem, A.; Sharif, M.; Shah, J.H. Hexagonal scale invariant feature transform (H-SIFT) for facial feature extraction. J. Appl. Res. Technol. 2015, 13, 402–408. [CrossRef] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).