Page 1 of 12
IET Biometrics This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
IET Control Theory & Applications Submission Template for IET Research Journal Papers
Learning Pairwise SVM on Hierarchical Deep Features for Ear Recognition
ISSN 1751-8644 doi: 0000000000 www.ietdl.org
Ibrahim Omara1,2 , Xiaohe Wu1 , Hongzhi Zhang1 , Yong Du3 , Wangmeng Zuo1 1
School of Computer Science and Technology, Harbin Institute of Technology (HIT), Harbin, China Department of Mathematics, Faculty of Science, Menoufia University, Shebin El-koom, Egypt 3 Department of Electrical and Information Engineering, Northeast Agricultural University, Harbin, China E-mail:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] 2
Abstract: Convolutional neural networks (CNNs) based deep features have been demonstrated with remarkable performance in various computer vision tasks, such as image classification and face verification. Compared with the hand-crafted descriptors, deep features exhibit more powerful representation ability. Typically, higher layer features contain more semantic information, while lower layer features can provide more low level description. In addition, it turns out that the fusion of different layer features will lead to superior performance. In this paper, we investigate a novel approach for human ear identification by combining hierarchical deep features. First, hierarchical deep features are extracted from ear images using CNN models pre-trained on large scale dataset. To enhance the feature representation and reduce the high dimension of deep features, the Discriminant Correlation Analysis (DCA) algorithm is adopted for fusing deep features from different layers for further improvement. Due to the lack of ear images per person, we propose to transform the ear identification problem to the binary classification by composing pairwise samples and resolve it with the pairwise SVM. Experiments are conducted on four public databases: USTB I, USTB II, IIT Delhi I and IIT Delhi II. The proposed method achieves promising recognition rate and exhibits decent performance in comparison to the state-of-the-art methods. Keywords: Ear recognition, Feature level fusion, CNNs, Pairwise SVM.
1
Introduction
Biometric systems have attracted much attention on various applications, especially in the fields of security and forensics [1]. In general, biometric traits can be grouped into two categories, i.e. physiological and behavioral ones. Physiological traits include face, fingerprint, ear, hand geometry, finger vein, and iris, etc, while behavioral traits involve gait, keystroke dynamics, signature, etc. In the last few decades, varieties of biometric recognition methods have been studied on face, fingerprint, and iris [2]. Recently, ear print has received considerable research interests in biometric community due to its several prominent advantages [3]. Human ears are large and visible for acquisition, stable through age and expressions, and are different for identical twins. It satisfies all the required biometric characteristics such as universality, uniqueness, permanence and measurability. In [4], Iannarelli has shown that there are no two ears having the same helix. In Figure 1, one example of ear image is provided for better understanding the structure of human ear, which can be divided into the outer ear shape and inner ear shape, and includes inner and outer helix, antihelix, tragus, antitragus, concha, crus helix and lobe. Despite their intensive applications in computer vision, handcrafted features (e.g., the Scale Invariant Feature Transform (SIFT) [5]) and the visual descriptors such as Bag of Visual Words (BoVWs) [6] usually are limited in discrimination and generalization. Recently, driven by the advances in convolutional neural networks (CNNs), deep features have been drawing much attention due to its excellent representation and discrimination ability, which have been widely adopted and demonstrated impressive performance. Typically, the top layers of CNN contain more semantic information and describe the global feature of the images, while the intermediate layers describe the local features and the bottom layers contain more low-level information for the description of textures and edges. Compared with the handcrafted descriptors, deep features can provide more discriminative information and lead to better performance, especially in object detection, object tracking
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
and image classification [7, 8]. Moreover, CNNs have been studied intensively in various fields, such as face and handwritten digit recognition, object detection, medical image analysis [9, 10], and eye detection [11]. However, less effort has been paid on the deep feature representation of ear images for ear recognition. On the other hand, fusion algorithms have been widely adopted in pattern recognition. Different levels of fusion methods have been studied, such as pixel level [12], feature level [12] and decision level fusion [13]. Among these methods, feature level fusion has been intensively studied [14]. It is noted that different features may provide descriptions of the object from different aspects and contain complementary information. By combining these features, we can not only keep the effective discriminative information of multifeatures, but also eliminate the redundant information to certain degree. Obviously, all these benefit to the classification performance. The typical feature fusion methods include canonical correlation analysis (CCA) based fusion [15], parallel feature fusion [16] and serial feature fusion [14]. Liu et al. [14] concatenate different feature vectors into one, and Yang et al. [16] combine two feature vectors into a complex vector. The CCA based fusion method aims to maximize the correlation between two sets of features and has been widely used in feature fusion [17]. Instead of CCA, in this work, we use the discriminant correlation analysis (DCA) [18] for fusing different layer of features of ear image due to its simplicity and efficiency. Matching is processed after feature extraction, which aims to generate score for indicating the similarity of two ears images. In the nearest neighbor (NN) classification, similarity is measured based on some distance metric, such as Euclidean distance [19, 20] and Hamming distance [21]. As for ear identification, discriminative classifiers, such as the Support Vector Machine (SVM), can be adopted for classification [22]. However, due to the number of ear images per person is very limited (e.g. 3∼5 for most public datasets), SVM cannot achieve the desired performance. Thus, we propose to apply pairwise SVM [23] to solve the problem. Pairwise SVM provides an extension of the binary SVM classifier to multi-class
IET Review Copy Only
1
IET Biometrics
Page 2 of 12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
parallel scheme which makes two sets of data to form a complex feature vector, and then perform the complex linear projection to improve classification accuracy. Sun et al. [15] adopt the idea of canonical correlation analysis (CCA) and propose a feature fusion method to maximize the correlation between two sets of features. Based on CCA, Haghighat et al. [18] propose the discriminant correlation analysis (DCA) for multimodal biometrics feature fusion, which considers the class associations in feature sets, eliminates the between-class correlations and restricts the correlations to be within classes. In this paper, we exploit the DCA fusion algorithm to fuse different layers of features extracted from VGG-Net [30] to further improve the discriminability and robustness of the representation for ear recognition.
Fig. 1: Anatomy of the ear. classification, which has been demonstrated to get excellent results on the classification problem. To sum up, we present an ear identification method based on hierarchical CNN features, which is an extension of our previous work [24]. Specifically, we focus on the feature extraction and classification modules. Compared with [24], DCA is adopted to combine hierarchical feature from different layers, and more experiments are conducted on four public ear databases. The contributions can be summarized as follows: • Benefiting from the powerful representation ability of deep features, we propose to represent the ear images with deep features for ear recognition. • Discriminant correlation analysis (DCA) [18] is utilized to perform dimensionality reduction and fuse different layers of features for better representation of the ear images. Furthermore, to handle the ear image deficiency in training, we suggest using the pairwise SVM to transform ear identification into a binary classification task. • Extensive experiments are conducted to evaluate the proposed method on four public databases, i.e., USTB databases collections I and II, and IIT Delhi databases collections I and II. The results show that the proposed method performs favorably over the state-of-the-art methods. The rest of the paper is organized as follows: Section 2 introduces the related work about the feature extraction and matching methods for ear recognition. Section 3 presents the proposed method based on deep features, discriminative correlation analysis, and pairwise SVM. Section 4 reports the experimental results. Finally, Section 5 draws the conclusion of the paper.
2
Related Work
In this section, we separately discuss the methods for ear feature extraction, feature fusion, matching and classification. 2.1
Ear Feature Extraction
In this work, we use deep CNN for ear feature extraction. Instead of deep features, many hand-crafted features, including both holistic [25] and local [5, 26–28] ones, have been suggested for ear images. Compared with hand-crafted features, deep CNN features are learned in a data-driven manner, and have proven its effectiveness in many vision tasks. Moreover, it has been demonstrated that the features learned from the CNNs, such as AlexNet [29] and VGG-Net [30] pre-trained on ImageNet, can be well transferred to biometric recognition tasks, such as face recognition [31], iris recognition [32], signature [33] and finger-vein identification [34]. In this work, we show that the hierarchical CNN features can also been extended to ear feature representation with promising performance. 2.2
Feature Fusion
Different layers of CNN features may contain complementary information for ear recognition. Feature fusion can thus be explored to improve the representation ability of deep CNN features from different layers. For general feature fusion, Yang et al. [16] propose a
2
2.3
Ear Matching and Classification
After feature extraction, a matching method or classifier is needed to recognize the test ear image. Commonly adopted method for matching criterion is the distance metric, such as Euclidean distance and Hamming distance [19–21]. SVM has also been widely adopted for ear recognition as a classifier based on PCA and ICA features in [35]. However, due to the lack of training images and the multi-class characteristics for ear recognition, SVM may not lead to satisfying performance. Pairwise SVM [23] extends multi-class SVM to binary SVM from the similarity matching perspective and exhibits better interclass generalization performance, which can be applied to better handle the multiple classes of ear image database. Therefore, in this paper, we apply pairwise SVM for ear recognition.
3
The Proposed Method
In this section, we describe the proposed method for ear identification in details. Figure 2 illustrates the framework of the proposed method, which includes the deep features extraction from VGG-M [36], the DCA [18] algorithm for fusing different layers deep features, and the pairwise SVM classifier for ear recognition. In the following subsections, we will describe each part of the proposed method. 3.1
Deep Features Extraction
In this work, VGG-M [36] pre-trained on the ILSVRC dataset is employed for feature extraction based on the following considerations. First, the ear image usually has limited size, which implies that a proper depth of deep model should be considered. For ear identification, the depth of VGG-M is sufficient to extract the local and global features of ear image. Second, considering the computational budget and memory storage, those very deep models, such as VGG16, VGG-19 [30] and ResNet [37], usually require more resources and may be difficult to be deployed in practical biometric recognition systems. Finally, due to its excellent performance and promising transferability, VGG-M has been widely adopted in various vision tasks, such as multi-biometric recognition [38]. Moreover, we conduct experiments to compare the performance of the VGG-M and VGG-16. It can be found that VGG-M is on par with VGG-16 and the fine-tuned VGG-M in terms of recognition accuracy. As illustrated in Figure 2, VGG-M contains eight layers, where the first five are convolution layers, and the remaining three are fully-connected layers. Specifically, the first and second convolution layers are followed by response normalization layer, and maxpooling layers are stacked upon the response normalization layers as well as the fifth convolution layer. Furthermore, the rectified linearity unit (ReLU) is applied to the output of each convolution and fully-connected layer. Table 1 summarizes the VGG-M architecture. Due to the fixed-size input (e.g., 224 × 224 × 3) of VGG-M, we copy the gray image to 3 channels, and resize each ear image to 224 × 224 in experiments. For better understanding the deep features of ear images, Figure 3 visualizes several outputs of different convolution layers corresponding to four ear images from the USTB I, II, and ITT Delhi I, II datasets, respectively. We can see that these features do capture some
IET Review Copy Only
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Page 3 of 12
IET Biometrics This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
Fig. 2: The framework for ear recognition based on the deep feature representation, DCA and pairwise SVM. Table 1 The VGG-M Net architectures. Name
Type
Filter Size \ Stride
Output Size
conv1 relu1 norm1 pool1 conv2 relu2 norm2 pool2 conv3 relu3 conv4 relu4 conv5 relu5 pool5 f c6 relu6 f c7 relu7 f c8
Convolution ReLU LRN M ax-P ooling Convolution ReLU LRN M ax-P ooling Convolution ReLU Convolution ReLU Convolution ReLU M ax-P ooling F ully-Connected ReLU F ully-Connected ReLU F ully-Connected
7×7\2
109 × 109 × 96 109 × 109 × 96 109 × 109 × 96 54 × 54 × 96 26 × 26 × 256 26 × 26 × 256 26 × 26 × 256 13 × 13 × 256 13 × 13 × 512 13 × 13 × 512 13 × 13 × 512 13 × 13 × 512 13 × 13 × 512 13 × 13 × 512 6 × 6 × 512 4096 × 1 4096 × 1 4096 × 1 4096 × 1 1000 × 1
3×3\2 5×5\2
3×3\2 3×3\1 3×3\1 3×3\1 3×3\2
meaningful information about ears. Specifically, the bottom features contain more contour and edges information. Due to the property of CNN, we know that the top layer of features convey more semantic information to describe the global feature of ear images, while the middle and bottom layers of features contain more local features, such as edges and texture details. In Section 4, we conduct evaluation experiments on the convolution layers and the fully-connected layers separately to analyze the effect of features from different layers for ear recognition. Note that different layers of deep features conveys complimentary information of ear image. We will perform the feature fusion with the DCA algorithm for enhanced ear representation.
3.2 Feature Fusion with Discriminant Correlation Analysis (DCA) Discriminant correlation analysis (DCA) [18] is proposed for feature level fusion with low computational complexity. It aims to maximize the pair-wise correlation across the two different feature sets, while eliminating the between-class correlations and restricting the correlations to be within-class. DCA can be used to fuse different feature vectors that extracted from multiple traits or from a single trait. Since different layers of features extracted from VGG-M has its own characteristics, we adopt DCA algorithm to fuse different features of ear images to further enhance the representation ability.
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Suppose that there are n ear images which belong to c subjects. Denote by S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )} the training set, where xi stands for the i-th sample and yi ∈ {1, 2, . . . , c} stands for the corresponding class label. X1 and X2 denote the training set matrix composed by different layers of features extracted from VGG-M, respectively. However, the ear images have similar appearance and the number of training images are limited to 3 ∼ 5 for each subject, further increasing the difficulty of training a discriminative classifier. For the purpose of distinguishing the ear images from different subject, DCA integrates the class associations into the correlation analysis, which motivates the transform of features based on the between-class scatter matrix S = P c xi − x ¯)(¯ xi − x ¯)> , where ni denotes P the number of ear i=1 ni (¯ i xij denotes images belonging to the i-th subject, x ¯i = n1i n j=1 P 1 ¯i the center ear image of the i-th subject and x ¯ = n ci=1 ni x denotes the center of all ear samples. However, deep features always have the property of high dimension, which leads to the high computational complexity. According to [39], instead of decomposing the between-class matrix C = √ the covariance √ scatter matrix S, Φ> Φ Φ = n1 (¯ x1 − x ¯) , . . . , nc (¯ xc − x ¯) can be decomposed due to its semi-positive definite (SPD) property: C = UΛU> , Λ = Diag(λ1 , λ2 , ..., λc ), λ1 ≥ λ2 , ..., ≥ λc
(1)
where λi is the i-th eigenvalue of C, and the i-th column of U is the corresponding eigenvector. Considering the first largest r non-zero eigenvalues, and their corresponding eigenvectors Ur , the transfor−1/2 mation matrix is defined as P = ΦUr Λr×r . Typically, for feature > set X1 , it is projected to Z1 = P1 X1 and the feature set X2 is transformed to Z2 = P2 > X2 correspondingly. Deep feature vector generally is of high dimension, which implies that there may contain redundant features. One conventional strategy is to use PCA whitening [40] for dimensionality reduction. Note that the goal of DCA is to fuse different feature sets. Thus, after PCA whitening, DCA further suggests to make the features in one set have nonzero correlation only with their corresponding features in the other set. In the implementation, we adopt an adaptive method to set the dimension of PCA by keeping the ratio of the sum of the selected eigenvalues and the sum of all eigenvalues larger than 0.95. Based on the transformed feature sets Z1 and Z2 , the between-set covariance matrix is defined as Sb = Z1 Z> 2 . Due to the SPD property, singular value decomposition (SVD) on Sb is Sb = VΣV> . Then the transform matrix is defined as T = VΣ−1/2 . Z1 and Z2 are further transformed to X1 0 = T> Z1 and X2 0 = T> Z2 . Finally, the fused feature by DCA is defined as the summation of the transformed feature vectors: X0 = X1 0 + X2 0 for the sake of low dimension. The detailed process can be found in Algorithm 1.
IET Review Copy Only
3
IET Biometrics
Page 4 of 12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
Fig. 3: The visualization of different layer of deep features for one ear image from USTB I, II, and IIT Delhi I, II databases separately.
Algorithm 1 The DCA algorithm for deep features fusion Input: X1 , X2 . Output: X0 . P i i 1: Compute the mean vector for each class: x ¯i = n1i n j=1 xj , 1 Pc and the mean of training data: x ¯ = n i=1 ni x ¯i based on . > 2: Compute the covariance matrix: C = Φ Φ, where Φ = √ √ n1 (¯ x1 − x ¯) , . . . , nc (¯ xc − x ¯) , 3: Compute the SVD of C: C = UΛU> , Λ = Diag(λ1 , λ2 , ..., λc ) (λ1 ≥ λ2 , . . . , ≥ λc ) −1/2 4: Define the transform matrix: P = ΦUr Λr×r . (Line 1∼4 provides the same process performed on X1 and X2 to get their corresponding transform matrix separately.) 5: Compute the transformed data: Z1 = P1 > X1 and Z2 = P 2 > X2 , 6: Compute the between-set covariance matrix of the transformed feature set: Sb = Z1 Z> 2, 7: Compute the SVD of Sb : Sb is Sb = VΣV> , 8: Define the transform matrix: T = VΣ−1/2 , 9: Compute the transformed data: X1 0 = T> Z1 and X2 0 = T> Z 2 , 10: The final transformed data: X0 = X1 0 + X2 0
lack of ear images of each subject, there may exist the problem of unbalanced positive and negative pairs. Therefore, we take several strategies to relieve this issue. First, we extend (xi , xj ) to (xj , xi ) for producing more positive pairs, which also satisfies the symmetric training set constraint required by pairwise SVM. Second, we replicate the original positive pairs to generate more positive data. In addition, we integrate KNN algorithm into the composition of negative pairs. Instead of using all negative pairs or selecting some of them randomly, we choose the k nearest ear images from the different persons to form the set of negative pairs. Furthermore, to alleviate the class imbalance, we control the ratio of positive and negative pairs manually during the training of pairwise SVM. Through the strategies above, we finally balance the positive and negative pairs and also experimentally evaluate the effect of the ratio of positive and negative pairs. Benefited from the kernel trick k(x, z) = hφ(x), φ(z)i, φ : X → H, the pairwise kernels are defined as: K : (X × X) × (X × X) → R
(2)
and the decision function should be: f xi , xj =
X
αmn pmn K (xm , xn ) , xi , xj
+ b (3)
(m,n)∈I
3.3
Pairwise SVM
Given an ear image, the identification task aims to determine which person it belongs to. In this work, we transform the identification task to the binary classification problem, and introduce the pairwise SVM [23] to resolve it. Pairwise SVM [23] relies on a pair of input samples to predict whether they are from the same subject or not. Denote by the ear images set S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )}, where xi presents the feature representation of one ear image and yi stands for which person it belongs to. First, pairwise inputs are required consisting of the positive pair and negative pair samples. For each subject, we compose pairwise example as [xi , xj ] from the training images set. If xi and xj belong to the same subject, then [xi , xj ] is defined as the positive pair with label pij = +1, which is further extended to positive pairs [xj , xi ], [xi , xi ] and [xj , xj ]. On the contrary, if xi and xj belong to the different persons, then [xi , xj ] is assigned with negative label pij = −1, with pairwise sample [xj , xi ]. Due to the
4
where I ⊆ n × n. In pairwise SVM, the goal is to decide whether the examples of a pair (xi , xj ) belong to the same class or not, which implies the symmetric property of the decision function: f xi , xj = f xj , xi . Thus, the balanced kernel is motivated to keep the symmetric of the decision function. More details and discussions about the balanced kernels can be found in [23]. In addition to the balanced kernel, a symmetric pairwise decision function also can be achieved by symmetric training sets. Instead of exploiting the different balanced kernels, e.g., the metric learning pairwise kernel KM L , the tensor learning pairwise kernel KT L , the direct sum learning pairwise kernel KDL , and the tensor metric learning pairwise kernel KT M , we adopt the symmetric training sets for learning the pairwise SVM. One reason is to address the lack of training samples and the requirement on symmetric training set by exploiting the positive data augmentation strategy discussed above. Also the symmetric training set makes it easy to utilize the off-the-shelf SVM toolbox. The LibSVM toolbox is utilized for implementing pairwise SVM.
IET Review Copy Only
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Page 5 of 12
IET Biometrics This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
Fig. 4: The original ear images from USTB I, USTB II, IIT Delhi I and IIT Delhi II, respectively. To describe the identification procedure, we take the USTB I as an example with 2 ear images of each subject for training. Suppose that the training set is denoted as X = {x11 , x12 , x21 , x22 , . . . , xc1 , xc2 }, where c means the number of subjects and xji presents the i-th ear image that belongs to the j-th subject. The test set can be denoted as Z = {z1 , . . . , zc }. Given the pairwise SVM model, we compose pairwise samples {(zi , x11 ), (zi , x12 ), . . . , (zi , xc1 ), (zi , xc2 )} for each test data zi where i ∈ {1, 2, . . . , c}. According to the decision values of the pairwise SVM, we choose the label of the one with the highest positive score as the predicted label for the test sample zi .
4
Experimental Results
4.1
Datasets and evaluation metrics
The experiments are performed on four publicly available ear databases, i.e. USTB collections I, II, and IIT Delhi collections I, II. The USTB databases are collected from the students and teachers at the Department of Information Engineering, University of Science and Technology of Beijing (USTB) [42]. The IIT Delhi databases are acquired from the students and staffs at IIT Delhi, New Delhi [43], and all ear image databases have been collected from the volunteers during October 2006 − Jun 2007 in the indoor environment. Figure 4 shows the original ear images from these four databases. A brief summary of the databases is given as follows. USTB I Database. It consists of 180 images of 60 subjects, among which each subject has 3 images. The images are taken under standard illumination and trivial rotation, and all images have the same size 150×80 pixels. USTB II Database. It is comprised of 308 images of 77 subjects, which are taken under illumination and angle change condition. Each subject has 4 images: the first image is the frontal ear image under standard illumination, the second and the third images are obtained with the rotation of +30◦ and −30◦ , respectively. And the fourth image is taken under weak illumination. All images in USTB II database have been acquired during November 2003 to January 2004 and have the same size of 400×300. IIT Delhi I Database. This database contains 493 images of 125 subjects. Every subject includes at least 3 images with different pose and lighting condition, and the image has the size of 272×204. IIT Delhi II Database. It includes images of 221 persons, among which each subject has at least 3 images. The images are automatically segmented and normalized for 125 subjects from IIT Delhi I database and other set of subjects, and all the images have the same size of 180×50. Experimental settings. In our experiments, we set the default k to be 5 to address the class imbalance problem. The trade-off C in pairwise SVM is determined by 10-fold inner cross-validation (CV) from {2cmin :cstep :cmax }, with cmin = −5, cstep = 1, and cmax = 15. In the ablative experiments, we randomly select r(= 2) ear images per subject to constitute the training set, and use the remaining images as the test set. To avoid the effect of training/testing partitions, the average performance metrics are adopted by running such experiment 20 times. When compared with the competing methods, we adopt the widely used protocol in literature for fair comparison.
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Evaluation metrics. In the experiments, we consider both the close set and the open set identification setting [54], and adopt the following evaluation metrics: Closed set identification. Given a test image xtest , pairwise SVM i is used to compute its similarity score to each training image. By sorting the similarity scores, we can obtain the rank of the groundtruth subject, i.e., rank(xtest ). The rank-k recognition rate [54, 56] i is then defined as, Acc(k) =
|{xtest |rank(xtest ) ≤ k}| i i , N test
(4)
where | · | denotes the cardinality of a set, and N test denotes the number of test images. In our experiments, the closed set identification performance is measured by the rank-1 recognition rate Acc(1). Open set identification. Given a test image xtest , the system first i performs a detection sub-task to determine whether it belongs to the set of subjects by a decision threshold τ . If so, it further conducts an identification sub-task to decide the subject of xtest . Following i [54, 55], we define the detection and identification rate (DIR) at τ as |{xtest |rank(xtest ) = 1, f (xtest , xtrain ) ≥ τ, ∀j}| i i i j , test N (5) where xtrain denotes the j-th training image, and the similarity j function f (xtest , xtrain ) is defined in Eqn. 3. Due to that the noni j registered test images are unavailable, in our experiments we slightly modify the definition of false alarm rate (FAR) in [54, 55] as DIR(τ ) =
F AR(τ ) =
|{xtest | maxID(xtrain )6=ID(xtest ) f (xtest , xtrain ) ≥ τ }| i i j j
i
N test
(6) where ID(·) denotes the subject of the sample. The ROC curve for identification can then be obtained as the plot of DIR versus FAR by varying the decision threshold τ , namely DIR-FAR curve. Analogous to [56], we also report the rank-1 recognition rate at 1% FAR (i.e. DIR@1%). To evaluate the proposed approach, we further conduct comparison experiments with several state-of-the-art methods. For comprehensive evaluation, we consider three scenarios in experiments, and compare the proposed method with the competing ones reported with the similar settings. In Scenario 1, one image of each subject is taken for training, and the remaining images are for testing. In Scenarios 2 and 3, about two and three images per subject are selected for training, respectively, and the remaining images are used for testing. 4.2
Deep features analysis
To analyze the effect of the different layers of features, we conduct experiments on the USTB I and IIT Delhi I datasets by using two images per subject for training. To evaluate the generalization of deep features, we evaluate three CNN models, including the VGGM, VGG-16, and the fine-tuned VGG-M with the training images. In the fine-tuning stage of VGG-M, we use the softmax loss. Specifically, the learning rate starts from 0.01 and is divided by 10 after every 40 epochs. The training is terminated at the 120 epochs, and the batch size is set as 16. Furthermore, data augmentation,
IET Review Copy Only
5
,
IET Biometrics
Page 6 of 12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page. With 2 Repetitions of Positive Pairs
Without Positive Pairs Repetition USTB I (VGG-M)
USTB I (VGG-16)
USTB I (Fine-tuned VGG-M)
IIT Delhi I (VGG-M)
IIT Delhi I (VGG-16)
IIT Delhi I (Fine-tuned VGG-M)
USTB I (VGG-M)
Recognition rate (%)
Recognition rate (%)
100 80 60 40 20 0
USTB I (VGG-16)
USTB I (Fine-tuned VGG-M)
IIT Delhi I (VGG-M)
IIT Delhi I (VGG-16)
IIT Delhi I (Fine-tuned VGG-M)
100 80 60 40 20
conv3
conv4
conv5
fc6
fc7
0
fc8
conv3
conv4
conv5
fc6
fc7
fc8
Fig. 5: Recognition rates (%) for different layer features of VGG-M, VGG-16 and the fine-tuned VGG-M on USTB I and IIT Delhi I datasets. The ratio of positive and negative pairs is set as 1:3.
Without Positive Pairs Repetition 100
Recognition rate (%)
90 80 70 60 50 USTB I (fc6) USTB I (fc7) IIT Delhi I (fc6) IIT Delhi I (fc7)
40 30 1:1
1:3
1:5
1:7
1:9
1:11
Fig. 6: Evaluation of the ratio of the positive and negative pairs without the positive pairs repetition.
6
With 2 Repetitions of Positive Pairs
Recognition rate (%)
100 95
90
85 USTB I (fc6) USTB I (fc7) IIT Delhi I (fc6) IIT Delhi I (fc7)
80
75
1:1
1:3
1:5
1:7
1:9
1:11
Fig. 7: Evaluation of the ratio of the positive and negative pairs with 2 replications of positive pairs.
Recognition rate (%)
including random flipping, shifting, and rotation, is adopted during training. Considering the computational efficiency and memory storage, we employ the convolutional features {conv3, conv4, conv5} and fully-connected features {f c6, f c7, f c8} in VGG-M and the fine-tuned VGG-M. As for VGG-16, the deep features from {conv3_3, conv4_3, conv5_3, f c6, f c7, f c8} are utilized. For simplicity, we unify them to {conv3, conv4, conv5} and {f c6, f c7, f c8}. Figure 5 shows the ear recognition performance by using the different layers of features and the CNN models. It can be observed that the convolutional features show its superiority to the fully-connected features on all CNN models without positive pairs augmentation. We extend the experiments to be with positive pairs augmentation. By replicating the original positive pairs twice, we increase the number of the positive pairs three times. Furthermore, the ratio of positive and negative data is fixed to 1:3. It is obvious that the positive data augmentation leads to excellent performance on each layer. However, there are still test images not correctly predicted using the fully-connected features. In particular, a recognition rate of 80.00% around is obtained by using the f c8 layer features on the USTB I dataset, which means that there are about 10 test images misclassified by pairwise SVM. The fully-connected features of VGG-16 performs better than the VGG-M on USTB I, while worse on the IIT Delhi I dataset, which should owe to the characteristics of ear images in different datasets. As shown in Figure 4, the image in the USTB I contains the ear only, which means that all values in the fullyconnected layers correspond to the ear, while for the IIT Delhi I, some background information is included and may distract the classifier. With the convolutional features, VGG-16 performs favorably against VGG-M on both datasets. Furthermore, it can be seen that the pre-trained VGG-M performs comparably to the fine-tuned VGG-M model. The results validate the generalization of the deep models pre-trained on large scale dataset, which can be well transferred to extract deep features for ear images. Considering the comparative ear recognition rate and the feature extraction efficiency discussed in Section 4.7, VGG-M can serve as a suitable choice for ear image feature extraction.
Fig. 8: Evaluation of the positive pair replication. The ratio of positive and negative pairs is set as 1:3. 4.3
Data imbalance analysis
From Figure 5, one can note that the fully-connected features decrease the recognition rate on the USTB I dataset. On IIT Delhi I, the features from f c8 also lead to worse performance. Even though better recognition rate can be obtained by using the convolutional features. However, the higher dimension of the convolutional features may affect the efficiency. Thus, we hope to further improve the performance with the fully-connected features. Under the assumption that compared with the convolutional features, the data imbalance between the positive and negative pairs has larger influence on that by using the fully-connected layer features. To address the class imbalance problem, we propose to replicate the positive pairs and integrate the KNN into the selection of negative pairs as mentioned in Section 3.3. We further conduct experiments to evaluate the effects of the positive pairs replication and the ratio of positive and negative pairs. Without positive pairs replication, we perform experiments with different ratios of positive and negative pairs on the USTB I and IIT Delhi I datasets. From Figure 6, we note the proper ratio should be in the range of 1:1∼1:7. More negative samples will decrease the recognition rate. However, from Figure 7, we note that a little more negative data will improve the performance on the USTB I dataset. With positive pairs replication, the recognition rate on IIT Delhi I has achieved excellent performance and is more robust to the ratio. We perform an extreme experiment on USTB I with the
IET Review Copy Only
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Page 7 of 12
IET Biometrics This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
Table 2 Evaluation on the DCA feature fusion on USTB I and IIT Delhi I datasets without positive pair replication. The ratio of the positive and negative pairs is set as 1:3. The values on the diagonal corresponds to the average recognition rate that is obtained with single layer of features. USTB I conv5 98.33 98.67 97.67 -
Layers conv3 conv4 conv5 fc6 fc7 fc8
conv3 98.00 -
conv4 98.33 98.33 -
Layers conv3 conv4 conv5 fc6 fc7 fc8
conv3 91.28 -
IIT Delhi I conv4 conv5 98.35 98.68 94.03 98.85 97.74 -
fc6 98.67 98.67 98.33 70.00 -
fc7 98.67 98.67 98.33 99.33 51.33 -
fc8 98.33 99.67 99.00 98.33 98.00 35.00
fc6 98.85 98.56 99.25 94.92 -
fc7 98.93 99.52 99.51 99.23 91.98 -
fc8 99.42 99.51 99.23 99.12 99.23 79.63
Occlusion analysis
4.5
In real-world applications, the ear images are likely to be occluded by hair or other distractors. To evaluate the robustness to occlusion, we conduct experiments with random occlusion on USTB I and IIT Delhi I datasets. According to [44], we add random occlusion with different percentages, such as 10%, 20%, 30%, and 40% on the ear images. We perform three groups of experiments based on the feature fusion of f c6 and f c7 by DCA, i.e. occlusion on the training images, occlusion on the test images and occlusion on both of them. As shown in Figure 9, the proposed method is quite robust to random occlusion. With 10% occlusion on the training and test data separately, pairwise SVM still obtains high average recognition rate with 20 times random experiments on both the USTB I and IIT Delhi I datasets. Along with the further increase of occlusion, the recognition rate begins to drop, especially on the USTB I dataset. One possible explanation may be that the ear images in the IIT Delhi I dataset contain more background and the random occlusion has higher probability to be applied to the background region. Moreover, it can be observed that, even with heavy occlusion (e.g., 40%), our method can still achieve competitive results of > 90% average recognition rate, further demonstrating its robustness to the occlusion.
ratio of 1:33 and 1:36. The recognition rates are 56.67% and 36.67%, respectively, demonstrating the necessity of controlling the ratio. We also note that the replication of positive pairs has larger influence on the recognition rate. Therefore, we design experiments to evaluate the positive pair replication with the fixed ratio of 1:3. The results in Figure 8 validate the promising performance of the positive pairs augmentation both on the USTB I and IIT Delhi I datasets. To sum up, the class imbalance has serious effect on the recognition rate due to the data deficiency. A balance between the positive and negative pairs benefits to the performance, and the positive data augmentation is also helpful according to the experimental results. In the following experiments, we fix the ratio to 1:3 and augment the positive pairs with 2 replications if not explicitly emphasized.
90
USTB I (Occ. train) USTB I (Occ. test) USTB I (Occ. train&test) IIT Delhi I (Occ. train) IIT Delhi I (Occ. test) IIT Delhi I (Occ. train&test)
85
80
0
10
20
30
Occlusion (%)
40
Fig. 9: Evaluation of the feature fusion with DCA on the USTB I and IIT Delhi I datasets without positive pair replication. The ratio of the positive and negative pairs is set as 1:3.
Feature fusion analysis
From Figures 5 and 8, we have the following two observations. First, convolutional features outperform fully-connected features, and a single layer of convolutional features without positive pair replication can achieve better performance than fully-connected features. Second, the positive pair replication improves the recognition rate largely for fully-connected features. However, the improvement by positive pair replication depends on the dataset, and is complexly related to the ratio of positive and negative pairs which should be evaluated manually. For further enhancing ear image representation, we employ the DCA algorithm for feature fusion to improve the recognition rate. To analyze the effect of feature fusion, experiments are conducted on USTB I and IIT Delhi I datasets. For fair comparison, the evaluation experiments on the feature fusion by DCA algorithm is performed without positive pairs replication. It is well known that different layers of features contain complementary information. In particular, the lower layers of convolutional feature contain more low level information such as textures and edges, while the higher layers of fully-connected features convey more sematic information. With two images of each subject for training, we fuse any two layers from the convolutional features {conv3, conv4, conv5} and the fully-connected layers of features {f c6, f c7, f c8}. According to the previous evaluation, we fix the ratio of the positive and negative pairs to 1:3. The results are listed in Table 2. One can see that any two layers of feature fusion improves the recognition rate. While single layer of convolutional features achieve considerable performance, the fully-connected layers features benefit more from the fusion scheme by DCA. Typically, the feature fusion gains an obvious improvement in comparison with the use of single fully-connected layer of features on both the USTB I and IIT Delhi I datasets. In the following, we will utilize the fusion of features from the f c6 and f c7 layers.
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
95
Recognition rate (%)
4.4
100
Fig. 10: The recognition rates of the 20 times of runs on the USTB I and IIT Delhi I datasets.
4.6
Robustness to training/testing partitions
Using the USTB I and IIT Delhi I datasets, experiments are conducted to evaluate the robustness of the proposed method to training/testing partitions. For each subject we randomly select two ear images of each subject to constitute the training set, and use the remaining images as the test set to obtain the recognition rate. By repeating such experiment 20 times, the average recognition rate together with the standard deviation (std.) are adopted to assess the robustness. Figure 10 shows the recognition rate of each run on the two datasets. On the USTB I dataset the average recognition rate and std. are 99.33% and 0.97%. While on the IIT Delhi I dataset the average recognition rate and std. are 99.23% and 1.01%. The results indicate that the proposed method is robust to the selection of training/test sets.
IET Review Copy Only
7
IET Biometrics
Page 8 of 12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page. Table 3 Run time of feature extraction on CPU and GPU. Models
VGG-M
VGG-16
CPU (s) GPU (s)
0.08∼0.1 0.003∼0.005
0.25∼0.55 0.007∼0.009
Table 6 Recognition rates (%) for Scenario 3. Methods
Boodoo and Baichoo [27]
Table 4 Recognition rates (%) for Scenario 1. IIT Delhi II
USTB I
USTB II
Mu et al. [45] Benzaoui et al. [22] Benzaoui et al. [46] SVM K-NN,rank-1 Our approach K-NN,rank-2 K-NN,rank-3
− 94.3 91.3 92.4 92.4 94.4 94.8
− 92.6 90.3 93.7 93.7 96.4 98.2
− 96.0 − 91.7 91.7 94.2 95.8
85.0 − − 83.1 83.1 88.3 91.3
Ghoualmi et al. [20]
Our approach CNN+DCA
Table 5 Recognition rates (%) for Scenario 2. Methods
IIT Delhi I
IIT Delhi II
USTB I
USTB II
PCA ICA PCA Nosrati et al. [48] ICA 2D wavelet + PCA Shape features Kumar and Wu [49] Gabor Log Gabor Benzaoui et al. [22] Benzaoui et al. [46] Ghoualmi et al. [20] Tariq et al. [50] SVM Our approach pairwise SVM K-NN,rank-1 K-NN,rank-2 K-NN,rank-3
− − − − − 29.1 90.4 96.3 97.3 96.7 − 95.2 94.4 99.9 94.4 94.4 94.4
− − − − − 30.6 88.4 95.8 97.3 97.3 − − 96.8 99.8 97.7 98.6 99.1
85.0 88.3 − − − − − − 98.3 − 97.2 98.3 96.7 99.4 96.7 96.7 96.7
− − 84.3 87.7 90.5 − − − − − − − 85.1 99.6 87.7 90.3 90.9
Zhang et al. [47]
4.7
Run time
Although our method is promising in terms of identification performance, the introduction of deep feature requires more computational budget and may restrict the application of our method in real time system. In Tabel 3, we evaluate the time cost of feature extraction on CPU and GPU, respectively. On a computer with 3.4 GHz Intel Xeon E3-1230 CPU and MatConvNet toolbox, our method takes 0.08 ∼ 0.10 seconds (s) to extract the deep feature for one image based on VGG-M, and 0.25 ∼ 0.55s based on VGG-16. Fortunately, the speed can be significantly accelerated when a GPU is deployed. With an NVIDIA GeForce GTX 1080 Ti GPU, the feature extraction time decreases to 0.003 ∼ 0.005s for VGG-M, and 0.007 ∼ 0.009s for VGG-16, and undoubtedly can be performed in real time. Furthermore, even for embedded biometric system, one can exploit FPGA-based implementation [41] to meet the real time requirement. 4.8
Comparison with the state-of-the-arts
In this subsection, we compare the proposed method with several state-of-the-art ear recognition approaches. The results of the proposed method are obtained by fusing the deep features from f c6 and f c7 layers. The positive pairs composed are repeated twice for data augmentation and the ratio between the positive and negative pairs is set to 1:3. Scenario 1. In this scenario, one ear image of each subject is chosen for training and the remaining ones are used for testing. Since the subjects in these four databases have at least three images, most competing methods [22, 46] conduct three experiments by respectively using the first, the second, or the third images per subject for training. Then, the average recognition rate is given in [22, 46]. We also adopt such protocol and report the average recognition rates in Table 4. In
8
− − 85.0 93.0 99.1 98.2 99.6 − − 97.5 99.5 95.0 96.6 97.5
81.8 92.2 − − 93.5 94.2 94.8 93.2 96.1 97.4 99.0 97.4 98.7 98.7
Guo and Xu [51] Tariq et al. [50] SVM pairwise SVM K-NN,rank-1 K-NN,rank-2 K-NN,rank-3
addition, due to the requirement of the pairwise data for pairwise SVM, we only conduct SVM and the k-nearest neighbor (k = 1) (KNN) method by fusing the f c6 and f c7 deep features with the DCA algorithm. In the evaluation of the K-NN algorithm, we provide the rank-1, rank-2 and rank-3 recognition rates. As shown in Table 4, the performance obtained by SVM and KNN with feature fusion is favorably against other state-of-the-art methods on all datasets. Scenario 2. This scenario is widely adopted to evaluate the ear recognition task. However, some competing methods only state to use about two ear images per subject for training, but do not provide more details on training/testing partitions [20, 47, 48]. In [22, 46, 49], the average recognition rates are obtained based on the three training/testing partitions by respectively using the first and the second (the first and the third, or the second and third) images per subject for training. For fair comparison, we adopt the same protocol in [22, 46, 49], and report the average recognition rate of the proposed method in Table 5. From Table 5, we can observe the performance gain by pairwise SVM over the other state-of-the-art methods on the four databases. The pairwise SVM can achieve the recognition rates of 99.86%, 99.82%, 99.44% and 99.57% on the USTB I, USTB II, IIT Delhi I and IIT Delhi II datasets, respectively. Furthermore, compared with the standard SVM and K-NN classifier, pairwise SVM also shows its superiority with a large margin, which verifies the effectiveness of the feature fusion strategy and the pairwise SVM classifier. We also report the DIR-FAR curves of the proposed method for Scenario 2. Due to the lack of test images, we perform 20 times random experiments as in the ablative evaluation to obtain Figure 11. Since the source codes of the competing methods are not available, Figure 11 only shows the DIR-FAR curves of the proposed method on the four datasets. One can see that the DIRs by our method generally are high across the datasets. Specifically, our method achieve DIR@1% of 98.17%, 100.00%, 99.59% and 99.75% on the USTB I, USTB II, IIT Delhi I and IIT Delhi II datasets, respectively.
Detection and Identification Rate
IIT Delhi I
USTB II
PCA ICA PCA LBP HE CLAHE ABC
Zhang and Mu [35]
Methods
IIT Delhi I
DIR-FAR
1 0.99 0.98 0.97
USTB I USTB II IIT Delhi I IIT Delhi II
0.96 0.95
0
0.2
0.4
0.6
0.8
1
False Alarm Rate Fig. 11: The DIR-FAR curves on four ear datasets for Scenario 2.
IET Review Copy Only
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
Page 9 of 12
IET Biometrics
Detection and Identification Rate
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
DIR-FAR
1
0.995
0.99
0.985 USTB II IIT Delhi I 0.98
0
0.2
0.4
0.6
0.8
1
False Alarm Rate Fig. 12: The DIR-FAR curves on two ear datasets for Scenario 3. Scenario 3. Finally, we utilize three images per subject for training and the remaining for testing on the IIT Delhi I and USTB II databases. Comparison experiments are conducted with several state-of-the-art approaches. It should be noted that, the competing methods adopt different settings and do not provide the details on training/testing partitions. For fair comparison, we use the similar protocol for Scenario 2, and perform three experiments based on the three partitions of training/test sets to obtain the average recognition rate. In each partition, three image per subject are randomly selected to constitute the training set. As presented in Table 6, the proposed method gets the average recognition rate of > 99.00% on both datasets and performs favorably against the state-of-the-arts. Furthermore, we conduct 20 times random experiments to get Figure 12, which shows the DIR-FAR curves of the proposed method on the two datasets. The proposed method achieves the DIR@1% of 99.94% and 99.87% on the USTB II and IIT Delhi I datasets, respectively.
5
Conclusion
In this paper, we present an ear recognition method by fusing deep features from different layers for enhanced ear representation. With the detailed analyses on the features from different layers, we apply the discriminant correlation analysis (DCA) method to fuse different features. Finally, due to the lack of training samples of ear images, we propose to solve the ear identification problem by employing pairwise SVM. Experiments are performed on four public databases, i.e. USTB I, USTB II, IIT Delhi I, and IIT Delhi II. The results show that the proposed method can achieve promising recognition rates and performs favorably over the state-of-the-art ear recognition methods.
6
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant No.s 61671182, 61471146 and 51308096. The authors thank the anonymous reviewers for their valuable and reconstructive suggestions.
7
References
1 Jain, A.K., Ross, A., Pankanti, S.: ‘Biometrics: A tool for information security’, IEEE Transactions on Information Forensics and Security, 2006, 1, (2), pp. 125–143 2 Jain, A., Flynn, P., Ross, A.A.: ‘Handbook of biometrics’. (Springer Science & Business Media, 2007) 3 Pflug, A., Busch, C.: ‘Ear biometrics: A survey of detection, feature extraction and recognition methods’, IET Biometrics, 2012, 1, (2), pp. 114–129
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
4 Iannarelli, A.V.: ‘Ear identification’. (Paramont Publishing Company, 1989) 5 Kisku, D.R., Mehrotra, H., Gupta, P., Sing, J.K.: ‘SIFTbased ear recognition by fusion of detected keypoints from color similarity slice regions’, International Conference on Advances in Computational Tools for Engineering Applications, 2009, pp. 380–385 6 Dammak, M., Mejdoub, M., Amar, C.B.: ‘A survey of extended methods to the bag of visual words for image categorization and retrieval’, International Conference on Computer Vision Theory and Applications, 2014, 2, pp. 676–683 7 Ciregan, D., Meier, U., Schmidhuber, J.: ‘Multi-column deep neural networks for image classification’, Computer Vision and Pattern Recognition, 2012, pp. 3642–3649 8 Ulu, A., Ekenel, H.K.: ‘Convolutional neural networkbased representation for person re-identification’, Signal Processing and Communication Application Conference, 2016, pp. 945–948 9 Kim, Y.: ‘Convolutional neural networks for sentence classification’, arXiv preprint arXiv:14085882, 2014, 10 Lo, S.C.B., Chan, H.P., Lin, J.S., Li, H., Freedman, M.T., Mun, S.K.: ‘Artificial convolution neural network for medical image pattern recognition’, Neural Networks, 1995, 8, (7), pp. 1201–1214 11 Karahan, S., ¸ Akgül, Y.S.: ‘Eye detection by using deep learning’, Signal Processing and Communication Application Conference, 2016, pp. 2145–2148 12 Nirmala, D.E., Vaidehi, V.: ‘Comparison of pixel-level and feature level image fusion methods’, International Conference on Computing for Sustainable Global Developmen, 2015, pp. 743–748 13 Prabhakar, S., Jain, A.K.: ‘Decision-level fusion in fingerprint verification’, Pattern Recognition, 2002, 35, (4), pp. 861–874 14 Liu, C., Wechsler, H.: ‘A shape-and texture-based enhanced fisher classifier for face recognition’, IEEE Transactions on Image Processing, 2001, 10, (4), pp. 598–608 15 Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: ‘A new method of feature fusion and its application in image recognition’, Pattern Recognition, 2005, 38, (12), pp. 2437–2448 16 Yang, J., Yang, J.y., Zhang, D., Lu, J.f.: ‘Feature fusion: parallel strategy VS. serial strategy’, Pattern Recognition, 2003, 36, (6), pp. 1369–1381 17 Pong, K.H., Lam, K.M.: ‘Multi-resolution feature fusion for face recognition’, Pattern Recognition, 2014, 47, (2), pp. 556–567 18 Haghighat, M., Abdel.Mottaleb, M., Alhalabi, W.: ‘Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition’, IEEE Transactions on Information Forensics and Security, 2016, 11, (9), pp. 1984–1996 19 Omara, I., Li, F., Zhang, H., Zuo, W.: ‘A novel geometric feature extraction method for ear recognition’, Expert Systems with Applications, 2016, 65, pp. 127–135 20 Ghoualmi, L., Draa, A., Chikhi, S.: ‘An ear biometric system based on artificial bees and the scale invariant feature transform’, Expert Systems with Applications, 2016,
IET Review Copy Only
9
IET Biometrics
Page 10 of 12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
57, pp. 49–61 21 Chan, T.S., Kumar, A.: ‘Reliable ear identification using 2-D quadrature filters’, Pattern Recognition Letters, 2012, 33, (14), pp. 1870–1881 22 Benzaoui, A., Hadid, A., Boukrouche, A.: ‘Ear biometric recognition using local texture descriptors’, Journal of Electronic Imaging, 2014, 23, (5), pp. 053008–053008 23 Brunner, C., Fischer, A., Luig, K., Thies, T.: ‘Pairwise support vector machines and their application to large scale problems’, Journal of Machine Learning Research, 2012, 13, (Aug), pp. 2279–2292 24 Omara, I., Xiaohe, W., Hongzhi, Z., Yong, D., Wangmeng, Z.: ‘Learning pairwise svm on deep features for ear recognition.’, IEEE/ACIS International Conference on Computer and Information Science (ICIS), 2017, 25 Nanni, L., Lumini, A.: ‘Fusion of color spaces for ear authentication’, Pattern Recognition, 2009, 42, (9), pp. 1906–1913 26 Kumar, A., Chan, T.S.T.: ‘Robust ear identification using sparse representation of local texture descriptors’, Pattern Recognition, 2013, 46, (1), pp. 73–85 27 Boodoo.Jahangeer, N., Baichoo, S.: ‘LBP-based ear recognition’, International Conference on Bioinformatics and Bioengineering, 2013, pp. 1–4 28 Prakash, S., Gupta, P.: ‘An efficient ear recognition technique invariant to illumination and pose’, Telecommunication Systems, 2013, pp. 1–14 29 Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘ImageNet classification with deep convolutional neural networks’, Advances in Neural Information Processing Systems, 2012, pp. 1097–1105 30 Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, International Conference on Learning Representations, 2015, 31 Parkhi, O.M., Vedaldi, A., Zisserman, A.: ‘Deep face recognition.’, British Machine Vision Conference, 2015, p. 6 32 Minaee, S., Abdolrashidiy, A., Wang, Y.: ‘An experimental study of deep convolutional features for iris recognition’, Signal Processing in Medicine and Biology Symposium, 2016, pp. 1–6 33 Hafemann, L.G., Sabourin, R., Oliveira, L.S.: ‘Writerindependent feature learning for offline signature verification using deep convolutional neural networks’, International Joint Conference on Neural Networks, 2016, pp. 2576–2583 34 RADZI, S.A., HANI, M.K., Bakhteri, R.: ‘Finger-vein biometric identification using convolutional neural network’, Turkish Journal of Electrical Engineering & Computer Sciences, 2016, 24, (3), pp. 1863–1878 35 Zhang, H., Mu, Z.: ‘Compound structure classifier system for ear recognition’, IEEE International Conference on Automation and Logistics, 2008, pp. 2306–2309 36 Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: ‘Return of the devil in the details: Delving deep into convolutional nets’, British Machine Vision Conference, 2014, pp. – 37 He, K., Zhang, X., Ren, S., Sun, J.: ‘Deep residual learning for image recognition’, IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
10
38 Omara, I., Xiao, G., Amrani, M., Yan, Z., Zuo, W.: ‘Deep features for efficient multi-biometric recognition with face and ear images’, Ninth International Conference on Digital Image Processing, 2017, 10420, pp. 104200D 39 Turk, M., Pentland, A.: ‘Eigenfaces for recognition’, Journal of Cognitive Neuroscience, 1991, 3, (1), pp. 71– 86 40 Kessy, A., Lewin, A., Strimmer, K.: ‘Optimal whitening and decorrelation’, The American Statistician, 2017, pp. – 41 Lacey, G., Taylor, G.W., Areibi, S.: ‘Deep learning on FPGAS: Past, Present, and Future’, arXiv preprint arXiv:160204283, 2016, pp. – 42 Mu, Z.: ‘USTB ear image database’, -, 2005, pp. – 43 Kumar, A.: ‘IIT Delhi ear image database version 1.0’, New Delhi, India, 2007, pp. – 44 Chen, L., Mu, Z.: ‘Partial data ear recognition from one sample per person’, IEEE Transactions on HumanMachine Systems, 2016, 46, pp. 799 – 809 45 Mu, Z., Yuan, L., Xu, Z., Xi, D., Qi, S.: ‘Shape and structural feature based ear recognition’, Advances in Biometric Person Authentication, 2004, pp. 663–670 46 Benzaoui, A., Hezil, N., Boukrouche, A.: ‘Identity recognition based on the external shape of the human ear’, Applied Research in Computer Science and Engineering (ICAR), International Conference on, 2015, pp. 1–5 47 Zhang, H.J., Mu, Z.C., Qu, W., Liu, L.M., Zhang, C.Y.: ‘A novel approach for ear recognition based on ICA and RBF network’, International Conference on Machine Learning and Cybernetics, 2005, pp. 4511–4515 48 Nosrati, M.S., Faez, K., Faradji, F.: ‘Using 2D wavelet and principal component analysis for personal identification based on 2D ear structure’, International Conference on Intelligent and Advanced Systems, 2007, pp. 616–620 49 Kumar, A., Wu, C.: ‘Automated human identification using ear imaging’, Pattern Recognition, 2012, 45, (3), pp. 956–968 50 Tariq, A., Anjum, M.A., Akram, M.U.: ‘Personal identification using computerized human ear recognition system’, International Conference on Computer Science and Network Technology, 2011, pp. 50–54 51 Guo, Y., Xu, Z.: ‘Ear recognition using a new local matching approach’, IEEE International Conference on Image Processing, 2008, pp. 289–292 52 Kong, A.: ‘Palmprint identification based on generalization of iriscode’, University of Waterloo, 2007, pp. – 53 Zhang, D., Zuo, W., Yue, F.: ‘A comparative study of palmprint recognition algorithms’, ACM computing surveys, 2012, 44, (1), pp. 2 54 Phillips, P.J., Grother, P., Micheals, R.: ‘Evaluation methods in face recognition’, Handbook of Face Recognition, 2011, pp. 551–574 55 Liao, S., Jain, A.K., Li, S.Z.: ‘Partial face recognition: Alignment-free approach’, IEEE Transactions on pattern analysis and machine intelligence, 2013, 35, (5), pp. 1193–1205 56 Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: ‘Webscale training for face identification’, Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2746–2754
IET Review Copy Only
IET Control Theory Appl., pp. 1–10 c The Institution of Engineering and Technology 2015
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
Detection and Identification Rate
Page 11 1 of 12
DIR-FAR
IET Biometrics
0.99 0.98 0.97
USTB I USTB II IIT Delhi I IIT Delhi II
0.96 0.95
0
IET Review Copy Only 0.2 0.4 0.6
False Alarm Rate
0.8
1
Detection and Identification Rate
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication in an issue of the journal. To cite the paper please use the doi provided on the Digital Library page.
DIR-FAR
IET Biometrics
1
Page 12 of 12
0.995
0.99
0.985
0.98
0
IET Review Copy Only 0.2 0.4 0.6
False Alarm Rate
USTB II IIT Delhi I 0.8
1