SOFT-BIOMETRICS, PERFORMANCE AND ...

9 downloads 0 Views 6MB Size Report
samples, and fail to generalize to other types of spoofed samples. In the last couple ...... URL http://dx.doi.org/10.1007/978-1-4899-7488-4_24. 17. J. Daugman.
SOFT-BIOMETRICS, PERFORMANCE AND PRESENTATION ATTACK DETECTION IN IRIS RECOGNITION

A Proposal

Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by Andrey Kuehlkamp

Kevin Bowyer, Director

Graduate Program in Computer Science and Engineering Notre Dame, Indiana September 2017

CONTENTS

FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

TABLES

vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1: INTRODUCTION 1.1 Biometrics . . . . . . . . 1.2 Iris Recognition . . . . . 1.3 Motivation . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 3 6

CHAPTER 2: RELATED WORK . . . 2.1 Soft Biometrics . . . . . . . . . 2.1.1 Gender Prediction . . . 2.2 Search in Iris Databases . . . . 2.2.1 1:N and 1:First search . 2.2.2 Indexing and Limitations 2.3 Iris Spoofing . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8 8 8 10 12 13 15

CHAPTER 3: GENDER FROM IRIS . . . . . . . . . . . . . . . . 3.1 Using Neural Networks to Predict Gender from Iris Images 3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.1 Person-Disjoint Train and Test . . . . . . 3.1.1.2 The influence of the presence of cosmetics 3.1.1.3 Occlusion masks . . . . . . . . . . . . . . 3.1.2 Two approaches for Feature Extraction . . . . . . . 3.1.2.1 Data-Driven Features . . . . . . . . . . . 3.1.2.2 Hand-Crafted Features . . . . . . . . . . . 3.1.3 Neural Network Topologies . . . . . . . . . . . . . . 3.1.3.1 Convolutional Neural Networks . . . . . . 3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . 3.2 A new dataset to investigate Gender From Iris . . . . . . . 3.2.1 GFI-C . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Experiments and Results . . . . . . . . . . . . . . . 3.2.2.1 Dataset Analysis . . . . . . . . . . . . . . 3.2.2.2 Classification Results . . . . . . . . . . . . 3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

18 18 18 19 21 24 26 26 27 28 30 33 34 34 36 36 39 43

ii

CHAPTER 4: 1:FIRST SEARCH . . . . . . . . . . . 4.1 A Basic Approach . . . . . . . . . . . . . . . 4.1.1 Dataset . . . . . . . . . . . . . . . . 4.1.2 Method . . . . . . . . . . . . . . . . 4.1.3 Results . . . . . . . . . . . . . . . . . 4.1.4 Discussion . . . . . . . . . . . . . . . 4.2 Experiments in Extended Scenarios . . . . . 4.2.1 Closed-set versus Open-set . . . . . . 4.2.2 Matching output . . . . . . . . . . . 4.2.3 Dataset . . . . . . . . . . . . . . . . 4.2.3.1 Simulating more individuals 4.2.4 Gallery and Probe Set Formation . . 4.2.5 Threshold Selection . . . . . . . . . . 4.2.6 Results . . . . . . . . . . . . . . . . . 4.2.6.1 Open Set Scenarios . . . . . 4.2.6.2 Gallery Permutations . . . . 4.2.7 Discussion . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . for increased gallery size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 5: PROPOSED RESEARCH & TIMELINE 5.1 Gender From Iris . . . . . . . . . . . . . . . . . 5.2 1:First Search . . . . . . . . . . . . . . . . . . . 5.3 Iris Presentation Attack Detection . . . . . . . . 5.4 Timeline . . . . . . . . . . . . . . . . . . . . . .

. . . . .

80 80 82 84 86

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

iii

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

46 46 47 47 49 54 56 57 58 61 62 63 64 66 68 73 77

FIGURES

1.1

Four steps in the extraction of iris templates for recognition. . . . . .

4

3.1

Box plots show accuracy distributions on a non-person-disjoint dataset, using an MLP classifier operating on raw pixel intensity. The green dotted line and shaded area represent the average accuracy and deviation on a person-disjoint dataset. . . . . . . . . . . . . . . . . . . . .

20

Segmentation and normalization process in two images of the same eye, with and without mascara. . . . . . . . . . . . . . . . . . . . . .

22

3.2 3.3 3.4 3.5

Threshold on average image intensity can achieve 60% correct classification of males and FWC. . . . . . . . . . . . . . . . . . . . . . . . . Gender classification accuracy for different training groups, using pixel intensity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Accuracy using only the binary occlusion mask of the normalized iris. The dotted green line and shaded area denote the average accuracy and deviation for classifying normalized irises using LBP, and the dotted red line and shaded area are average accuracy and deviation using Gabor filtering. The box plots are accuracy distribution using simply the occlusion masks for the same normalized irises. . . . . . . . . . .

25

23

3.6

Accuracy based on pixel intensity for MLP and CNN classifiers. Average and standard deviation across 10 randomized train/test repetitions. 27

3.7

Classification accuracy distributions for hand-crafted features in a MLP classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Accuracy of MLP classifier on Gender prediction, using different features according to network topology. Average and standard deviation across 10 randomized repetitions. . . . . . . . . . . . . . . . . . . . .

31

3.8

3.9

Comparison of CNN with MLP results. Blue and black boxes represent accuracy using the entire eye image, respectively for CNN and MLP. Red, green and cyan boxes denote the accuracy using the normalized iris image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Mean intensity distribution in both datasets. GFI presents a perceptible shift in the intensity of Females With Cosmetics group. . . . . .

36

3.11 Comparison of the occlusion ratio for both datasets: GFI has a slight, but statistically significant higher occlusion ratio than GFI2. . . . . .

37

iv

31

3.12 Mean intensity distribution by image band (starting from the pupil boundary to the sclera boundary): In GFI the FWC group presents a growing shift to dark, consistent with cosmetics interference in the outer region of the iris. The same does not occur in GFI2. . . . . . . 3.13 Pupil and Iris Size and Dilation Ratio. There is no clear separation of the group distributions, suggesting these factors are not associated with gender prediction ability. . . . . . . . . . . . . . . . . . . . . . .

40

3.14 Age distribution in GFI and GFI-C. GFI-C has a marginally younger population than GFI. . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.15 Sharpness assessment of images in GFI and GFI-C. The FWC has a distinct distribution from the other groups in both datasets. . . . . .

42

3.16 gender prediction accuracy on different datasets using VGG features and Linear SVM. The shaded areas show the standard deviation from the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.17 Accuracy comparison across different classifiers on both datasets. Results suggest that VGG+SVM are more robust to the influence of cosmetics than LBP+MLP. . . . . . . . . . . . . . . . . . . . . . . .

45

4.1

Iris rotation tolerance limits for the experiments.

. . . . . . . . . . .

49

4.2

Example of false match: despite the evident segmentation error in 4.2a, the Hamming Distance to 4.2b was 0.298701, a score low enough to be considered a match. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Example of false non-match: 4.3a and 4.3b had a Hamming Distance of 0.423313, and was considered a non-match. The translucent ring that appears on the subjects’ contact lens was partially classified as an occlusion on 4.3a, which might have contributed for the high score.

50

4.4

FMR for 1:N and 1:First as Gallery Size increases. . . . . . . . . . . .

52

4.5

FMR for varying HD and gallery size. . . . . . . . . . . . . . . . . . .

53

4.6

ROC curve for matching without eye rotation tolerance. The colors represent the gallery size. . . . . . . . . . . . . . . . . . . . . . . . . . 54 Matching performance with ±3 - 14 rotational shifts. Note that the 1:First ROCs on the right have an unusual aspect. This occurs because the FMR increases and TMR decreases proportionally, with larger threshold values. As the FMR increases with the increase in the threshold, the curve leans to the right; at the same time, the drop in the TMR makes the curve to go down. This effect is particularly accentuated with larger gallery sizes. . . . . . . . . . . . . . . . . . . 55 Comparison of HD distributions between original and artificial images. 63

4.3

4.7

4.8

v

38

4.9

CDF-based threshold selection. Observe that the Cumulative Distribution curves are very different in format for IrisBee (top row) and VeriEye (bottom row): This happens because the matchers use dissimilarity and similarity scales, respectively. . . . . . . . . . . . . . .

65

4.10 True Positive Identification Rate comparison for IrisBee and VeriEye, using 1:N and 1:First search, over a range of gallery sizes. Higher means better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.11 FMR in closed set scenario with different rotation tolerances. Higher means worse. While at more restrictive settings, the difference between 1:N and 1:First is negligible, the problems in 1:First start to appear when a target accuracy is more strict than a certain limit. . . . . . .

69

4.12 FMR in open set scenario with different rotation tolerances. The presence of unenrolled identities in the probe set poses as a harder challenge, regardless of matcher or search method: Even 1:N search can yield higher FMR if the target accuracy is lenient enough. . . . . . .

70

4.13 Overall IrisBee performance in open set scenario. . . . . . . . . . . .

71

4.14 Overall VeriEye performance in open set scenario. . . . . . . . . . . .

72

4.15 IrisBee mean performance in closed set scenario, with 20 gallery permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.16 IrisBee mean performance in open set scenario, with 20 gallery permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.17 VeriEye mean performance in closed set scenario, with 20 gallery permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

4.18 VeriEye mean performance in open set scenario, with 20 gallery permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.1

5.2

5.3

An example of Class Activation Maps in image classification. Predicted classes are shown along with the activation map that originated them. Source: Zhou et al. [60] . . . . . . . . . . . . . . . . . . . . . .

81

Meta-recognition based on Extreme Value Theory. While the threshold t0 would have falsely rejected the score denoted by the red dot, postrecognition analysis of the non-match scores distribution reveals this score is at one of the extremes, and could be considered a match. Source: Scheirer et al. [51] . . . . . . . . . . . . . . . . . . . . . . . .

83

Proposed structure for a modular classifier for iris PAD.

85

vi

. . . . . . .

TABLES

2.1

Overview of gender prediction from iris images. . . . . . . . . . . . .

9

3.1

MLP neural network topologies used. P is the number of input pixels.

30

4.1

Galleries and Probe Sets Sizes . . . . . . . . . . . . . . . . . . . . . .

48

4.2

Average Performance Scores (in %) for Thresholds between 0.26 and 0.35, with increments of 0.01, no rotation tolerance. . . . . . . . . . . Conventional possible matching outputs. . . . . . . . . . . . . . . . .

51 58

Possible outputs for matching against a gallery in Closed-set and Openset scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.5

Empirically selected threshold for closed set with 14 rotations . . . .

66

5.1

Proposed timetable for dissertation research and writing. . . . . . . .

87

4.3 4.4

vii

CHAPTER 1 INTRODUCTION

1.1 Biometrics Humans frequently use body characteristics like face, voice and gait to recognize each other individually. It is an innate ability performed with ease by most people, and allows us to carry out social activities by identifying acquaintances, friends and family. On a more formal level, identity management is a task of crucial importance in a large variety of activities like border crossing, access control to sensitive locations and shared resources. To perform identity management, it is necessary to associate a person to an identity. According to Jain et al. [33], there are three ways to recognize an individual: 1) knowledge the person has; 2) objects the person possess; and 3) who the person is intrinsically. A typical example of the application of the first type of recognition is the use of passwords or Personal Identification Number (PIN) to grant access to a bank account. The person who wants access to the bank account needs to know the password in order to provide confirmation that they actually are who they claim to be. The second case is typically used to access mailboxes: in order to get what’s inside the mailbox, the person has to possess the corresponding key. The problem with these two forms of verification is that they are transferable. The third way to recognize an individual is more complex, because it requires verification of who that person is in fact. In order to not fall back to the previous cases, it is then necessary to assess nontransferable physical characteristics of that person. This measurement is known as biometric recognition. 1

Automatic biometric systems became available no earlier than the last 60 years, although the concept of biometric identification is much older than that. Among the evidence of earliest use of biometric traits as some form of identification, we can mention the numerous handprints found in prehistoric cave paintings, which seem to have functioned as a signature of their authors. Another indication is the fact that Babylonian business transactions, registered in clay tablets, seemed to include fingerprints as a form of authentication [42]. After the industrial revolution, the rapid urban growth reinforced the need for formal and accurate identification of people. To address this need, two main approaches emerged: Bertillon system of body measurements, called anthropometrics, and the formal use of fingerprints by police departments. Towards the end of the 1800s, a method that allowed record retrieval based on fingerprint patterns and ridges was created by Francis Galton, and allowed the popularization of its use. It was only after the first half of the twentieth century, however, that truly automated systems started to be developed. The emergence of these automated systems is clearly related to the appearance of computer systems. In the early 1960s, several biometric modalities, such as face, speech, fingerprint and signature began to be studied for automation [42]. These systems continued to evolve gradually until the 1990s, when along with the increase and popularization in computing power, there was an extensive popularization of automated biometric systems. In the following decade the foundations were mature enough to allow mass utilization of biometric identification in everyday systems. In summary, biometrics traits are fairly stable, non-transferable and more difficult to forge or manipulate than other types of identification tokens. Technology allowed the use of these properties in an automated way, enticing the development of a growing and exciting field.

2

1.2 Iris Recognition One of the most successful biometric recognition implementations is Iris Recognition. It is based on captured images from the colored area surrounding the pupil, called the iris. Each iris contains a complex pattern composed of elements like crypts, freckles, filaments, furrows, pits, striations and rings. These texture details are what make the iris particularly useful for recognition [33]. Since its first live demonstration was made in 1993 [15], iris recognition has evolved and become one of the most well known biometric modalities. The largest biometric database in the world, the Aadhaar program in India has already collected 1.13 billion people’s irises for enrollment [2]. In 2016, Somaliland started to register voters using iris biometrics [29]. The reason to do so is to prevent voting fraud, after authorities found a large number of duplicate registrations, even with the use of facial and fingerprint recognition [50]. The decision was made after months of testing and preparation, helped by a team of academic researchers [7]. Since 2002 countries like the United Kingdom, Canada and Singapore use iris biometric systems to perform border crossing checks on frequent travelers. Similarly, the United Arab Emirates (UAE) has employed a iris-based biometric system to keep track of banned travelers since 2001. The UAE system is known for performing approximately 14 billion IrisCode comparisons daily [16]. The idea of using iris texture as a biometric is usually attributed to Bertillon [3] (as cited in [5]), over 100 years ago. James Doggart [20] (as cited in [15]) suggested irises could be used in the same way as fingerprints around 1949. Inspired by Doggart’s idea, American ophtalmologists Flom and Safir patented an automated system for iris biometrics system in 1987, but it was not implemented. In 1992, a report by the Los Alamos National Laboratory confirmed the iris potential for verification and identification scenarios. The implementation of an iris-based biometric recognition 3

only became practical in 1993, when John Daugman [18] described his approach in detail. This method was later patented [19], and ended up as a de facto standard for iris recognition for the next two decades. The iris template extraction, as described in Daugman [12], can be split into four distinct steps: acquisition, segmentation, normalization and encoding (Fig 1.1). Acquisition – Naturally, this is the first step in the template extraction pipeline. The image is captured by a Charge-Coupled Device (CCD) camera in the NearInfrared (NIR) light frequency. Some darker-colored irises absorb so much light in the visible spectrum, that their patterns are hardly detected. For this reason, NIR illumination is used in order to capture the iris patterns independently of their pigmentation. In general the iris images are 480 × 640 pixels in size, taken from a distance of no more than 1 meter, and they frame the entire eye socket. In such an image, the iris usually has 200 − 280 pixels in diameter.

(a) Original Image

(b) Segmentation

(c) Normalization

(d) Template

Figure 1.1. Four steps in the extraction of iris templates for recognition.

4

Segmentation – Since the image contains the image corresponding to the whole periocular region, it is necessary to segment the iris region. Several methods can be used, but the most popular is to apply some form of circle detection to find the sclera and pupil boundaries. Daugman [12] proposes the use of an integrodifferential operator to locate these circles, based on the intensity gradient present in the iris/sclera and iris/pupil borders. An additional task that can be performed in this step is the masking of pixels contained within the pupil and iris circles, that correspond to occlusions, like eyelids or eyelashes. Normalization – The computational handling of a circular image is not trivial. To simplify the process and allow the use of traditional image processing tools, the circular ring that contains the iris is “unwrapped”, by performing a conversion from polar coordinates into cartesian coordinates. This process also adjusts iris size variations, resulting in a size invariant, rectangular-shaped image, that can be easily processed. Encoding – Having the normalized iris image, it is possible to extract its features. This process is accomplished by 2D Gabor wavelet filtering of the image. The result of this filtering is the encoding of certain frequencies present in the complex iris texture into a matrix of 2,048 binary values, which are commonly referred to as IrisCode. After the IrisCode extraction, the comparison can be done by counting the agreements between the corresponding bits in two templates. This operation can be performed very effectively through the Boolean Exclusive-OR (XOR) of the templates. This result is used to calculate a fractional Hamming Distance (HD) as the measure of dissimilarity between two irises: while two different irises will tend to disagree in about 50% of the IrisCode bits (random), two templates extracted from the same iris will tend to disagree consistently below that. Daugman’s work showed that inferring whether two templates come from the same iris of not can be approached by looking at the level of agreement in their IrisCodes in a statistical context.

5

Several improvements in the process were made since its proposition in 1993, but the essence of the process is still the same. Daugman [14] wrote a summary of several changes that were later proposed to improve the accuracy and performance of iris recognition.

1.3 Motivation Extensive surveys like [5] and [6] show that Iris Recognition is still a very active research subject. The activity in the research field and the popularization of iris recognition systems all over the world feed off each other, each stimulating the other. As these systems become more utilized, weaknesses and opportunities for improvement emerge, driving research to address these issues. This is the scope in which this work fits: the analysis of recently raised questions, as an attempt to provide a better understanding of these issues. Furthermore, concrete proposals are made in order to solve or mitigate the motivating issues. This work has the intent of making contributions to several distinct aspects related to iris recognition: soft biometrics, performance and presentation attack detection. Soft biometrics can contribute to strong biometric traits without creating a strain for the system. One such question is the possibility of determining the gender of a person from properties of their iris texture. This topic was initially studied over a decade ago, but has not yet been convincingly answered. Despite the absence of anatomical or genetic evidence for such [38, 54], computational attempts to perform gender prediction showed promising results, as seen in [36]. The present work intends to deepen the knowledge on gender from iris, contributing to settle the issue. Iris recognition is being deployed in an increasing number of application, and with larger and larger database sizes. The growth both in size and occurrence of iris-based systems is most likely going to lead to performance issues. Although iris matching can be performed in an extremely rapid manner, the need for optimization tends to 6

become more evident. In this sense, we analyze one search technique that is known to have been utilized in practice, but which performance and accuracy have not been formally considered. For reasons similar to those driving the need for speed optimization, robustness to malicious or unintentional attacks is another area in which advancements will be required. Current Presentation Attack Detection (PAD) technology in iris systems has techniques which allow detection for a specific type of attack or a specific type of sensor, but they do not generalize well. It is expected that the number and sophistication of presentation attack attempts will grow, so it would be important to integrate detection abilities, to simplify its application. In this sense, we expect to propose a deep-learning based method to perform PAD on cross-sensor scenarios.

7

CHAPTER 2 RELATED WORK

2.1 Soft Biometrics Most biometric recognition systems are based on a single trait, and they are called unimodal biometric systems. Since unimodal systems can present problems due to noisy data or low distinctiveness of the trait, more than one biometric trait (e.g. face and fingerprint) can be combined and used for recognition, resulting in multimodal biometric systems [32]. However, the additional complexity of adding a second biometric trait to improve recognition may not be justifiable. The use of additional information like age, weight, height or gender may contribute to improve unimodal biometric systems performance, without the overhead of processing an additional trait. As defined by Dantcheva et al. [11], ”[s]oft biometric traits are physical, behavioral, or material accessories, which are associated with an individual, and which can be useful for recognizing an individual.” Still according to them, these attributes can be automatically extracted and classified into human understandable categories. The extraction of ancillary information from biometric traits is known as soft biometrics.

2.1.1 Gender Prediction Gender is one soft biometric attribute, and gender recognition has been explored using biometric traits such as faces, fingerprints, gait and irises. The earliest work on gender-from-iris [59] used a classifier based on decision trees, and reported an 8

accuracy of about 75%. They extracted hand-crafted geometric and texture features from log-Gabor-filtered images in a dataset of over 57,000 images. The training and testing sets were not person-disjoint, which typically results in a higher estimated accuracy than can be expected for new persons.

Authors Thomas et al. (2007) [59] Lagree et al. (2011) [37] Bansal et al. (2012) [1] Tapia et al. (2014) [57] Fairhurst et al. (2015) [24] Tapia et al. (2016) [58] Bobeldyk et al. (2016) [4] Kuehlkamp et al. (2017) [36]

Classifier

Accuracy(%)

Features Gabor filtering + Hand-Crafted Gabor filtering + Hand-Crafted

Dataset Size

PersonDisjoint

57,137

No

600

Yes

Decision Tree

75

SVM

47.67–62.17

SVM

83.06

Hand-crafted + DWT

300

SVM

96.67

LBP

3,000

Various (individual and combined)

49.61–89.74

Geometric+Texture Hand-Crafted

SVM

91; 85.33

SVM MLP/CNN

CrossCosmetics Validation 10f No

Breakdown by Gender No

2, 5 and 10f

No

No

No

10f

No

No

No

80/20

No

Yes

1,600

Yes

72/28

No

No

IrisCode

3,000; 3,0001

No; Yes

80/20

No

Yes

66 ± 22

BSIF

3,314

Yes

60/40, 5x

No

Yes

66 ± 2.7

Intensity, Gabor filtering, LBP

3,000

Yes

80/20, 10x

Yes

Yes

TABLE 2.1: Overview of gender prediction from iris images.

Later, [37] used a Support Vector Machine (SVM) classifier with features extracted using spot and line detectors and Law’s texture measures. They used a dataset of 600 images and a cross-validation protocol with 2, 5 and 10 folds, with persondisjoint partitions. They considered both race-from-iris and gender-from-iris, and their classification accuracy on gender-from-iris ranged from 47% to 62%. A similar approach was used by [1], which used 2D Discrete Wavelet Transform (DWT) in combination with hand-crafted statistical features to extract texture information from the images. Using an SVM to classify the irises, they reported accuracy up to 83% 1

The first set of 3,000 was not person-disjoint, so the authors used another 3,000 images persondisjoint set. 2

The authors report higher accuracies, but using the entire eye image.

9

on a small dataset of 300 images. In the work of [57], using an SVM to classify Local Binary Pattern (LBP) features extracted from 3,000 iris images yielded an accuracy of 96.67%. This was for an 80/20 train/test split, on non-person-disjoint partitions. The same authors used a similar technique to perform gender classification based on the IrisCode used for identification in [58]. In this work, they performed evaluation on two different datasets: one was person-disjoint, while the other was not, and the reported accuracy changed considerably. The person-disjoint dataset, called the Gender-from-Iris (GFI) dataset, is available to the research community. In another study, [24] used an SVM in a combined consensus with other classifiers to achieve 81% accuracy on a person-disjoint dataset. They used a combination of geometric and texture features, selected via statistical methods, and a 72/28 training/testing split to prevent overfitting. An overview of the techniques and results used so far is presented in Table 2.1. None of these works has looked systematically at the effect of cosmetics on accuracy of predicting gender-from-iris. Most of the works do not use a subject-disjoint training and testing, especially those reporting the highest accuracy. And these works report accuracy from a single random split into train-test data, rather than a mean of N random splits. Apart from [24], no other research employed neural networks for this task.

2.2 Search in Iris Databases Iris recognition, as well as other biometric modalities in general, offers two types of identity management functionality [33]: Verification is the term employed when a user presents him or herself to the recognition system and claims to be a certain person, and the task of the system is to determine if the claim is true. In this case, the biometric template from the 10

user is compared to a single template in the database (one-to-one matching [41]). If the distance between the two templates meets a determined threshold, the claim is considered ”genuine” and the user is accepted. If the distance is above such threshold, the user is considered an ”impostor” and the claim is rejected. Identification is the other type of functionality provided by biometric systems. This term is used when the user presents him or herself to the recognition system, but does not explicitly claim an identity. The system then has to compare the user’s biometric template with the templates of potentially all the persons enrolled in the database (one-to-many matching[41]). The result of this process will be: a) the system ”accepts” the user, and assumes that his or her identity is the person with the smallest below-threshold match out of all enrollments in the database; or b) the system ”rejects” the user, indicating that the user is not enrolled in the database. Within identification it is possible to distinguish closed-set and open-set identification tasks. With a closed-set, the user is known to be enrolled in the database, and the system is responsible to determine his or her identity. On the other hand, when doing open-set identification, the system must, before trying to identify a user, determine if he or she is enrolled in the database [41]. This work is concerned with one-to-many matching as used in an identification system, and particularly, with exploring the difference between two possible implementations of one-to-many. The matching procedure is a core part of every biometric identification or verification system. In this procedure, the system compares the biometric sample acquired from the user against previously stored templates and scores the level of similarity between them. According to a predetermined threshold, the system then makes a decision about the user: either it is a match or a non-match. Declaring a match means to assume that the system accepts both biometric samples as being originated by the same human source [41].

11

A biometric system may produce two types of errors, false match and false nonmatch. A false match occurs when two samples from different individuals are declared by the system as a match. A false non-match is when two samples from the same individual fail to be considered as a match by the system [33]. Not every captured image of an iris has the same head tilt, camera angle and rotational torsion of the eye, which can cause it to be misinterpreted as a nonmatch. Iris matchers usually offer some tolerance to the iris rotation, in the form of “best-of-M” comparisons: A comparison between a pair of iris codes is performed several times, over a range of relative rotations, and the best match is chosen to be the distance score for that pair [13]. As an example, the NEXUS border-crossing program considers a range of 14 rotation values in the initial scan of the enrollment database, but the range is widened to an additional 28 rotation values if no match was found on the initial scan [9].

2.2.1 1:N and 1:First search In practice, the application of biometric identification may encounter restrictions when implementing one-to-many matching. If the enrollment list is large, it may be slow to sweep it entirely every time a user is presented to the system. The traditional approach for the implementation of one-to-many identification is the exhaustive 1-toN (1:N ), and is probably the only form of one-to-many matching to receive attention in the research literature. To speed up the search process in a one-to-many identification, a common approach is to perform a search known as 1-to-First (1:First), in which the system sweeps the enrollment database until it finds the first template for which the distance score is within a defined threshold, and declares a match [44]. This approach yields a lower number of comparisons on average, compared with the 1:N method. On the other hand, this technique may lead to a higher error rate, since when a match 12

pair is found the search is stopped, ignoring other potentially better matches. Other efforts have been made in the sense of improving the search performance in iris databases. Rakvic et al. [47] proposed the parallelization of the algorithms involved in the iris recognition process, including the template matching. Their parallelized version, although more efficient than a sequential CPU to perform a single match, still has its overall performance directly associated to the size of the database. In another attempt to address the issue, Hao et al. [28] propose an approach based on Nearest Neighbor search, to reduce the search range and thus improve the performance. For the sake of clarity, this work will refer to one-to-many as the general identification procedure, while considering 1:N and 1:First as two different types of oneto-many matching.

2.2.2 Indexing and Limitations Unlike other numeric or lexicographic data, biometric samples do not have any natural ordering [49]. Therefore, any attempts to index biometric databases are considerably hindered. The task of automatic iris identification is based on the comparison of a probe to an enrollment gallery. Since there is no order for the gallery samples, the obvious approach is to exhaustively scan the entire gallery and compare each enrolled iris to the probe. Mukherjee and Ross [40] define the problem of iris identification in terms of comparing a query iris sample q, with enrolled iris samples D = {d1 , d2 , d3 , ...dn }, in order to determine the identity y of the query sample. Each gallery sample dj , j = 1, 2, ..., n is associated with an identity yj . Consequently, the computational complexity of the process is directly linked to the number of enrolled samples |D| = n in the gallery. Matching iris samples based on Daugman’s approach (called IrisCodes) is an operation that involves the accumulation of bitwise XOR operations between the samples, 13

and can be done quite efficiently. Despite this fact, the computational complexity of the task grows linearly with the increase in gallery size. Additionally, the complexity for de-duplication grows quadratically regarding the size of the database, as noted by Proen¸ca [45]. With many iris recognition deployments reaching millions of enrollment samples already in use, a tendency to grow in popularity, and the need for processing less than ideal iris images [40], one can expect the computational demands to increase. Thus, it is justified the need for optimizations and alternative approaches to reduce the computational requirements. Nevertheless, only a small number of proposed indexing methods are found in the literature [45]. The most simple approach to reduce computational demand of an iris search is implemented in 1:First. Instead of scanning the entire gallery and finding the best possible match for a sample, the method proposes to scan until a sufficiently good match is found. If the distribution of the samples in the gallery is random and normal, it is likely that acceptable matches will be found before the end of the list, thus reducing the search time. However, this new approach raises some questions which have not been answered so far: 1. How good a match has to be to be considered acceptable? 2. Is an acceptable match always a correct match (the probe and the enrollment samples correspond to the same identity)? 3. Empirically, does 1:First perform faster than 1:N? If so, how much faster? 4. Does 1:First result in the same identification accuracy than 1:N? Bowyer et al. [6] surveyed several different attempts to improve the matching speed of iris codes and reduce the time required for database scans. These works report different degrees of success, but there seems to be no clear trend in performance for iris matching and searching. Furthermore, to the best of our knowledge, [35] is the only work to have evaluated the 1:First search technique. 14

2.3 Iris Spoofing The evolving field of biometrics has drawn interest not only from the research community, but from the whole society. As it could be expected, malicious attempts to circumvent rules of biometric systems are not rare. These attempts are usually denominated attacks [48] (as cited in [23]). In such attempts, the attacker tries to use some kind of fabricated artifact (e.g. rubber fingers or printed face/iris) to mimic the biometric features of a genuine user. This practice, commonly called spoofing has motivated the emergence of a new research area: “biometric anti-spoofing” [26]. Erdogmus and Marcel [23] classify attacks into two types: indirect, which are usually performed from within the system, by cyber-criminal hackers or even insiders, and direct, which are the attacks performed at the sensor level. Since indirect attacks are in many ways similar to digital attacks that occur in any type of computer systems, biometric anti-spoofing research focuses on direct attacks. There are three main categories in which direct attacks can be classified: obfuscation is when an impostor tries to change his biometric characteristic to prevent the system to perform identification; zero-effort attacks occur when the attacker claims to be an authorized user, but does not try to modify his biometric characteristics; finally, spoofing is when the attacker presents a counterfeit biometric template of an authorized user. Formally known as Presentation Attack Detection or simply PAD, this line of research studies the susceptibilities of all biometric modalities against such malicious attacks. More specifically, researchers try to develop specific countermeasures to enable biometric systems to detect fake samples and automatically reject them. A large number of approaches have already been utilized to perform PAD, and it is possible to arrange them into three groups, according to the mechanism they employ [23]: • Inherent features: authentic biometric samples have certain intrinsic visual or physical characteristics which can be used to distinguish them from fake sam15

ples. Among other features, it is possible to use color, shape, density, elasticity, electrical conductivity and electromagnetic absorbance. • Aliveness detection: another way to avoid capturing forged biometric traits is to ensure the presenter is a living person. Several voluntary or involuntary signals can be used to detect aliveness: pulse, blood pressure, pupillary light reflex or even face movements. • Forgery detection: unlike the previous methods, the third group of PAD approaches does not prevent the sensor to capture a forged sample. Instead, it consists in analyzing the sample after its acquisition, looking for specific cues that could only be found exclusively in a real or in a fake sample, and using those cues to classify it as legitimate or not. Since iris biometrics are more recent than other modalities, it also has a shorter history of studies of spoofing attacks. Galbally and Gomez-Barrero [26] conducted a survey, and concluded that most iris attacks can be categorized as one of three types: 1) photo attacks, in which the attacker presents a photo of a genuine iris to the sensor; 2) Contact lens attacks, when the pattern of a genuine iris reproduced on a contact lens used by the attacker during presentation; and 3) Artificial eye attacks, in which an artificial eye made of plastic or glass is used for presentation. Most PAD approaches for iris are specifically engineered to a particular type of attack, or even a particular data set, but perform poorly when presented to different situations. In other words, most PAD methods are trained for a specific type of fake samples, and fail to generalize to other types of spoofed samples. In the last couple of years the research community has become increasingly aware of this problem, and several works tried to address this issue. In an attempt to create a robust and comprehensive method to prevent iris spoofing, Sun and Tan [55] presented a software method to asses samples after the capture. They evaluate spoofing detection against three types of attack: printed irises, contact lenses and prosthetic eyes. Four different texture analysis methods are used: Gray Level Co-ocurrence Matrices (GLCMs), Iris Texton, selected-LBP and weighted-LBP. The reported overall EER for these methods was between 0.8 and 6.5%, but the 16

authors recognize the difficulty to identify all spoofing attacks using software-level techniques, and recommend combining the use of sensor level liveness detection to obtain better results. Raghavendra and Busch [46] used a scheme based on multiscale BSIF to extract features and a Linear SVM to classify real and fake images. They created a database of 3,300 images of normal and artifact iris samples. More recently, Kohli et al. [34] proposes the fusion of Zernicke moments and LBP with variance, classified by an Artificial Neural Network (ANN). The use of structural and textural features aims to detect different types of spoofing attacks in real world scenarios. They evaluated their method using Combined Spoofing Database (CSD), which was obtained by merging images of real and fake irises from at least 5 datasets, including different types of spoofed irises (printed, textured contacts and synthetically generated). They report an accuracy of 82.2% on this dataset, and acknowledge the need for improvement in the detection of multiple kinds of attacks. Sequeira et al. [52] propose an alternate approach to the way attacks are usually considered. Instead of training a classifier on negative (real) and positive (fake) samples of a few different types, they suggest the classifier to be trained only on the negative samples. This way, instead of learning to distinguish between negative samples and a few different types of positive samples, the system would be able to separate the real eyes from all (or at least, most) kinds of forgeries. The proposed method was evaluated on the Visible Spectrum Iris Artefact (VSIA), which contains five different types of iris artefacts, in addition to the real irises. The results obtained with one-class classifier were in general inferior to binary classifiers, illustrating how these methods can be overly optimistic. There is a tendency towards the creation of PAD techniques that are more effective and at the same time, more broadly applicable. Recent research has shown that this goal is yet to be achieved.

17

CHAPTER 3 GENDER FROM IRIS

3.1 Using Neural Networks to Predict Gender from Iris Images Several attempts have already been made to predict gender from iris images, with varying results. In this section, we describe our attempt to approach the problem, which was published in Kuehlkamp et al. [36].

3.1.1 Methods We use the “Gender from Iris” (GFI) dataset

1

used in [58], which to our knowl-

edge is the only publicly available dataset for this problem. It consists of 1, 500 left-eye and 1, 500 right-eye images, for 3, 000 total, representing 750 male and 750 female subjects. The 480 × 640, near-infrared images were obtained with an LG 4000 iris sensor. Previous work generally reported accuracy based on a single random split of the data into train and test. The problem with this is that a single partitioning of the data into train and test can easily give an “optimistic” estimate of true accuracy. For this reason, in our experiments, a basic trial is a random 80/20 split into train and test data, and reported accuracy is averaged over ten trials. Each trial is persondisjoint training and testing. With this approach, we expect to obtain a more true estimate of accuracy. 1

https://sites.google.com/a/nd.edu/public-cvrl/data-sets

18

The iris images were processed using IrisBee [39] to segment and normalize the iris region. Normalized iris images were stored in different resolutions: 40×240, 20×240, 10 × 240, 5 × 120, 3 × 60 and 2 × 30 pixels. As a result of the segmentation, a mask is generated for each image, marking where the iris texture is occluded, usually by eyelids or eyelashes. In the experiments that used raw pixel intensities as the features, the normalized iris images were used as feature inputs of the classifier. The sizes of the feature vectors were then 4800, 2400, 600, 180 and 60, respectively. After performing training on a portion of the images, we use the test set to perform the evaluation, based on a simple criterion: given an unlabeled normalized iris, can we correctly predict the subject’s gender? Two main feature extraction techniques were explored: data-driven features using raw pixel intensity, and handcrafted features using Gabor filtering and LBP. A more detailed description of these feature extraction approaches is given in section 3.1.2. Classification experiments were performed using MLP neural networks and CNNs. The details about the topology of the networks are described in section 3.1.3.

3.1.1.1 Person-Disjoint Train and Test We performed the same experiment on the person-disjoint GFI dataset, and on a previous version of that dataset that is not person-disjoint. For the GFI dataset, there is one image per iris, and so the training and testing is necessarily persondisjoint. For the second dataset, there are a varying number of images per iris, of a smaller number of different irises, and so the training and testing is not persondisjoint. For both sets of results, accuracy is averaged over 10 trials, with each trial using a random 80/20 split for train/test data. The estimated accuracy using the subject-disjoint training and testing enforced by the GFI dataset is 61% ± 2.9. The estimated accuracy with the non-person-disjoint training and testing allowed by the other dataset with multiple images per iris is 19

77% ± 2.6. This is an average over ten trials; Figure 3.1 shows that a single nonperson-disjoint trial could easily result in an estimated accuracy of 100%. The higher estimated accuracy for the non-person-disjoint train/test apparently results from the classifier learning subject-specific features, rather than generic gender-related texture features.

Figure 3.1. Box plots show accuracy distributions on a non-person-disjoint dataset, using an MLP classifier operating on raw pixel intensity. The green dotted line and shaded area represent the average accuracy and deviation on a person-disjoint dataset.

This experiment makes the point that it is impossible to meaningfully compare non-person-disjoint results with person-disjoint results. Higher (but optimistic) accuracies are reported for works using a non-subject-disjoint methodology and lower (but more realistic) accuracies reported using a subject-disjoint methodology. Also,

20

in general, accuracies are reported for a single split of the data. A more useful accuracy estimate is computed over N trials using random person-disjoint splits of the data.

3.1.1.2 The influence of the presence of cosmetics Mascara causes the eyelashes to appear thicker and darker in the iris image. Figure 3.2 shows a female eye with and without mascara. The use of eye makeup has been shown to affect iris recognition accuracy [22]. The basic mechanism is that if eyelash segmentation is not perfect, the segmented iris region may include some eyelash occlusion. To the degree that eyelash occlusion is present in the iris region, the use of mascara will generally increase the magnitude of the artifact in the texture computation. The same effect can also happen with other types of makeup like eyeliner, although this one is applied to the eyelid instead of the eyelashes. To investigate how mascara might affect gender-from-iris results, we reviewed the GFI dataset and annotated which images show evidence of mascara or eyeliner. Just over 60% for the female iris images show visible evidence of cosmetics, compared to 0% of the male iris images. The annotation allowed us to perform experiments using three categories of images: Male, Female With Cosmetics (FWC) and Female with No Cosmetics (FNC). One simple observation is that average image intensity for FWC is darker than for FNC or for Males (Fig. 3.3). This is true whether one considers the image as a whole, or only the segmented iris region. For Males and FNC, the distributions of average image intensity are almost identical; see Fig. 3.3a and 3.3b. For Males and FWC, there is a noticeable separation between the distributions; see Fig. 3.3c and 3.3d. Based on this separation, we could apply a simple threshold and achieve better than 60% accuracy distinguishing Males from FWC (EER of about 37%). However, a similar threshold for Males and FNC results in only about 50% accuracy. This 21

(a) Eye without mascara

(b) Eye with mascara

(c) Segmented eye without mascara

(d) Segmented eye with mascara

(e) Normalized eye without mascara

(f) Normalized eye with mascara

(g) Resized normalized eye without mascara

(h) Resized normalized eye with mascara

Figure 3.2. Segmentation and normalization process in two images of the same eye, with and without mascara.

experiment shows how the presence of mascara can potentially make the genderfrom-iris problem appear to be easier to solve than it is in reality. We also trained MLP networks to classify gender-from-iris. We considered both using the whole iris image, and using only the normalized iris region. We also considered training with and without images containing mascara. The results are summarized in Figure 3.4. When training with the full dataset (Males, FNC and FWC), the accuracy achieved with the whole image is greater than the accuracy achieved with the iris region alone. Also, the accuracy achieved is highest for the FWC subgroup, and lowest for the FNC subgroup. The trained MLP is apparently able to use the presence of mascara to correctly classify a higher fraction of the females in the FWC

22

(a) Males and Females with No Cosmetics, whole eye images.

(b) Males and Females with No Cosmetics, segmented irises.

(c) Males and Females With Cosmetics, whole eye images.

(d) Males and Females Without Cosmetics, segmented irises.

Figure 3.3: Threshold on average image intensity can achieve 60% correct classification of males and FWC. subgroup, at the expense of lower classification for the FNC subgroup. Next we trained two additional networks, one using Males + FNC, and another using Males + FWC. The Male images were randomly sampled to equal the number of female images, to avoid biasing the training toward a majority class. Comparing the results for normalized iris trained on all subjects (Fig. 3.4a right) with those trained on Males+FNC (Fig. 3.4b right), while FNC performance improved, we can perceive a small decrease in the overall accuracy. At the same time, training on Males+FWC (Fig. 3.4c right) causes the overall performance to increase to 64%. This effect is amplified when working with whole eye images. In Fig. 3.4b (left side) the FNC accuracy improvement is almost equal to the male accuracy drop, and it results in an overall accuracy contraction with regard to Fig. 3.4a (left). On the other hand, in Fig. 3.4c (left) the overall accuracy rises to 88%. The experiment makes it clear that mascara is an important confounding factor for gender-from-iris. If mascara is present in the dataset, then it is hard to know

23

1

1 Males FNC FWC Overall

Accuracy

0.9

1 Males FNC FWC Overall

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

Whole Eye

Normalized Iris

(a) Training on All subjects

Whole Eye

Normalized Iris

(b) Training on Males+FNC

Males FNC FWC Overall

0.9

0.3

Whole Eye

Normalized Iris

(c) Training on Males+FWC

Figure 3.4. Gender classification accuracy for different training groups, using pixel intensity.

the degree to which the classifier learns gender from iris texture versus gender from mascara. Future research on gender-from-iris should use datasets that include annotations for the presence of mascara, and new mascara-free datasets are needed.

3.1.1.3 Occlusion masks Eyelids and eyelashes frequently occlude portions of the iris, Ideally, the segmentation step would result in these occlusions becoming part of the “mask” for the image. Results in the previous section indicate that eyelash occlusion is generally not perfectly segmented. It appears that mascara causes the “noise” resulting from un-masked eyelash occlusion to become a feature that can be correlated with gender. If this is the case, mascara may also cause more eyelash occlusion to be identified and segmented (Fig. 3.2). In this case, the size and shape of the masked region would be a feature correlated with gender. In order to determine if the occlusion mask contains gender-related information, we performed an experiment where the only information given to the MLP classifier is the (binary) occlusion mask. Figure 3.5 shows the result of this experiment. Despite the fact that the MLP has no access to any iris texture information, the accuracy

24

achieved is similar to that achieved on the iris images.

Figure 3.5. Accuracy using only the binary occlusion mask of the normalized iris. The dotted green line and shaded area denote the average accuracy and deviation for classifying normalized irises using LBP, and the dotted red line and shaded area are average accuracy and deviation using Gabor filtering. The box plots are accuracy distribution using simply the occlusion masks for the same normalized irises.

The results of this experiment suggest that there are two paths by which mascara can make it easier to identify female iris images. To the degree that eyelash occlusion is not well segmented, the eyelash occlusion that contaminates the iris texture will be darker with mascara than it is without. To the degree that mascara makes it easier to segment more of the eyelash occlusion, the masked area of the iris will be larger. By whichever path, when high gender-from-iris accuracy is found using a dataset in which many women wear mascara, it is difficult to know if the accuracy is truly

25

due to gender-related patterns in the iris texture, or simply due to the presence of mascara.

3.1.2 Two approaches for Feature Extraction Approaches explored to extract discriminative features from the normalized iris images include hand-crafted features (e.g., Gabor filtering, LBP) and data-driven features, in which the raw pixel intensity is fed into neural networks that may “learn” features. All the experiments followed the same methodology described in Section 3.1.1.

3.1.2.1 Data-Driven Features Neural networks are an example of a classifier that can learn features from raw data. Data-driven features are “learned” from a dataset through a training procedure, and are dependent on the characteristics of the data and the classification goal. Here we present results of this approach, obtained through MLP and CNN classifiers. Details on the implementation of these networks are in Section 3.1.3. Pixel intensity is the simplest feature. The pixel values of the masked, normalized image are fed directly to the neural network. Despite no explicit texture information being given to the network, the average accuracy of this approach was approximately 60%. This accuracy is similar to what could be achieved using a simple intensity thresholding on the images, as seen in Section 3.1.1.2. This suggests that in this instance the neural network may be learning to predict gender based on a measure of average pixel intensity, or some other feature that is no more powerful. Figure 3.6 shows a plot of the average accuracy obtained by this technique, across different image resolutions. It is worth observing that low resolutions like 2 × 30 and 3 × 60 the images could contain very little texture information because of the averaging of pixels. 26

Figure 3.6. Accuracy based on pixel intensity for MLP and CNN classifiers. Average and standard deviation across 10 randomized train/test repetitions.

3.1.2.2 Hand-Crafted Features Gabor filtering and LBP are popular examples of hand-crafted feature extraction techniques. Gabor filtering is done as part the standard approach to creating the “iris code” [12, 17]. In our experiments here, 1-Dimensional Gabor filtering was performed for each row of the normalized iris. We chose to explore a range of wavelengths similar to those used for iris recognition. LBP has been used in previous work on genderfrom-iris [57]. For Gabor-filtered iris images, the average accuracy was 57% ± 3 across all wavelengths, and there was no significant difference between different wavelengths considered. The fact that Gabor filtering resulted in worse classification than pixel intensity may seem surprising, but there is a possible explanation. Gabor filtering highlights the local occurrence of certain frequencies in the image by maximizing its response to these frequencies, while minimizing the response to other frequencies. If these low27

response frequencies are related to features like occlusions or mascara, it makes sense that its attenuation has a negative effect on accuracy. As shown in section 3.1.1.2, the presence of eye cosmetics or even occlusion masks may artificially enhance the gender classification accuracy, and their removal makes the problem harder. So these results are consistent with the idea that a significant part of the information that is used for gender classification may not come from the iris texture. It is also important to mention that this work was limited to testing a certain range of parameters, based on those used for iris recognition. Since the main objective of iris recognition is to maximize the distinction between individual subjects and attenuate all non-person-specific features (such as gender, race, eye color, etc.), these parameters may not be the most appropriate for gender classification. Local Binary Patterns (LBP) is a well-known method for texture analysis [43, 27]. We took some of the same LBP variations and parameters in [57], and used MLP neural networks to perform gender prediction. In general, the best performances were achieved by uniform patterns and its variations (ULBP, CULBP-Mag and CULBP-Sign). ULBP histograms with and without patch overlapping had the highest accuracy with an average of 66%. Figure 3.7 shows an overall comparison between the three different feature extraction techniques. Gabor filtering had the worst results, with an average accuracy a little above 58%. In this graph, LBP extraction is divided into two different categories because of the significant performance difference between them. LBP images yields better accuracy than Gabor filtering or pixel intensity, but still well below Concatenated LBP Histograms.

3.1.3 Neural Network Topologies It is difficult to characterize the specific geometric or texture features that can be used to distinguish male from female irises. Thus, we decided to use an approach 28

0.7 0.68 0.66

Accuracy

0.64 0.62 0.6 0.58 0.56 0.54 0.52 0.5 Gabor Filtering

LBP Image

Concatenated LBP Histograms

Figure 3.7. Classification accuracy distributions for hand-crafted features in a MLP classifier.

based on neural networks, so that they could learn the features that are best fit for this classification. The first portion of these experiments consists of an exploratory attempt to classify gender, training arbitrary-sized MLP Neural Networks using backpropagation. As a rule of thumb for the structuring of the networks, all of them had a first hidden layer of 1.5 × P , where P is the number of input features. The following layers of neurons in the network were defined as shown in Table 3.1. For example for a 10×240 image, the first network was configured as 3600 × 20 × 1, the second 3600 × 40 × 1, and so on. In the cases where resolutions higher than 20×240 were used, the size of the MLP had to be reduced due to memory limitations. In these cases, we limited the size of Layer 1 to 5, 000 neurons.

29

Layer 1

Layer 2

Layer 3

Layer 4

Output

P × 1.5

20

-

-

1

P × 1.5

40

-

-

1

P × 1.5

300

40

-

1

P × 1.5

300

80

-

1

P × 1.5

600

80

-

1

P × 1.5

300

80

20

1

TABLE 3.1 MLP neural network topologies used. P is the number of input pixels.

The activation function used for each layer of the network was a hyperbolic tangent, with the exception of the output layer, which consisted of a sigmoid activation function, in order to produce an output within the range of 0 and 1 corresponding to the gender. Network topology, within the range of options explored here, seems to have very little effect on the classification accuracy. Figure 3.8 shows how little variation occurs across different topologies with different types of image features. These results also emphasize that LBP features perform better than raw intensity or Gabor features.

3.1.3.1 Convolutional Neural Networks We also experimented with a CNN architecture. These architectures have seen great progress in prominent image recognition benchmarks [53, 56], and their success in N-way image classification makes them promising for binary image classification

30

0.68 0.66 0.64

Accuracy

0.62 0.6 0.58 0.56 0.54 Raw Intensity Gabor Images LBP

0.52 0.5 x1 Nx20

x1

Nx40

x1

0x40

Nx30

0x1

0x8 Nx30

1 x1 x20x 0x80 0x80 Nx30

Nx60

Figure 3.8. Accuracy of MLP classifier on Gender prediction, using different features according to network topology. Average and standard deviation across 10 randomized repetitions.

0.85

CNN/Entire Eye MLP/Entire Eye/Intensity CNN/Normalized Iris MLP/Normalized Iris/LBP MLP/Normalized Iris/Intensity

0.8

Accuracy

0.75 0.7 0.65 0.6 0.55 95x120

190x240

285x360

60x69

40x240

80x480

120x720

10x240

20x240

40x240

10x240

20x240

40x240

Resolution

Figure 3.9: Comparison of CNN with MLP results. Blue and black boxes represent accuracy using the entire eye image, respectively for CNN and MLP. Red, green and cyan boxes denote the accuracy using the normalized iris image.

31

as well. For the purposes of this paper, a CNN was used to classify gender based on two inputs: the full image and the segmented iris image with black occlusion masks (Figure 3.9, blue and red plots). While the networks described in [53] and [56] are extremely large, trial by experimentation and difficulty of task (1000-way multi-scale classification vs 2-way singlescale classification) led to the conclusion that a smaller architecture would suffice in this environment. The network used consists of 3 sets of CNN layers, followed by 2 fully-connected (FC) layers and a softmax output. Each CNN set consisted of a Convolutional layer with a (4, 4) kernel and a (1, 1) stride, followed by a Max Pooling layer with a (2, 2) kernel and a (2, 2) stride. The number of features in each CNN layer were 16, 32 and 64 respectively, and the number of neurons in the FC layers were 1024 and 1536. Each neuron in the CNN and FC layers used the activation function max(0, X), commonly known as a Rectified Linear Unit, or ReLU activation. Like before, GFI data was randomly split into 80/20 person-disjoint subsets for training/testing, and the network trained on 2500 batches of 32 images before testing in all cases. The training was carried out separately for left and right eyes on three different resolutions: 95 × 120, 190 × 240 and 285 × 360 for the entire eyes, and 40 × 240, 80 × 480 and 120 × 720 for normalized irises. Twenty randomized trials for each resolution and eye were performed. Surprisingly, the results were virtually the same for all eyes and resolutions and almost identical to the accuracies obtained from using MLP networks. This may be because the data embedded in the image is low-level and separable by the MLP network, so the CNN layers simply transfer the underlying data to the final FC layers instead of extracting more information through its convolutions. This phenomenon would result in similar accuracies across network topologies and input resolutions, like those produced in this paper’s experiments. If we look at the resolutions used in the experiments with CNN and MLP on

32

the entire eyes (Fig. 3.9, blue and black boxes), the lower resolution used with MLP shows there is no accuracy gain using larger images, or using a more complex classifier. This means that classification is relying on image blobs that are large enough to be detected in a 60 × 69 image, once again suggesting that the fine details of iris texture do not contribute to gender classification as much as it was initially thought. When we look comparatively to normalized iris images (Fig. 3.9, red, green and cyan boxes), a significant portion of the gender-related information is lost. Again, CNNs do not seem to have a substantially higher accuracy.

3.1.4 Discussion We showed how the use of non-person-disjoint training and test can result in estimated gender-from-iris accuracy that is biased high. We also showed the importance of averaging over multiple trials. Using a single random train-test split of the data, the estimated accuracies ranged from 40% to 100%. We showed that the presence of eye makeup results in higher estimated genderfrom-iris accuracy. We also showed that classification based on the occlusion masks, disregarding completely the iris texture, results in an accuracy of approximately 60%. And we showed that simple averaging of the iris image intensity and thresholding can result in approximately 60% gender-from-iris accuracy. Our experiments showed hand-crafted features like LBP to yield better prediction accuracy (66%) than data-driven features (60%) when using MLP networks. On the other hand, CNNs (using data-driven features) had performance comparable to MLPs+LBP. In a similar experiment using the entire eye images, CNNs and MLPs had equivalent performance (around 80%) using learned features. Previous research may have misjudged the complexity of gender-from-iris, especially because of the subtle but important factors explored here. For future work, we suggest the creation of a subject-disjoint, mascara-free dataset. Currently, it is 33

not clear what level of gender-from-iris accuracy is possible based solely on the iris texture.

3.2 A new dataset to investigate Gender From Iris The findings in [36] introduced suspicion on some aspects of the gender from iris problem. First of all, how well can we actually predict gender based on iris images? More objectively, to what extent are factors external to the actual iris texture interfering with gender prediction? Although some works have shown very optimistic results, it is possible that these results have been influenced by problems in the experimentation, or unnoticed biases in the datasets, like the presence of cosmetics in images of female subjects. In fact, most of the datasets used so far were not put together with the specific objective of gender classification. We believe that in order to better understand gender from iris prediction, a new dataset needed to be constructed, taking account of all possible intervening factors present in the images during its collection and selection. In this section, we describe the process of building a new iris dataset for gender prediction while performing a series of analysis and comparisons to ensure its soundness and avoid potential biases. Furthermore, we comparatively evaluate the prediction performance using convolutional features and SVM classification.

3.2.1 GFI-C The Gender From Iris (GFI) dataset [57] was the first dataset created specifically for the problem of determining the gender of an individual based on the iris texture. It was later updated [58] in order to be completely person-disjoint (i.e. there is no more than one image per eye for each subject). Previous work [36] reported a significant interference in the classification potentially caused by the presence of eye cosmetics in the female population. The GFI 34

dataset was not collected having in mind potential distortions caused by the use of cosmetics, and yet approximately 60% of the female subjects wear some kind of eye makeup. To address this issue more carefully, we selected a new dataset from available images in the Biometrics Research Grid of the University of Notre Dame. We selected a number of female subjects for which there was positive and negative samples for eye cosmetics on each eye. The new dataset, called GFI-C, is composed of one image without cosmetics and one with cosmetics for each eye of 116 female subjects, tallying up 464 images and 232 distinct female identities. We also selected two left and right images from each of 116 male subjects without cosmetics, counting an additional 464 images and 232 distinct male identities. The complete dataset totalizes 928 images of 464 distinct subjects. For each subject, images were manually selected to favor good images in terms of alignment, focus, and visible iris size. The iris images were segmented and normalized using IrisBee [39] to a resolution of 64 × 512 pixels. For classification, features were extracted using a CNN implemented after VGG-E topology (described in [53]), trained for object recognition on ImageNet. The network activations were obtained at the first fully connected layer of the network (FC1), resulting in 4,096 features for classification. An additional preprocessing step had to be taken to fit the rectangular normalized irises to VGG’s 224 × 224 × 3 input. The single-channel normalized iris was resized to 224 × 28, and the remaining 196 rows were filled with black pixels. This image was then replicated into the remaining two channels, resulting in an image size compatible with VGG input. The classification experiments were performed using a Linear SVM, using an 80/20 train/test split. Classification trials were composed of 30 random train/test splits to reduce the effect of biased partitions, as highlighted in [36]. Each of these 30 train/test splits was ensured to be person-disjoint, and the number of subjects in each

35

group (Males, Females With Cosmetics and Females No Cosmetics) was balanced, to avoid any type of biasing in the process.

3.2.2 Experiments and Results This section details the analysis and comparisons performed on GFI-C dataset, as well as the gender prediction experiments and comparisons made.

Mean Intensity Distribution - Normalized Iris

Male Female, no cosmetics Female, with cosmetics

0.020

Prob. Density

0.015

0.010

0.005

0.000

0

50

100 Intensity (GFI)

150

200 0

50

100 Intensity (GFI-C)

150

200

Figure 3.10. Mean intensity distribution in both datasets. GFI presents a perceptible shift in the intensity of Females With Cosmetics group.

3.2.2.1 Dataset Analysis In order to assess the soundness the new GFI-C dataset, we performed an analysis of aspects that may cause interference in gender prediction, and compared the results with the previous GFI dataset.

36

Image intensity is the primary information in an image. It has been shown that, in some datasets, by simply establishing a threshold for the average intensity it is possible to predict gender with approximately 60% accuracy [36]. Figure 3.10 shows a comparison of the mean intensity distribution for different group both datasets. While in GFI there’s a perceptible shift to dark in the Female With Cosmetics group (FWC), the same does not occur in GFI-C.

0.6

Iris Occlusion Comparison

0.5

Occlusion ratio

0.4 0.3 0.2 0.1 0.0

GFI occlusion

GFI-C occlusion

Figure 3.11. Comparison of the occlusion ratio for both datasets: GFI has a slight, but statistically significant higher occlusion ratio than GFI2.

Next we analyzed iris occlusion in the images. In the segmentation/normalization

37

process, an occlusion mask is created to exclude objects that obstruct the iris texture, typically eyelids and eyelashes. Figure 3.11 compares the two datasets in terms of the portion of the iris that is covered by the occlusion mask. The graph shows that GFI-C has a slightly smaller occlusion than the previous dataset. Although small, the notches in the boxplot indicate the difference is statistically significant. In Figure 3.12 the mean intensity distribution of the image was calculated in five separate bands: the first band is adjacent to the pupil boundary, while the fifth is adjacent to the sclera boundary of the iris. In GFI dataset (Fig. 3.12 top row) the FWC group presents a shift to darker intensities that grows from the first to the last band. This shift to dark is consistent with the presence of dark, unmasked eyelashes, and is much more subtle in GFI-C dataset. This can be related to the fact that GFI-C has a lower occlusion ratio (Shown in Fig. 3.11).

Probability (GFI)

0.025

Mean Intensity Distribution - Normalized Iris by Band Male Female, no cosmetics Female, with cosmetics

0.020 0.015 0.010 0.005

Probability (GFI-C)

0.000 0.025 0.020 0.015 0.010 0.005 0.000

0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Band 1 Band 2 Band 3 Band 4 Band 5

Figure 3.12: Mean intensity distribution by image band (starting from the pupil boundary to the sclera boundary): In GFI the FWC group presents a growing shift to dark, consistent with cosmetics interference in the outer region of the iris. The same does not occur in GFI2.

38

Trying to better understand the visual cues that are used to predict gender, we looked at pupil and iris area, as well as the pupil dilation ratio with relation to the iris. For obvious reasons, this analysis was performed after segmentation and before the normalization step. The most evident difference between distributions occurs in iris area for GFI-C (Fig 3.13), where the distribution of FWC shows a shift to the right. Nevertheless, there seems to be little separation between the group distributions, suggesting this could hardly be a central factor in the prediction. The age of the person may influence some aspects of their eyes, such as the eyelid drooping, cornea shape, or pupil size [25]. Figure 3.14 shows the age distribution of the populations in GFI and GFI2, based on the year of birth. GFI-C is composed by a slightly younger set of people. The difference is not large, but the notches in the boxplots indicate it to be statistically significant. On the other hand, the difference between the median age for each dataset is less than two years. It does not seem reasonable that such a small difference in age might cause a perceptible effect on the irises, and consequently in gender classification. Focus of the image is another potential latent aspect that may introduce blur into the iris texture. Image sharpness could cause intensity distortions in the iris texture and thus disturb classification. We performed a sharpness assessment of the images in both datasets following the ISO 29794-6:2014 [31], shown in Figure 3.15. There is no important difference in the score distributions across datasets, but the FWC presents a sharper distribution from the other groups. Since the assessment is performed in the whole image (and not in the normalized iris), it may be that eyelids and eyelashes with cosmetics display a stronger edge contrast, causing this shift.

3.2.2.2 Classification Results Using VGG-generated features, we trained a simple Linear SVM to perform gender prediction on both GFI and GFI-C. Two experiments were performed: one using the 39

Pupil Area (Density)

0.00025 0.00020 0.00015 0.00010 0.00005 0.00000

0

5000 10000 15000 20000

0

5000 10000 15000 20000

Iris Area (Density)

0.00010 0.00008 0.00006 0.00004 0.00002

Dilation Ratio (Density)

0.00000

0

00 00 00 00 00 00 00 0 0000 0000 0000 0000 0000 0000 0000 100 200 300 400 500 600 700 1 2 3 4 5 6 7

8 7 6 5 4 3 2 1 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 GFI

Male Female, no cosmetics Female, with cosmetics

0.0 0.1 0.2 0.3 0.4 0.5 0.6 GFI-C

Figure 3.13. Pupil and Iris Size and Dilation Ratio. There is no clear separation of the group distributions, suggesting these factors are not associated with gender prediction ability.

40

Dataset Age

2000 1990

Year of Birth

1980 1970 1960 1950 1940 1930

GFI

GFI-C

Figure 3.14. Age distribution in GFI and GFI-C. GFI-C has a marginally younger population than GFI.

whole near-infrared eye image, and the other using the segmented and normalized irises. Figure 3.16 shows the results of each of the 30 trials in these experiments. Whole eye results are plotted in green and normalized irises are plotted in purple. To isolate possible effects of eye cosmetics, the experiments were performed in three training groups: All subjects, Males + Females With Cosmetics and Males + Females No Cosmetics. As it could be expected from a smaller dataset, GFI-C presents a higher variance in both experiments. With mean accuracies of 81% and 84% on GFI and GFI-C respectively, gender prediction from the whole eye had the best results. This confirms the findings of [4] and [36], which suggest the majority of the cues used in the prediction come from

41

ISO Sharpness

0.08

Male Female, no cosmetics Female, with cosmetics

ISO Sharpness (Prob. Density)

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00

0

20

40

GFI

60

80

100 0

20

40

GFI-C

60

80

100

Figure 3.15. Sharpness assessment of images in GFI and GFI-C. The FWC has a distinct distribution from the other groups in both datasets.

the periocular region, instead of the iris texture. There is a larger difference between the mean accuracies on each dataset when using the normalized irises as the input for prediction. GFI-C had an average accuracy of 65%, suggesting the classification is somehow easier on that dataset than in the previous (60%). Figure 3.17 shows an accuracy comparison across different classifiers on both datasets (top row is GFI and bottom row is GFI-C), over 30 random train/test trials. The left column shows classification results for VGG features on a Linear SVM, and the right column shows results for LBP features on MLP network. GFI-C accuracy results are noticeably better, despite the higher variance. Another manifest result is that the accuracy distributions across training sets are more uniform for VGG+SVM than they are for LBP+MLP in both datasets. This suggests VGG features might be more robust regarding the presence of cosmetics in the images.

42

GFI

1.0

All subjects, normalized Males + FWC, normalized Males + FNC, normalized Normalized Average All subjects, full image Males + FWC, full image Males + FNC, full image Full Image Average

0.9

GFI-C µ = 0. 84 ± 0. 04

µ = 0. 81 ± 0. 02

Accuracy

0.8

0.7

0.6

0.5

0

5

10

µ = 0. 60 ± 0. 03 15 20 Permutation

25

30 0

5

10

µ = 0. 65 ± 0. 04

15 Permutation

20

25

30

Figure 3.16. gender prediction accuracy on different datasets using VGG features and Linear SVM. The shaded areas show the standard deviation from the mean.

3.2.3 Discussion In the second part of this work on Gender From Iris we present a new dataset, selected to ensure, to the best of our ability no bias in the data could cause distortions in classification results. Also, we perform gender classification using VGG features, which seems to be the first application of such approach to the problem. Results indicate marginally better accuracies in GFI-C in the overall performance. Nevertheless, given the small difference in the average accuracy and the significant difference in size of the datasets, it is possible that this improvement does not translate into a concrete advance. Similar trends are observed for classification performed using normalized irises (60–65% accuracy) and using the entire eye image (81–84%). This in turn corroborates once more the notion that the periocular area contains the majority of clues for gender prediction. The detailed analysis on incidental aspects of the images performed while GFI-C was constructed allowed us to mitigate suspicion on some of these aspects. On the other hand, the qualm regarding the structure and location of gender-specific cues in the iris texture remains: it is still possible to obtain better than random performance

43

from the normalized irises, even though we are not able to identify these cues.

44

VGG+SVM on GFI 1

LBP+MLP on GFI 1

0.80 0.75 0.70 0.65 0.60 0.55 0.50

All subjects

Male+FNC

Male+FWC

All subjects

VGG+SVM on GFI 2

Male+FNC

Male+FWC

LBP+MLP on GFI 2

0.80 0.75 0.70 0.65 0.60 0.55 0.50

All subjects

Male+FNC Training set

Male+FWC

All subjects

Male+FNC Training set

Male+FWC

Figure 3.17. Accuracy comparison across different classifiers on both datasets. Results suggest that VGG+SVM are more robust to the influence of cosmetics than LBP+MLP.

45

CHAPTER 4 1:FIRST SEARCH

This chapter will describe experiments conducted in order to evaluate the performance of 1:First search, a technique that has been used in real-world applications to reduce search time. In Section 4.1, we report on our initial analysis of the performance of 1:First search. This analysis assumes a “closed set” context. Section 4.2 describes the results of more comprehensive experiments that generalize to the more realistic “open set” context.

4.1 A Basic Approach Our initial approach to evaluate 1:First search consisted in creating enrollment galleries of different sizes and corresponding sets of probes, and using IrisBee to match these sets, using 1:N and 1:First approaches. The results of this initial approach were published in [35], and those experiments will be described in this section. The goal of this work was to perform an empirical investigation to explore the difference in the accuracy of systems that use the practical approach of 1:First search, against the more traditional 1:N. More specifically, the investigation assessed how 1:First scales when applied to a range of different gallery sizes and distance thresholds. To do so, an environment for closed-set identification was set up using an available iris image dataset.

46

4.1.1 Dataset In the biometrics dataset available to us, a total of 57, 232 iris images were captured with an LG-4000 sensor. With this dataset, enrollment galleries were created with the earliest image of each iris. Different size galleries were created for different experimental runs. The remaining (later) images of these subjects were used as a probe set to match against the gallery. Thus, it is assumed that each iris is enrolled with just one image, and that each probe would have a corresponding enrolled identity. The fact that each probe has a corresponding enrollment in the gallery makes this a “closed-set” scenario [30, 41]). The ordering of the images in each gallery was defined randomly, and the same order was used in all experiments. The process was then repeated to create different-sized galleries and probe sets, as shown in Table 4.1. The number of images in each probe set is not uniform, because it tries to maximize the number of probes, but depends on the number of available images for each subject. The difference between left and right probe set size is however no larger than 0.15%. We assume both 1:N and 1:First are based on sequential search from first to last, rather than some more elaborated search. As a consequence, the order of enrollments is important in 1:First search, because it affects when the search stops. For this reason, we randomized the order of the galleries for every experiment.

4.1.2 Method The iris matcher used to perform the identification was the IrisBEE baseline iris recognition algorithm [39]. The resulting log of performing a 1:N search allows us to figure out what would have happened with a 1:First search. Because this is a closed-set identification, each probe has a corresponding enrollment in the gallery, and therefore the number of true non-matches is zero, by

47

TABLE 4.1 Galleries and Probe Sets Sizes

Left Eye

Right Eye

Gallery

Probe Set

Gallery

Probe Set

100

7,745

100

7,740

200

12,529

200

12,515

400

18,644

400

18,624

600

22,395

600

22,364

800

24,555

800

24,521

1,000

25,582

1,000

25,547

1,200

26,078

1,200

26,040

1,400

26,478

1,400

26,437

definition. In order to account for the rotational tolerance, the experiments allowed up to ±14 rotation shifts. This number was chosen because it is similar to that used for matching in the NEXUS program [8]. In the IrisBEE implementation, each rotation step corresponds to 1.5◦ , meaning that rotations of up to 42◦ can be allowed in the most extreme case. Figure 4.1 illustrates this situation. For each gallery, the matcher tested each probe set using 1:N search. After that, the analysis of the 1:N matching results lets us understand what would happen using 1:First matching. This procedure was repeated using a range of values for the Hamming Distance threshold, and also for the number of eye rotations allowed during matching.

48

Figure 4.1. Iris rotation tolerance limits for the experiments.

4.1.3 Results After running the matching, the accuracy was measured in terms of performance scores for each of the possible results: True Match Rate (TMR), False Match Rate (FMR) and False Non-Match Rate (FNMR). The presented TMR, FMR and FNMR results are for the identification decisions made as the result of 1:N or 1:First search. Figures 4.2 and 4.3 show examples of a false match and a false non-match that occurred during the execution of the experiment. Table 4.2 presents the performance scores for each gallery size, averaged for every threshold between 0.26 and 0.35, in 0.01 increments. From this table, it is easy to perceive that the matching method had little or no effect (< 0.4%) in the TMR. As for the FNMR, the different matching method had no effect at all. This might be explained because most false non matches are the result of factors external to the matching process (e.g. low quality image of the probe or the enrollment, or segmentation error). These external factors can also account for the high FNMR, especially because IrisBEE is not as accurate in the segmentation of the images as other commercial matchers. However, the FMR for 1:First identification is up to 2.5 times larger than the 49

(a)

(b)

Figure 4.2. Example of false match: despite the evident segmentation error in 4.2a, the Hamming Distance to 4.2b was 0.298701, a score low enough to be considered a match.

(a)

(b)

Figure 4.3. Example of false non-match: 4.3a and 4.3b had a Hamming Distance of 0.423313, and was considered a non-match. The translucent ring that appears on the subjects’ contact lens was partially classified as an occlusion on 4.3a, which might have contributed for the high score.

50

TABLE 4.2: Average Performance Scores (in %) for Thresholds between 0.26 and 0.35, with increments of 0.01, no rotation tolerance. Gallery Size 100 200 400 600 800 1000 1200 1400

TMR 1:N 1:First 86.65 ± 4.49 86.65 ± 4.48 85.77 ± 4.55 85.76 ± 4.53 86.21 ± 4.44 86.18 ± 4.40 86.61 ± 4.20 86.65 ± 4.48 86.77 ± 4.13 86.65 ± 4.48 86.99 ± 4.05 86.65 ± 4.48 87.12 ± 3.99 86.65 ± 4.48 87.28 ± 3.94 86.65 ± 4.48

FMR 1:N 1:First 0.01 ± 0.03 0.02 ± 0.04 0.02 ± 0.03 0.03 ± 0.05 0.04 ± 0.05 0.07 ± 0.12 0.05 ± 0.08 0.11 ± 0.18 0.06 ± 0.10 0.14 ± 0.24 0.08 ± 0.11 0.22 ± 0.38 0.10 ± 0.13 0.23 ± 0.36 0.12 ± 0.15 0.38 ± 0.58

FNMR 1:N 1:First 13.33 ± 4.50 13.33 ± 4.50 14.21 ± 4.57 14.21 ± 4.57 13.75 ± 4.48 13.75 ± 4.48 13.34 ± 4.26 13.34 ± 4.25 13.16 ± 4.20 13.17 ± 4.20 12.93 ± 4.14 12.93 ± 4.14 12.78 ± 4.09 12.78 ± 4.09 12.61 ± 4.05 12.61 ± 4.05

FMR for 1:N. This fact is illustrated by the graph presented in Figure 4.4. The same Hamming distance thresholds were used for both methods, hence it was expected that the FMR suffered a larger increase for 1:First, when the gallery size was increased. Empirically, for the conditions used in this experiment, the rate of increase of errors between 1:N and 1:First grows with the size of the gallery. Since the FNMR suffered no change, this means that a portion of the comparisons that resulted in True Matches yielded False Matches when using the 1:First method. This fact can be verified by observing the correlation between the drop in the TMR and the increase in the FMR. To have a better idea of how the 1:First method affects the matching process, experiments with a range of Hamming Distance thresholds and gallery sizes were performed. Figure 4.5 shows the FMR for each of these experiments, averaged for the left and right eyes. In this graph it is easily perceived that the FMR increases both when the threshold is increased (made less strict) and when the gallery size is increased. More importantly, it is possible to see that the FMR for 1:First is larger than for 1:N in most cases. Figure 4.6 shows the corresponding ROC charts for both matching methods. It is possible also here to notice the interference that the gallery size projects in the performance: in the worst case, 1:N resulted in a TMR of approximately 92% while

51

Average False Matches by method

0.6

1:N 1:First

% of False Matches

0.5

0.51

0.4 0.32

0.3

0.27

0.2

0.16

0.13

0.1

0.08 0.02 0.02

100

0.03 0.04

200

0.05

400

0.08

0.07

0.19

600 800 Gallery Size

0.20

0.11

1000

1200

1400

Figure 4.4. FMR for 1:N and 1:First as Gallery Size increases.

maintaining a FMR around 0.6%. With the same parameters, 1:First produced a TMR a little higher than 90%, and a FMR of nearly 2.2%. It is also important to notice that in all cases for 1:First, the drop in the TMR when the gallery size is increased reflects directly in an increase of the FMR, but the same does not happen with 1:N matching. The above results were obtained without rotation of the iris codes. As explained earlier, relative rotation of the eye between the enrollment and probe is handled by matching the iris codes for a range of relative circular shifts between the codes. In this case, due to the increase in the number of comparisons, the probability to get a lower Hammming distance score is higher. In order to get a more realistic assessment, rotation shifts were introduced. Figure 4.7 presents the ROC curves obtained for using ±3, 5, 9 and 14 rotational shifts. The effects of rotational shifts are clearly visible in both cases. As the rotation

52

1:N 1:First 2.5 2.0

% of False Matches

1.5 1.0 0.5 0.0 1400 1200 1000 Ga 800 ller y S 600 ize 400 200 100

0.35 0.32 0.33 0.34 0.31 0.3 0.29 0.26 0.27 0.28 Threshold

Figure 4.5. FMR for varying HD and gallery size.

shifts are increased, the TMR lower bound of 1:N rises from around 82% with ±3 shifts, to close to 94% with ±9 shifts. In most of these cases, the TMR is above 98% for the largest galleries. At the same time, the FMR, which is at most 1% with ±3 shifts, is actually reduced to a little less than 0.8% when ±5 shifts are used. Additional rotation up to 14 shifts causes the FMR to grow to a little over 1.2%. With 1:First, on the other hand, the TMR initially rose to around 96% for small gallery sizes, but the performance deteriorated for larger galleries, dropping down to below 85% when using ±9 shifts, and below 75% with ±14 shifts. At the same time, the FMR reaches up to nearly 20% with ±9 shifts and a gallery of 1,400 irises. In its worst case, the FMR reaches approximately 26% with ±14 shifts.

53

1:N

1.0 0.92 0.8 0.90

1:First 0.92 0.90 0.88

True Match Rate

0.88 0.6 0.86

0.86 0.84

0.84 0.4 0.82 0.2 0.80 0.78 0.0 0.000 0.0

0.82 0.80 0.001

0.002

100

0.2 0.003

200

0.004

400

0.78 0.005 0.4 0.006 0.000 False Match Rate

600

800

0.6 0.005

0.010

1000

0.8 0.015

0.020

1200

1.0

1400

Figure 4.6: ROC curve for matching without eye rotation tolerance. The colors represent the gallery size. 4.1.4 Discussion The results point to interesting and impactful issues, especially for large scale applications of iris identification. Although not much can be inferred by the small changes in the TMR, a closer look at the FMR reveals not only larger values for 1:First, but also an exponential growth curve (Figure 4.4). FNMR was unaffected by the matching method. This was expected, because in order for a false non-match to occur, all the enrolled samples must be examined, regardless if using 1:N or 1:First. On the other hand, there is an inverse relation between the true match and false match rates when using 1:First: as the FMR goes up, the TMR degrades. Regarding the scaling ability, the results indicate a steep growth in the FMR, as can be seen in Figure 4.5. This fact strongly suggests 1:First might not be adequate for large databases, as it tends to generate many more false matches than 1:N. The analysis of the ROC curves without rotation shifts shown in Figure 4.6 reveals small differences between both methods in the TMR (under 2%), but with the exception of the two smaller galleries, 1:First is the lowest in all cases. Accordingly, the False Match Rate of 1:First is higher in nearly all cases, and it tends to get worse

54

1:N

1.00 1.0 0.95 0.90 0.85 0.80 0.8 0.75 0.000 1.00

0.002

0.004

0.006

0.008

0.010

0.002

0.004

0.006

0.008

0.010

0.002

0.004

0.006

0.008

0.010

0.002

0.004

0.006 0.2

0.008

0.010

0.95 0.90 True Match Rate

0.85 0.6 0.80 0.75 0.000 1.00

0.95 0.4 0.90 0.85 0.80 0.75 0.000 1.00 0.2 0.95 0.90 0.85 0.80 0.75 0.0 0.000 0.0

100

200

400

1.00 0.95 0.90 ± 3 shifts0.85 0.80 0.75 0.70 0.012 0.00 1.00 0.95 0.90 0.85 ± 5 shifts 0.80 0.75 0.70 0.012 0.00 1.00 0.95 0.90 0.85 ± 9 shifts 0.80 0.75 0.70 0.012 0.00 1.00 0.95 0.90 0.85 ± 14 shifts 0.80 0.75 0.70 0.4 0.012 0.00 0.6 False Match Rate

600

800

1:First

0.05

0.10

0.15

0.20

0.25

0.05

0.10

0.15

0.20

0.25

0.05

0.10

0.15

0.20

0.25

0.05

0.10

0.15 0.8

0.20

0.25 1.0

1000

1200

1400

Figure 4.7: Matching performance with ±3 - 14 rotational shifts. Note that the 1:First ROCs on the right have an unusual aspect. This occurs because the FMR increases and TMR decreases proportionally, with larger threshold values. As the FMR increases with the increase in the threshold, the curve leans to the right; at the same time, the drop in the TMR makes the curve to go down. This effect is particularly accentuated with larger gallery sizes.

55

with the gallery size growth. In the worst case, the FMR for 1:First is more than 3 times higher than 1:N. When rotation shifts are introduced (Figure 4.7), the problems with 1:First become more evident. In 1:N matching, what happens as the number of allowed rotation shifts is increased, is that the TMR is increased up to a little more than 94% in the worst scenario, in comparison with the results shown in Figure 4.6. At the same time, the FMR goes up to 1.2% in the worst case. The rotational shifting has improved the TMR without increasing the FMR. In comparison, when the same number of rotations is used in 1:First, the TMR is increased for smaller galleries (less than 800 images), but at the same time the FMR begins to increase proportionally. Ultimately, the FMR gain ends up forcing down the TMR, and yielding worse results. For the larger galleries, 1:First had TMR’s below 90%, with a FMR up to 25%. The behavior of 1:First matching is not as similar to 1:N as might be expected, and unfortunately for the worse. The low performance of 1:First in the false match scores when compared to 1:N would, by itself, argue against the method. But it is with larger galleries and the addition of rotational shifts that 1:First performance was really disturbed, raising the question if the reduction in search time could outweigh the loss in accuracy. Results indicate that while the FNM errors are the same for both 1:N and 1:First, FM errors are quite different. Furthermore, the FM errors are generically larger in 1:First than in 1:N, and the problem is worsened by 1) larger gallery sizes, and 2) more shifts to handle a wider range of eye rotation.

4.2 Experiments in Extended Scenarios The results obtained in the previous section helped to understand better the implications of the use of 1:First, but they also raised other questions. The results suggest 56

a proportional growth of the FMR for 1:First when the size of the gallery is increased. But gallery sizes in the range of 100–1400 subjects are hardly representative of real world applications of iris recognition. Another limitation of the previous experiment was that it only considered closed-set scenarios, but these may not reflect well real world scenarios, which are frequently open-set (e.g. not all identities presented for identification are in the enrollment list). Furthermore, the academic matcher IrisBee does not perform as well as commercial matchers. We wanted to be able to compare results with a commercial grade matcher, in order to verify if the same trends would occur. Finally, different ordering of the same gallery may introduce variability in the results of 1:First searches, and our results only accounted for a single search. Another set of experiments was then conceived trying to address all of these issues: 1) gallery sizes ranging from 500 to 11,000 distinct subjects; 2) closed and open set scenarios; 3) use of a commercial matcher in addition to IrisBee for comparison; 4) matching a probe set against several random permutations of the same gallery.

4.2.1 Closed-set versus Open-set The National Science and Technology Council Subcommittee on Biometrics defines the task of closed-set identification as a case in which the system tries to determine the identity of an unidentified individual whom is known to be enrolled in the database. Conversely, open-set identification is when identification attempts are made for individuals who are not enrolled. In this case, the system is required to perform two tasks: first, to determine if the user is in the database; and second, to find the corresponding identity record for that user [41]. The introduction of nonenrolled users in an evaluation scenario can generate interesting results, especially if we look into the possible outcomes of the identification system.

57

4.2.2 Matching output Tipically, the types of errors produced by biometric systems are two: False Match (FM) and False Non-Match (FNM). A FM occurs when two samples from different individuals are incorrectly classified as a match. Conversely, FNM occurs when two biometric samples of the same individual are not recognized as a match [33]. These errors are very similar to Type I (false-positive) and Type II (false-negative) statistical errors. This resemblance is illustrated in Table 4.3. However, this traditional standpoint usually does not contemplate a scenario variation: open set vs. closed set [30]. In both of these cases, there is a list of enrolled identities (a gallery G), and the comparisons made against that gallery come from a list of probes (the probe set P ). If the identities in P are a proper subset of G (P ⊆ G), then the scenario is said to be closed set. On the other hand, if any of the identities in P are not contained in G, that is, P = {P ∩ G ∧ P 6⊆ G}, the scenario is called open set.

Matcher Result 2 samples 2 samples NOT same identity same identity True State

2 samples same identity 2 samples NOT same identity

True Match

False Non-Match

False Match

True Non-Match

TABLE 4.3 Conventional possible matching outputs.

58

As briefly mentioned by ISO 19795-1:2006 [30], the conventional definition of possible matching results is a little different when considering open-set and closedset scenarios. Table 4.4 lists the possible outcomes in both. In a closed-set, we have the typical cases of True Match (TM) and FNM for enrolled probes. Even though in this scenario there are no unenrolled probes, false matches can occur when an enrolled impostor sample was similar enough to a different identity sample in the gallery. We called these occurrences Enrolled False Matches (EFM) (Tab. 4.4a ). Another interesting peculiarity of this scenario is that TNMs cannot happen, because all the probes are enrolled (Tab. 4.4b ). On the other hand, in an open-set scenario, all four typical cases occur, but there is a distinction to be made: false matches (Tab. 4.4c ) can occur either as EFMs, like in a closed-set, or as Unenrolled False Matches (UFM), when an image of an unenrolled individual is similar enough to match one of the enrolled identities. ISO 19795-1:2006 [30] also defines standards for biometric performance testing and reporting. According to it, the fundamental performance metrics for matching are False Non-Match Rate (FNMR) and False Match Rate (FMR). Since our evaluation is made at the algorithm level, we do not take into account acquisition or enrollment failures. FNMR is defined as the proportion of genuine samples that are incorrectly declared not to match the enrolled template of the same identity, while FMR is the proportion of impostor samples that are incorrectly declared to match an enrolled template of some other identity. The norm also defines metrics specific for the identification task, namely: TruePositive Identification Rate (TPIR), False Negative Identification Rate (FNIR) and False Positive Identification Rate (FPIR). TPIR is the proportion of identification transactions by an enrolled user, in which the identification was correctly made by the system. FNIR is the proportion of identification transactions by enrolled users, in

59

Matching Result Closed-set

Enrolled Probe

Open-set

TRUE

FALSE

TRUE

FALSE

TM

FNM

TM

FNM

Non-Enrolled Probe EFM a N/A b EFM/UFM c TNM a Enrolled False Match: An enrolled impostor image was similar enough to be considered a match. b

True Non-Match cannot happen, because there are no unenrolled probes.

c

Unenrolled False Match: An unenrolled image was similar enought to be considered a match.

TABLE 4.4 Possible outputs for matching against a gallery in Closed-set and Open-set scenarios.

which the correct identity was not found by the system, while FPIR is the proportion of identification transactions by non-enrolled users (thus, in an open set scenario), in which a wrong, enrolled identity was found by the system. In our algorithm-level experiments, an identification transaction is the presentation of one probe template to the identification system, which in turn responds with one out of two possible outcomes: a) an identity label assigned to that probe, or b) no identity assigned to the probe, in the case that no sufficiently good match was found in the enrollment list. The output of the transaction is then classified according to its correctness, into five possible categories: 1. True Match (TM) – occurs when the system returns an identity that corresponds to the identity of the probe that was presented; 2. Enrolled False Match (EFM) – the system returns an identity that does not correspond to the one of the presented probe, and the probe corresponds to an enrolled identity; 3. True Non-Match (TNM) – the system returns no identity and the probe identity 60

was not enrolled (this situation can only happen in an open-set scenario); 4. False Non-Match (FNM) – the system returns no identity, but the probe identity was in the enrollment list; 5. Unenrolled False Match (UFM) – the system returned an identity, but the presented probe identity does not correspond to any enrolled identity (this situation only occurs in an open-set scenario). Using these result categories, we calculate the accuracy metrics in terms of TPIR, FMR, FNMR, FNIR and FPIR to evaluate accuracy in our experiments.

4.2.3 Dataset The iris images were selected from the Biometrics Research Grid

1

at the Uni-

versity of Notre Dame. Images were captured using an LG-4000 sensor, during acquisition sessions performed over the years of 2008 through 2013. In addition to the 57,232 images used in the first set of experiments, the University’s repository contains 51,234 images captured with the same sensor, after a firmware update. Thus, a total of 108,466 images of left and right eyes were selected, comprehending a total of 1,991 people, and 3,982 individual eyes. Since the firmware update can introduce significant changes in the pre and postacquisition processing of the images done by the sensor, these images were appropriately labeled in the repository. In order to add these images to our iris image pool, it was necessary to analyze the two sets of images to ensure the sensor firmware update would not introduce new characteristics that could interfere with the results. Four different aspects of these groups of images were analyzed and compared: a) intensity distributions; b) matching score distributions; c) subject distributions; and d) image focus. While intensity and focus did not reveal significant changes after the upgrade, the analysis of matching score and subject distributions revealed some distortions. An examination of the data revealed the distortions were caused 1

http://ccl.cse.nd.edu/operations/bxgrid/

61

not because of the firmware update, but because of a few subjects who had a much higher image count than the average. These subjects participated in data collections for specific experiments and were removed from this dataset.

4.2.3.1 Simulating more individuals for increased gallery size In order to perform experiments on the largest possible galleries and probe sets, we used a data augmentation technique to increase the number of unique eyes in our dataset. It is known that there is no correlation between the left and right iris of the same person, and in the same sense, two different images of the same iris must be correctly aligned in order to generate an identity match [10]. Based on this, we performed two spatial transformations on the original images of the set: 180◦ rotation and horizontal flipping. To make sure the spatial transformation does not result in a set that diverges from the properties of the original, we selected the oldest image of each eye of each subject, on which we applied rotation and flipping, creating four other sets of 1,991 images each. Next, we performed all-versus-all matching using IrisBee and compared the Hamming Distance distributions. Each set of images resulted in approximately 2 million one-way comparisons. As shown in Figure 4.8, there is very little difference between the six sets. The final data set is composed by the union of these six sets, amounting to 11,946 unique eyes, which for the purposes of this work, were then considered as unique subjects. Finally, the rotation and flipping transformations were applied to all available images, resulting in a total of 325,398 images that were used in the experiments.

62

0.2 Original Left Original Right Flipped Left Flipped Right Rotated Left Rotated Right

0.18 0.16

Probability

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0

0.1

0.2

0.3

0.4

0.5

0.6

Hamming Distance

0.7

0.8

0.9

1

Figure 4.8. Comparison of HD distributions between original and artificial images.

4.2.4 Gallery and Probe Set Formation From this augmented data set, we created 22 subject-disjoint galleries varying in size from 500 to 11,000 subjects, in increments of 500. To select these galleries, we picked the single oldest image for each person, for each eye, and for each transformation (original, rotated and flipped). We call this set the Gallery Pool. From the Gallery Pool, images were randomly drawn to form each of the galleries. After having formed the galleries, the remaining 313,452 images were used as a Probe Pool to create closed probe sets for each of the galleries. Then, for each subject in the gallery, the corresponding images in the Probe Pool were added to the probe set. The total size of each probe set varied according to the number of images available for each subject. The smallest probe set has 10,836 probes, and the maximum size was constrained to 20,000 in order to limit the computational time of our experiments.

63

To create open probe sets, a similar procedure was adopted: for each gallery subject, a closed probe set with size N × 1.5 was randomly drawn from the Probe Pool, where N is the size of the gallery. Next, a set of

N ×1.5 2

images from subjects not

in the gallery was added to the selection to form the open fraction of the probe set. As an example, a gallery of 500 subjects yields a closed probe set of 500×1.5 = 750 images, and 750/2 = 375 images of unenrolled subjects are added as the open (or unenrolled) portion of the probe set. The final probe set size in this case is 750 + 375 = 1, 125 images, where ∼ 33% of them correspond to unenrolled subjects. The enlargement factor of 1.5 was arbitrarily chosen because it resulted in a reasonable combination of size and computational requirement.

4.2.5 Threshold Selection In order to compare the performance of different matchers, it is necessary to standardize the metrics. In addition to the matcher evaluation metrics defined by ISO 19795-1:2006 [30], it is also necessary to ensure the equivalence of the matcher input parameters. The “strictness” of an iris recognition application is regulated by a threshold that stipulates the minimum similarity required between two samples so that they can be considered a match. This threshold is usually arbitrary, and it is defined as a value in the scale of the matcher output. Daugman-based iris matchers like IrisBee use Hamming Distance as the scale of dissimilarity between samples. This scale goes from 0.0 (no dissimilarity) to 1.0 (complete dissimilarity). VeriEye, on the other hand uses a similarity scale to compare samples: it ranges from 0 (minimal similarity) to 9,443 (maximal similarity). Since there is no direct relation between the two scales, it is necessary to establish an equivalence between values in these two scales so that we can make a fair comparison of the two matcher results. To do so, we ran matching using the largest available gallery (11,000 identities) 64

IrisBee Threshold Selection 0.0001% < FMR < 1.00%, 3 Rotations AT = 0.39

AT = 0.39

Genuine Impostor

0.8

IrisBee Threshold Selection 0.0001% < FMR < 1.00%, 7 Rotations

1

0.6 AT = 0.22 0.4

0.6 AT = 0.18

AT = 0.00 0.2

AT = 0.00

AT = 0.00

0

AT = 0.00

0 0.2

0.4

0.6

0.8

1

Hamming Distance VeriEye Threshold Selection 0.0001% < FMR < 1.00%, 3 Rotations T = 10.00

1

0.2

0.4

0.6

1

T = 23.00

1

CDF

T = 34.00

10

T = 42.00 T = 84.00 0.2

10

T = 300.00 10

0 10 0

Genuine Impostor

0.6

T = 79.00

4

1

0.4

0.2

3

0.8

T = 31.00

T = 39.00

T = 289.00

Similarity Score

0.6

0.8

0.6

T = 74.00 0.2

10

0.4

Hamming Distance VeriEye Threshold Selection 0.0001% < FMR < 1.00%, 14 Rotations T = 21.00

Genuine Impostor

0.4

2

0.2

T = 28.00

0.4

1

0

Hamming Distance VeriEye Threshold Selection 0.0001% < FMR < 1.00%, 7 Rotations

0.8

0.6

0 10 0

0.8

T = 18.00

Genuine Impostor

0.8

0 0

CDF

1

AT = 0.15

AT = 0.06 0.2

0

0.6 0.4

AT = 0.09

CDF

AT = 0.33

0.4

0.2

Genuine Impostor

0.8

AT = 0.34

CDF

CDF

AT = 0.36

AT = 0.38

Genuine Impostor

0.8

IrisBee Threshold Selection 0.0001% < FMR < 1.00%, 14 Rotations

1

CDF

1

10

1

10

2

Similarity Score

10

3

T = 306.00 10

4

0 10 0

10

1

10

2

10 3

10 4

Similarity Score

Figure 4.9. CDF-based threshold selection. Observe that the Cumulative Distribution curves are very different in format for IrisBee (top row) and VeriEye (bottom row): This happens because the matchers use dissimilarity and similarity scales, respectively.

and its corresponding probe set (20,000 images), resulting in more than 200 million comparisons. Using the matching score output for each comparison, we plotted the Cumulative Distribution Function (CDF) for the genuine and impostor distributions. Figure 4.9 shows examples of the threshold selection for IrisBee and VeriEye. Based on the impostor CDFs, five threshold values were selected, corresponding to 0.0001%, 0.001%, 0.01%, 0.1% and 1% of the impostor comparisons. This ensures we have thresholds corresponding to five distinct Accuracy Targets, regardless of the score metric that is used by the matcher. An example of this equivalence is shown in Table 4.5.

65

TABLE 4.5 Empirically selected threshold for closed set with 14 rotations

Accuracy Target (Max. False Matches)

IrisBee Threshold

VeriEye Threshold

1 100

0.3780

21

1 1,000

0.3263

31

1 10,000

0.1475

42

1 100,000

01

84

1 1,000,000

01

306

4.2.6 Results The most essential metric for accuracy in an identification system is TPIR. Figure 4.10 makes a comparison of the identification rate achieved in both matchers, using 1:N and 1:First search methods in a closed set scenario. One aspect to be considered here is the inferior accuracy of IrisBee when compared to VeriEye, especially combined with the relatively small size of our data set. The highest accuracy target that could be achieved with IrisBee was at a level of using 3 and 7 rotational steps, against

1 1,000,000

1 100,000

for VeriEye. This is why the red

and green series are absent on the left plot. Even so, when the rotation tolerance is increased to 14 rotation steps, IrisBee only could achieve the accuracy target of 1 . 10,000

Still, it is possible to see that 1:N search (represented by circles) remains fairly

constant across the different gallery sizes, while 1:First (represented by X’s) search clearly suffers a degradation with the increase in the gallery size. It should be considered, however, that 1:First performance degradation only occurs with lower accuracy target settings. An example of that is the performance 1

The matcher could not achieve such precision.

66

achieved by VeriEye at an accuracy target of

1 1,000,000

(red series): it corresponds to a

very restrictive setting, and we can verify its circles and X’s overlap almost perfectly, indicating there was no significant difference in accuracy between 1:N and 1:First. The same occurs in IrisBee at the accuracy target of

1 10,000

(blue series), but at a

lower accuracy level than VeryEye. On the other hand, the performance degradation with 1:First search across larger gallery sizes is similar in both matchers at accuracy targets of

1

1 1,000

and

1 . 100

IrisBee TPIR, Closed-Set, 14 Rotations VeriEye TPIR, Closed-Set, 14 Rotations

0.8

Acc. tgt.=1/1,000,000 Acc. tgt.=1/100,000 Acc. tgt.=1/10,000 Acc. tgt.=1/1000 Acc. tgt.=1/100 1:N 1:First

0.6 0.4 0.2 0 0

2000

4000

6000

8000

Gallery Size

10000

0

2000

4000

6000

8000 10000

Gallery Size

Figure 4.10. True Positive Identification Rate comparison for IrisBee and VeriEye, using 1:N and 1:First search, over a range of gallery sizes. Higher means better.

Figure 4.11 shows another performance comparison between 1:N and 1:First, in terms of their FMR, in closed set scenarios under different settings. Matching was performed using both matchers, over a range of rotation tolerance values, and FMR accuracy targets. In this case, we can perceive an inverse trend in 1:First performance, with relation to Fig. 4.10: as the gallery size increases, TPIR declines, while FMR grows proportionally. Comparing Figs. 4.11a and 4.11b we can confirm the higher accuracy of commer-

67

cial VeriEye matcher as opposed to the research-software IrisBee matcher.

2

Both

matchers, however, show the same types of trends when we compare 1:N and 1:First. 1:First search has a tendency to higher FMRs when the target accuracy is less strict than

1 . 10,000

On the other hand, if the target accuracy is more strict than a certain

limit, 1:N and 1:First are very similar. As it could be expected given the differences in overall accuracy between the matchers, this threshold is different for each of them: with IrisBee, 1:N and 1:First have similar performances for target accuracies less strict than than

1 . 10,000

1 . 100,000

With VeriEye, the same occurs when target accuracy is less strict

This is verified by the overlapping “X”s and “O”s of the same color in

these plots. Rotation tolerance seems to have exerted little influence for both matchers, but this is explained by the fact that the acceptance thresholds were calculated individually for each rotation interval (as described in section 4.2.5). Nevertheless, all situations considered, 1:N search never yielded an FMR higher than 10%.

4.2.6.1 Open Set Scenarios Open set scenario results can be seen in Figure 4.12. In this case, 33.33% of the probes are identities that are not enrolled in the gallery. The same types of trends found in closed set scenarios are present here: if the accuracy target of the system is not highly restrictive, 1:First search yields worse results than 1:N, and the difference grows proportionally to the size of the galleries. Comparing the worst FMR open set cases with the corresponding closed set experiment, FMR seems to have decreased (in some cases, from ∼ 100% to ∼ 60%). This fact can be explained by the presence of unenrolled identities in the probe set, which corresponds to one third of the probes. An interesting note about this is that despite the fact that VeriEye is in general 2

In Figure 4.11a, the accuracy target steps, and therefore is not visible.

1 100,000

overlaps completely with

68

1 10,000

for 3 and 7 rotation

IrisBee FMRs, Closed-Set, 3 Rotations

1

IrisBee FMRs, Closed-Set, 7 Rotations

IrisBee FMRs, Closed-Set, 14 Rotations

0.8

Acc. tgt.=1/1,000,000 Acc. tgt.=1/100,000 Acc. tgt.=1/10,000 Acc. tgt.=1/1000 Acc. tgt.=1/100 1:N 1:First

0.6 0.4 0.2 0 0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

Gallery Size

Gallery Size

VeriEye FMRs, Closed-Set, 3 Rotations

VeriEye FMRs, Closed-Set, 7 Rotations

0

2000

4000

6000

8000

10000

Gallery Size

(a) IrisBee 1

VeriEye FMRs, Closed-Set, 14 Rotations

0.8

Acc. tgt.=1/1,000,000 Acc. tgt.=1/100,000 Acc. tgt.=1/10,000 Acc. tgt.=1/1000 Acc. tgt.=1/100 1:N 1:First

0.6 0.4 0.2 0 0

2000

4000

6000

8000

10000

Gallery Size

0

2000

4000

6000

8000

10000

0

Gallery Size

2000

4000

6000

8000

10000

Gallery Size

(b) VeriEye

Figure 4.11. FMR in closed set scenario with different rotation tolerances. Higher means worse. While at more restrictive settings, the difference between 1:N and 1:First is negligible, the problems in 1:First start to appear when a target accuracy is more strict than a certain limit.

more accurate than IrisBee, FMR behavior manifested similarly in both matchers. At accuracy target

1 10,000

(blue series in Figs. 4.12a and 4.12b), FMR was generally

higher for VeriEye than for IrisBee, both when using 1:N and 1:First search. Similarly to closed set scenarios, rotation tolerance had little noticeable influence in accuracy under all settings. As recommended by ISO 19795-1:2006 [30] for open set evaluations, FPIR and FNIR were also calculated for each case, and they do not show significant discrepancy between 1:N and 1:First search. In addition, we also calculated what we called “Enrolled FPIR”, which is, as opposed to FPIR, the false-positive identification rate for enrolled subjects. This way, it is possible to make a comparison of false match errors between the enrolled and unenrolled portions of the probe set. Figure 4.13 shows performance scores achieved by IrisBee in an open set scenario, under a range of target accuracies. Like in closed set experiments, there is practi-

69

IrisBee FMRs, Open-Set, 3 Rotations

1

IrisBee FMRs, Open-Set, 7 Rotations

IrisBee FMRs, Open-Set, 14 Rotations

0.8

Acc. tgt.=1/1,000,000 Acc. tgt.=1/100,000 Acc. tgt.=1/10,000 Acc. tgt.=1/1000 Acc. tgt.=1/100 1:N 1:First

0.6 0.4 0.2 0 0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

Gallery Size

Gallery Size

VeriEye FMRs, Open-Set, 3 Rotations

VeriEye FMRs, Open-Set, 7 Rotations

0

2000

4000

6000

8000

10000

Gallery Size

(a) IrisBee 1

VeriEye FMRs, Open-Set, 14 Rotations

0.8

Acc. tgt.=1/1,000,000 Acc. tgt.=1/100,000 Acc. tgt.=1/10,000 Acc. tgt.=1/1000 Acc. tgt.=1/100 1:N 1:First

0.6 0.4 0.2 0 0

2000

4000

6000

8000

10000

0

2000

Gallery Size

4000

6000

8000

10000

0

Gallery Size

2000

4000

6000

8000

10000

Gallery Size

(b) VeriEye

Figure 4.12. FMR in open set scenario with different rotation tolerances. The presence of unenrolled identities in the probe set poses as a harder challenge, regardless of matcher or search method: Even 1:N search can yield higher FMR if the target accuracy is lenient enough.

cally no difference between 1:N and 1:First scores at highly strict accuracy targets 1 ). At this level, although FPIR and Enrolled FPIR are very low, FNIR ( 1,000,000

reaches approximately 20% for both methods. As the system strictness is relaxed, 1:First Enrolled FPIR starts to grow much higher than 1:N Enrolled FPIR, increasing with the size of the gallery. This starts to happen at the accuracy target

1 100,000

(from top to bottom in the figure). At the same

time, FNIR drops close to zero in these cases. Still, the divergence between search methods in terms of FPIR or FNIR was negligible in all cases. A similar overview is shown in Figure 4.14, which contains VeriEye performance scores in open set scenarios. All the scores present the same trends found in IrisBee results. As expected from a high accuracy matcher, FNIR for VeriEye was below 10% in all cases. On the other hand, its performance in the unenrolled portion of the probe set (FPIR) was worse than IrisBee by as much as 24%. 1:First search

70

IrisBee, Open-Set, 14 rotations, 1 gallery permutations. Enrolled FPIR @ threshold 0.19

FPIR @ threshold 0.19

FNIR @ threshold 0.19

1/1,000,000

1 1:N 1:First

0.8 0.6 0.4 0.2 0

Enrolled FPIR @ threshold 0.27

FPIR @ threshold 0.27

FNIR @ threshold 0.27

Enrolled FPIR @ threshold 0.32

FPIR @ threshold 0.32

FNIR @ threshold 0.32

Enrolled FPIR @ threshold 0.35

FPIR @ threshold 0.35

FNIR @ threshold 0.35

Enrolled FPIR @ threshold 0.38

FPIR @ threshold 0.38

FNIR @ threshold 0.38

1

1/100,000

0.8 0.6 0.4

0 1

1/10,000

0.8 0.6 0.4 0.2 0 1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2

Gallery Size

0

0

00 11

00

00

10

00

90

00

Gallery Size

Figure 4.13. Overall IrisBee performance in open set scenario.

71

80

00

70

00

60

00

50

00

40

00

30

00

20

10

0

0

00

11

00

00

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

0

00

20

10

0

00 11

00

00

10

00

Gallery Size

90

00

80

00

70

00

60

00

50

00

40

00

30

20

00

0

10

1/100

Accuracy Target

0.2

also had worse overall results than IrisBee regarding the Enrolled FPIR, but not as accentuated as FPIR.

VeriEye, Open-Set, 14 rotations, 1 gallery permutations. Enrolled FPIR @ threshold 317.00

FPIR @ threshold 317.00

FNIR @ threshold 317.00

1/1,000,000

1 1:N 1:First

0.8 0.6 0.4 0.2 0

Enrolled FPIR @ threshold 67.00

FPIR @ threshold 67.00

FNIR @ threshold 67.00

Enrolled FPIR @ threshold 42.00

FPIR @ threshold 42.00

FNIR @ threshold 42.00

Enrolled FPIR @ threshold 31.00

FPIR @ threshold 31.00

FNIR @ threshold 31.00

Enrolled FPIR @ threshold 21.00

FPIR @ threshold 21.00

FNIR @ threshold 21.00

1

1/100,000

0.8 0.6 0.4

0 1

1/10,000

0.8 0.6 0.4 0.2 0 1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2

Gallery Size

Figure 4.14. Overall VeriEye performance in open set scenario.

72

0

0

00 11

00

00

10

00

Gallery Size

90

00

80

00

70

00

60

00

50

00

40

00

30

00

20

10

0

0

00

11

00

00

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

0

00

20

10

0

00 11

00

00

10

00

Gallery Size

90

00

80

00

70

00

60

00

50

00

40

00

30

20

00

0

10

1/100

Accuracy Target

0.2

4.2.6.2 Gallery Permutations Contrary to what happens in 1:N search, 1:First accuracy can be affected by the ordering of the gallery. To better understand how this effect could interfere in search accuracy, we performed experiments in which the probe set was presented to different permutations of the same gallery. These experiments were also performed on both matchers, using the same matching thresholds presented so far. In Figure 4.15 the mean performance scores for IrisBee in a closed set scenario, across 20 random gallery permutations are shown. Note that FPIR was ommited from this figure because it is a closed set scenario. In this plot, scores represent average values, and tails denote the standard deviation from the mean. As expected, 1:N results show no variance in performance. Using 1:First search, however, some degree of variance in Enrolled FPIR exists, if the system accuracy target is permissive enough (Enrolled FPIR @ Threshold 0.33). At the highest tolerance setting (Enrolled FPIR @ Threshold 0.38), 1:First performance is so degraded that even variance is very small. Like in previous experiments, IrisBee failed to operate at accuracy targets more strict than

1 , 10,000

mostly because of the rotation tolerance.

In an open set scenario, IrisBee was able to achieve a wider range of performance levels (Figure 4.16), from

1 1,000,000

to

1 . 100

The general tendencies for both search

methods however, remain the same as in closed set: performance degradation starts 1 to appear at a moderate level of strictness ( 10,000 ), and gets worse when the system 1 is more lenient ( 1,000 and

1 ). 100

1:First presents a moderate amount of variation in

Enrolled FPIR. Both in closed and open set scenarios, 1:First performance variance seems to decrease as the gallery size increases. In general, the same trends found in IrisBee results are seen in VeriEye results (Figures 4.17 and 4.18). Some cases are interesting when we compare the results of both matchers. In closed set scenario, at a target accuracy of

1 , 10,000

we observe no

Enrolled FPIR increase for 1:First in IrisBee (Fig. 4.15, Enrolled FPIR @ threshold 73

IrisBee, Closed-Set, 14 rotations, 20 gallery permutations. Enrolled FPIR @ threshold 0.15

FNIR @ threshold 0.15

1 1:N 1:First

1/10,000

0.8 0.6 0.4

0

Enrolled FPIR @ threshold 0.33

FNIR @ threshold 0.33

Enrolled FPIR @ threshold 0.38

FNIR @ threshold 0.38

Gallery Size

Gallery Size

1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8

1/100

Accuracy Target

0.2

0.6 0.4 0.2 0

Figure 4.15. IrisBee mean performance in closed set scenario, with 20 gallery permutations.

74

IrisBee, Open-Set, 14 rotations, 20 gallery permutations. Enrolled FPIR @ threshold 0.19

FPIR @ threshold 0.19

FNIR @ threshold 0.19

1/1,000,000

1 1:N 1:First

0.8 0.6 0.4 0.2 0

Enrolled FPIR @ threshold 0.27

FPIR @ threshold 0.27

FNIR @ threshold 0.27

Enrolled FPIR @ threshold 0.32

FPIR @ threshold 0.32

FNIR @ threshold 0.32

Enrolled FPIR @ threshold 0.35

FPIR @ threshold 0.35

FNIR @ threshold 0.35

Enrolled FPIR @ threshold 0.38

FPIR @ threshold 0.38

FNIR @ threshold 0.38

1

1/100,000

0.8 0.6 0.4

0 1

1/10,000

0.8 0.6 0.4 0.2 0 1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2

Gallery Size

Figure 4.16. IrisBee mean performance in open set scenario, with 20 gallery permutations.

75

0 00

0

Gallery Size

11

00

00

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

00

20

10

0

0

00

11

00

00

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

0

00

20

10

0

00 11

00

00

10

00

Gallery Size

90

00

80

00

70

00

60

00

50

00

40

00

30

20

00

0

10

1/100

Accuracy Target

0.2

VeriEye, Closed-Set, 14 rotations, 20 gallery permutations. Enrolled FPIR @ threshold 306.00

FNIR @ threshold 306.00

1/1,000,000

1 1:N 1:First

0.8 0.6 0.4 0.2 0

Enrolled FPIR @ threshold 84.00

FNIR @ threshold 84.00

Enrolled FPIR @ threshold 42.00

FNIR @ threshold 42.00

Enrolled FPIR @ threshold 31.00

FNIR @ threshold 31.00

Enrolled FPIR @ threshold 21.00

FNIR @ threshold 21.00

Gallery Size

Gallery Size

1

1/100,000

0.8 0.6 0.4

0 1

1/10,000

0.8 0.6 0.4 0.2 0 1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8

1/100

Accuracy Target

0.2

0.6 0.4 0.2 0

Figure 4.17. VeriEye mean performance in closed set scenario, with 20 gallery permutations.

76

0.15), while in the same tolerance level, 1:First in VeriEye reached up to nearly 40% (Fig. 4.17, Enrolled FPIR @ threshold 42). Nevertheless, the good FMR performance of IrisBee in this situation is compensated by very high FNIR scores (above 40%). Still in the closed set, 1:First Enrolled FPIR is very different for both matchers at a target accuracy of

1 . 1,000

While IrisBee has larger variance, the maximum mean

Enrolled FPIR is below 40% in the largest galleries (Fig. 4.15, Enrolled FPIR @ threshold 0.33). On the other hand, VeriEye had FMR of almost 90% at the same tolerance level. In this case, FNIR for both matchers was below 5%. Comparing the results obtained in open set scenarios (Fig 4.16 and 4.18), we find again similar trends. The most distinguishing case is at target accuracy

1 , 10,000

where again 1:First scores (Enrolled FPIR and FPIR) for VeriEye were in general higher than for IrisBee, while FNIR is under 5% for both matchers. In more tolerant levels (target accuracy

1 1,000

and

1 ), 100

the difference in 1:First error rates is not so

accentuated, but is still a little higher in VeriEye than in IrisBee.

4.2.7 Discussion The general tendency that can be apprehended in these experiments is that other parameters kept the same, 1:First search usually has worse accuracy than 1:N search. 1:First search accuracy degrades more quickly with increased gallery size than does 1:N accuracy: this trend happened in all experiments and scenarios, regardless of the metrics used (TPIR, FMR or FPIR). Making the system tolerance for a match more strict can lower the FMR (FPIR in open set), but at the cost of increasing the FNMR (FNIR in open set). If the system is set to a very restrictive accuracy target of

1 , 1,000,000

there is no perceptible difference

between the search methods in terms of False Matches, but False Non-Matches can start becoming too high for practical uses. Our second set of experiments with larger gallery sizes revealed no contradicting 77

VeriEye, Open-Set, 14 rotations, 20 gallery permutations. Enrolled FPIR @ threshold 317.00

FPIR @ threshold 317.00

FNIR @ threshold 317.00

1/1,000,000

1 1:N 1:First

0.8 0.6 0.4 0.2 0

Enrolled FPIR @ threshold 67.00

FPIR @ threshold 67.00

FNIR @ threshold 67.00

Enrolled FPIR @ threshold 42.00

FPIR @ threshold 42.00

FNIR @ threshold 42.00

Enrolled FPIR @ threshold 31.00

FPIR @ threshold 31.00

FNIR @ threshold 31.00

Enrolled FPIR @ threshold 21.00

FPIR @ threshold 21.00

FNIR @ threshold 21.00

1

1/100,000

0.8 0.6 0.4

0 1

1/10,000

0.8 0.6 0.4 0.2 0 1

1/1,000

0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2

Gallery Size

Figure 4.18. VeriEye mean performance in open set scenario, with 20 gallery permutations.

78

0

0

00 11

00

00

Gallery Size

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

00

20

10

0

0

00

11

00

00

10

00

90

00

80

00

70

00

60

00

50

00

40

00

30

0

00

20

10

0

00 11

00

00

10

00

Gallery Size

90

00

80

00

70

00

60

00

50

00

40

00

30

20

00

0

10

1/100

Accuracy Target

0.2

trends from the experiments with small galleries (up to 1,400 subjects). In fact, it illustrates how 1:First search accuracy degradation is closely connected to the gallery size, while the same does not happen in 1:N. If we look at how FMR grows in small galleries (Fig. 4.4), although at a clearly lower rate than 1:First, 1:N accuracy is also degrading. Experiments with larger galleries showed that the accuracy degradation in 1:N was a phenomenon specific of small galleries: once FMR stabilization occurs at galleries of ∼ 1500 subjects, it does not increase again, regardless of the gallery size (Fig. 4.11). Perhaps one of the most unexpected results regards the behavior of matchers in open set scenarios: although the error rate for the enrolled portion of the probe set remains similar to what was previously found, the unenrolled error rate revealed a different trend. Unlike with the enrolled probeset, FPIR calculation showed there is no difference in accuracy between 1:N and 1:First in the unenrolled probe set. Finally, the last set of experiments confirmed all the previous tendencies for the behavior of 1:N and 1:First searches. Performing 1:First search against different permutations of the same gallery introduces some degree of variance into the results, but the standard error from the mean does not exceed 6%.

79

CHAPTER 5 PROPOSED RESEARCH & TIMELINE

Aiming for the completion of my dissertation, the following sections present the work plan that should be carried out in the next months. In the first section, the remaining stages for the Gender From Iris project are outlined. Next, Section 5.2, describes the planned approach to finalize the work on 1:First search method. Section 5.3 defines the procedure through which the iris presentation attack detection will be approached. Finally, a timeline for the execution of each of these projects and writing of the dissertation is presented.

5.1 Gender From Iris The work done so far has shown that gender prediction from the iris may have been regarded with excessive confidence. New research suggests important cues used for gender prediction are located not in the iris itself, but in the periocular region. On the other hand, it still seems to be possible to extract some gender information from the iris texture. Therefore, the remaining portion of this project intends to answer to the following questions: a) What accuracy can be expected in gender prediction based exclusively on the iris texture? and b) Is it possible to identify locations or physical structures in the iris that may contain gender cues? Here is listed the required research to complete this project: 1. Extend the current GFI-C dataset. Some results obtained in our GFI-C dataset are still inconclusive because variations from the older GFI dataset 80

are small. Since GFI-C is a much smaller dataset, these variations can still be attributed to its size and consequent lack of diversity. The complete set of LG4000 images that is available to us contains 1,037 unique female subjects. Out of these, 116 were selected for GFI-C, because they had images both with and without makeup. Due to limitations in schedule and personnel, our examination of these subjects was however not finished. Approximately 800 subjects still remain to be examined, and they represent a possibility for a substantial increase in the size of GFI-C. With a larger dataset, reliability in our results will be increased, and the publication of another carefully constructed set focused on the gender from iris problem will be valuable to the community. 2. Perform classification experiments with CNNs using Class Activation Maps. As described in Zhou et al. [60], class activation maps are used to interpret the prediction decision made by the CNN. Figure 5.1 shows an example of class activation maps for object localization. It is expected that this work will allow us to identify regions or structures in the iris texture that correlate with gender.

Figure 5.1. An example of Class Activation Maps in image classification. Predicted classes are shown along with the activation map that originated them. Source: Zhou et al. [60]

81

3. Perform an inspection of the texture features involved in gender prediction. Using visualization tools like Class Activation Maps to identify these features, it will be possible to show if and how specific areas of the iris correlate with gender. The objective here is to better understand how automatic classifiers can produce reasonably accurate predictions from the iris texture, even when the cues are not perceptible to human observation. 4. Conduct a study to assess how well can humans predict gender based on a) periocular image, b) iris texture alone. Although the idea of predicting gender from iris has gained momentum in the field of machine learning, it is currently unknown if human examiners can identify gender based on the iris texture, and to what extent. The idea is to assess human ability to distinguish gender through visual inspection of the ocular region. In this experiment, we propose an approach similar to what Bobeldyk and Ross [4] used to train SVM classifiers. Human performance for gender prediction would be measured in three types of images: entire periocular region, cropped iris region, and masked iris. The outcome of this experiment may confirm the results of automatic classification regarding the localization of the gender cues in the eye image, and possibly provide new insight to help us better understand the problem. 5. Publish a journal paper with the results of Gender From Iris project. The ideal venue for this publication will be decided at a later stage, together with my advisor.

5.2 1:First Search The exploratory work conducted as part of this proposal made important measurements on the performance and accuracy of 1:First search method: a) it is on average, considerably faster than the traditional 1:N scanning; and b) performance improvement comes at the cost of accuracy loss, and the loss is proportional to the size of the database. At first, these findings discourage the use of 1:First search, but its combination with some other technique may allow us to take advantage of its performance while minimizing the accuracy loss. For a long time psychologists have delved into the human ability of knowing its own cognitive process and using this knowledge for self-improvement. This “knowing about knowing” is called meta-cognition, and it can be adapted to what is defined by 82

Scheirer et al. [51] as meta-recognition. In this sense, meta-recognition is a recognition system which is able to take corrective action to improve its own accuracy, based on the results obtained in the recognition task. In [51], they present a performance prediction method to predict the performance of a recognition system based on the output for each match instance. Using Extreme Value Theory, the idea is to use the tails of recognition score distributions to build a classifier that can predict a recognition decision as a success or a failure. Figure 5.2 illustrates this concept. Their evaluation demonstrates the applicability of their method in several biometric modalities based on similarity metrics. Our proposal is to combine their meta-recognition technique to 1:First search, improving the search accuracy and still dismissing the need to scan the entire database.

Figure 5.2. Meta-recognition based on Extreme Value Theory. While the threshold t0 would have falsely rejected the score denoted by the red dot, post-recognition analysis of the non-match scores distribution reveals this score is at one of the extremes, and could be considered a match. Source: Scheirer et al. [51]

83

The main steps required for the conclusion of this work are described below: 1. Extend the testing framework which currently performs 1:First and 1:N search, implementing a third hybrid search method. The additional search will be a version of 1:First search with Recognition Score Analysis [51]. That means that after locating a few match candidates in a search, a statistical-based assessment of the matching scores will allow the system to decide how good are those candidates. Based on this score analysis, the system can then choose to return a result for the search or to continue the database scan. The set of galleries and probe sets that were used in this work would allow a comparative evaluation of this hybrid technique with current 1:First and 1:N results. 2. Perform the same set of experiments for the extended version of 1:First, using the same input data, parameters and iris matchers. The objective is to conduct a complete comparison between the methods, allowing us to compare the proposed approach to both the original method, 1:First, and the traditional 1:N search. This comparison should take into account at least two main aspects: search accuracy and search performance. We expect the proposed method to be able to reduce the speed/accuracy tradeoff associated with 1:First search. 3. Publish the results of the comprehensive 1:First evaluation, along with the proposed modified version, on a journal paper. The ideal venue for this publication will be decided at a later stage, together with my advisor.

5.3 Iris Presentation Attack Detection Presentation Attack Detection methods are usually ad-hoc solutions. Several approaches have shown to work very well to detect a specific type of attack, for a specific type of sensor, or in a distinct data set, but its ability to generalize attacks, sensors or data sets is limited. In this sense, we propose the construction of a modular tool for the detection of iris spoofing with the use of textured contact lenses. The idea is to train a set of small CNNs to recognize textured lenses from images captured with different sensors. Each of these CNNs will be trained on a different set of features, from images coming from different sensors, with texture contact lenses and without them. Figure 5.3 shows a diagram of the organization of such classifier. The final decision will be based on the fusion of results from the CNN ensemble, 84

Real Fake

...

{

Genuine Image

Spoofing Attempt

{

BSIF Image

Original Image

{

Real Fake

Figure 5.3. Proposed structure for a modular classifier for iris PAD.

which will then be able to identify different types of spoofed images. The modular characteristic of this approach is designed to be expansible by adding new CNNs to the ensemble. This way, new classifiers for different types of attack can be incrementally added into the tool: a CNN for detection of paper-printed irises or plastic eyes can be trained and added to the configuration. Another approach that could be used with this kind of ensemble would be to train different CNNs to recognize a specific class of image, and their combined result would produce a response in terms of spoof/non-spoof. For the execution of this project, the following steps will be necessary: 1. Composition of a data set of iris images, containing three classes: a) no contact lenses, b) transparent lenses and c) textured contact lenses. The two first classes will be used as negative samples (real irises), and the last as positive samples (fake irises). The number of images in each type of sample (real/fake) should be roughly balanced, to prevent biasing the classifier. The objective of this dataset is to gather the largest possible number of iris images out of the University of Notre Dame biometric repository. This data set will allow the training of a CNN to perform detection of spoofing attacks (in this first case, textured contact lenses). 2. Definition and implementation of a simple CNN architecture to act as the base classifier. The objective of this CNN is to be able to learn texture patterns that differentiate a fake iris from a real one. Since the final objective of the project is to combine the classification result of several CNNs, we should define a simple architecture, keeping the number of layers and neurons to a minimum. 3. Perform an experiment to evaluate different types of input to the CNN. This experiment will consist in training and evaluating at least two CNNs on the 85

new dataset: the first will use as input the raw iris images, and the second will use BSIF images. Given the types of texture features that could be highlighted or attenuated by different BSIF filters, we should make an empirical assessment based on several sizes and depths to find the most suited for the task. The CNNs should be able to perform a binary classification on the iris images, distinguishing fake irises from real ones. A strict cross validation protocol must be used to prevent overfitting the dataset. Objectively, this experiment should allow us to evaluate and compare how each individual classifier responds to each image class. Furthermore, it should provide an answer as to how much improvement can be expected from spoofing detection using BSIF filters, in comparison to the raw intensity. 4. Perform an experiment to evaluate an ensemble of CNN classifiers specialized in each class of image. Despite the fact that the final classification is binary (live or fake iris), classifiers have to deal with at least types of images: positive examples are images with cosmetic lenses, while negative examples may belong to two other classes – clear lenses or no lenses. Works like [52] suggest that the use of single-class classification may be beneficial in iris spoofing. Our experiment should train one single-class CNN classifier for each of the classes, and use the ensemble combined output as the final result of the classification. 5. Perform a comparative evaluation against other state of the art techniques for iris spoofing, such as [21, 34, 46, 52] . 6. Publish the results, preferentially in a journal like T-IFS.

5.4 Timeline Table 5.1 summarizes the timeline for the proposed execution of this thesis. The estimated time for conclusion is twelve months. Since the proposed line of work consists of three separate projects, their execution was kept mainly sequential, to accomodate eventual prioritization changes. Dates are provisory, and they should be adjusted as necessary within the time limit for conclusion.

86

Project

Task

Sep-17 Oct-17 Nov-17 Dec-17 Jan-18 Feb-18

Mar-18 Apr-18 May-18 Jun-18 Jul-18 Aug-18

Dataset composition Spoofing

PAD implementation/evaluation Publishing results GFI-C dataset extension

GFI

Class Activation Map experiments Human prediction Experiment Publishing results Meta-Recognition

1:First

implementation/evaluation Publishing Results

Dissertation

Writing Defense

TABLE 5.1 Proposed timetable for dissertation research and writing.

87

BIBLIOGRAPHY

1. A. Bansal, R. Agarwal, and R. K. Sharma. Svm based gender classification using iris images. Proceedings - 4th International Conference on Computational Intelligence and Communication Networks, CICN 2012, pages 425–429, 2012. doi: 10.1109/CICN.2012.192. 2. S. Bengali. India is building a biometric database for 1.3 billion people – and enrollment is mandatory. Los Angeles Times, 11 May 2017. Available http://www. latimes.com/world/la-fg-india-database-2017-story.html [Last accessed in: 14 May 2017.]. 3. A. Bertillon. La couleur de l’iris. Rev. Sci., 36:65–73, 1885. 4. D. Bobeldyk and A. Ross. Iris or periocular? exploring sex prediction from near infrared ocular images. In Biometrics Special Interest Group (BIOSIG), 2016 International Conference of the, pages 1–7. IEEE, 2016. 5. K. W. Bowyer, K. Hollingsworth, and P. J. Flynn. Image understanding for iris biometrics: A survey. Computer Vision and Image Understanding, 110 (2):281–307, 2008. ISSN 1077-3142. doi: http://dx.doi.org/10.1016/j.cviu. 2007.08.005. URL http://www.sciencedirect.com/science/article/pii/ S1077314207001373. 6. K. W. Bowyer, K. P. Hollingsworth, and P. J. Flynn. A survey of iris biometrics research: 2008–2010. In M. J. Burge and K. W. Bowyer, editors, Handbook of Iris Recognition, pages 15–54. Springer London, London, 2013. ISBN 978-1-44714402-1. doi: 10.1007/978-1-4471-4402-1 2. URL http://dx.doi.org/10.1007/ 978-1-4471-4402-1_2. 7. K. W. Bowyer, E. Ortiz, and A. Sgroi. Trial somaliland voting register de– duplication using iris recognition. Biometrics in the Wild Workshop 2015, BWild 2015, Ljubljana, Slovenia, May 2015. 8. Canada Border Services Agency. NEXUS Air – Entering Canada by Air. Canada Border Services Agency Website http://www.cbsa-asfc.gc.ca/prog/nexus/ air-aerien-eng.html [Last accessed in: 18 August 2017.]. 9. M. Chumakov. Canadian Border Services Agency. Private communication, 2015.

88

10. A. Czajka, K. Bowyer, M. Krumdick, and R. Vidal Mata. Recognition of imageorientation-based iris spoofing. IEEE Transactions on Information Forensics and Security, 6013(c), 2017. ISSN 1556-6013. doi: 10.1109/TIFS.2017.2701332. URL http://ieeexplore.ieee.org/document/7919203/. 11. A. Dantcheva, P. Elia, and A. Ross. What Else Does Your Biometric Data Reveal? A Survey on Soft Biometrics. IEEE Transactions on Information Forensics and Security, 11(3):441–467, mar 2016. ISSN 1556-6013. doi: 10.1109/TIFS.2015. 2480381. 12. J. Daugman. How Iris Recognition Works. In IEEE Transactions on Circuits and Systems for Video Technology2, volume 14, pages 21–30. IEEE, 2004. doi: 10.1109/TCSVT.2003.818350. 13. J. Daugman. Probing the Uniqueness and Randomness of IrisCodes: Results From 200 Billion Iris Pair Comparisons. Proceedings of the IEEE, 94(11):1927– 1935, Nov. 2006. ISSN 0018-9219. doi: 10.1109/JPROC.2006.884092. URL http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4052470. 14. J. Daugman. New methods in iris recognition. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society, 37(5):1167–75, oct 2007. ISSN 1083-4419. URL http://www.ncbi.nlm.nih.gov/pubmed/17926700. 15. J. Daugman. Iris recognition. In A. Jain, P. Flynn, and A. Ross, editors, Handbook of Biometrics, chapter 4. Springer, 2008. 16. J. Daugman. Iris recognition at airports and border crossings. In S. Z. Li and A. K. Jain, editors, Encyclopedia of Biometrics, pages 998–1004. Springer US, Boston, MA, 2015. ISBN 978-1-4899-7488-4. doi: 10.1007/978-1-4899-7488-4 24. URL http://dx.doi.org/10.1007/978-1-4899-7488-4_24. 17. J. Daugman. Information Theory and the IrisCode. IEEE Transactions on Information Forensics and Security, 11(2):400–409, 2016. 18. J. G. Daugman. High confidence visual recognition of persons by a test of statistical independence. IEEE transactions on pattern analysis and machine intelligence, 15(11):1148–1161, 1993. 19. J. G. Daugman. Biometric personal identification system based on iris analysis, March 1994. US Patent 5,291,560. 20. J. H. Doggart. Ocular signs in slit-lamp microscopy. H. Kimpton, London, 1949. 21. J. S. Doyle and K. W. Bowyer. Robust Detection of Textured Contact Lenses in Iris Recognition Using BSIF. IEEE Access, 2015. ISSN 21693536. doi: 10.1109/ ACCESS.2015.2477470.

89

22. J. S. Doyle, P. J. Flynn, and K. W. Bowyer. Effects of mascara on iris recognition. In I. Kakadiaris, W. J. Scheirer, and L. G. Hassebrook, editors, Proc. SPIE 8712, Biometric and Surveillance Technology for Human and Activity Identification X, volume 8712, page 87120L, may 2013. doi: 10.1117/12.2017877. URL http://dx.doi.org/10.1117/12.2017877http://proceedings. spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.2017877. 23. N. Erdogmus and S. Marcel. Introduction. In S. Marcel, M. S. Nixon, and S. Z. Li, editors, Handbook of Biometric Anti-Spoofing: Trusted Biometrics under Spoofing Attacks, chapter 1, pages 1–11. Springer, London, 2014. 24. M. Fairhurst, M. Erbilek, and M. D. Costa-Abreu. Exploring gender prediction from iris biometrics. In Biometrics Special Interest Group (BIOSIG), 2015 International Conference of the, pages 1–11, Sept 2015. doi: 10.1109/BIOSIG.2015. 7314602. 25. S. P. Fenker, E. Ortiz, and K. W. Bowyer. Template Aging Phenomenon in Iris Recognition. IEEE Access, 1:266–274, 2013. ISSN 2169-3536. doi: 10.1109/ ACCESS.2013.2262916. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=6516567. 26. J. Galbally and M. Gomez-Barrero. A review of iris anti-spoofing. In Biometrics and Forensics (IWBF), 2016 4th International Workshop on, pages 1–6. IEEE, 2016. 27. Z. Guo, L. Zhang, and D. Zhang. A completed modeling of local binary pattern operator for texture classification. IEEE Transactions on Image Processing, 19 (6):1657–1663, 2010. ISSN 1941-0042. doi: 10.1109/TIP.2010.2044957. 28. F. Hao, J. Daugman, and P. Zieli´ nski. A fast search algorithm for a large fuzzy database. IEEE Transactions on Information Forensics and Security, 3(2):203– 212, 2008. 29. Interpeace. Somaliland successfully launches voter registration. Interpeace, 21 January 2016. Available http://www.interpeace.org/2016/01/ somaliland-successfully-launches-voter-registration/ [Last accessed in: 06 June 2017.]. 30. ISO 19795-1:2006. International Standard ISO/IEC 19795-1 Biometric performance testing and reporting Part 1. Standard, International Organization for Standardization, Geneva, CH, 2006. 31. ISO 29794-6:2014. Information technology Biometric sample quality Part 6: Iris image data. Standard, International Organization for Standardization, Geneva, CH, 2014. 32. A. K. Jain, S. C. Dass, and K. Nandakumar. Soft Biometric Traits for Personal Recognition Systems. In D. Zhang and A. K. Jain, editors, Biometric 90

Authentication: First International Conference, ICBA 2004, Hong Kong, China, July 15-17, 2004. Proceedings, pages 731–738. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-540-25948-0. doi: 10.1007/978-3-540-25948-0 99. URL http://dx.doi.org/10.1007/978-3-540-25948-0_99. 33. A. K. Jain, A. A. Ross, and K. Nandakumar. Springer, 2011. ISBN 9780387773254.

Introduction to Biometrics.

34. N. Kohli, D. Yadav, M. Vatsa, R. Singh, and A. Noore. Detecting medley of iris spoofing attacks using DESIST. pages 1–6, 2016. 35. A. Kuehlkamp and K. W. Bowyer. An analysis of 1-to-first matching in iris recognition. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–8, March 2016. doi: 10.1109/WACV.2016.7477687. 36. A. Kuehlkamp, B. Becker, and K. Bowyer. Gender-from-iris or gender-frommascara? In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1151–1159, March 2017. doi: 10.1109/WACV.2017.133. 37. S. Lagree and K. W. Bowyer. Predicting ethnicity and gender from iris texture. In IEEE International Conference on Technologies for Homeland Security (HST), 2011, pages 440–445, Nov 2011. doi: 10.1109/THS.2011.6107909. 38. M. Larsson, N. L. Pedersen, and H. Stattin. Importance of genetic effects for characteristics of the human iris. Twin research : the official journal of the International Society for Twin Studies, 6(3):192–200, 2003. ISSN 1369-0523. doi: 10.1375/twin.6.3.192. 39. X. Liu, K. W. Bowyer, and P. J. Flynn. Experiments with an Improved Iris Segmentation Algorithm. Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID’05), (October):118–123, oct 2005. doi: 10. 1109/AUTOID.2005.21. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=1544411. 40. R. Mukherjee and a. Ross. Indexing iris images. 19th International Conference on Pattern Recognition, (December), 2008. ISSN 1051-4651. doi: 10.1109/ICPR. 2008.4761880. 41. National Science & Technology Council, Subcommittee on Biometrics. Biometrics Glossary. Available at http://www.nws-sa.com/biometrics/biooverview.pdf [Last accessed in: 07 July 2017.], . 42. National Science & Technology Council, Subcommittee on Biometrics. Biometrics History. Available at http://www.nws-sa.com/biometrics/biohistory.pdf [Last accessed in: 07 July 2017.], .

91

43. T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971–987, jul 2002. ISSN 0162-8828. doi: 10.1109/TPAMI.2002.1017623. 44. E. Ortiz and K. W. Bowyer. Exploratory Analysis of an Operational Iris Recognition Dataset from a CBSA Border-Crossing Application. In IEEE Computer Society Workshop on Biometrics. IEEE, 2015. 45. H. Proen¸ca. Iris biometrics: Indexing and retrieving heavily degraded data. IEEE Transactions on Information Forensics and Security, 8(12):1975–1985, 2013. ISSN 15566013. doi: 10.1109/TIFS.2013.2283458. 46. R. Raghavendra and C. Busch. Robust scheme for iris presentation attack detection using multiscale binarized statistical image features. IEEE Transactions on Information Forensics and Security, 10(4), 2015. ISSN 15566013. doi: 10.1109/TIFS.2015.2400393. 47. R. N. Rakvic, B. J. Ulis, R. P. Broussard, R. W. Ives, and N. Steiner. Parallelizing iris recognition. IEEE Transactions on Information Forensics and Security, 4(4): 812–823, 2009. 48. N. K. Ratha, J. H. Connell, and R. M. Bolle. An analysis of minutiae matching strength. In International Conference on Audio-and Video-Based Biometric Person Authentication, pages 223–228. Springer, 2001. 49. C. Rathgeb and A. Uhl. Iris-Biometric Hash Generation for Biometric Database Indexing. 2010. doi: 10.1109/ICPR.2010.698. 50. L. Sandhana. Iris register to eyedentify voting fraud in somaliland. New Scientist, 03 September 2014. Available https://www.newscientist.com/article/ mg22329854-400-iris-register-to-eyedentify-voting-fraud-in-somaliland/ [Last accessed in: 06 June 2017.]. 51. W. J. Scheirer, A. Rocha, T. E. Boult, and R. J. Micheals. Meta-Recognition: The Theory and Practice of Recognition Score Analysis. IEEE transactions on Pattern Analysis and Machine Intelligence, 33(August):1689–1695, 2011. 52. A. F. Sequeira, S. Thavalengal, J. M. Ferryman, P. Corcoran, and J. S. Cardoso. A realistic evaluation of iris presentation attack detection. pages 660–664, 2016. 53. K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. In 3rd International Conference on Learning Representations (ICRL2015), May 2015. URL http://arxiv.org/abs/1409.1556v6. 54. R. A. Sturm and M. Larsson. Genetics of human iris colour and patterns. Pigment Cell and Melanoma Research, 22(5):544–562, 2009. ISSN 1755-1471. doi: 10. 1111/j.1755-148X.2009.00606.x. 92

55. Z. Sun and T. Tan. Iris anti-spoofing. In S. Marcel, M. S. Nixon, and S. Z. Li, editors, Handbook of Biometric Anti-Spoofing: Trusted Biometrics under Spoofing Attacks, chapter 6, pages 103–123. Springer, London, 2014. 56. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. 57. J. E. Tapia, C. A. Perez, and K. W. Bowyer. Gender Classification from Iris Images using Fusion of Uniform Local Binary Patterns. In European Conference on Computer Vision (ECCV) Workshops, 2014. Springer International Publishing, 2014. 58. J. E. Tapia, C. A. Perez, and K. W. Bowyer. Gender classification from the same iris code used for recognition. IEEE Trans. Information Forensics and Security, 11:1760–1770, 2016. 59. V. Thomas, N. V. Chawla, K. W. Bowyer, and P. J. Flynn. Learning to predict gender from iris images. In First IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS), 2007., pages 1–5, Sept 2007. doi: 10.1109/BTAS.2007.4401911. 60. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.

This document was prepared & typeset with pdfLATEX, and formatted with nddiss2ε classfile (v3.2013[2013/04/16]) provided by Sameer Vijay and updated by Megan Patnott.

93