Evaluating Biometric Security: Understanding the

0 downloads 0 Views 5MB Size Report
Understanding the Impact of Wolves in Sheep's Clothing. Daniel Lopresti. †. , Fabian Monrose. ‡. , and Lucas Ballard. ‡. †. Department of Computer Science ...
Evaluating Biometric Security: Understanding the Impact of Wolves in Sheep’s Clothing Daniel Lopresti†, Fabian Monrose‡ , and Lucas Ballard‡ †

Department of Computer Science and Engineering Lehigh University Bethlehem, PA 18015, USA ‡

Department of Computer Science Johns Hopkins University Baltimore, MD 21218, USA

Email: † [email protected], ‡ {fabian,lucas}@cs.jhu.edu Abstract Growing interest in biometric security has resulted in much work on systems that attempt to exploit the individuality of human behavior. In this paper, we survey our recent research examining issues arising when such biometrics are to be used for authentication or cryptographic key generation. We propose steps towards the development of more rigorous evaluation methodologies for behavioral biometrics that take into account threat models previously ignored in the literature. The pervasive assumption that adversaries are minimally motivated (or, even worse, na¨ıve), or that attacks can only be mounted through manual effort, is too optimistic and even dangerous. The discussion is illustrated by summarizing our analysis of a handwriting-based key generation system showing that the standard evaluation methodology significantly overestimates its security. We also present an overview of our work on fully automated (generative) attack models that can be nearly as effective as skilled human forgers and thus present both a serious threat as well as a potential tool for improving the testing of biometric systems. Keywords: biometric security, signature verification, generative models, cryptographic key generation, password hardening.

1. Introduction Growing interest in biometric security has resulted in much work on systems that attempt to exploit the individuality of human behavior. Biometrics are, of course, a form of user input that is intended to be difficult for an attacker to reproduce. They can be employed as a mechanism for a user to authenticate him/herself to a reference monitor that becomes unresponsive after a certain number of failed attempts, or as

a means for generating user-specific cryptographic keys (see, for example, [25, 32]). In the event that the biometric measurement alone does not provide sufficient security, a possible alternative is to combine both a secret password with biometric features recorded from the user during the process of entering it, a scheme known as password hardening [26, 27]. Although this work may someday pave the way for the widespread use of biometrics in a variety of applications, we believe the manner in which biometric systems are traditionally tested, at least as represented by the scientific literature, raises serious concerns. Our recent research in this area [1, 2, 21, 25] suggests a clear need for more powerful adversarial models in performing security analyses. There exists a disconnect between real-world threats and typical “best” practices [22] for reporting biometric performance, one that requires rethinking as industry and governments, supported by the research community, move towards the deployment of biometric technologies. The type of analysis we describe herein is of primary importance when biometrics are used for authentication and cryptographic key generation (e.g., [5, 11, 16, 25]), where weakest-link analysis is paramount. To provide a concrete example of this critical issue, we discuss a particular methodology where we assume that the adversary utilizes indirect knowledge of the targeted user’s biometric features. That is, we presume the attacker has observed measurements of the biometric in contexts outside its use for security. For instance, if the biometric is derived from handwriting dynamics that arise when the user inputs text via a stylus, then we presume the attacker has samples of the user’s handwriting in another context, captured hardcopy illustrating how the user writes, or handwriting from other users who employ a similar style. We argue that this approach is more reflective of real threats to biometric security, as such information can be used by an attacker to build generative models that effectively mimic the behavior of the targeted user.

It is important to note that our ultimate goal in studying generative attacks is not simply to invent new threat models, but rather to improve biometric authentication and password hardening approaches so that it is possible to resist them. This said, some biometrics are intrinsically susceptible to generative attacks, static biometrics (e.g., fingerprints, retinal patterns, etc.) being prime examples: once recorded in another context, they can be trivially “generated” since they do not change (see, for example, [33]). In such cases, reference monitors must confirm that they are measuring a biometric feature from the human directly (e.g., by sensing blood flow) rather than a synthesized copy. For this reason, we are currently focusing our attention on behavioral biometrics, attempting to identify the shortcomings in current practices.

2. Traditional Biometric Evaluation Despite the diversity of approaches examined by the biometrics community [3], several key points remain relatively constant. For instance, the traditional procedure for applying a biometric as an authentication paradigm involves sampling an input from a user, extracting an appropriate set of features, and comparing these to previously stored templates to confirm or deny the claimed identity. While a wide range of features have been investigated, it is universally true that system designers seek features that exhibit large inter-class variability and small intra-class variability. In other words, two different users should be unlikely to generate the same input features, while a single user ought to be able to reproduce his/her own features accurately and repeatably. Likewise, the evaluation of most biometric systems usually follows a standard model: enroll some number of users by collecting training samples, e.g., of their handwriting or speech. At a later time, test the rate at which users’ attempts to recreate the biometric to within a predetermined tolerance fail. This failure rate is denoted as the False Reject Rate (sometimes called FRR or “Type I” error rate). Evaluation also involves assessing the rate at which another user’s input (an impostor) is able to fool the system when presented as coming from the user in question (the target). This evaluation yields the False Accept Rate (or FAR or “Type II” error rate) for the system under consideration. A tolerance setting to account for natural human variation is also vital in assessing the limits within which a sample will be consider as genuine, while at the same time, balancing the delicate trade-off of resistance to forgeries. Typically, one uses the Equal Error Rate (EER) – that is, the point at which the FRR and FAR are equal – to describe the accuracy of a given biometric system. Essentially, the lower the EER, the higher the accuracy. In the literature, researchers also commonly distinguish between forgeries that were never intended to defeat the system (“random” or na¨ıve forgeries), and those created by a user who was instructed to make such an attempt given information about the targeted input (i.e., so-called “skilled” forgeries). However, the evaluation of biometrics under such weak security assumptions can be misleading. Indeed, it may even be argued that because there is no rigorous means by which one can define a “good” forger and prove his/her existence (or non-existence), such analysis is theoret-

ically impossible [30]. Nevertheless, the biometric community continues to rely on relatively simple measures of adversarial strength, and most studies to date only incorporate unskilled adversaries, and very rarely, “skilled” impersonators [15, 19, 20, 23, 24, 30]. This general practice is troubling as the evaluation of the FAR is likely to be significantly underestimated [29, 30]. Moreover, we believe that this relatively ad hoc approach to evaluation misses a significant threat: the use of generative models to create synthetic forgeries which can form the basis for sophisticated, fully-automated attacks on biometric security. Even rather simplistic attacks launched by successive replication of synthetic or actual samples from a representative population can have adverse effects on the FAR – particularly for the weakest users (i.e., the so-called “Lambs” in the taxonomy for a hypothetical menagerie of biometric users [6]). In what follows, we provide an overview of our recent work examining the extent of this problem and our proposals for solving it. The interested reader is referred to past papers [1, 2, 21, 25] for further details beyond what we are able to present here.

3. Handwriting as a Biometric The key thesis behind biometric security is to exploit the individuality of human behavior. Signature verification, with its long, rich history coupled with hundreds of papers written on the subject (see, e.g., [15, 31]) is a perfect example. The use of signatures has well-known benefits: they are a natural and familiar way of confirming identity, have already achieved acceptance for legal purposes, and their capture is less invasive than most other biometric schemes [9]. Still, each individual has only one true signature, a severe limitation when it comes to certain security applications. As a result, researchers have recently begun to examine using arbitrary handwritten phrases, recasting the problem as one of computing cryptographic keys or biometric hashes (e.g., [10, 16, 35, 36]). In our work, we have adopted such a paradigm to support the study of threat models that have received insufficient attention in the literature. While some of the attacks we envision are more realistic than others, all are designed to “stress” handwriting biometrics in a variety of new ways, with our final goal being to build confidence in the inherent security of a given measure (or, conversely, to demonstrate that the measure suffers from flaws that should be addressed before it is put into practice). In addition to studying the abilities of human forgers when presented with varying degrees of knowledge regarding the targeted handwriting, we have asked whether an adversary can be successful in developing an accurate generative model for the user in question when given only limited information. In particular, we have discovered that such models, trained only on captured static (offline) samples combined with penstroke dynamics learned from general population statistics, is nearly as effective as the best human forgers we have encountered, and substantially better than our “average” forger [1]. This surprising result suggests that automated attacks should

receive serious attention in the evaluation of security systems that use handwriting as a biometric and, by the obvious extension, to any biometric based on human behavior that can be modeled algorithmically. In considering the mathematical features that can be extracted from the incoming signal to perform authentication, it is important to distinguish between two different classes of inputs. Data captured by sampling the position of a stylus tip over time on a digitizing tablet or pen computer are referred to as online handwriting, whereas inputs that presented in the form of a 2-D bitmap (e.g., scanned off of a piece of paper) are referred to as offline handwriting. We also sometimes refer to the former as “dynamic” data and the latter as “static.” Features extracted from static handwriting samples include bounding boxes and aspect ratios, stroke densities in a particular region, curvature measurements, etc. In the dynamic case, these features are also available and, in addition, timing and stroke order information that allows the computation of pen-tip velocities, accelerations, etc. Studies on signature verification and the related topic of handwriting recognition often make use of 50 or more features and, indeed, feature selection is itself a topic for research. The features we use in our own work are representative of those commonly reported in the field. Repeatability of features over time is, of course, a key issue, and it has been found that dynamic and static features are equally repeatable [12].

4. Design of the Experiments In this section, we review a series of experiments we performed to investigate our thesis in the context of a specific system for generating cryptographic keys from handwriting. It should be clear, however, that the same conclusions apply to many other proposed schemes for biometric security appearing in the literature. These results were first reported in two earlier papers: [1, 2].

assumes that each value in the biometric hash must be recovered perfectly. On the other hand, we assume that it will be possible to correct errors in a small number of hash positions, either through a search process or error-correcting code. For example, depending on the range of values in the positions in questions, correcting 5-10 errors is feasible in a few seconds of computation time on modern PC’s. The number of positions in which errors have been corrected provides the x-axis in our performance graphs.

4.2. Data Collection Our results are based on 11,038 handwriting samples collected over several weeks from 50 users at our two universities. We used NEC VersaLite and HP Compaq TC1100 tablet PC’s as our writing platforms. To motivate our participants, incentives were awarded during each round of data collection.1 We summarize our data collection efforts below. First, users were asked to write five phrases, comprising two-word oxymorons, ten times each. These passphrases were two-word sayings that are easy to recall (“perfect misfit,” “solo concert,” “crisis management,” “least favorite,” and “graphic language”). Each phrase was displayed in the tophalf of the screen, and the user’s writing was collected in the bottom half. During that same session a second (disjoint) set of 65 oxymorons was also collected. The writings from these oxymorons represent our generative corpus. We refer to this data as Round I. Approximately two weeks later, participants re-wrote the original five phrases ten times each. Next, users were asked to forge representative samples from victims from the earlier round. The first challenge was to provide forgeries after seeing only a static representation of the writing; later the process repeated, this time, using real-time renderings of the targets. This data makes up Round II. A screen snapshot of a graphical tool we developed for managing the forgery collection process is shown in Figure 1.

4.1. Biometric Hash for Handwriting For this work, we adapted the biometric hash of Vielhauer, et al. [35, 36] and evaluated the impact of forgeries produced by human adversaries with varying levels of skill and knowledge [2], as well as forgeries generated using using a concatenative synthesis model we developed for this purpose [1]. To ensure that our results reflected a meaningful assessment, we also devoted some effort to studying the robustness of the underlying authentication system. In addition to testing the system as originally proposed in [35] we also examined several enhancements. First, we performed an empirical analysis of a wide range of features to find those that appear to be the most secure [2]. Specifically, we analyzed 145 stateof-the-art features and measured the security of each by noting the difference between the proportion of times legitimate users and forgers fail to reproduce a given feature. We then chose the features with the smallest such difference and used the resulting 37 online and offline features to drive the system. Second, we use smaller tolerance values than those described by Vielhauer et al. [35]. The approach by Vielhauer, et al.

Figure 1. icapture tool for collecting forgeries. 1 E.g., snacks, gift certificates, and special prizes for the most consistent writers, best forgers, the most dedicated participants, etc.

One week later, we singled out nine “skilled” (but untrained) forgers; three forgers per group based on the writing style for which they appeared to have a natural ability to duplicate. These forgers were then asked to forge 15 writing samples, with 60% of the samples coming from the weakest 10 targets, and the other 40% chosen at random. As before, a real-time reproduction of the target sample was displayed and the forger allowed to attempt forgeries with the option of saving the attempts he/she liked. The forger could also select and replay forgeries and compare them to the target. Once satisfied, the forger chose the forgery that represented his/her best attempt, and proceeded to the next target. This data was our Round III data. A point worth emphasizing is that our “skilled” forgers exhibited a high degree of self-motivation. In one instance, a forger wrote a single passphrase 106 times before being satisfied that what he had created depicted an accurate reproduction (the forgers were not provided with any direct feedback on the effectiveness of their attempts). Another replayed the target phrase 40 times. On average, each forger made 14 attempts and redrew the target five times, taking roughly 1-2 hours to forge the 15 passphrases. Clearly, such dedication plays an important role when evaluating the security of biometric systems, but this point has been explored only to a limited degree in the literature to date. One can expect determined adversaries to expend at least as much effort as the forgers in our study. Finally, we had four test subjects trace the parallel corpora to recover the overall shape, stroke-order, and direction of the writing. These users were never asked to trace their own handwriting. Velocity information was discarded, to be re-synthesized as described in the next section.

Figure 2. Example forgeries created by a skilled human forger and our algorithm.

thesized passphrase is estimated, reconstructing the dynamics. Figures 2 and 3 show examples of forgeries generated in this manner.

4.3. Concatenative Handwriting Synthesis As suggested earlier, we group users into one of three categories based on the predominate style of their handwriting: cursive, block, or mixed. To generate forgeries, we use an algorithm that we call Concatenative-Synthesis with TemporalInference (CSTI). At a high level, the algorithm proceeds as follows. When attempting to forge a targeted writer, the adversary first computes some general statistics over the (annotated) parallel corpora of writers with the same writing style. Such statistics include inter-character spacing and timing, as well as stroke velocity. The adversary then traces the phrases in the target user’s parallel corpus to recover the overall shape of the writing, as well as a “best-guess” estimate of stroke direction and order. The tracer is provided with an offline image of the writing and allowed unlimited attempts to replicate the strokes. However, there is no attempt to reproduce the dynamics of the writing; that information will be synthesized using a procedure to be described shortly. A search of the (traced) parallel corpus is then performed to extract n-grams that comprise the passphrase p and combine the x(t) and y(t) signals of each n-gram to form a static rendering of p (see [2] for a more detailed discussion of this process). Finally, using the temporal-inference algorithm discussed below, the elapsed time between each point in the syn-

Figure 3. Example forgeries created by a skilled human forger and our algorithm.

To aid in the generation of forgeries, we first attempt to derive a set of measures that can be used to infer the pentip velocities for a user’s writing. The intention is that these measures will be useful in computing a model at the character level that also reflects the subtleties of the underlying strokes. To do so, we assume access to the online corpus of handwriting for users with the same writing style as the target. For each of the annotated samples we apply linear re-sampling to ensure the Euclidean distance between each point within a stroke is separated by d units. 2 For a sliding window, we compute four functions to aid in generating velocity profiles, namely straightness, the offset of the window within the stroke, direction, and the number of extrema within the window. To partition this set into groupings with similar characteristics, we apply k-means to find representative clusters with accompanying velocities. To infer pen-up time, we apply a similar concept except using a 2 The

one exception is that we preserve the end points of each stroke.

the less standard (but more reasonable) na¨ıve* classification (FAR-na¨ıve*) where a user only attempts to reproduce samples from writers of a similar style. ROC Curves for Various Forgery Styles 1

0.8

Error Rate

function of the distance between pen-up and pen-down points. This process is applied to all characters in the corpus. After using Concatenative-Synthesis [2, 21] to create a static rendering of a forgery, the adversary uses the recorded velocity profiles to infer the elapsed time between each point in the forgery. Each stroke of each character in the static rendering of the target passphrase p is processed in this fashion. The adversary slides a window over the stroke and, using a knearest-neighbors approach, determines which cluster the set of points is most similar to. Once found, this velocity is used as an estimate of the pen speed for the point at the center of the window.

0.6

0.4 FRR FAR-naive FAR-naive* FAR-static FAR-dynamic FAR-trained

5 Performance of Human Forgers 0.2

This section presents the results for five evaluation metrics that examine forgeries generated by humans. Before we computed the FRR and the FAR, we removed the outliers that are inherent to biometric systems. For each user, we removed all samples that had more than δ = 3 features that fell outside k = 2 standard deviations from that user’s mean feature value. The parameters δ and k were empirically derived. We also excluded samples from users (the so-called “Goats” [6]) who had more than 25% of their samples classified as outliers. Such users “Failed to Enroll” [22]; the FTE rate was approximately 8.7%. After combining this with outlier removal, we still used 79.2% of the original data set. To determine the FRR and the FAR, we use the biometric hash system described in Subsection 4.1. The FRR was computed as follows: we repeatedly randomly partitioned a user’s samples into two groups and used the first group to build a template and authenticate the samples in the second group against the template. To compute the FAR, we use all of the user’s samples to generate a template and then authenticated the forgeries against this template. Our experiments were designed to illustrate the discrepancy in perceived security when considering traditional forgery paradigms versus a more stringent, but realistic, adversarial model. In particular, we assumed that at a very minimum, a realistic adversary: 1. attacks victims who have a writing style that the forger has a natural ability to replicate, 2. has knowledge of how biometric authentication systems operate, and 3. has a vested interest in accessing the system, and therefore is willing to devote significant effort towards these ends. Figure 4 presents ROC curves for forgeries from impersonators with varying levels of knowledge. The plot denoted FAR-na¨ıve depicts results for the traditional case of na¨ıve forgeries widely used in the literature [15, 19, 30]. In these cases, the impersonation attempts simply reflect taking one user’s natural rendering of phrase p as an impersonation attempt on the targeted writing p. This scheme makes no differentiation based on the forger’s or the victim’s style of writing, and so may include, for example, block writers “forging” cursive writers. Arguably, such forgeries may not do as well as

0 0

5

10

15

20

25

30

35

Errors Corrected

Figure 4. Overall ROC curves for na¨ıve, na¨ıve*, static, dynamic, and trained human forgers.

The FAR-static plots represent the success rate of forgers who receive access to only a static rendering of the passphrase. By contrast, FAR-dynamic forgeries are produced after seeing (possibly many) real-time renderings of the image. One can envision this being a realistic threat if we assume a motivated adversary could capture the handwriting using a digital video camera, or, more likely, may have access to data written electronically in another context. Lastly, FARtrained presents the resulting success rate of forgeries derived under our forgery model which corresponds to a more worthy opponent. Intuitively, one would expect that forgers with access to dynamic and/or static representations of the target writing should be able to produce better forgeries than those produced under the na¨ıve* classification. This is not necessarily the case, as we see in Figure 4 that at some points, the na¨ıve* forgeries do better than the forgeries generated by forgers who have access to static and/or dynamic information. This is primarily due to the fact that the na¨ıve* classification reflects users’ normal writing (as there is really no forgery attempt here). The natural tendencies exhibited in their writings appear to produce better “forgeries” than that of static or dynamic forgers (beyond some point), who may suffer from unnatural writing characteristics as a result of focusing on the act of forging. One of the most striking results depicted in the figures is the significant discrepancy in the FAR between standard evaluation methodologies and that of the trained forgeries captured under our strengthened model. While it is tempting to directly compare the results under the new model to those under the more traditional metrics (i.e., by contrasting the FARtrained error rate at the EER under one of the older models), such a comparison is not valid. This is because the forgers under the new model were more knowledgeable with respect to the intricacies of handwriting verification and had performed style-targeted forgeries.

1

0.8

0.6

Error Rate

However, the correct comparison considers the EERs under the two models. For instance, the EER for this system under FAR-trained forgeries is approximately 20.6% at four error corrections. However, for the more traditional metrics, one would arrive at EERs of 7.9%, 6.0%, 5.5% under evaluations of dynamic, static and na¨ıve forgeries, respectively. These results are indeed inline with the current state of the art [15, 19, 30]. Even worse, under the most widely used form of adversary considered in the literature (i.e., na¨ıve) we see that the security of this system would be over-estimated by nearly 375%. As was also reported in our earlier paper [2], we saw significant improvement in the performance of our human forgers between Round II and Round III. This improvement resulted from less than two hours of training and effort, which seems quite modest in comparison to what truly dedicated and/or skilled forger might be willing to invest.

0.4

FRR Static FAR Dynamic FAR Skilled FAR CSTI FAR

0.2

0 0

20

40

60

80

100

Features Corrected (%)

Figure 5. ROC curves using the 24 features described by Vielhauer et al. [36].

6 Performance of Generative Attacks

3 The system that used the features presented by Vielhauer et al. had an FTE of ≈ 4.7%, whereas the system based the strengthened feature set exhibited an FTE of ≈ 8.7%.

The performance against the new, stronger, feature set is shown in Figure 6. The Static and Dynamic forgeries exhibit an EER of 6.8% and 8.2% at eight and seven errors corrected, respectively. The EER for Skilled forgeries is 20.1% at four errors corrected. The EER for forgeries generated using our generative approach is 17.2% at five errors corrected, which is close to the rate of skilled forgers. As expected, the choice of a strengthened feature set reduced the EERs in each case. Nonetheless, forgeries generated using this method performed almost as well as skilled forgers, and were much more successful than forgeries generated under traditional assumptions. 1

0.8

Error Rate

In our studies of generative attacks based on concatenative synthesis, as described in Subsection 4.3, we tested both the original Vielhauer scheme which uses 24 features [35], as well as the best 37 online and offline features chosen from a much larger set of 145 features identified by surveying the literature. We used the 20 renderings of the five passphrases from the first two rounds of data collection (recall Subsection 4.2) to generate a biometric template (i.e., the “interval matrix” in [35]) for each user and passphrase. After omitting data from users who failed to enroll, 3 we computed the False Reject Rate (FRR) by generating templates using 75% of a user’s writing, and attempted to authenticate the remaining 25%. The False Accept Rate (FAR) was computed using templates generated from the writings of other users for the passphrase in question. The reported results are averaged values across 25 random partitions of the data. In our original paper [1], we reported results based on four distinct False Accept Rates: the Static and Dynamic cases are computed using forgeries from a group of nontrained forgers given access to static and dynamic renderings of the target passphrase, respectively. The Skilled case reflects forgeries from our skilled forgers. Lastly, we report the FAR for the concatenative-synthesis approach with inferred temporal information. Figure 5 gives the results when using the features described by Vielhauer, et al. [36]. The Static and Dynamic forgeries exhibit an EER of 8.6% and 12.8% at five and four errors corrected. Not surprisingly, the impact of Skilled forgeries is dramatic, and results in an EER of 23.2% at two errors corrected. Finally, the EER for forgeries generated using concatenative-synthesis with temporal-inference is 18.6% at three errors corrected on average. This rate is much higher than both Static and Dynamic forgeries, which is particularly alarming given that we infer online information from general population statistics.

0.6

0.4

FRR Static FAR Dynamic FAR Skilled FAR CSTI FAR

0.2

0 0

10

20

30

40 50 60 70 Features Corrected (%)

80

90

100

Figure 6. ROC curves using the 37 most secure features out of a set of 145.

These results are of significant practical value as we assume access to only a limited number of offline samples of the targeted user’s writings (on average, 6.9 samples of length 1.83 characters for each forgery attempt), each written in a context outside its use for security. Again, recall that the individual samples have simply been traced (to infer stroke or-

der) as a determined adversary would likely do. Therefore, stroke order, direction, and other user-specific spatial idiosyncrasies could be incorrectly reproduced. Nonetheless, our attack outperforms average forgers, even those with access to real-time representations of the passphrase as it was originally rendered. It is interesting to note that the handwriting of several users was especially susceptible to our attack (e.g., the targeted writer in Figure 2), whereas other users produced handwriting that was quite resilient to our forgeries (e.g., the targeted writer in Figure 3). This multi-modal behavior might be explained by the fact that our final velocities represent “average” velocity profiles. It seems likely that those writers who are close to the average are more susceptible than those with unique writing habits. Nevertheless, in the context of cryptographic key-generation or authentication, the ability to accurately forge any writer can have significant ramifications as an adversary often only needs to compromise the weakest link in the system.

Figure 7. Screen snapshot of our tool for collecting human judgments of authenticity.

7 Human Evaluation of Forgery Attempts Beyond judgment by the algorithmic methods we have described, a natural question to ponder is whether the forgery attempts by our test subjects would fool a human expert. In pattern recognition and machine vision research, it is generally accepted that human perception provides the “gold standard” by which the so-called ground-truth is defined. In this case, the truth is already known: either a sample is genuine or it is a forgery attempt (even if the forgery is a perfect reproduction of the original). Still, finding an alternate way of confirming that our subjects were producing good-quality forgeries is clearly useful. To measure the effectiveness of the forgeries we collected, we conducted a relatively simple experiment: users were shown two handwritten passphrases and asked whether they were authored by the same individual. Each subject was shown 25 pairs of static handwriting samples and another 25 pairs of dynamic samples where an unlimited number of realtime replays of the writing was allowed. Half of the time the two samples originated from the same writer, while for the other half, one of the samples was a forgery attempt based on knowledge of either the static or dynamic target as appropriate. Figure 7 shows a screen snapshot of the data collection tool we developed to support the experiment. A total of 25 users participated in this study, with somewhat more than half not having been involved in our original data collection activities and hence new to the work we were doing. The somewhat surprising results of this preliminary experiment are shown in Figure 8, grouped into three criteria (from left to right): 1. human judge performance (i.e., success is the ability to detect a forgery), 2. forger performance (i.e., success is the ability to have one’s forgery fool a human judge), and 3. target performance (i.e., success is the ability to have one’s handwriting resist a forgery attempt).

Our original goal had been to show that at least some of our test subjects were capable of producing good-quality forgeries. While this appears to be true, we also discovered that humans are less able judges of handwritten forgeries than we had anticipated: the overall average accuracy for our 25 subjects was only 79%, which is much lower than for many other pattern recognition tasks where humans tend to excel. Moreover, this same level of performance held for both static and dynamic samples, with the latter only slightly easier for the test subjects to distinguish (82% vs. 75%). As can be seen in the leftmost chart in the figure, performance of the human judges on this task ranged from 66% to 92%. Turning to the performance of our forgers, as shown in the center chart in Figure 8, we note that overall, they were able to fool the judge in 14% of their attempts. There is a wide range in abilities, however, with seven forgers in particular succeeding at least a quarter of the time. Interestingly, some subjects were noticeably more effective producing static forgeries, while others were better at dynamic forgeries. Lastly, looking at forgery targets in the rightmost chart in the figure, we note that a number of users (10) were never successfully forged, while the nine “weakest” users saw attempts against their handwriting pass human judgment 25% of the time or more. Our general conclusion as a result of this exercise is that some of our forgers appear reasonably skilled, but that humans are not perfect judges of forgery quality and, hence, better and more accurate means for automating the evaluation of biometric performance and its resilience to impersonation attacks are required.

8

Related Work

Particularly relevant to the research we are pursuing are a series of recent papers that use online handwriting for the computation of cryptographic keys. Feng and Wah, for example, describe a scheme for generating keys from handwrit-

Figure 8. Performance of human judges in detecting forgeries. ten signatures using an initial filtering based on dynamic time warping followed by the extraction of 43 features yielding an average key length of 40 bits [10]. The authors claim an Equal Error Rate (EER) of 8%, and mention that their test database contains forgeries, but unfortunately provide no details on how these were produced or their quality. Kuan, et al. present a method to generate cryptographic keys from online signatures [16]. Their approach was evaluated on the SVC dataset [38] and achieved EERs of between 6% and 14% given access to a stolen token. Vielhauer, et al. present a biometric hash based on 24 integer-valued features extracted from an online handwriting signal [36]. Fourteen of the features are global, while the remaining ten features are derived from segmented portions of the input obtained by partitioning the bounding box surrounding the ink into five equal-sized regions in the x- and y-dimensions. Forgeries based on an offline image of the target signature were collected. The authors report achieving a False Accept Rate (FAR) of 0% at a False Reject Rate (FRR) of 7% in their studies, but only 10 subjects were used in the testing. A later paper discusses feature correlation and stability for a much larger number of features; however, the same number of test subjects is used [35]. This follow-on work differs in that the new system is more robust to natural variation, but the authors do not report a FAR. As can be seen, performance figures (i.e., EER) are difficult to compare directly as the sample sizes are often small and test conditions quite dissimilar [8]. Furthermore, even when forgers are employed in such experiments, there is usually no indication of their proficiency. We note that the production of “skilled” forgeries for the SVC dataset [38] resembles the methodology we have used in our own studies, although the definition of “skilled” in that work is closer to “knowledgeable” than it is to “talented.” For the competition, a database incorporating 100 sets of signatures was created, with 20 genuine signatures and 20 forgeries penned by at least four subjects comprising each set. In the case of attempted forgeries, users could replay the dynamic sequence of the targeted handwriting on the screen of a software viewer before attempting to forge it. However, there was no attempt to distinguish effective vs. ineffective forgers (or hard-to-forge vs. easy-to-forge writers), or to measure the performance of forgeries based only on static (offline) data. The first serious attempt we are aware of to provide a tool for training forgers to explore the limits of their abilities is the work by Z¨obisch and Vielhauer [37]. In a small preliminary

study involving four users, they found that showing an image of the target signature increased false accepts, and showing a dynamic replay doubled the susceptibility to forgeries yet again. However, since the verification algorithm used was simplistic and they do not report false reject rates, it is difficult to draw more general conclusions. Lopresti and Raim reported the results of a simple concatenative-style attack in a small-scale study involving two test subjects writing four passwords 20 or more times each [21]. A parallel corpus of unrelated writing was also collected and labeled at the unigram level. This was then used to attempt to break the original biometric hash proposed by Vielhauer, et al. [36]. After a feature-space search of at most one minute on a Pentium-class PC, the attack was found to be successful 49% of the time. More recently, the authors of the present paper conducted a detailed analysis which examined the skill levels of human forgers along with their ability to improve via training [2]. This earlier work also included a more comprehensive study involving a particular kind of concatenative attack based on the rather strong assumption that the adversary has access to online samples of the targeted user’s handwriting. At that time, we found that our most successful forgers were able to achieve significantly higher success rates than the average “random” test subject and, furthermore, they were able to increase their chances of breaking the biometric hash with a reasonable amount of practice. We also noted that the generative attack was even more successful than the human forgers. Our conclusion, then, was that such worst-case scenarios deserve much more attention than they have been receiving in the evaluation of biometric systems. A slightly later paper [1] built on this earlier work by relaxing a key assumption: we only allowed the attacker access to offline samples of the user’s handwriting, as might be obtained through discarded correspondence or scraps of paper, for example. The temporal characteristics needed to recreate the writing accurately enough to defeat the biometric system were computed from general population statistics. Other researchers have also attempted to infer dynamics from a static image of handwriting, or to synthesize realisticlooking writing using a model. The former problem arises in offline instances of handwriting recognition and signature verification, where one promising approach has been to try to extract both offline and online features from a scanned image of the writing (e.g., [4, 7, 17, 18]), although largely this work has focused on recovering stroke-order and not velocities or

accelerations. From the broadest perspective, our research relates to the topic of generative models for handwriting, which has received much attention over the years (e.g., [13, 28]). This work is often oriented toward automating the production of training data to improve recognition algorithms, but is clearly relevant to the attack scenarios we have in mind. Particularly intriguing is recent research that combines the development of generative models with techniques for learning the dynamics of handwriting (e.g., [14, 34]). Our plans include studying the threat posed by such models in the context of biometric security.

9 Conclusions Fundamental computer security mechanisms are predicated on the intended user generating an input that attackers are unable to reproduce. The security of biometric-based technologies hinges on this perceived inability on the part of the attacker. Even so, the evaluation of biometric technologies is usually conducted under fairly weak adversarial conditions. Unfortunately, this practice may significantly underestimate the real risk of accepting forgeries as authentic. To address this issue, we described our work on studying the natural skill levels of human forgers, as well as the development of an automated technique for producing generative forgeries that can assist in the evaluation of biometric systems. We showed that our generative approach is nearly as effective as forgeries produced by skilled human forgers, and greatly exceeds the average case. This attack was based on only a small number of offline samples of the user’s writing in concert with general population statistics. Our hope is that this work will serve as a solid foundation for the work of other researchers and practitioners, particularly as it pertains to evaluating biometric authentication or key-generation systems. Admittedly, such evaluations are difficult to undertake due to the reliance of recruiting large numbers of human subjects. In that regard, the generative approach presented herein should reduce the difficulty of this task and allow for more rigorous evaluations as it pertains to biometric security. Additionally, there is much future work related to the topics presented here, including several directions for incorporating more sophisticated generative algorithms into the evaluation paradigm. Our conclusions concerning the threat we have described apply to any attempt to extract distinguishing features from a user’s normal style of handwriting. As it stands, however, they do not directly apply to more stylized writing such as a true signature. For the SVC competition [38], users were asked to create a “new” signature and “practice” it. It would be interesting to know whether such pseudo-signatures might provide a solution to this problem. We hope to explore such questions in the coming months. In a broader sense, our work demonstrates that traditional approaches to evaluating biometric security, which are most often based on an average-case analysis, are insufficient to characterize the threats posed by determined adversaries who possess some degree of talent and/or automated attacks that exploit generative models for human behavior. Increased con-

fidence will come only through more careful consideration of such worst-case scenarios.

10

Acknowledgments

We express our gratitude to Dishant Patel, Carolyn Buckley, Jay Zarfoss and Charles Wright for their assistance. We also thank the numerous people who participated in this study. This project is supported by NSF grant CNS-0430338.

References [1] L. Ballard, D. Lopresti, and F. Monrose. Evaluating the security of handwriting biometrics. In Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule, France, October 2006. To appear. [2] L. Ballard, F. Monrose, and D. Lopresti. Biometric authentication revisited: Understanding the impact of wolves in sheep’s clothing. In Proceedings of the Fifteenth Annual USENIX Security Symposium, pages 29–41, Vancouver, BC, Canada, August 2006. [3] The biometrics consortium. http://www.biometrics.org/. [4] G. Boccignone, A. Chianese, L. P. Cordella, and A. Marcelli. Recovering dynamic information from static handwriting. Pattern Recognition, 26(3):409–418, 1993. [5] Y.-J. Chang, W. Zhung, and T. Chen. Biometrics-based cryptographic key generation. In Proceedings of the International Conference on Multimedia and Expo, volume 3, pages 2203– 2206, 2004. [6] G. R. Doddington, W. Liggett, A. F. Martin, M. Przybocki, and D. A. Reynolds. Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the nist 1998 speaker recognition evaluation. In Proceedings of the Fifth International Conference on Spoken Language Processing, November 1998. [7] D. S. Doermann and A. Rosenfeld. Recovery of temporal information from static images of handwriting. International Journal of Computer Vision, 15(1-2):143–164, 1995. [8] S. J. Elliott. Development of a biometric testing protocol for dynamic signature verification. In Proceedings of the International Conference on Automation, Robotics, and Computer Vision, Singapore, 2002. [9] M. C. Fairhurst. Signature verification revisited: promoting practical exploitation of biometric technology. Electronics & Communication Engineering Journal, pages 273–280, December 1997. [10] H. Feng and C. Wah. Private key generation from on-line handwritten signatures. Information Management and Computer Security, 10(4):159–164, 2002. [11] A. Goh and D. C. L. Ngo. Computation of cryptographic keys from face biometrics. In Proceedings of Communications and Multimedia Security, volume 2828 of Lecture Notes in Computer Science, pages 1–13. Springer-Verlag, 2003. [12] R. M. Guest. The repeatability of signatures. In Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition, pages 492–497, October 2004. [13] I. Guyon. Handwriting synthesis from handwritten glyphs. In Proceedings of the Fifth International Workshop on Frontiers of Handwriting Recognition, Colchester, England, 1996. [14] G. Hinton and V. Nair. Inferring motor programs from images of handwritten digits. In Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA, 2006. [15] A. K. Jain, F. D. Griess, and S. D. Connell. On-line signature verification. Pattern Recognition, 35(12):2963–2972, 2002.

[16] Y. W. Kuan, A. Goh, D. Ngo, and A. Teoh. Cryptographic keys from dynamic hand-signatures with biometric security preservation and replaceability. In Proceedings of the Fourth IEEE Workshop on Automatic Identification Advanced Technologies, pages 27–32, Los Alamitos, CA, 2005. IEEE Computer Society. [17] P. M. Lallican, C. Viard-Gaudin, and S. Knerr. From offline to on-line handwriting recognition. In Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition, page 303312, September 2000. [18] K. K. Lau, P. C. Yuen, and Y. Y. Tang. Recovery of writing sequence of static images of handwriting using uwm. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 1123–1128, August 2003. [19] F. Leclerc and R. Plamondon. Automatic signature verification: the state of the art 1989-1993. International Journal of Pattern Recognition and Artificial Intelligence, 8(3):643–660, 1994. [20] J. Lindberg and M. Blomberg. Vulnerability in speaker verification – a study of technical impostor techniques. In Proceedings of the European Conference on Speech Communication and Technology, volume 3, pages 1211–1214, Budapest, Hungary, September 1999. [21] D. P. Lopresti and J. D. Raim. The effectiveness of generative attacks on an online handwriting biometric. In Proceedings of the International Conference on Audio- and Video-based Biometric Person Authentication, pages 1090–1099, Hilton Rye Town, NY, USA, 2005. [22] A. J. Mansfield and J. L. Wayman. Best practices in testing and reporting performance of biometric devices. Technical Report NPL Report CMSC 14/02, Centre for Mathematics and Scientific Computing, National Physical Laboratory, August 2002. [23] T. Masuko, T. Hitotsumatsu, K. Tokuda, and T. Kobayashi. On the security of hmm-based speaker verification systems against imposture using synthetic speech. In Proceedings of the European Conference on Speech Communication and Technology, volume 3, pages 1223–1226, Budapest, Hungary, September 1999. [24] T. Masuko, K. Tokuda, and T. Kobayashi. Imposture using synthetic speech against speaker verification based on spectrum and pitch. In Proceedings of the International Conference on Spoken Language Processing, volume 3, pages 302– 305, Beijing, China, October 2000. [25] F. Monrose, M. Reiter, Q. Li, D. Lopresti, and C. Shih. Towards speech-generated cryptographic keys on resourceconstrained devices. In Proceedings of the Eleventh USENIX Security Symposium, pages 283–296, 2002. [26] F. Monrose, M. K. Reiter, Q. Li, and S. Wetzel. Cryptographic key generation from voice (extended abstract). In Proceeedings of the 2001 IEEE Symposium on Security and Privacy, pages 12–25, May 2001. [27] F. Monrose, M. K. Reiter, and S. Wetzel. Password hardening based on keystroke dynamics. International Journal of Information Security, 1(2):69–83, February 2002. [28] R. Plamondon. A delta-lognormal model for handwriting generation. In Proceedings of the Seventh Biennal Conference of the International Graphonomics Society, pages 126–127, London, Ontario, Canada, 1995. [29] R. Plamondon and G. Lorette. Automatic signature verification and writer identification – the start of the art. Pattern Recognition, 22(2):107–131, 1989. [30] R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000.

[31] K. Price. Bibliography: On-line signatures, March 2006. http://iris.usc.edu/Vision-Notes/bibliography/char1011.html. [32] C. Soutar, D. Roberge, A. Stoianov, R. Gilroy, and B. V. Kumar. Biometric encryptiontm using image processing. In Optical Security and Counterfeit Deterrence Techniques II, volume 3314, pages 178–188. IS&T/SPIE, 1998. [33] U. Uludag and A. K. Jain. Fingerprint minutiae attack system. In The Biometric Consortium Conference, September 2004. [34] T. Varga, D. Kilchhofer, and H. Bunke. Template-based synthetic handwriting generation for the training of recognition systems. In Proceedings of the 12th Conference of the International Graphonomics Society, pages 206–211, June 2005. [35] C. Vielhauer and R. Steinmetz. Handwriting: Feature correlation analysis for biometric hashes. EURASIP Journal on Applied Signal Processing, 4:542–558, 2004. [36] C. Vielhauer, R. Steinmetz, and A. Mayerhofer. Biometric hash based on statistical features of online signatures. In Proceedings of the Sixteenth International Conference on Pattern Recognition, volume 1, pages 123–126, 2002. [37] C. Vielhauer and F. Z¨obisch. A test tool to support brute-force online and offline signature forgery tests on mobile devices. In Proceedings of the International Conference on Multimedia and Expo, volume 3, pages 225–228, 2003. [38] D.-Y. Yeung, H. Chang, Y. Xiong, S. George, R. Kashi, T. Matsumoto, and G. Rigoll. SVC2004: First international signature verification competition. In Proceedings of the International Conference on Biometric Authentication (ICBA), Hong Kong, July 2004.

Suggest Documents