2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools
A Multicore Embedded Processor For Fingerprint Recognition G. Danese, M. Giachero, F. Leporati, N. Nazzicari Dip. di Informatica e Sistemistica - University of Pavia – via Ferrata 1, Italy Tel. +39 0382 985350 – Fax +39 0382 985373 – e-mail:
[email protected] efficient and robust algorithm on an Altera Stratix II FPGA device including the Nios II processor, outperforming modern general purpose processors in carrying out computation on 2D input images. The use of this elaboration platform makes the system small, portable and fast, improving the performance of the demanding elaboration required by an automatic system for safety applications. This work represent a significant evolution of the paper [25] into which we implemented the Phase Only Correlation function on a single core CPU.
Abstract - Biometric identification systems exploit automated methods of recognition based on physiological or behavioural people characteristics. Among these, fingerprints are very affordable biometric identifiers. In order to build embedded systems performing real-time authentication, a fast computational unit for image processing is required. In this paper we propose a parallel architecture that efficiently implements the high computationally demanding core of a matching algorithm based on Band Limited Phase Only spatial Correlation (BLPOC), elaborated by two concurrent computational units implemented onto Stratix II family Altera FPGA. The realised device is competitive with those provided by similar hardware solutions described in literature and outperforms the elaboration capabilities of general purpose PC processors.
2. STATE OF THE ART Fingerprint authentication and issues related to large database management are well known topics [3]. The development of a fingerprint verification system on a lowcost embedded platform is an open issue, and FPGA technology appears to be a good candidate to achieve high performance/cost ratios [4, 5]. Most research projects focus on minutiae related algorithms, showing that very good error rates can be reached at the cost of high elaboration times, due to both the samples preparation (enrollment) and actual matching phase [6-10]. The enrollment time is a cost to be paid (typically) only once per new fingerprint, because it is possible to store in the database already enrolled data. High enrollment times are due to minutiae extraction, that is a high demanding operation. Furthermore, the matching time must be multiplied for every fingerprint in database, so it must be as small as possible, especially for huge databases. Correlation-based algorithms have been proposed [1, 1112]. Efficient correlation implementation is a very well known problem, and it is possible to take advantage of the many good results available in literature [13-14]. To the best of our knowledge, nobody has proposed an efficient correlation-based fingerprint matching architecture yet.
Keywords: FPGA, Application Specific processors, Biometrics
1. INTRODUCTION A Biometric System is essentially a pattern recognition system that identifies a person by his/her specific physiological and/or behavioural peculiarity (biometric identifier). At present, several biometric technologies have been developed and are employed in a variety of applications. Among them, fingerprints are one of the most commonly used, since they provide a good tradeoff among the properties a recognition system should have. A typical Automatic Fingerprint Identification System (AFIS) requires very small dimensions while at the same time, a significant number of fingerprint template images reuire to be quickly searched and compared. For all these reasons, developing dedicated devices for the implementation of proper recognition algorithms is mandatory to satisfy the previously mentioned requirements. The performance of matching algorithms is highly dependent on the fingerprint representation and on the image quality. Most of these methods are minutiae based and thus highly influenced by fingertip surface conditions. To avoid such problems, the matching algorithm we worked on, proposed by K. Ito et al. in [1, 2], employs the evaluation of correlation between the input and template images, so to achieve robust matching also in case of low-quality fingerprints. In this paper we present an FPGA-based system for fingerprint matching; the algorithm we implemented is based on the computation of the Band Limited Phase Only Correlation (BLPOC) function. Our architecture implements these 978-0-7695-4171-6/10 $26.00 © 2010 IEEE DOI 10.1109/DSD.2010.101
3. THE BLPOC ALGORITHM The algorithm we chose to implement is known as Band Limited Phase-Only Correlation (BLPOC), proposed in [1, 2] and consisting of the following processing steps: 1. the two fingerprints to compare are enhanced (to improve the results of the following steps); 2. one of the fingerprints is rotated by several angles, thus generating a number of fingerprints to compare the other with; 779
3.
the enhanced fingerprints are transformed using the two-dimensional Discrete Fourier Transform (DFT); 4. the high-frequency components are discarded, so to keep only the signal at frequencies compatible with the physiological characteristics of fingertips; 5. each sample is replaced with its phase, while the modulus is discarded; 6. for each comparison, a new complex signal is constructed where each sample has modulus 1 and a phase equal to the difference between the phases of the corresponding input signals; 7. the new signals (which are real) are inverse transformed; 8. the peak modulus present in the resulting signals (i.e. the largest of the peaks) is used as the matching score; 9. the score is compared to a threshold, which determines whether the two fingerprints are believed to belong to the same fingertip or not. Fingerprint cores detection and alignment is a typical problem in fingerprint matching field. The exposed algorithm, as is, doesn’t address the alignment of the fingerprints, and in the original papers [Ito, Ito2] the core alignment is cited as a known issue. However following expression demonstrates that POC algorithm isn’t influenced by translational displacement (the final result does not depend on k). In this formula, H and G are the 2D Fourier Transform of the input image h and its relative template g, φ is H/G over its norm.
Fig. 1. A single core architecture memory interface
while the remaining parts compound the “matching” phase. The templates are then stored in their enrolled form (i.e. the database contains the results of the enrollments, not the fingerprints themselves), and the input image gets enrolled only once. The time-critical part of the algorithm is therefore the matching, since it is the only part that is executed many times. 3.1. Enrollment The enrollment phase is compounded of the steps 1, 2, 3, 4 and 5 of the algorithm. Step 2 can in principle be performed on either the input or the template images, resulting in a space vs speed trade-off (since rotating the templates results in more space used by each template fingerprint, but the input enrollment gets significantly faster). When defining the image enhancement (step 1), we evaluated the performance of several filters reasonably cheap when implemented in hardware. These include background elimination and contrast augmentation, both with static and adaptive strategies. The most rewarding filter turned out to be the adaptive elimination of the background, which cleared (blackened) 50% of the pixels. For the template rotation (step 2) we computed the rotation from -16 to +15 degrees with steps of 1 degree. These numbers were initially derived from the original BLPOC articles [Ito, Ito2], and have been validated experimentally. The rotation algorithm itself is a very simple one based on integer pixels (no interpolation), which is both fast and simple to implement in hardware. Step 3 has been implemented as a sequence of onedimensional transforms. The arithmetic used in this step is integral, as space and performance restrictions disallowed the usage of a floating point DFT. In step 4 we tested several low-pass filters and chose to reduce the signal from 256x256 to 64x64 samples as it turned out being the bandwidth resulting in the best matching results in terms of FAR/FRR and EER.
For sake of simplicity, the previous expression is limited to one dimensional case, but it is easily extendible to two dimensions if needed. The BLPOC algorithm is not influenced by translational displacement for the same reasons. In many applications, a single fingerprint is to be compared to many other reference ones, stored in a database. Said single fingerprint is often referred to as the “input” fingerprint, while the database is said to be composed of “template” fingerprints. Since both the input and the templates are used for many comparisons (the input image is compared to all template ones, while every template is compared to every input), it makes sense to try to anticipate all those parts of the algorithms that don’t need a second fingerprint to be processed. These parts of the algorithms come to form the “enrollment” phase,
4.
HARDWARE IMPLEMENTATION
We decided to implement the matching part of the algorithm in hardware: it is the most critical part and the
780
resulting saved execution time is to be multiplied per every fingerprint in the database. Furthermore, the enrollment phase could easily be implemented in hardware if needed, but we decided to use a software implementation to save hardware resources that can be more appropriately allocated to additional matching units. The main features of our hardware implementation are:
•
•
stream2mem DMA configuration: a data structure, stored as a linked list element, containing the addresses where to write the matching scores. Fig. 1 shows memory connections and data flows. We decided to dedicate a RAM to the fingerprint database, that is critical and read-intensive. DMA configurations and matching scores are much less critical: the system needs to make 18 accesses to this memory every 4096 accesses to the database RAM, giving an effective busyness of 0.4%. Therefore they can share a single RAM without compromising elaboration throughput, and in multi-core architectures, be shared by several cores. During the set-up phase the central processor transmits the input image, writes DMA configuration data and generates the start elaboration signal. After that, the processor is free to perform other operations, and only needs to check periodically for the end of elaborations. Fig. 2 shows memory connections and data flows in multi-core configuration. As long as it is possible to have different RAMs to contain the fingerprint database, the system is very well scalable: the only shared component is the configuration/result RAM, that will reach a significant busyness only with at least 100 cores, and that can be easily split if it becomes the system bottleneck. Central processor commands and enrolled input image are broadcast to every core, thus avoiding bottleneck.
• independence: after an initial set-up, the architecture is able to perform all the elaboration (that is: producing matching scores) without external assistance; • high throughput: images are read from database backto-back, without pauses. That guarantees the highest possible throughput and, after an initial latency, the matching scores will be produced at the same rate at which images from the database are loaded into the architecture; in this sense our archtecture results as a pipelined architecture; • modularity: it is possible to instantiate several elaboration cores, each one elaborating part of the database matching scores. 4.1. Single-core and multi-core architectures Our architecture uses an instance of Nios II [15], a general purpose RISC microprocessor developed by Altera, to coordinate several elaboration cores. Each core is able to memorize the input image, load several images from the database, compute the matching scores and store them back to memory. To do so each core has two different configurable DMAs: one (named mem2stream DMA) loads data from fingerprint database, the other one (named stream2mem DMA) saves results. There are four different types of data to be read and written: •
database fingerprints;
•
matching scores;
mem2stream DMA configuration: a data structure, stored as a linked list element, containing the address and size of each enrolled fingerprint;
4.2. Matching algorithm implementation The hardware implementations of the matching phase of the algorithm is exposed in Fig. 3. The elaboration chain is composed by modules communicating using Altera’s Avalon Streaming Interfaces [16]: each module is a sink for the previous one and a source for the following one. The back-pressure mechanism allows a module, if busy, to pull down the sink_ready signal, thus informing its source that it is not able to elaborate new data at the moment. In this way, if any problem appears, the whole chain can be blocked, avoiding data loss. The only module that can generate a new stop is the final DMA, that could have to wait to gain write access to the configuration/result RAM. If this happens the slow down FIFO stores incoming data and when full, it pulls down its sink_ready signal, blocking the chain. We have to emphasize that this situation seldom (if ever) happens, because of the configuration/result RAM low busyness. All the other modules can’t spontaneously generate a stop. Each module, as part of a streaming elaboration chain, receives data from the previous module and, after an elaboration, transmits new data to the next one. Their specific functions are:
Fig. 2. A multicore architecture memory interface
781
•
• it allowed the validation and tuning of the algorithm before the hardware implementation was available (to do this, some parts of the implementation are written to mimic their hardware counterparts);
speed-up and slow-down FIFOs are used to divide two different clock domains: the external, slow clock domain, and the fast clock domain used in the computational core. Although data throughput is linked to DMA (and memory) frequency, an internal higher clock reduces the initial latency time. Since latency could be high in a long elaboration chain, a higher clock appears justified and useful;
•
phase difference stores the template image transmitted by the NIOS processor and computes the phase difference between the stored image and each new fingerprint coming from the database;
•
two iDFT modules and a matrix transpose module compute a transposed 2D inverse DFT;
•
data comes from the iDFT in Cartesian form, but square modules are needed to find the correlation peak;
•
peak finder looks for the maximum value, that is the correlation peak in case of fingerprint from the same finger, or a random (lower) value for not correlated images;
•
NIOS processor is a general purpose RISC microprocessor with an associated C compiler. A routine sets DMAs configuration during the set-up phase and reads results after the elaboration.
• it provided us sample results to detect and diagnose potential problems on the hardware implementation; • it was necessary to benchmark the algorithm on a PC, which is a reasonable comparison element. The software has been developed as a native Linux application on GNU/Linux platforms, and has been verified to build on FreeBSD. Since the most compute-intensive part of the elaboration is the DFT transformation, our software has been written to be able to use FFTW3 [17, 18], which is the fastest free FFT implementation available. A custom implementation has been developed to be able to use our software even when FFTW3 is not available, but it turned out to be significantly slower than the library implementation. To exploit the computational power of the recent multicore/multi-thread processors, significant efforts have been put to make the software multi-threaded using the POSIX thread library and implementing it in Linux based systems. The parallelism is at the single rotation level in the enrollment (meaning that the threads will rotate, transform and filter the fingerprint), while the threads on matching work on a comparison level (meaning that each thread compares the input fingerprint with the full template, which includes all the rotational variants). A Windows port has been developed using the MinGW/mSYS [19] environment, and the Pthreads-w32 [20] library. The result was measurably slower than the native implementation.
4.3. Software implementation Together with the hardware implementation, a suitable software one has been developed. This has been done for three main reasons:
5.
PERFORMANCE
5.1. Speed Our hardware implementation has been deployed on an Altera Stratix II 2S60 FPGA board [21], which turned out to be able to host two parallel cores running at 100 MHz. This results in a matching time of 660μs per fingerprint, which is the total throughput for the 2 cores. To compare this number to a PC, we ran our software implementation on an AMD Quad-core Phenom 9750based computer running Ubuntu Linux with the FFTW3 library available. Using 6 threads (which turned out to be the optimal thread count) a single match (performing a single matching) required 3.5ms. This result was obtianed considering not a single matching but the maximum throughpu. These results show a 5x speed-up using our tested implementation. This is however a huge understatement if we keep in mind that our focus is to compare the architecture more than the technology. The Stratix II device is a 2004 product, while the Phenom processor has
Fig. 3. Matching algorithm architecture
782
•
been marketed in 2008. A fair comparison requires the use of contemporary technology. One easy additional benchmark has been done by running our software on a computer based on an Intel Pentium IV processor running at 2.6 GHz. Using 2 threads, a single matching required 13ms. This yields to a 20x speed-up using similar technology. A more difficult work is to deploy our system on a 2008 FPGA, namely a Stratix IV device, whose development board isn’t yet available in our development laboratory. However, we have been able to estimate that a large Stratix IV device could easily host 8 matching components running at 250MHz, yielding to a >50x speed-up when compared to the Phenom processor. Such a large value (if compared to the 20x result using 2004 technology) is easily justified observing that in the most recent years FPGA technology has seen greater improvements than PC processors, and by observing that our 2S60 FPGA is not the largest available for its class (the largest one could easily host 4 cores).
Figure 4 shows how FAR and FRR change as the acceptance threshold grows. Note how much the FAR curve drops rapidly and how the FRR grows slowly: since it is often more important to have low FAR than low FRR, that evolution of the curves is an indicator of the good quality of the system. An EER of 6,16% is reached when the threshold is 37. It is impossible to select the “right” threshold without knowing the final architecture of the system. We could consider two possible scenarios: • first scenario: standalone algorithm. In this scenario the exposed algorithm is the only one operating. In that case it would be preferable having low FAR and reasonably low FRR. In some cases to have an extremely low FAR (such that with threshold = 41) would be acceptable, even if every six matching fingerprints one doesn’t match (that is FRR = 14%). Repeating the fingerprint acquisition will bring the probability of false rejection to roughly one on thirtysix, and only in the 0.46% of the cases a third acquisition would be necessary;
5.2. Accuracy We measured the accuracy of our system on FVC2002 databases [22], containing 256x256 8 bit images acquired from optical sensors, capacitive sensors as well as synthetic images. To give a good measure of a matching algorithm operating on strongly unbalanced databases (as fingerprint databases always are, because of matching fingerprints are way less than non-matching ones) isn’t possible to simply observe the number of errors. The following standard parameters have been calculated: •
• second scenario: database pruning. In this scenario our architecture is used to rapidly eliminate fingerprints that definitely don’t match, reducing the workload of another more accurate (and slower) algorithms following in the elaboration chain. That other algorithm main target is to take care of all the false positive matches that our system wasn’t able to eliminate. To do so is important to have a low FRR: as an example, threshold = 33 would reject less than 2% of matching fingerprints and would eliminate nearly
FAR (False Acceptance Rate) represents the odds of accepting non-matching fingerprints as matching and is calculated as: FAR=false positives/(false pos. + true negatives);
•
AUC (Area Under the ROC Curve) measures the discriminatory capabilities of the algorithm, and varies from 0.5 (clueless system) to 1 (ideal classifier). In our experimentations it was equal to 0.983.
FRR (False Rejection Rate) represents the odds of rejecting matching fingerprints as non-matching and is calculated as FRR=false negatives/(false negs. + true positives);
•
EER (Equal Error Rate) represents the total percentage of errors made by the system at the threshold that balances FAR and FRR. Note that a system is rarely used at the EER point, preferring to reduce FAR in spite of a rise of FRR;
•
ROC (Receiver Operating Characteristic) is a graphical plot of the sensitivity vs. (1 - specificity) as discrimination threshold is varied. An ideal ROC has a first vertical part and then an horizontal one, and describes a system able to choose the correct answer without ever committing mistakes. A completely random system would have a 45° segment;
Fig. 4. Evolution of FAR and FRR as the threshold varies
783
[2] K. Ito et al., “A Fingerprint Matching Algorithm Using Phase-Only Correlation”, IEICE Trans. Fundamentals, vol.E87-A, NO.3, 2004 [3] Ratha, N.K. Karu, K. Shaoyun Chen Jain, A.K., “A real-time matching system for large fingerprint databases”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 799-813, Aug. 1996. [4] M. Barrenechea, J. Altuna, M. San Miguel, “A Low-Cost FPGAbased Embedded Fingerprint Verification and Matching System”, Fifth Workshop on Intelligent Solutions in Embedded Systems, Leganes, June 2007, pp. 250-261, ISBN 978-84-89315-47-1. [5] M. Fons, F. Fons, E. Canto, M. Lopez, “Hardware-Software Codesign of a Fingerprint Matcher on Card”, IEEE International Conference on Electro/information Technology, East Lansing, May 2006, pp. 113-118, ISBN 0-7803-9592-1. [6] Militello, C. Conti, V. Sorbello, F. Vitabile, S. “A Novel Embedded Fingerprints Authentication System Based on Singularity Points” International Conference on Complex, Intelligent and Software Intensive Systems, pp. 72-78, 2008. [7] Lopez M., Canto E., “FPGA implementation of a minutiae extraction fingerprint algorithm”, IEEE International Symposium on Industrial Electronics, pp. 1920-1925, 2008. [8] F. Fons, M. Fons, E. Canto, “Approaching Fingerprint Image Enhancement through Reconfigurable Hardware Accelerators”, IEEE International Symposium on Intelligent Signal Processing, Alcala de Henares, October 2007, pp. 1-6, ISBN 978-1-4244-0830-6. [9] M. Fons, F. Fons, E. Canto, “Design of FPGA-based Hardware Accelerators for On-line Fingerprint Matcher Systems”, Research in Microelectronics and Electronics, 2006, pp. 333-336, ISBN 1-42440157-7. [10] M. L. Garcia, E. F. Canto Navarro, “FPGA Implementation of a Ridge Extraction Fingerprint Algorithm Based on Microblaze and Hardware Coprocessor”, International Conference on Field Programmable Logic and Applications, Madrid, August 2006, pp. 1-5, ISBN 1-4244-0312-X. [11] V. A. Sujan, M. P. Mulqueen, “Fingerprint Identification Using Space Invariant Transforms”. Pattern Recognition Letters, vol. 23, no. 5, pp. 609-619, 2002. [12] A. M. Bazen, G. T. B. Verwaaijen, S. H. Gerez, L. P. J. Veelenturf, B. J. Van der Zwaag, “A Correlation-Based Fingerprint Verification System”, 11th Annual Workshop on Circuits Systems and Signal Processing (ProRISC), 30 November – 1 December 2000, Veldhoven, the Netherlands, pp. 205-213, STW Technology Foundation, ISBN 9073461-24-3. [13] J. W. Cooley, J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series”, Mathematics of Computation, vol. 19, no. 90, pp. 297-301, 1965. [14] Johnson, S. G., and M. Frigo, “A modified split-radix FFT with fewer arithmetic operations,” IEEE Trans. Signal Processing 55 (1), pp. 111–119 (2007). [15]http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf [16]http://www.altera.com/literature/manual/mnl_avalon_spec.pdf [17] http://www.fftw.org/ [18] M. Frigo, S. G. Johnson, “The Design and Implementation of FFTW3”, Proceedings of the IEEE, 2005, vol. 93, no. 2, pp. 216-231. Invited paper, Special Issue on Program Generation, Optimization, and Platform Adaptation. [19] http://www.mingw.org/ [20] http://sourceware.org/pthreads-win32/ [21] http://www.altera.com/products/devkits/altera/kit-Nios II2S60.html [22] http://bias.csr.unibo.it/fvc2002/databases.asp 23] Lindoso A., Entrena L., Izquierdo J., "FPGA-Based Acceleration of Fingerprint Minutiae Matching", 3rd Southern Conference on Programmable Logic, 2007. SPL '07, pp. 81-86, 2007. [24] Lindoso A., Entrena L., Lopez-Ongil C., Liu J., "Correlationbased fingerprint matching using FPGAs", Proc. of IEEE International Conference on Field-Programmable Technology, 2005, pp. 87-94, 2005. [25] G. Danese, M. Giachero, F. Leporati, G. Matrone, N. Nazzicari: “An FPGA-Based Embedded System for Fingerprint Matching Using Phase-Only Correlation Algorithm”. Proc. of Euromicro Conference on Digital System Design, Patras 2009, pp. 672-679.
the 80% of the database, practically bringing to a speedup between four and five to the accurate algorithm. 6.
CONCLUSIONS
In this paper we propose an architecture for fast fingerprint matching, and show the results obtained from an FPGA Stratix II implementation. The resulting device turns out to outperform by at least an order of magnitude even modern high-performance COTS processors in terms of matching time, while the algorithm itself is reasonably accurate as depicted in sections 5.1 and 5.2. Moreover, our sub-ms matching time compares favorably with other currently available FPGA-based fingerprint matching implementations, in fact: • in [6], Militello et al. propose an FPGA (Virtex II) matching system based on singularity points detection with false acceptance rate and false rejection rate around 3.48% and 6.32% respectively, with an enrollment implementation requiring 31.2ms and a matching time of 3.62ms; • in [7] Lopez et al. propose a minutiae extraction algorithm implemented on a Spartan III FPGA requiring 988ms to process a fingerprint; • in [8-9], Fons et al. propose a fingerprint enhancer and a matcher leading (using Atmel and again based on minutiae extraction) to a 25-40ms enrollment time and a 7.2ms matching time; • in [10], Garcia et al. propose a ridge extraction algorithm implemented on a Virtex II Fpga requiring 261.9ms to process a single fingerprint; • in [23-24], Lindoso et al. propose two FPGA-based (Virtex IV) matching algorithms with low elaboration times, but fail to provide usable precision metrics due to the usage of fingerprint databases of unknown origin, so being not comparable with our work. We are also aware of the importance of the precision in terms of latency and accuracy. To perform our evaluation we used floating point arithmetic for performing the enrollment phase. The resulting data were truncated at 8 bits enough to perform the matching on the resulting spectrum.
REFERENCES [1] K. Ito, A. Morita, T. Aoki, T. Higuchi, H. Nakajima, K. Kobayashi, “A Fingerprint Recognition Algorithm Using Phase-Based Image Matching for Low-Quality Fingerprints”, IEEE International Conference On Image Processing, September 2005, ISBN 0-7803-9134-9.
784