Efficient Identification of Faces in Video Streams

0 downloads 0 Views 861KB Size Report
classifier is well adapted for fast and efficient matching of faces captured in ... It is well adapted to efficient matching of ROI pattern against facial models in 1- ...... Tesla. C2050. 1.15. 238. 448. 35. NVIDIA. Tesla. C1060. 1.30. 187.8. 240. 36.
CHAPTER 3.3 EFFICIENT IDENTIFICATION OF FACES IN VIDEO STREAMS USING LOW-POWER MULTI-CORE DEVICES

Donavan Prieur1, Eric Granger1, Yvon Savaria2, and Claude Thibeault3 1

Laboratoire d'imagerie, de vision et d'intelligence artificielle, École de technologie supérieure, Université du Québec, Montreal, Canada 2 Groupe de recherche en microélectronique et microsystèmes, École Polytechnique de Montréal, Montreal, Canada 3 Laboratoire de communications et d’intégration de la microélectronique, École de technologie supérieure, Université du Québec, Montreal, Canada E-mail: [email protected] The recognition of individuals based on their facial traits provides a powerful alternative to traditional methods for access control in many mobile and distributed applications. In these cases, the fuzzy ARTMAP neural network classifier is well adapted for fast and efficient matching of faces captured in video streams against the model of individuals enrolled to the access control system. In this paper, fuzzy ARTMAP networks co-jointly optimized for accuracy and efficiency are implemented and compared for one-against-many face matching using the Intel Core i3-530, the Intel Atom N270 and the Octasic Vocallo MGW processors. The performance of these implementations is studied from several standpoints including processing time, memory requirements, energy consumption and classification accuracy. Experimental results obtained using real-world video data show that implementing fuzzy ARTMAP networks optimized using multi-objective PSO on low-power parallel processors allows to significantly reduce energy consumption over traditional processor solutions while maintaining a high level of classification accuracy.

1. Introduction Automatic identification of individuals is an important function in a wide range of public and private sector applications that require secured access to resources through computers, phones, automatic teller machines, etc1. A growing number of biometric systems are being deployed for the recognition of individuals based biometric traits such as the face, fingerprint, iris and signature, rather than ID cards and access codes. Since the hardware required for biometric recognition 1

2

Prieur, Granger, Savaria and Thibeault

(e.g., camera sensors, microphones, accelerometers) is now integrated inside most mobile electronic devices, techniques that were once limited to control access to high-security areas can now be exploited for more routine tasks such as user identification for shared mobile devices. However, the continual improvements in the accuracy and transaction speed are required to ensure ease of use, cost effectiveness and enhanced security. To analyse the large amounts of audio, video and physiological data, specialized implementations are required for computer visions and pattern recognition techniques. Systems for face recognition (FR) from video streams are relevant in different scenarios, ranging from to open-set a video surveillance or screening applications, where individual of interest in a watch list must be recognized within dense and moving crowds at major events and airports, to closed-set access control applications, where one of N individuals pre-enrolled to a system must by identified prior to accessing secured resources. In identification applications, an individual provides a reference still image or video for enrollment. Then, given an operational video streams captured using one or more cameras, the FR system performs segmentation to isolate facial regions of interest (ROIs) in each frame. Invariant and discriminant features of each ROI are then extracted and assembled into a ROI pattern for classification. That is, ROI patterns are matched against the facial model of individuals enrolled to the biometric system. Finally, classification scores are output to provide application-specific decisions. In 1-to-N identification, the system selects the enrolled individuals with the highest matching scores for each face capture. The performance of state-of-the-art systems applied to video-based FR typically declines in practice. Indeed, the faces captured in video frames are typically of poor quality and lower resolution. The appearance of faces may vary considerably due to limited control over capture conditions (e.g., illumination, pose, blur, expression and occlusion) and changes in an individual’s physiology (e.g., aging). There are additional limitations associated with the camera and signal processing techniques used for segmentation, scaling, filtering, feature extraction and classification3-5. Finally, the facial models used for matching are designed during a preliminary enrolment process, using limited number of reference faces. All these factors contribute to a growing divergence between the stored facial model of an individual and its underlying class distribution6. Despite these difficulties, techniques that exploit spatio-temporal information extracted from video sequences have been shown to improve performance4. a

Applications where all input samples correspond to individuals enrolled to the system are called closed set, whereas in open set applications some inputs correspond to unknown individuals.

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

3

Beyond the need for accurate video FR techniques, the efficient implementation of such techniques for low power, distributed, and mobile devices constitutes a challenging problem. With distributed and mobile video applications, a cost-sensitive FR system must typically operate with limited resources yet sustain a high classification rate, a processing speed close to realtime and good power efficiency to maximize battery life. Matching each input ROI to facial models of several individuals increases the computational burden. This paper focuses on fast and energy efficient implementations of optimized fuzzy ARTMAP classifiers for video FR in closed-set identification applications, as needed for controlling access to shared resources. Fuzzy ARTMAP7 is a versatile neural network classifier that has been shown to provide a high level of classification accuracy with moderate time and memory complexity8. As such, it has been successfully applied to a wide variety of pattern recognition problems9. It is well adapted to efficient matching of ROI pattern against facial models in 1to-N identification applications due to its ability to perform fast, stable, on-line, unsupervised, supervised and incremental learning from limited amount of training data40. A multi-objective particle swarm optimization (MOPSO) training strategy10 is used to co-jointly optimize all fuzzy ARTMAP parameters such that both classification error and memory resources (for storage of facial models) are minimized9. Three different commercial processors are evaluated and compared for fast and low-power software implementation of fuzzy ARTMAP networks for face matching in video streams. They are: (1) the Intel CoreTM i3-53011, one of the most common dual core processor found in current workstations, (2) the Intel AtomTM N27012, a low power processor intended for mobile applications and (3) the Octasic Vocallo MGW13, a low power multi-core processor principally marketed for video transcoding applications. The performance of these different implementations is compared in terms processing time, memory requirements, energy consumption, and classification accuracy using real-world video data collected by Institute of Information Technology of the Canadian National Research Council (IIT-NRC) for secured computer login3. The rest of this paper is structured as follows. The next section provides a brief review of video-based FR. Section 3 briefly outlines the main features of fuzzy ARTMAP, along with the MOPSO training strategy, and algorithmic modifications for efficient software implementations. The hardware architectures of the three processors are reviewed in Section 4. Then, Section 5 described the experimental methodology (data set, protocol and performance measures) used to compare the different implementations. Finally, simulation results obtained on real-world video streams are presented and discussed in Section 6.

4

Prieur, Granger, Savaria and Thibeault

2. Video-Based Facial Recognition Figure 1 shows a typical system for automated video-based face recognition. Each digital camera captures a sequence of 2D images providing the system with a particular view of individuals populating the scene for each frame. First, the system performs segmentation to locate and isolate regions of interest (ROIs) corresponding to the individual faces in the current frame. Invariant and discriminant features are then extracted from the ROI and assembled into feature patterns a and b for classification (matching) and spatiotemporal tracking functions, respectively. During enrolment, one or more input patterns acquired for an individual are employed to design his facial template or model, and stores in a database. Matching is typically implemented with classifiers trained to a map the input pattern space to one of L predefined classes, each one corresponding to an individual enrolled to the system. During operations, input patterns a are matched against the facial model of individuals enrolled to the system. The resulting matching score Sl(a) indicate the likelihood that input a corresponds to individual l, for l = 1, 2, ..., L, and is compared against decision threshold γl to provide an application-specific decision. In verification applications, the system accepts or rejects the authenticity, while in identification and surveillance applications, the system outputs a list of the most likely or of all possible matching identities, respectively. To reduce ambiguities during the decision process, some features are also assembled into an input pattern b for tracking of an individual’s motion or appearance over successive ROIs.

Figure 1: A generic system for video-based FR.

In still-to-video applications, facial models used for classification are designed during enrolment from one or more ROIs from reference still images.

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

5

Video-to-video applications differ in that facial models are designed using ROIs and spatio-temporal information extracted from reference video streams. In video surveillance for instance, an analyst may decide to enroll individuals of interest in some video stream, and then match and track their activities over multiple video feeds (from various cameras). A common approach to recognizing faces in video consists in exploiting only spatial information, and applying extensions of image-based techniques on high quality ROIs isolated during segmentation. Several powerful techniques have been proposed to recognize frontal views of faces in 2D still images under controlled operational conditions. The predominant techniques are holistic or appearance-based methods like Eigenfaces, and local or feature-based methods like Elastic Bunch Graph Matching14. Despite the challenges of video-to-video FR, it is possible to exploit spatio-temporal information extracted from video sequences to improve performance (see Figure 1). Weak evidence in individual frames can be integrated over trajectories, potentially leading to more accurate recognition. For example, track-and-classify approaches combine information from the motion and appearance faces in a scene to reduce ambiguity (e.g., partial occlusion)4-5. Faces of different individuals are regrouped using a tracker, and scores of the corresponding ROI are accumulated for robust FR. Matching ROI patterns against the facial model of individuals is typically the bottleneck for a FR system. The fuzzy ARTMAP neural network classifier has been selected to implementation face matching because it can perform supervised incremental learning of limited data for fast and efficient one-against-many matching. With neural and statistical classifiers, the facial models are assumed to have been estimated a priori during enrolment by training the fuzzy ARTMAP weights of each individual enrolled to the system with reference ROIs, where each individual linked to an output class. During operations, matching is performed in a manner analogous to template matching with the Eigenfaces technique4, but where the fuzzy ARTMAP classifier matches input ROI patterns to compact statistical class models. The design of efficient systems for facial matching involves a trade-off between predictive accuracy and resource requirements (storage of facial models, power consumption, throughput, etc.). In video-based FR, fast classification is often required to process ROIs at near real-time processing, and ROIs are often captured at 20-30 frames/second per person in the camera viewpoint. It is wellknown that state-of-the-art systems are confronted with complex environments that change during operations, and their facial models are designed during a preliminary enrolment process, using limited data and knowledge of individuals. Facial model often become poor representatives of the biometric trait to be

6

Prieur, Granger, Savaria and Thibeault

recognized. The needed to store more representative facial models – more user templates or a statistical representation – increases the resource requirements of the system. This issue is particularly relevant in mobile and distributed applications where the hardware resources are for the most part limited. 3. Implementation of the Fuzzy ARTMAP Neural Network Classifier While using templates or statistical non-parametric representations as in Parzen window15 or k-Nearest Neighbor (k-NN) classifiers16 allows representing facial models for the matching process, they involve storing in memory all reference samples captured during enrolment. By contrast, neural and statistical classifiers, such as generative one-class classifiers based on Gaussian Mixture Models17, or discriminative two-class classifiers based on support vector machines18, provide some data compression for facial models and reduces the number of matching operations per ROI. In this paper, the fuzzy ARTMAP neural network classifier7 is implemented for matching of ROIs against the facial models due to its ability to perform fast, stable, on-line, unsupervised or supervised and incremental learning from limited amount of training data. It is known to provide good generalization accuracy from a relatively limited number of training samples8. Fast matching is achieved by exploiting an L1 norm (city block or “Manhattan” distance) and fuzzy AND and OR operators. The use of hyper-rectangles to represent facial models also contributes to reduce the time and memory complexity. Finally, its architecture allows the algorithm to be parallelized at different levels of granularity. The time complexity required by fuzzy ARTMAP to process one input pattern, O(MN), holds some advantages compared to multi-layers perceptrons, O(M2), and SVMs, O(LN2). 3.1. Fuzzy ARTMAP Algorithm The fuzzy ARTMAP neural network classifier consists of two fully connected layers of nodes – an input layer (F1), of M nodes, and an N node competitive layer (F2). A set of real-valued weights W = {wij∈ [0,1] : i = 1, 2, ..., M; j = 1, 2, ..., N} is associated with the F1-to-F2 layer connections. Each F2 node j represents a recognition category that learns a prototype vector wj = (w1j, w2j, ..., wMj). The F2 layer of fuzzy ARTMAP is connected, through learned associative links, to an L node map field (Fab), where L is the number of classes in the output space. A set of binary weights Wab = {wabjk ∈ {0,1} : j = 1, 2, ..., N; k = 1, 2, ..., L} is associated with the F2-to-Fab connections. The vector wabj = (wabj1, wabj2, ..., wabjL)

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

7

links F2 node j to the L nodes of output layer Fab (classes). During the training phase, fuzzy ARTMAP dynamics are governed by four hyper-parameters ̶ the choice parameter α > 0, the learning rate parameter β ∈ [0,1], the baseline vigilance parameter ρ0 ∈ [0,1], and the match tracking parameter ε = 0+. For the batch supervised learning of a finite data set, a training set pattern a = (a1, a2, ..., aM) is presented to the network and the vigilance parameter ρ is set to its baseline value ρ0. The original M dimensions input pattern a is complementcoded (ac) to produce a 2M dimensions network’s input pattern: A = (a, ac) = (a1, a2, ..., aM, ac1, ac2, ..., acM), where aci = (1 - ai), and ai ∈ [0;1]. Each F2 node is activated according to the Weber law choice function: (1) Tj(A) = |A ^ wj| / (α + |wj|) and the node with the strongest activation J = argmax{Tj : j = 1, 2, ..., N} is chosen. The algorithm then verifies if wJ is similar enough to A using the vigilance test: (2) |A ^ wJ| / 2M ≥ ρ If node J fails the vigilance test, it is deactivated and the network searches for the next best node on the F2 layer. If the vigilance test is passed, then the map field Fab is activated through the category J and fuzzy ARTMAP makes a class prediction K = k(J). If node K the correct prediction, its category is updated by adjusting its prototype vector: (3) wJ = β(A ^ wJ) + (1 - β)wJ In the case of an incorrect class prediction, a match tracking signal is raised: (4) ρ = (|A ^ wJ| / 2M) + ε. Node J is deactivated, and the search among F2 nodes begins anew. If none of the nodes can satisfy both conditions (vigilance test and correct prediction), then a new F2 node is initialed. A new association between F2 node J and Fab node K is learned by setting wJ = A and wabJk = 1 for k = K, where K is the target class label for a, and 0 otherwise. Then, the next training pattern a is presented to the network for complement coding. Batch supervised training ends in accordance with some learning strategy, following one or more epochs. (An epoch is defined as one complete presentation of all the patterns of a finite training data set.) Once the weights W and Wab have been found through this process, fuzzy ARTMAP classifier computes the choice function and predicts a class label for each input pattern. During testing, a pattern a that activates node J is predicted to belong to class K = k (J). Predictions are obtained without vigilance and match tests.

8

Prieur, Granger, Savaria and Thibeault

3.2. MOPSO Training Strategy A particle swarm optimization (PSO)-based training strategy has been proposed8 to optimize fuzzy ARTMAP parameters, weights and architecture by using generalization error as its fitness function. With this population-based stochastic optimization technique, each PSO particle corresponds to a single solution in the optimization space, and the population of particles is called a swarm. When fuzzy ARTMAP is trained using the PSO strategy, it produces a significantly lower classification error than with standard parameters on synthetic and real world problems9. However, the performance improvements are accompanied by a greater number of F2 category neurons, especially for data with complex decision bounds, and time complexity increases with the number of F2 nodes8. A multiobjective PSO-based training strategy, that seeks the optimal trade-off between both classification rate and compression, yields more cost effective solutions for resource-limited implementations. Multi-objective evolutionary algorithms (MOEA) can be sorted into six categories20: decomposition, preference, indicator, hybrid, memetic and coevolution. PSO-based techniques are part of the hybrid category of MOEAs. MOEAs aim to generate and select a set of non-dominated solutions (belonging to a Pareto front), instead of a single solution as in global optimization. There are several different approaches to extend PSO for multi-objective optimization20. The MOPSO algorithm presented in21 does so by replacing the concept of a swarm’s global best solution with an archive of non-dominated solutions. To promote an effective exploration of the parameter space, and prevent premature convergence of the solutions, the MOPSO technique also use a mutating operator, where the probability of randomizing a particle’s current position diminishes over time. By using an archive of solutions to replace the PSO's best global solution, MOPSO also requires to modify the way a particles velocity is calculated. In the PSO algorithm, a weighted sum between the particles old velocity, the difference between its current position and the position of its own best performance and the difference between its current position and the swarms best performance is used to obtain the particles velocities. The MOPSO algorithm (Algorithm 1) replaces the swarm’s best performance by the position of a randomly selected nondominated solution from the archive. To promote exploration of the parameter space and prevent premature convergence of the solutions, MOPSO use a mutation operator whose probability of randomizing a particles position diminishes with time.

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

9

A. Initialization: • set the maximum number of iteration Q • set MOPSO parameters U, c0, c1, c2, r1 and r2 • initialize particle positions at random such that s0u ∈ [0,1]d , for u = 1, 2, ..., U and pu = s0u • initialize particle velocities to v0u=0, for u = 1, 2, ..., U • set iteration counter q = 0 and archive counter g = 0 B. Iterations: while q ≤ Q do for u = 1, 2, ..., U do • train fuzzy ARTMAP using squ • compute fitness value F(squ) • if F(squ) dominates F(pu) then update particle’s best personal position: pu = squ • for h = 1, 2, ..., g do if F(squ) dominates F(archiveh) then archiveh = Ø • if F(squ) is not dominated by a particle in the archive then archive = archive ∪ squ g = g+1 end end for u = 1, 2, ..., U do • select a hypercube randomly such that the probability of its selection among all populated hypercubes is inversely proportional to the number of particle it contains • randomly select a particle (position eu) from the selected hypercube to evaluate the particle’s velocity • compute velocity: vq+1u = c0vqu +c1r1(pu - squ)+c2r2(eu - squ) • compute position: sq+1u = squ + vq+1u • if sq+1u lies outside the boundary of parameter space then squ = boundary value and vq+1u = -vq+1u • if squ mutates then set squ at random such that squ ∈ [0,1]d end q = q+1 end Algorithm 1: MOPSO-based strategy for supervised learning of fuzzy ARTMAP.

Algorithm 1 shows the pseudo-code of the MOPSO strategy applied to supervised learning of FAM neural network classifiers10. It seeks to maximize the multi-objective fitness function F(squ) in the 4-dimensional space of FAM

10

Prieur, Granger, Savaria and Thibeault

hyper-parameter values squ = (αqu, βqu , εqu, ρ0qu). Each iteration q of this algorithm consists in evaluating the classification rate and compression of a trained FAM network on a validation subset. Therefore, the MOPSO strategy cojointly optimizes the hyper-parameters, the network architecture and the synaptic weight values. Following the last iteration of Algorithm 1, the overall performance for the network corresponding to particles positions of the nondominated solutions (stored in archive) is evaluated by measuring the classification rate on an independent test dataset. 3.3. Modifications for Efficient Implementation During the implementation of the fuzzy ARTMAP algorithm using the programming language C, some modifications were made to increase both performance and portability. Following are the most noteworthy improvements. Choice function: The floating point division is a complex and time consuming operation for modern processors, and many specialized systems do not include it in their basic instruction set. Since such a division is at the core of choice-by-ratio function used to calculate F2 category neuron activation (Eq. 1), the use of a computationally simpler choice-by-difference function22 which avoids the use of a floating division, improves the resolution of results, accelerates processing, and improves algorithm portability to more specialized hardware. The choice-bydifference function is defined as: (5) Tj(A) = |A ^ wj| + (1 - α)(M - |wj|) Negative match tracking: It has been shown that using a negative match tracking parameter (ε) improves classifier performance when it is subjected to overlapping classes during training, and limits the proliferation of F2 nodes23. By allowing the optimization process use both positive and negative values for the match tracking parameter, ε∈[-1,1], the choice of positive or negative match tracking is left to the MOPSO training strategy. Output weight encoding: To reduce network size and improve computational speed, the class number is

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

11

encoded in binary directly into the connection weights (Wab) to the output layer (Fab). By doing so, the number of neurons on the output layer (Fab) is reduced from L to log2(L) neurons. Accumulation of responses: Until this point, the classification rate of the FR system corresponds to matching responses for single independent ROIs. By tracking the motion of individuals over the successive frames of a video stream, it is possible to improve the performance of FR by accumulating multiple responses for each individual. One way to implement evidence accumulation is to store classifier scores or prediction for consecutive ROIs of a person in a decision buffer. This allows building a discrete probability density estimate for final prediction is made24. While this can greatly improve classification accuracy, it also delays the output predictions. Parallel Processing: Despite the reduction in parasitic capacitance, power consumption generally increases with the continuing miniaturization of transistors. A common technique to increase computing power of processors while managing power consumption is the use of parallel processing. However, algorithms must be designed and/or modified to exploit parallel architectures. There are two main approaches for parallelizing the fuzzy ARTMAP algorithm – data partionning and network partitionning38. To increase overall throughput with the data partitionning approach, incoming data is divided among several independent instances of the neural network executing the same core algorithm but working in parallel. In the data partitionning approach, incoming ROIs are divided amongst several instances of the same neural network, working in parallel, to increase the systems overall throughput. In the network partitionning approach, a neural network is partitionned into several smaller subnets. The same pattern is then processed by each subnet in parallel before local predictions are transmitted to a master processor that produces a global final prediction. Data partitionning has been shown to outperform network partitionning for fuzzy ARTMAP training38. A key factor for efficient implementation is the size of the network with respect to the local memory of a processing core. If the local memory of each core can store an entire neural network, then identical networks can be replicated over multiple cores for parallel processing of multiple ROIs. Although this

12

Prieur, Granger, Savaria and Thibeault

allows for parallel processing of ROIs from multiple different individuals in one or more video sequences from several cameras, performance will not necessarily increase for ROIs from a single individual captured in a video sequence. In such case, the processing speed will be dictated by the camera’s acquisition rate. If the network size is larger than the cache memory of a processing core, the network must be partitionned into subnets that will process the same input pattern and communicate their local predictions between each other (Network Partition). Local predictions may be exchanged between processing elements using different communication strategies (Figure 2). The master-slave strategy is the communication strategy where all processing elements communicate their prediction sequentially to a single master processing element which them combines them to provide a global prediction. In the pipeline strategy, each processing element compares its local prediction to the local prediction received from the upstream processing element and sends it the best of the two prediction to the downstream processing element until the last processing element in the pipeline produces the final global prediction. The independent strategy is the test case where no communication takes place between processing elements.

Figure 2: The independent, master/slave and pipelined communication strategies.

4. Processors for Software Implementation The range of technologies available to implement dedicated systems is broader than ever. Given the fast pace at which technologies and related design tools are changing, the choice of one over the others is not always straightforward. Choosing a technology that can meet the performance, flexibility, reliability, supportability and size within the development and unit cost budgets can be challenging. For instance, specialized high-end biometric systems for public

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

13

sector applications are often required in low to moderate numbers making it hard to recover the initial investment made to develop the system. FR algorithms have been implemented on application specific integrated circuits (ASICs)25, field programmable gate arrays (FPGAs)26, digital signal processors (DSPs)27,28, embedded29 or specialized video processors30 to achieve desired performance (processing rates, accuracy, etc.). Even though general purpose processors (GPPs) cannot achieve the same performance as dedicated application specific implementations, they may offer a cost-effective alternative for implementing and maintaining a FR system, along with a significant reduction of the time to market and ease to modify and upgrade the system31. Software implementations on GPPs involve only two design steps (coding and simulation), resulting in a short design time and simple design flow. The cost of software tools (compilers and assemblers) is significantly less than that of ASIC and mixed hardware/software tools. In addition, the newer generation of low power processors intended for mobile and distributed applications blur the line between embedded processors and GPPs. The fuzzy ARTMAP neural network algorithm has been successfully implemented on supercomputers32, cluster computers33, dedicated analog and digital VLSI circuits34 as well as on optoelectronic hardware35. Implementations of fuzzy ARTMAP on commercially-available multi-core GPPs that have been reported are mainly guided by the need to accelerate the training process, with less regard for power efficiency during operation. Amongst all the commercially available hardware solutions, three were selected in this paper as being representative of different types of hardware currently available on the GPP market. They are: (1) the Intel Atom N270, a single core processor implementing the popular IA32 instruction set using a relatively simple architecture and a moderate clock speed to achieve good power efficiency for mobile applications; (2) the Intel Core i3-530, a dual core processor using Intel’s mainstream architecture, aimed at lower end workstation and striving for speed more than power efficiency; and (3) the Vocallo MGW processor from Octasic Inc., with 15 parallel processing cores leveraging an innovative embedded asynchronous technology for efficiency. The Vocallo MGW processor is a device that was optimized for video transcoding applications. Figure 3 shows the power efficiency (number of instructions per nano Joule) as a function of clock frequency (GHz) for 38 commercially available GPPs. In this figure, the Intel Atom N270, Intel Core i3-530 and Vocallo MGW are presented relative to several other commercially-available GPP solutions. The Vocallo MGW processor provides considerably lower energy consumption per

14

Prieur, Granger, Savaria and Thibeault

operation than the Intel Atom N270 for roughly the same clock frequency (about 1.5 - 1.6 GHz). Note however that the asynchronous nature of Vocallo makes the ‘clock’ frequency somewhat variable. Although the Intel Core i3530 has a higher clock frequency and a dual processor, its energy consumption is among the highest.

Figure 3: Power efficiency versus clock frequency for selected GPPs. Figure A-I and Table A-I (Appendix A) provide additional details on each one of the 38 commercially available processors (correspond to points on the graph).

The rest of this section provides additional details on the Intel Core i3-530, the Intel Atom N270 and the Octasic Vocallo MGW, with a focus on memory configurations and communication strategies. Indeed, the storage and communication of ROIs (input patterns), facial models (with fuzzy ARTMAP, the prototype vectors linked to a user class) and intermediate results have a significant impact on the speed and power consumption of software implementations.

4.1 The Intel Core i3-530:

Representing the type of processor commonly found in office workstations and home computers, the Intel Core i3-53011 use the IA-32 instruction set. This instruction set is supported by most operating systems and has several development environments available both commercially and

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

15

through open source license. It has been commercially available since the first quarter of 2010 and it is the entry level of the first generation of the Core processor family for desktop workstation. Based on the 32nm Nehalem micro architecture36, the processor contains an integrated DDR3 dual channel memory controller, a PCI Express interface, a Direct Media Interface, Hyper-Threading Technology and QuickPath Technology. The dual-core processor runs at a 2.93GHz clock speed for a 73W thermal design power (TDP). It has 32kB of L1 cache per core, 256kB of L2 cache per core and 4 MB of shared L3 cache [12] (Figure 4). 4.2 The Intel Atom N270 Intended for mobile applications or computer systems with a small form factor, the Atom N27012 is a 32-bit processor from Intel. It is also supported by most of the major operating systems and development environments because it uses the IA-32 instruction set. It offers low-cost, energy-efficient computing that is compatible with most of the software developed for its more powerful relative. It is comprised of a CPU core, memory controller and graphics in a single die built using a 45nm process (Figure 4). This allows for a 1.6GHz processor with 512kB of L2 cache and a front side bus speed of 533MHz while maintaining a 2.5W TDP12.

Figure 4: Memory configuration of the Intel Core i3-530, Intel Atom N270 and Vocallo MGW.

16

Prieur, Granger, Savaria and Thibeault

4.3 The Octasic Vocallo MGW A low power multi-core processor, Octasic's Vocallo MGW is marketed as a flexible DSP solution for IP based audio and video applications13. To achieve a high level performance while maintaining low power consumption, Vocallo’s Opus architecture functions asynchronously, thus greatly reducing the power consumption by eliminating the need for a global clock37. It does lead to some variability of the instruction completion rate but that feature is mostly transparent to users. Each processing core of the Vocallo MGW contains 32-bits registers as well as 16 asynchronous ALUs. The ALUs support a RISC instruction set complemented by specialized functions for signal processing. The Vocallo MGW Evaluation Board (EVB) consists of a Vocallo MGW processor connected to a bank of DDR RAM and a t041 communication interface (TCP/IP) through a DMA controller13. As illustrated in Figure 4, the 15 processing cores are arranged in a 5x3 grid where each core is surrounded by 96kB of L1 cache. Each core is connected to the DMA controller and the other cores through a 32 bit bus running at 1.5GHz. The communication between the cores is handled through the DMA controller by a specialized functions allowing for cores to directly write into the L1 cache of other cores. 5. Experimental Methodology Proof-of-concept experiments reported here were performed using real-world video data for FR with the fuzzy ARTMAP neural classifier applied to matching ROIs to facial models in a closed-set access control (identification) application. The software implementation consists of C language code compiled and optimized for the different selected processors. 5.1 Video Database: The dataset used for experiments was collected by the Institute for Information Technology of the Canadian National Research Council (IIT-NRC)3, and used in several other research efforts. It was captured using a commercially available integrated camera, or webcam, commonly found in laptop computers and mobile devices. In experiments, ROIs captures during operation are compared to the face model of individuals enrolled to a closed-set (1-against-N) identification system, as required for secured computer login applications.

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

17

It is composed of 22 video sequences b captured from eleven individuals positioned in front of a computer. For each individual, two color MPEG1 video sequences of about 15 seconds are captured at a rate of 20 frames per second with a 160x120 resolution. Of these two video sequences, one is dedicated to training and the other to testing. They are captured using the same setup and under approximately the same illumination conditions and similar background. The face of the individual occupies between a 1/4th to an 1/8th of the total frame area. This dataset contains a variety of challenging operational conditions such as motion blur, out of focus factor, facial pose and expression and low resolution. The number of ROIs detected varies from person to person, ranging from 40 to 190 for each video sequence. Learning is performed with ROIs extracted from the first series of video sequences (1527 ROIs) while testing is done with ROIs extracted from the second series of video sequences (1585 ROIs). Segmentation (or face detection) for each frame is performed using the Viola-Jones algorithm included in the OpenCV C/C++ computer vision library39. It produces ROIs that are converted to gray scale, and then normalized to a common 24x24 pixel facial region, where the eyes are aligned horizontally. This ROI resolution preserves a distance of 12 pixels between eyes, and is known to be sufficient for real-time FR by humans3. Multi-Bloc Local Binary Pattern (MBLBP) technique is used with different block sizes to extract additional features that are invariant to illumination. Principal Component Analysis (PCA) is then performed to extract and select a reduced number of features. The M features with the greatest eigenvalues are extracted and vectorized into an input ROI pattern, a = (a1, a2, ..., aM), where each feature ai is converted to a proportional value between 0 and 1 using the min-max normalisation. 5.2 Evaluation Protocol: Prior to computer simulations, the training dataset of each individual is divided into subsets in order to perform 10 independent experiments or replications. To create those subsets, a 10-fold cross validation process [16] was used to generate the training, validation, fitness evaluation datasets for each of the ten replications using the first video sequence. During each replication, the PCA mapping is determined with the training subset and then applied to other (training, validating, evaluating and testing) subsets. b

Although the IIT-NRC data set is relatively limited in size, the authors consider that it is adequate to compare implementations according to processing time, power and memory consumption, and accuracy. These measures are mostly evaluated at the transaction (face matching) level.

18

Prieur, Granger, Savaria and Thibeault

The MOPSO training strategy was performed using 32 particles and a maximum of 25 iterations for each replication. This produced a set of up to 32 non-dominated solutions. This was then repeated using a growing number of PCA features, in an order corresponding to the greatest to lowest eigenvalue. Each non-dominated solution was then tested on the selected processors and average performance indicators were measured. . For every different number of features used, 3 solutions from the archive have been selected to better illustrate the effect of parameters: ̶ Heavy solution: network with the best classification rate that can fit into the processor’s memory (Cache memory for Intel and local memory for Vocallo); ̶ Medium solution: network with the best classification rate and a size under 225 kilobytes; ̶ Light solution: network with the best classification rate and a size under 75 kilobytes; ̶ Full solution: network with the best classification rate using all of the initial input features of the ROI (without PCA). The effect of buffer size used for evidence accumulation was measured by averaging the classification rate given from the decision buffer at each possible position in a video sequence. This process was repeated for different buffer sizes and on each video sequence in the test dataset. Results were then averaged to produce an average network performance for a decision buffer size. The communication on the Vocallo MGW processor was characterized using the Medium solution using 16 PCA features. 5.3 Performance Indicators: To compare the different implementations of the video based facial recognition system, the following different aspects of performance are measured – the classification rate, compression, storage requirements, processing speed and energy consumption. The classification rate is the ratio of correct predictions to the total number of predictions made by the system. It describes the average prediction accuracy of the system. The amount of resources required during training is measured by compression. It refers to the average number of training patterns per category prototype created in the F2 layer. The storage requirement is the amount of RAM needed to store the facial models learned by a fuzzy ARTMAP network during training. It is a function of M, the number of features of a prototype or input pattern and N, the number of F2 neurons in the trained classifier. The processing speed is the average amount of

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

19

time required by the system to process a single ROI. It is measured by dividing the time required to process all the ROIs in the testing dataset by the number of ROIs present in the testing dataset. Energy consumption was estimated by dividing the number of instructions executed during the testing phase by the number of instructions per joule (IPS/W). The number of instructions executed varies with the number of F2 layer neurons, N, in the selected fuzzy ARTMAP network and the number of features, M, in the input patterns. 6. Simulation Results

6.1 Communication Strategies with Vocallo MGW: To characterize the inter-core communication on the Vocallo MGW processor, the Master/Slave, Pipeline and Independent communication strategies were compared. Communications were characterized using the Medium solution with 16 PCA features. Figure 5 shows the total processing time required by each communication method to process the entire testing dataset. The total processing time is broken down into actual time spent processing data, communicating with external RAM, communicating with other cores, and waiting to communicate. As shown in Figure 5, the actual processing time spent computing the predictions is longer than the time spent communicating the intermediate results. As a consequence, the different communication strategies have little impact on the total processing time. This figure also shows that the time spent waiting to communicate results are slightly longer with the master/slave technique than with the pipeline technique. This may be explained by the fact that, in the former, all slave cores must communicate sequentially with a single master core, whereas in the latter, each core can communicate with a different one. However, due to the single shared bus of the Vocallo processor, the pipeline technique is also reduced to sequential communications. This bus also introduces delays when the number of ROI in a sequence is smaller than the number of processors in the pipeline. The time spent communicating data and waiting for communications increases when F2 neurons are distributed across all the processing cores instead of saturating a minimum number of cores. This is caused by the increased amount of data communicated to reach the final prediction. Since the processing time is closely tied to the number of F2 neurons in the subnet, a decrease of the total processing time is observed when the neurons are distributed across all the cores because it reduces the number of neurons per subnet. Although there is a speed-up when F2 neurons are distributed across all processing cores, it is lower

20

Prieur, Granger, Savaria and Thibeault

than the increase in processing power that is offered by using several independent instances of the same network and then distributing the ROIs amongst them.

Figure 5: Processing and communication time for the Vocallo MGW processor with the Master / Slave, Pipeline and Independent communication strategies.

The low communication cost between the processing cores on the Vocallo MGW allows dividing a network that would not normally fit in the local cache memory of a single core, into several smaller subnets that can. This allows increasing the classification accuracy of the system without a significant impact on the processing speed. Due to the negligible effect of communication strategies on the processing time, only results coming from the Master / Slave strategy will be shown for the rest of this section. 6.2 Feature Extraction and Selection: The choice of the number of features, M, selected through PCA has a direct effect on the systems performance. If too few are selected, the classifier has insufficient information to discriminate between individuals and with too many, redundant data adds little discrimination power, yet increases the storage requirements, processing time and energy consumption of the system. A comprehensive approach for exploring the impact of the number of selected features is to train fuzzy ARTMAP networks with the MOPSO training strategy over the range of possible features, yet this strategy is too costly. Figure 6 shows it is possible to estimate the classification rate of the best network obtained from the MOPSO solution archive by using a k-NN classifier using the same number of features in the input patterns. Although there is some divergence between the

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

21

results, especially when a higher number of features is used, the classification accuracy of both fuzzy ARTMAP and k-NN classifiers follows the same trend.

Figure 6: Classification rate and number of F2 neurons in the Vocallo MGW versus the number of features.

Figure 6 also shows the maximum number of neurons that can fit in the total cache memory of the Vocallo MGW processor. Because processing time, memory usage and energy consumption grow exponentially as the number of features increases, selecting the lowest number of features possible before there is a marked decline in the classification accuracy is preferable. In our case this point is situated around 16 components. However, networks using 32 and 64 components were also evaluated to better illustrate the effect of network size on the processing time and classification accuracy. 6.3 MOPSO Training Strategy: Figure 7 shows the classification rate and compression of fuzzy ARTMAP networks trained with the MOPSO and PSO training strategies and with standard hyper-parameters values. Using the standard parameter values provides fuzzy ARTMAP networks with the lowest classification rate. The negative effect on the compression produced with the PSO training strategy is shown when optimizing the classification accuracy. The MOPSO training strategy yields solutions that

22

Prieur, Granger, Savaria and Thibeault

provide the same improvement in classification rate but with a greatly improved compression.

Figure 7: Classification rate and compression of fuzzy ARTMAP networks trained with different training strategies on IIT-NRC data.

Figure 8 shows the solutions obtained using a different number of components in the ROI patterns produced by each of MOPSO optimization. The classification rate of the trained classifier is presented with respect to its compression value. The compression values are shown using the logarithmic scale to show the non-dominated solutions smoothly distributed along the Pareto front. It is worth noticing in Figure 8 the effect of dimensionality reduction on the classification rate of the solutions. There is a marked decrease in the classification rate going from solutions obtained with the original input patterns to those obtained using PCA and any number of components. Given the variance in classification rates among the solutions with high compression, this decrease in classification accuracy becomes less obvious as the compression value increases. Since the compression value is tied to the number of neurons in the network and ignores the actual size of individual neurons, it is difficult to appreciate the difference in network size between the solutions using different number of features. By transforming the compression value into the actual network size (Figure 9), the non-dominated solutions using different number components in the

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

23

input patterns can be compared on an equal footing. This allows easy selection of the solution in the archive that is the closest to the desired operating characteristics.

Figure 8: Pareto fronts of MOPSO optimization runs with growing number of PCA features.

Tables 1 and 2 display the performance of selected solutions appearing on the Pareto fronts of Figures 7 and 8. These were chosen to illustrate the effect of network size and F2 neuron size on performance. Because the classification rate and compression are set during the training phase, the choice of hardware has a marginal effect on the overall results. The only exception is the “heavy” solutions on the Vocallo MGW - smaller than the one used for the ones used for the Atom N270 and Core i3-530 because of memory limitations. By comparing the solutions obtained using the entire ROI as input patterns to the heavy solutions using 64 components, using PCA as a dimensional reduction tool allowed for a 781.5% decrease in network weight while only decreasing the classification accuracy by 6%. For both the classification rate and network size, the number of neurons in the network is much more influential than the size of individual neurons. It is important to note that the network weight increases exponentially when the classification rate increases linearly. This suggest the

24

Prieur, Granger, Savaria and Thibeault

possibility of a marked improvement in network weight with a relatively small decrease in the classification accuracy. Table 1: Average classification rate (%). Standard deviation is shown between parentheses.

Vocallo MGW PCA 16 Light PCA 16 Medium PCA 16 Heavy PCA 32 Light PCA 32 Medium PCA 32 Heavy PCA 64 Light PCA 64 Medium PCA 64 Heavy Full CNRC

66.9 (0.031) 69.4 (0.027) 74.7 (0.017) 62.1 (0.059) 65.0 (0.046) 75.0 (0.019) 65.0 (0.070) 63.6 (0.070) 74.2 (0.042) N/A

Atom N270 66.9 69.4 74.8 62.1 65.0 76.6 65.0 63.6 78.3 84.4

(0.031) (0.027) (0.017) (0.059) (0.046) (0.019) (0.070) (0.070) (0.024) (0.026)

Core i3 66.9 69.4 74.8 62.1 65.0 76.6 65.0 63.6 78.3 84.4

(0.031) (0.027) (0.017) (0.059) (0.046) (0.019) (0.070) (0.070) (0.024) (0.026)

Table 2: Average memory consumption in kilobytes (kB).

Vocallo MGW PCA 16 Light PCA 16 Medium PCA 16 Heavy PCA 32 Light PCA 32 Medium PCA 32 Heavy PCA 64 Light PCA 64 Medium PCA 64 Heavy Full CNRC

2.962 ( 2.0) 16.93 ( 4.6) 66.91 (26.9) 3.635 ( 1.5) 14.67 ( 5.6) 72.96 (33.0) 5.688 ( 0.04) 9.953 ( 6.0) 66.12 (22.3) N/A

Atom N270 2.962 16.93 88.37 3.635 14.67 136.6 5.688 9.953 344.7 2693.8

( 2.0) ( 4.6) ( 44.1) ( 1.5) ( 5.6) ( 65.4) ( 0.04) ( 6.0) (167.8) (1568)

Core i3 2.962 16.93 88.37 3.635 14.67 136.6 5.688 9.953 344.7 2693.8

( 2.0) ( 4.6) ( 44.1) ( 1.5) ( 5.6) ( 65.4) ( 0.04) ( 6.0) (167.8) (1568)

6.4 Processing Time and Power Consumption: Tables 3 and 4 respectively show the average processing time for a single ROI and the average energy consumption estimate for each processor. Contrary to the accuracy and compression, results in these tables are closely tied to the hardware on which it is running. A factor influencing both the energy consumption and processing time is the network size. The fact that they are both grow linearly with network size justifies the use of the MOPSO training strategy to find optimal trade-offs between classification rate and compression. In turn, this allows for the best trade-off between the classification rate and both the processing time and energy consumption. There is little difference between the execution time on the Vocallo MGW and the Atom N270 but there is close to a fivefold reduction in processing time when using the Core i3-530 processor. However, even using a heavy solution, the Vocallo MGW will use over 9 times less energy than the Core i3-530, and even 200 less when using a light solution. By choosing a suitable hardware solution

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

25

and minimizing the size of the network, energy consumption can be reduced by several orders of magnitude. Note that the processors considered in this paper are able to process ROIs faster than the acquisition rate of a standard camera, at e.g., 30 fps. In fact the processing time for a heavy solution on the Vocallo MGW and Atom N270 can be up to 60 times faster and the light solution close to 2000 times faster. These values are even higher when using the Core i3-530 where the processing time can be over 6000 times faster than the camera acquisition speed when using a light solution. Table 3: Average processing time for a ROI in microseconds (μs). PCA 16 Light PCA 16 Medium PCA 16 Heavy PCA 32 Light PCA 32 Medium PCA 32 Heavy PCA 64 Light PCA 64 Medium PCA 64 Heavy Full CNRC

Vocallo MGW 16.9 ( 11.2) 102.4 ( 27.9) 554.2 (192.7) 21.1 ( 7.6) 84.4 ( 46.1) 567.4 (192.7) 33.3 ( 1.21) 62.2 ( 39.0) 656.2 (210.1) N/A

Atom N270 19.7 ( 13.1) 110.9 ( 30.4) 574.4 (284.0) 23.1 ( 7.6) 96.7 ( 36.5) 906.9 (405.7) 39.4 ( 2.39) 66.1 ( 37.0) 2265 (1065) 17280 (9635)

Table 4: Average energy consumption given in joules (J). Vocallo MGW Atom N270 0.0052 (0.0036) 0.1217 (0.0850) PCA 16 Light 0.0294 (0.0082) 0.6886 (0.1922) PCA 16 Medium 0.1159 (0.0465) 3.5873 (1.7861) PCA 16 Heavy 0.0065 (0.0025) 0.1526 (0.0580) PCA 32 Light 0.0259 (0.0099) 0.6071 (0.2331) PCA 32 Medium 0.1283 (0.0576) 5.6312 (2.5590) PCA 32 Heavy 0.0103 (0.0006) 0.2416 (0.0144) PCA 64 Light 0.0179 (0.0107) 0.4187 (0.2506) PCA 64 Medium 0.1173 (0.0394) 14.315 (6.9473) PCA 64 Heavy N/A 112.67 (65.601) Full CNRC

Core i3-530 3.943 ( 2.38) 18.71 ( 4.41) 94.53 ( 55.8) 4.432 ( 1.52) 17.00 ( 5.07) 157.5 ( 71.0) 7.886 ( 1.93) 12.00 ( 5.58) 388.5 (187.6) 2656 (1521)

Core i3-530 1.1232 (0.7849) 6.3578 (1.7751) 33.122 (16.491) 1.4086 (0.5355) 5.6052 (2.1525) 51.994 (23.628) 2.2309 (0.1330) 3.8659 (2.3138) 132.17 (64.146) 1040.3 (605.71)

6.5 Temporal Accumulation of Responses: Figure 9 shows the effect of the decision buffer size on the average classification rate of a video sequence. As shown, the average classification rate increases with the size of the decision buffer, reaching a 100% classification rate. Test sequence #2 is an exception and was excluded of the average system performance. Compared to other classes, very few training and testing data is available for the #2 test subject. After further investigation, it was found that test subject #2 is the

26

Prieur, Granger, Savaria and Thibeault

only individual not from Caucasian ancestry. This highlights an intrinsic problem with the face detection. Indeed, early implementations of the Viola-Jones face detector [39] were originally trained using photographs collected during a web crawl. Because of the preponderance of the occidental culture on the internet at the time, the trained facial detector inherited a bias towards Caucasians faces. The improvement in classification rate provided by accumulating ROI predictions of each individual over time comes at the expense of added system latency. Even with an optimistic 100% face detection rate and a standard capture rate of 30 fps, the target individual must appear in the camera’s field of view during several seconds. Processors like the Vocallo MGW and Core i3-530 can process ROIs at a rate that surpasses the camera’s frame rate. Therefore the use of a faster camera, smaller buffers, and optimized face detection implementations curtail long delays.

Figure 9: Average classification rate of a typical network for a varying decision buffer size.

5. Conclusions In this paper, the fuzzy ARTMAP neural network classifier has been successfully optimized and mapped to three the commercially-available processors ̶ the Vocallo MGW, Atom N270 and Core i3-530 ̶ for efficient

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

27

video face matching in access control (closed-set identification) applications. A MOPSO training strategy has been employed to co-jointly optimize all fuzzy ARTMAP parameters such that both classification error rate and resource requirements are optimized, leading to fast and energy efficient software implementations. In addition, by tracking facial regions over a video streams, multiple fuzzy ARTMAP predictions have been accumulated over several successive ROIs of an individual, for improved spatio-temporal FR. Our results with real-world video data indicate that by implementing fuzzy ARTMAP classifiers that have been optimized using the MOPSO training strategy, and by using low-power parallel processors it is possible to significantly reduce power consumption while sustaining a high data rate and accuracy. Furthermore, the multi-core architecture of processors such as the Vocallo MGW allows using the extra processing power to identity of a greater number of individuals appearing in video streams, or to handle a faster acquisition rate, allowing for more reliable predictions, accumulated over multiple ROI. This would increase the system accuracy to the desired performance level without delaying the final prediction beyond the acquisition rate. While additional design efforts are required to map the fuzzy ARTMAP algorithm to low-power multicore systems, the advantages outweigh the difficulties that may be encountered. References 1. Boyer, K.W., V. Govindaraju, and N.K. Ratha, Introduction to the special issue on recent advances in biometric systems, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 2007. 37(5): p. 1091-1095. 2. Jain, A.K., A. Ross, and S. Pankanti, Biometrics: A tool for information security. Information Forensics and Security, IEEE Transactions on, 2006. 1(2): p. 125-143. 3. Gorodnichy, D.O., Editorial: Seeing faces in video by computers. Special issue on face processing in video sequences. Image Vision Comput., 2006. 24(6): p. 551-556. 4. Matta, F. and J.-L. Dugelay, Person recognition using facial video information: A state of the art. J. Vis. Lang. Comput., 2009. 20(3): p. 180-187. 5. Zhou, S.K., R. Chellappa, and W. Zhao, Unconstrained FR, Springer, 2006. 6. Rattani, A., et al., Adaptive Biometric System based on Template Update Procedures, 2010, PhD Thesis, Dept. of Electrical and Electronic Engineering University of Cagliari. 7. Carpenter, G.A., Grossberg, S., and Markuzon, N., Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps., IEEE Trans. Neural Networks, 1992. 3(5): p. 698-713. 8. Granger, E., et al., Supervised learning of fuzzy ARTMAP neural networks through particle swarm optimisation. Journal of Pattern Recognition Research, 2007. 2(1): p. 27-60. 9. Lerner, B. and H. Guterman, Advanced developments and applications of the fuzzy ARTMAP neural network in pattern classification, in Computational Intelligence Paradigms, L. Jain, et al., Editors. 2008, Springer Berlin Heidelberg. p. 77-107. 10.Granger, E., D. Prieur, and J.-F. Connolly. Evolving ARTMAP neural networks using multiobjective particle swarm optimization, IEEE Congress on Evolutionary Computation 2010.

28

Prieur, Granger, Savaria and Thibeault

11.Intel. Intel Core i3-530 Processor Specification. Novembre 2010. Available from: http://ark.intel.com/products/46472/Intel-Core-i3-530-Processor-(4M-Cache-2_93-GHz). 12. Intel. Intel Atom N270 Processor Specification. Novembre 2010, Available from: http://www.intel.com/content/dam/doc/datasheet/mobile-atom-n270-single-core-datasheet-.pdf. 13. Octasic. Vocallo MGW: The Expendable Media Gateway Solution. December 2008; Available from: http://www.octasic.com/documents/en/products/vocallo/octvocpb2004.pdf. 14. Zhao, W., et al., FR: A literature survey. ACM Computer Survey, 2003. 35(4): p. 399-458. 15. Parzen, E., On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962. 33(3): p. 1065-1076. 16. Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification. 2nd ed.: Wiley, 2001. 17. Reynolds, D.A., T.F. Quatieri, and R.B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 2000. 10(1–3): p. 19-41. 18. Cortes, C. and V. Vapnik, Support-Vector Networks. Mach. Learn., 1995. 20(3): p. 273-297. 19. Rosenblatt, F., Principles of Neurodynamics: Perceptron and the Theory of Brain Mechanisms: Spartan Books, 1962. 20. Zhou, A., et al., Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm and Evolutionary Computation, 2011. 1(1): p. 32-49. 21. Coello, C.A.C., G.T. Pulido, and M.S. Lechuga, Handling multiple objectives with particle swarm optimization. Evolutionary Computation, IEEE Transactions on, 2004. 8(3): p. 256-279. 22. Carpenter, G.A. and M.N. Gjaja, Fuzzy ART choice functions, in Technical report CAS/CNS ;1993, Boston University, Center for Adaptive Systems and Dept. of Cognitive and Neural Systems: Boston, MA. p. 14. 23. Carpenter, G.A., B.L. Milenova, and B.W. Noeske, Distributed ARTMAP: a neural network for fast distributed supervised learning. IEEE Trans on Neural Networks, 1998. 11(5): p. 793-813. 24. Carpenter, G.A. and W.D. Ross, ART-EMAP: A neural network architecture for object recognition by evidence accumulation. Neural Networks, , 1995. 6(4): p. 805-18. 25. Nagel, J.-L., et al. A Low-Power VLSI Architecture for Face Verification Using Elastic Graph Matching. in 11th European Signal Processing Conference. 2002. Toulouse, France. 26. Borgatti, M., et al., A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O. IEEE Journal of Solid-State Circuits, 2003. 38(3): p. 521-529. 27. Batur, A.U., B.E. Flinchbaugh, and M.H. Hayes, III. A DSP-based approach for the implementation of FR algorithms. in Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on. 2003. 28. Mao, W. and A. Bigdeli. Implementation of a real-time automated FR system for portable devices. in Communications and Information Technology, 2004. ISCIT 2004. IEEE International Symposium on. 2004. 29. Kondo, H., et al., Implementation of FR Processing using an Embedded Processor. J. Robotics & Mechatronics, 2005. 17(4). 30. Kleihorst, R., et al. Camera mote with a high-performance parallel processor for real-time frame-based video processing. in Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on. 2007. 31. Reyneri, L.M., Implementation issues of neuro-fuzzy hardware: going toward HW/SW codesign. Neural Networks, IEEE Transactions on, 2003. 14(1): p. 176-194. 32. Malkani, A. and C.A. Vassiliadis, Parallel implementation of the fuzzy ARTMAP neural network paradigm on a hypercube. Expert Systems, 1995. 12(1): p. 39-53. 33. Castro, J., et al., Pipelining of Fuzzy ARTMAP without matchtracking: Correctness, performance bound, and Beowulf evaluation. Neural Networks, 2007. 20(1): p. 109-128. 34. Lubkin, J. and G. Cauwenberghs, VLSI implementation of fuzzy adaptive resonance and learning vector quantization, in Proc. of the 7th International Conf. on Microelectronics for Neural, Fuzzy and Bio-Inspired System, IEEE Computer Society. p. 147, 1999.

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

29

35. Blume, M. and S.C. Esener, An Efficient Mapping of Fuzzy ART onto a Neural Architecture. Neural Networks, 1997. 10(3): p. 409-411. 36. Intel Inc. First the Tick, Now the Tock: Next Generation Intel® Microarchitecture (Nehalem) On-line: http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf. 37. Octasic Semiconductor, Asynchronous Processor Design Evolution, Octasic White Paper, 2010. 38. Castro, J., et al., Parallelization of Fuzzy ARTMAP to Improve its Convergence Speed: The Network Partitioning Approach and the Data Partitioning Approach. Nonlinear Analysis, 2004. 63: p. 877-889. 39. Viola, P. and M.J. Jones, Robust Real-Time Face Detection. Int'l Journal Computer Vision, 2004. 57(2): p. 137-154. 40. J.F. Connolly, E. Granger, R. Sabourin, Dynamic multi-objective evolution of classifier ensembles for video face recognition, Applied Soft Computing, 2012. 13(6): p.3149-66.

30

Prieur, Granger, Savaria and Thibeault

Appendix A Table A-I: Some technical specifications for commercially-available processors shown in Figure 2. Processor

Manufacturer

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

AMD AMD AMD AMD AMD AMD AMD AMD AMD AMD AMD AMD AMD AMD ARM ARM Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel Intel NVIDIA NVIDIA NVIDIA Octasic Xscale

Series Athlon Athlon II x2 Athlon II x3 Athlon II x4 Athlon x2 Athlon x2 Athlon x2 Phenom x3 Phenom x3 Phenom x4 Phenom x4 Phenom x4 Sempron Sempron Cortex Cortex Atom Atom Atom Core 2 Core 2 i3 i3 i5 i5 i5 i7 i7 i7 i7 Itanium Xeon Xeon Tesla Tesla Tesla Vocallo StrongARM

Model LE-1640 250 710 640 7850 5600 4850e 8850 8450e 9650 9850 9150e 150 140 A8 A9 N270 Z500 N470 E8500 Q9550 i3-350M i3-530 i5-520M i5-560UM i5-750 i7-720QM i7-660UM i7-970 i7-930 9320 L5630 X5670 C2050 C1060 C870 MGW

Clock Frequency (GHz) 2.7 3.0 2.6 3.0 2.8 2.9 2.5 2.5 2.1 2.3 2.5 1.8 2.9 2.7 1.0 1.0 1.6 0.8 1.83 3.16 2.83 2.26 2.93 2.4 1.33 2.66 1.6 1.33 3.2 2.8 1.33 2.13 3.33 1.15 1.30 1.35 1.5 0.8

Power (W) 45 65 95 95 95 65 45 95 65 95 125 65 45 45 0.3 0.45 2.5 0.65 6.5 65 95 35 73 35 18 95 45 18 130 130 155 40 95 238 187.8 170.9 1.5 0.5

Number of Cores 1 2 3 4 2 2 2 3 3 4 4 4 1 1 1 1 1 1 1 2 4 2 2 2 2 2 4 2 6 4 4 4 6 448 240 128 15 1

Efficient Identification of Faces in Video Streams Using Low-Power Multi-Core Devices

Figure A-I: Power Efficiency versus Clock Frequency for GPPs listed in Table A-I.

31

Suggest Documents