158
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012
Design and Implementation of a Pipelined Datapath for High-Speed Face Detection Using FPGA Seunghun Jin, Dongkyun Kim, Thuy Tuong Nguyen, Daijin Kim, Munsang Kim, and Jae Wook Jeon, Member, IEEE
Abstract—This paper presents design and implementation of a pipelined datapath for real-time face detection using cascades of boosted classifiers. We propose following methods: symmetric image downscaling, classifier sharing, and cascade merging, to achieve the desired processing speed and area efficiency. First, an image pyramid with 16 levels is generated from the input image to simultaneously detect faces with different scales. The downscaled images are then transferred to the first stage of the cascade that is shared between the corresponding image pairs based on the pixel validity of the symmetric image pyramid. The last method exploits the different hit ratios of the cascade stages. We use a tree-structured cascade of classifiers since most of the nonface elements are eliminated during the early stages of the classifier. The use of a synthesis tool confirms that the proposed design reduces resource utilization by one-eighth without accuracy loss, compared to the fully parallelized implementation of the same algorithm. We implemented the proposed hardware architecture on a Xilinx Virtex-5 LX330 FPGA. The indicative throughput is 307 frames/s irrespective of the number of faces in the scene for 480) images with an operating frequency standard VGA (640 of 125.59 MHz. We may ensure that face detection results are generated at each clock cycle after the initial pipeline delay, using this fully pipelined datapath for tree-structured cascade classifiers. Index Terms—Computer vision, face detection, field-programmable gate arrays (FPGAs), integrated circuit design.
I. INTRODUCTION ACE detection is the task of searching for faces within a source image in which face locations and sizes can vary. Face detection plays an important role in a wide range of computer vision applications, such as intelligent robotic vision, con-
F
Manuscript received March 26, 2011; revised September 10, 2011; accepted October 08, 2011. Date of publication October 28, 2011; date of current version January 20, 2012. This work was supported by the Ministry of Knowledge Economy and the Korea Institute for Advancement in Technology through the Workforce Development Program in Strategic Technology, and by Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (20110018397). Paper no. TII-11-128. S. Jin was with Sungkyunkwan University, Suwon 440-746, Korea. He is now with the Samsung Advanced Institute of Technology, Yongin 446-712, Korea (e-mail:
[email protected]). D. Kim, T. T. Nguyen, and J. W. Jeon are with the School of Information and Communication Engineering, Sungkyunkwan University, Suwon 440-746, Korea (e-mail:
[email protected];
[email protected];
[email protected]). D. Kim is with the Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang 790-784, Korea (e-mail:
[email protected]). M. Kim is with the Center for Intelligent Robotics, Korea Institute of Science and Technology, Seoul 136-791, Korea (e-mail:
[email protected]). Digital Object Identifier 10.1109/TII.2011.2173943
tent-based image retrieval, video surveillance, and human–computer interactions. Many different approaches have been developed in recent decades to achieve fast and reliable face detection [1]. Among them, statistical face detection methods based on AdaBoost and Haar-like features are widely used because they are robust and computationally efficient. A Haar-like feature has one or more scalar values that stand for the average intensity difference between two rectangular regions. This can represent the intensity gradient in different directions and at different locations by adjusting the rectangular regions’ positions, sizes, or shapes. A Haar-like feature is a weak classifier because it only considers a few adjacent pixels. Therefore, many Haar-like features are required to describe an object with satisfactory accuracy. Hence, strong classifiers are a linear combination of weak classifiers. The Haar-like features are organized to form classifier cascades that create strong classifiers. As demonstrated in earlier studies, software implementation of the AdaBoost face detector achieves near real-time operation and high detection rates [2], [3]. However, there are many Haar-like features in a single-face image, so in every round, AdaBoost needs to search a large pool of weak candidate classifiers [4]. Moreover, repeated downscaling and searching of the entire image data are required until the downscaled image becomes smaller than the face-training sample size to achieve size-invariant face detection [5]. Almost all of the computational power of high-performance processors is required to achieve the desired detection speeds, since general-purpose processors are too slow to handle extremely large datasets [6], [7]. The constraints in embedded environments are even harder to meet due to performance, size, and power consumption limitations. Several studies regarding the use of hardware parallelism for real-time face detection have recently been conducted. Due to the architectural flexibility of logical design and considerable parallelism extraction [8], [9], field-programmable gate array (FPGA)-based custom designs are the most obvious solutions for real-time face detection applications. Lai et al. [10] designed a parallel hardware architecture for AdaBoost-based face detection that can process video graphics array (VGA) images at the speed of 143 frames per second (fps) using a single scan window running at 126 MHz. However, the number of Haar-like features tested is too small for real-world applications, and the reported performance has only been theoretically verified, based on synthesis results. Irick et al. [11] proposed a neural network-based streaming architecture for face detection and gender classification that can process quarter VGA (QVGA, 320 240) images in real time. The proposed architecture was implemented on the Virtex-4 FX12 FPGA from Xilinx for verification. Even though the experimental results show 52 and 175 fps in the worst case
1551-3203/$26.00 © 2011 IEEE
JIN et al.: DESIGN AND IMPLEMENTATION OF A PIPELINED DATAPATH FOR HIGH-SPEED FACE DETECTION USING FPGA
and typical case, respectively, a pixel offset of 10 is unrealistically high. Ngo et al. [12] developed a cost-effective face detection system based on a low-cost FPGA prototyping board from Altera (DE2 board). They proposed an area-efficient modular architecture for the Viola–Jones face detector in QVGA video streams with a minimum processing rate of 30 fps. However, the reported results are based on simulations that were performed only for the first stage of the classifier cascade that had a total of five stages. Farrugia et al. [13] presented an FPGA architecture for a convolutional face finder algorithm that is able to process 35 VGA images per second using a ring of 25 processor elements (PEs) running at 80 MHz. However, the reported performance was based on FPGA synthesis results; an architecture using only four PEs was implemented in the Xilinx Virtex-4 SX35 FPGA, and the speed of 29 fps was achieved using QVGA images. Hiromoto et al. [14] proposed a specialized processor that performs parallel and sequential processing for the former and subsequent stages, respectively. The proposed architecture was implemented on a Virtex-5 LX330 FPGA and a detection rate of 30 fps on VGA images was reported. Gao and Lu [15] proposed an FPGA accelerator for the Haar classifier-based face detection algorithm that can process 256 192 images at 98 fps using 16 parallel classifiers. Cho et al. [16] also proposed an AdaBoost-based hardware face detector with three parallel classifiers. The reported detection rate using a Virtex-5 Xilinx FPGA was 7 fps for VGA images, and almost 19-fold improvement over the corresponding software implementation was reported. Considering the size of the input images, both Gao and Cho achieved near real-time face detection performance. He et al. [17] have recently proposed an FPGA-based system-on-a-chip architecture for fast face detection using artificial neural network classifiers on AdaBoost-trained Haar-like features. The reported detection speed was 625 fps when VGA images were processed at 73 MHz, roughly two orders of magnitude higher than the corresponding software. However, only three object window sizes were used (11 11, 19 19, and 27 27), causing poor detection accuracy for faces smaller than the window, as stated in their paper. Moreover, the number of Haar-like features was also relatively small compared to other AdaBoost-based face detectors implemented using a hardware basis. As stated in the earlier studies, the hardware-based face detection systems developed to date had sequential bottlenecks due to the cascaded architecture and only partially satisfied the accuracy, detection rate, and area-efficiency requirements. In this paper, we propose a design of a fully pipelined datapath that can perform high-speed face detection irrespective of the number of faces in an image. We targeted a face detection system that can provide: 1) pixel-clock synchronized detection; and 2) resource savings using the symmetric characteristics of the pyramid image and cascade stages. These sequential characteristics are eliminated using a pipelined datapath that can generate face detection results at each clock cycle after the first pipeline latency. This allows us to achieve the desired speed performance irrespective of the number of faces in a scene. Furthermore, resource consumption is reduced by more than one-eighth compared to the fully parallelized implementation of the same algorithm without any loss of accuracy. The remainder of this paper is organized as follows. Section II describes the employed face detection method based on Ad-
159
aBoost and Haar-like features. Section III provides the proposed pipelined datapath and describes the detailed design of each stage. Section IV presents the FPGA implementation of the proposed hardware architecture, along with the synthesis and experimental results. Section V concludes this paper and discusses areas of future work. II. FACE DETECTION USING FACE CERTAINTY MAP This section introduces the proposed face detection algorithm implemented in this paper, based on a robust real-time face detection method [5]. The face detection algorithm consists of four separate stages: image pyramid generation, preprocessing, face detection, and postprocessing, as shown in Fig. 1. The purpose of image pyramid generation is to detect faces of various sizes within an image, since the detector analyzes image patches based on the set of trained features of fixed size. Nearest neighbor interpolation is then applied by referring to adjacent pixels to increase the quality of the downscaled images. Thus, a certain level of successively downscaled images is generated. A local binary pattern (LBP) transform is applied to each downscaled image as a preprocessing step after the image , LBP pyramid is generated. At a specific pixel position is defined as the ordered set of binary comparisons of pixel intensities between the centered pixel and its eight surrounding pixels. The decimal form of the resulting 8-bit LBP can be expressed as follows: (1) where is the gray value of the centered pixel the gray values of the eight surrounding pixels; and function defined as
;
are is a
(2) A pixel value is converted to the specific pattern that describes the surroundings of the pixel itself, after the LBP transform. LBP is less sensitive to illumination changes, since LBP is invariant to the monotonic grayscale transformation. The resulting LBP value is transferred to a cascade of boosted classifiers for the face detection. Fig. 2 shows the examples of the classifier cascades to explain for nonface rejection and face acceptance. If a nonface region cannot pass the first (the simplest) classifier, it is rejected immediately. The cascade system will no longer consider this region; therefore, the processing time will be significantly reduced. Similarly, if a nonface region is stopped at the fifth classifier, the time is also saved because there is no further processing for this region. If a region is a face, it can certainly go through all classifier cascades. We use a total of 524 weak classifiers trained using LBPtransformed facial images. Five strong classifiers with 8, 36, 80, 150, and 250 weak classifiers are built to classify each window as either a face or a nonface, based on the linear combination in (3) is used to of weak classifiers. The function determine the confidence value of the LBP-converted feature. In this paper, training of faces and nonfaces is done offline and the training data are stored as a lookup table (LUT). Indeed, this
160
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012
Fig. 1. Block diagram of FCM-based face detection algorithm used in the proposed system.
Fig. 2. Examples of Haar-like features.
LUT contains confidence values for different LBP feature locations. If an LBP feature location has high confidence, it is likely to belong to a face; otherwise, it is from a nonface. Hence, is the mapping function of an LBP value to the corresponding datum stored in the LUT [5]. Fig. 3 shows the detailed view of the proposed classifier cascade that can classify faces and nonfaces. Intuitively, in Fig. 3, is represented by the connector from the block “LBP value” to the block “LUT (confidence value).” Fig. 7 shows that there are five separate stages (or cascades); therefore, there are five to determine as in confidence values (3). At every rectangular block named Stage 1 (or 2, 3, 4, 5) in (or ) is calcuFig. 3, the confidence value . If an image region has confidence lated by summing all values (five confidence values corresponding to five stages) exceeding thresholds defined by offline training (five thresholds corresponding to five stages), this region is determined to be face; otherwise, it is a nonface. The classifiers are cascaded in the software implementation to increase the detection speed. In particular, the early cascade stages have a smaller number of features, while the late stages of the cascade have a larger number of features. Unnecessary computations can be avoided by excluding nonface windows
in the early stages of the cascade, because all of the classification is performed sequentially throughout the entire image pyramid. Even more computational advantage can be expected considering the number of classifications caused by the pyramid scaling. Table I shows the device usage and reference count of each cascade stage. The MIT+CMU face database is used to achieve the experimental results. As shown in Table I, the early stages of the cascade occupy only a small amount of the routing resources, while removing most of the nonface windows. Conversely, the late stages of the cascade classify the remaining face candidates with a large amount of routing resources. Therefore, the area efficiency of the early cascade stages is higher than that of the late stages. Classification is performed by evaluating the confidence , as shown in value of each scanning window centered at (3), where and represent the th cascade and the th feature is the set of feature locations. If the location, respectively. scanning window passes all cascades, the window is selected as a face (3) Postprocessing is applied on the detected faces on the detected faces to remove overlapped and falsely accepted faces. As in [5], face certainty map (FCM) is constructed based on four parameters: , , , . is the maximum confidence value and among the windows of the image pyramid with the same center . and are the width location
JIN et al.: DESIGN AND IMPLEMENTATION OF A PIPELINED DATAPATH FOR HIGH-SPEED FACE DETECTION USING FPGA
161
Fig. 3. Detailed view of the proposed classifier cascade that can classify face and nonface patterns from one level of the image pyramid. TABLE I DEVICE UTILIZATION AND REFERENCE COUNT OF CASCADE STAGES
and height, respectively, of the detected face window that is the contains the maximum confidence value and cumulative confidence value for the image pyramid. The face region is determined using the constructed FCM. Location is determined to be the center of the face when is above the threshold for all of the values above the threshold in . A nonface region where the maximum confidence value is above the threshold is not classified as a face region, is lower than the threshold. since III. PROPOSED DATAPATH DESIGN Fig. 4 overviews the proposed datapath architecture. This consists of four separated stages, corresponding to the four stages of the employed face detection algorithm: image pyramid
stage, LBP transform stage, tree-structured cascade stage, and postprocessing stage. The proposed system takes image signals from the external camera and generates a pixel stream including video interface signals, such as frame/line valid signals. The image pyramid stage maintains the local coordinate of subimage patches and generates downscaled images with the corresponding valid signals based on that local coordinate. The LBP transform stage converts the pixel values streaming from the image pyramid stage to LBP values as described in the previous section. The converted LBP values are then transferred to the tree-structured cascade, and the candidate window is classified based on the pretrained face dataset. The postprocessing stage makes decisions regarding the face candidates based on confidence values and the pyramid level generated from the former stages. A detailed explanation of the architectural advantages of each stage is provided next. A. Image Pyramid Stage Image pyramid generation is required for the size-invariant detection, since the sizes of the faces in the captured image vary with distance. The entire image is stored to the video buffer for random access of the image in most reported works, because an arbitrary pixel in the captured image can be referred to during the mapping. The image pyramid is then generated by reading back the image from the video buffer. Temporal asymmetry occurs, because the amount of data to be read is not the same as the amount of data written. This increases the input/output latency
162
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012
Fig. 4. Overall hardware architecture of the proposed system. The datapath consists of four stages: pyramid scaling stage, LBP transform stage, tree-structured cascade stage, and postprocessing stage.
and limits the operating frequency of the system. Consequently, the overall throughput of the system is affected. The proposed image pyramid stage receives VGA images as the input, which is conceptually divided into a set of 16 16 subimage patches. Each downscaled image samples the same number of pixels with its level from the subimage as in the Fig. 5, since the employed image pyramid consists of 16 levels. For example, 14 pixels within the patch at the predefined location are sampled along with the horizontal and vertical directions when generating the downscaled image at pyramid level 14. Sixteen successively downscaled images from 640 480 to 40 30 are simultaneously generated after a certain amount of initial latency, as in Fig. 5. The advantage of the proposed image pyramid stage is that all of the downscaled images and their corresponding valid signals are simultaneously derived with a small latency without storing the entire image. Therefore, the temporal asymmetry between the image acquisition and the pyramid scaling is eliminated. Moreover, a single read/write clock is used that is synchronized with the pixel sampling clock of the camera; the max-
imum throughput of the system becomes independent of the operating clock of the image pyramid stage. B. LBP Transform Stage As previously stated, we use LBP-transformed face images as a training dataset to achieve robustness with regard to the illumination changes. Thus, each pixel from the image pyramid stage must be converted to the LBP value before evaluation. Fig. 6 shows the proposed datapath design of the LBP transform stage. A set of shift registers and line buffers is used for parallel processing, since the LBP transform refers to the centered pixel and its surrounding pixels simultaneously. The first 3 3 pixels of the image are assigned to each shift register in the window, after a latency of two lines and three pixels. The intensity of the centered pixel is compared to its neighboring pixels using eight comparators. The LBP value corresponding to the respective 3 3 neighborhood can be obtained by concatenating the resulting eight binary values in counterclockwise order. The resulting LBP is delivered to the first stage of the tree-structured cascade stage.
JIN et al.: DESIGN AND IMPLEMENTATION OF A PIPELINED DATAPATH FOR HIGH-SPEED FACE DETECTION USING FPGA
163
2
2
Fig. 5. Different valid positions of the downscaled images (dark-gray pixels). The input image is conceptually divided into 40 30 patches of 16 16 subimage coordinates, and the image pyramid is generated by sampling the incoming pixel when the pixel is located at the valid position of each level. The generated image pyramid is symmetric and the corresponding downscaled images are mutually exclusive.
Fig. 6. Hardware implementation of LBP transform stage.
C. Tree-Structured Cascade In the tree-structured cascade module, all of the cascade stages are implemented using the fixed pipelining as in Fig. 6. Since the classification result is generated at each pixel clock after the initial pipeline latency, the advantage of the cascading in reducing the computational time diminishes. Moreover, the confidence value calculated in the late stages of cascade becomes meaningless unless the face candidate passes the early stages of the cascade. Thus, parallel processing of the entire cascade can cause redundancy in terms of the frequency of use. Thus, two different methods, classifier sharing and cascade stage merging, are considered to reduce the resource consumption, while maintaining the detection rate and accuracy. Only one LBP value is valid in the corresponding image pair at a specific time, since the image pyramid is built symmetrically. That
is, we can share the first stage of the cascade between the corresponding images without accuracy loss. In addition, we use the tree-structured cascade of classifiers, because most nonface candidates are eliminated during the early stages of the cascade, and face candidates tend to be detected in adjacent levels of the image pyramid. The face candidate of a specific downscaled image is advanced to the next stage, only if the resulting confidence value is higher than the predefined threshold. This procedure is performed repeatedly, until the last stage of the cascade. However, face candidates identified during specific stages are likely to collide with others in subsequent stages since the LUTs are shared among the pyramid images. In particular, there is a high probability of conflict among adjacent pyramid images because: 1) the detected face windows usually overlap each other and 2) the pixel validity signal in one downscaled image is similar to those of adjacent downscaled images. Thus, each LUT is shared by any two results of the cascade in such a way as to minimize conflict. In addition, downscaled images that have low resolution have higher priority, since faces closer to the camera are more important than those that are further away from the camera. As with the LBP transform stage, a number of 20 20 shift registers and 19 line buffers are used for simultaneous access to the window pixels. LUTs that represent the Haar-like features for each classifier can be implemented in hardware using the block RAM (BRAM) component of the FPGA, by mapping: 1) the LBP value of the candidate window to the BRAM address and 2) the confidence value of the Haar-like feature to the BRAM data. As a result, the confidence values for each feature can be derived within a single clock cycle by referring to the BRAM using the LBP value as an address. To calculate the sum of all of the confidence values of the candidate window within
164
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012
Fig. 7. Tree-structured cascade of classifiers. The results of former stages are compared to each other using the resulting confidence values. BRAMs shared by any two results of cascade in a way to minimize conflicts. Five strong classifiers with 8, 36, 80, 150, and 250 weak classifiers are built to classify each window as either a face or a nonface.
the designated clock cycles, dedicated adder trees for each classifier are used, as shown in Fig. 7.
TABLE II DEVICE UTILIZATION/TIMING SUMMARY REPORTED FROM XILINX SYNTHESIS TOOL XST 10.1
D. Postprocessing Stage In the postprocessing stage, overlapped or falsely accepted faces are handled using the overlapping characteristics of the Haar-like classifier-based face detection algorithm. As shown in the previous sections, different scales of faces can be detected in the same center location for real face regions. Faces with different locations and the same scale can be detected as well. Therefore, we can determine that regions in which multiple detected face windows overlap are face regions. The regions with no overlapped face windows are determined to be falsely accepted regions. The width and height of the maximum confidence value for each pyramid image are calculated using the result of the classifier cascade and the corresponding scaling factor. The detected face window with the highest confidence value is selected as the final detected faces. IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS The proposed hardware architecture for high-speed face detection is designed using VHSIC Hardware Description Language. We implemented the proposed design using a Xilinx Virtex-5 LX330 FPGA, since it offers sufficient speed and resource for evaluation and verification purposes. Table II shows the implemented face detection system and the device utilization/timing summary reported by the Xilinx synthesis tool XST 10.1[19]. 74 977 slice LUTs were used and 128 041 slice registers were used. The reported maximum allowed frequency of the proposed system is 125.59 MHz.
A. Precision and Recall Robustness of the system was also evaluated via simulation using 934 images that contain 1394 faces. These experimental images were downloaded and referred to from the face database in [20]. We only selected the first directory named 2002–2007 in the database for the experiment. The evaluation was determined using five criteria. First, the precision and recall were analyzed with respect to all faces. The precision and recall were computed based on the number of correct detections (true positive), the number of miss detections (false negative), and the number
JIN et al.: DESIGN AND IMPLEMENTATION OF A PIPELINED DATAPATH FOR HIGH-SPEED FACE DETECTION USING FPGA
Fig. 8. Evaluation of five different cases: all faces, all except partially occluded faces, all except out-of-focus faces, all except RIP, and all except ROP.
165
Fig. 10. Face detection results of the proposed system in various scenes. Multiple faces are successfully detected in real time. (a) Experimental environment. Right monitor displays input image, and left monitor displays resulting image. (b) Face detection result before postprocessing. Faces are detected through scanline (unlimited). (c) Test video clip: Multiple faces with similar scales. (d) Test video clip: Multiple faces with different scales.
sample image results where faces were successfully detected under different conditions. B. Experimental Results
Fig. 9. Sample results yielded from the face image database.
of wrong detections (false positive). Second, the result was analyzed when partial occlusion was not considered. Similarly, in the three remaining criteria, out-of-focus, rotation-in-plane (RIP), and rotation-out-plane (ROP) were not considered, respectively. Fig. 8 shows the experiment results in five different cases. It is notable that our implementation does not support the case of ROP; therefore, most of faces rotated out of the plane were missed in detection. Hence, if ROP was not considered, the experiment results yield 93% detection accuracy. Fig. 9 shows
Two kinds of cameras with different frame rates are interfaced to test the performance of the proposed system. A VCC-8350CL camera-link camera captures a standard VGA image at 60 fps with a 24.58 MHz pixel clock. An MV-D640 CMOS camera captures a standard VGA image at 200 fps with an 81.92 MHz pixel clock. Both cameras with the designated pixel clock perform face detection in our experiments without performance degradation. We can expect more than 307 fps processing of standard VGA images when running at the maximum reported clock frequency, 125.59 MHz, because the frame rate increases linearly along with the pixel clock increment. Further performance enhancement can be expected considering the underestimation characteristic of the synthesis tool [18]. The performance of the proposed system is verified by timing simulation, since the maximum frame rate of the camera available for this research is limited to 200. A Mentor Graphics ModelSim 6.1 f simulation environment with test vectors that describe the actual behavior of the camera is used for verification. Figs. 10 and 11 show face detection results of the proposed system. The results were processed and obtained from the implemented system in real-time using a VCC-8350CL camera at 60 fps. The experimental results yielded 93% detection accuracy. The proposed system can successfully detect multiple faces in real time, even if the scene changes very rapidly, as shown in Fig. 10. Table III compares the performance. The proposed system can process 307 fps while running at the reported maximum frequency (125.59 MHz), while achieving the small pixel offset of 1. Considering the number of pyramid level and Haar-like features, the proposed system shows the best speed performance among the reported FPGA-based face detection
166
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012
TABLE III PERFORMANCE SUMMARY OF REPORTED FPGA-BASED FACE DETECTION SYSTEMS
V. CONCLUSION
Fig. 11. Performance comparison between the proposed system and the market product. The proposed system can track multiple faces that oscillate from side to side at a speed of 3 cm/s, while the market product easily loses the faces. (a) Proposed system at frame t. (b) Market product at frame t. (c) Proposed system . (d) Market product at frame t . (e) Proposed system at at frame ssst frame t . (f) Market product at frame t .
+1 + 21
+1 + 21
systems. In addition, we can assume the accuracy of the proposed system also outranks the reported FPGA-based face detection systems since a larger number of Haar-like features and image pyramid means detection that is more accurate.
In this paper, we proposed the design of fully pipelined datapath for high-speed face detection. All of the procedures required for face detection were integrated within a single FPGA, including an image pyramid stage, LBP transform stage, treestructured cascade stage, and postprocessing stage. The performance of the proposed system was evaluated in a common environment to analyze further applicability. We focused on the intensive use of pipelining during the design stage, synchronizing all of the functional elements with a single pixel clock, to achieve the desired speed performance. The frame rate of the proposed system is independent of the number of faces in the scene, as shown in the experiments. It can be flexibly adjusted in proportion to the frame rate of the camera. However, detection features are embedded into the design and can only be configured during the compile time at the present stage. In addition, the size of detectable features is dependent on the size of the input image, since different image sizes are allowed as an input but the level of pyramid image generation is fixed. For future work, we plan to extend the applicability of the proposed system to detect facial features such as the eyes and lips, by making each module configurable. In addition, the use of our implementation as an intelligent sensor is also considered for higher level vision applications such as intelligent robots, surveillance, automotives, and the human–computer interface. Additional areas of application for the proposed face detection system will be evaluated and explored. REFERENCES [1] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 1, pp. 34–58, Jan. 2002. [2] D. Zhang, S. Z. Li, and D. Gatica-Perez, “Real-time face detection using boosting in hierarchical feature spaces,” in Proc. 17th Int. Conf. Pattern Recognit., Aug. 2004, pp. 411–414.
JIN et al.: DESIGN AND IMPLEMENTATION OF A PIPELINED DATAPATH FOR HIGH-SPEED FACE DETECTION USING FPGA
[3] P. Viola and M. Jones, “Fast and robust classification using asymmetric AdaBoost and a detector cascade,” Adv. Neural Inf. Process. Syst., vol. 2, pp. 1311–1318, 2002. [4] Y. Yan, Z. Guo, and J. Yang, “Multi-view face detection based on the enhanced AdaBoost using walsh features,” in Proc. 8th ACIS Int. Conf. Softw. Eng., Artif. Intell., Netw., Parallel/Distrib. Comput., Jul. 2007, vol. 1, pp. 200–205. [5] B. Jun and D. Kim, “Robust real-time face detection using face certainty map,” in Lecture Notes Computer Science. Berlin, Germany: Springer, 2007, vol. 4642, pp. 29–38. [6] M. Grajcar, “Strengths and weaknesses of genetic list scheduling for heterogeneous systems,” in Proc. 2nd Int. Conf. Appl. Concurr. Syst. Design, Jun. 2001, pp. 123–123. [7] Y. Wei, X. Bing, and C. Chareonsak, “FPGA implementation of AdaBoost algorithm for detection of face biometrics,” in Proc. IEEE Int. Workshop Biomed. Circuits Syst., Dec. 2004, p. S1/6 - 17-20. [8] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance comparison of FPGA, GPU and CPU in image processing,” in Proc. Int. Conf. Field Programmable Logic Appl., Aug. 2009, pp. 126–131. [9] B. Kisacanin, S. S. Bhattacharyya, and S. Chai, Embedded Computer Vision. New York: Springer, 2008. [10] H.-C. Lai, M. Savvides, and T. Chen, “Proposed FPGA hardware architecture for high frame rate (> 100 fps) face detection using feature cascade classifiers,” in Proc. IEEE Int. Conf. Biometr.: Theory, Appl., Syst., Sep. 2007, pp. 1–6. [11] K. Irick, M. DeBole, V. Narayanan, R. Sharma, H. Moon, and S. Mummareddy, “A unified streaming architecture for real time face detection and gender classification,” in Proc. Int. Conf. Field Programmable Logic Appl., Aug. 2007, pp. 267–272. [12] H. Ngo, R. Tompkins, J. Foytik, and V. Asari, “An area efficient modular architecture for real-time detection of multiple faces in video stream,” in Proc. 6th Int. Conf. Inf., Commun. Signal Process., 2007, pp. 1–5. [13] N. Farrugia, F. Mamalet, S. Roux, F. Yang, and M. Paindvoine, “Fast and robust face detection on a parallel optimized architecture implemented on FPGA,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 4, pp. 597–602, Apr. 2009. [14] M. Hiromoto, K. Nakahara, and H. Sugano, “A specialized processor suitable for AdaBoost-based detection with Haar-like features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Oct. 2007, pp. 1–8. [15] C. Gao and S. Lu, “Novel FPGA based Haar classifier face detection algorithm acceleration,” in Proc. Int. Conf. Field Programm. Logic Appl., Sep. 2008, pp. 373–378. [16] J. Cho, S. Mirzaei, J. Oberg, and R. Kastner, “FPGA-based face detection system using Haar classifiers,” in Proc. 17th ACM/SIGDA Int. Symp. Field-Programm. Gate Arrays, Dec. 2009, pp. 103–112. [17] C. He, A. Papakonstantinou, and D. Chen, “A novel SoC architecture on FPGA for ultra fast face detection,” in Proc. IEEE Int. Conf. Comput. Design, Oct. 2009, pp. 412–418. [18] J. Diaz, E. Ros, F. Pelayo, E. M. Ortigosa, and S. Mota, “FPGA-based real-time optical-flow system,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 274–279, Feb. 2006. [19] Xilinx XST User Guide 2008. [Online]. Available: http://www. xilinx.com [20] V. Jain and E. L. Miller, FDDB: A benchmark for face detection in unconstrained setting Dept. Comput. Sci., Univ. Massachusetts, Amherst,, Tech. Rep. UM-CS-2010-009, 2010. [Online]. Available: http://vis-www.cs.umass.edu/fddb
Seunghun Jin received the B.S., M.S., and Ph.D. degrees in electrical and computer engineering from Sungkyunkwan University, Suwon, Korea, in 2005, 2006, and 2009, respectively. In 2010, he joined the Samsung Advanced Institute of Technology, Yongin, Korea, as a R&D Staff Member. His research interests include reconfigurable architecture, image and speech signal processing, embedded systems, and real-time applications.
167
Dongkyun Kim received the B.S. and M.S. degrees in electrical and computer engineering from Sungkyunkwan University, Suwon, Korea, in 2007 and 2009, respectively, where he is currently working toward the Ph.D. degree at the School of Information and Communication Engineering. His research interests include image/speech signal processing, embedded systems, and real-time applications.
Thuy Tuong Nguyen received the B.S. (magna cum laude) degree in computer science from Nong Lam University, Ho Chi Minh City, Vietnam, in 2006, and the M.S. degree in electrical and computer engineering from Sungkyunkwan University, Suwon, Korea, in 2009, where he is currently working toward the Ph.D. degree at the School of Information and Communication Engineering. His research interests include computer vision, image processing, and graphics processing unit computing.
Daijin Kim received the B.S. degree in electronic engineering from Yonsei University, Seoul, Korea, in 1981, the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, Korea, 1984, and the Ph.D. degree in electrical and computer engineering from Syracuse University, Syracuse, NY, in 1991. He is currently a Professor in the Department of Computer Science Engineering, Pohang University of Science and Technology, Pohang, Korea. His research interests include intelligent fuzzy system, soft computing, genetic algorithms, and hybridization of evolution and learning.
Munsang Kim received the B.S. and M.S. degrees in mechanical engineering from the Seoul National University, Seoul, Korea, in 1980 and 1982, respectively, and the Dr.-Ing. degree in robotics from the Technical University of Berlin, Berlin, Germany, in 1987. Since 1987, he has been a Research Scientist at the Korea Institute of Science of Korea, Seoul. He led the Advanced Robotics Research Center in 2000 and became the director of the Intelligent Robot-The Frontier 21 Program in 2003. His current research interests include design and control of novel mobile manipulation systems, haptic device design and control, and sensor application to intelligent robots.
Jae Wook Jeon (S’82–M’84) received the B.S. and M.S. degrees in electronics engineering from Seoul National University, Seoul, Korea, in 1984 and 1986, respectively, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, IN, in 1990. From 1990 to 1994, he was a Senior Researcher at Samsung Electronics, Suwon, Korea. In 1994, he joined the School of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, as an Assistant Professor, where he is currently a Professor. His research interests include robotics, embedded systems, and factory automation.