Image Processing & Communications, vol. 21, no. 3, pp.55-68 DOI: 10.1515/ipc-2016-0016
55
FPGA IMPLEMENTATION OF MULTI-SCALE PEDESTRIAN DETECTION IN THERMAL IMAGES
´ T OMASZ K A NKA T OMASZ K RYJAK M AREK G ORGON
AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków
[email protected],
[email protected],
[email protected]
Abstract.
1
Introduction
In this paper an embedded vision system for
Vision systems that allow automatic pedestrian detection
human silhouette detection in thermal images
at night-time or when the visibility is heavily hindered
is presented.
As the computing platform
by fog or smoke have many practical applications. First,
a reprogrammable device (FPGA – Field Pro-
they are used in broadly understood security systems for
grammable Gate Array) is used. The detec-
surveillance of areas around important facilities like mil-
tion algorithm is based on a sliding window ap-
itary bases, nuclear power plants, airports, ports and all
proach, which content is compared with a prob-
kinds of warehouses. Second, due to the need to guarantee
abilistic template. Moreover, detection is four
uninterrupted operation under different conditions, such
scales in supported. On the used test database,
solutions are applied in the military, mainly for poten-
the proposed method obtained 97% accuracy,
tial target detection. Third, the fusion of visual and ther-
with average one false detection per frame.
mal information can greatly improve road traffic safety at
Due to the used parallelization and pipelining
night. Collisions with unprotected road users i.e. pedes-
real-time processing for 720 × 480 @ 50 fps
trians or cyclists moving along the edge of the road are
and 1280 × 720 @ 50 fps video streams was
a significant number of all accidents. According to the
achieved. The system has been practically veri-
document entitled ”General data about motorization and
fied in a test setup with a thermal camera.
road accidents” regarding 2015 issued by the Polish Police, the most incidents with tragic consequences were observed in autumn and winter months. As the main reasons
Key words. pedestrian detection, thermal im-
early occurring dusk and impaired visibility were given.
ages, FPGA, multi-scale image processing, em-
The use of automated pedestrian detection system could
bedded vision systems, real-time image pro-
allow the driver to react earlier and brake or steer clear
cessing
of a human on the road. It is worth noting that this type
56
T. Ka´nka, T. Kryjak, M. Gorgon
of solutions are currently available in selected cars from programmable logic in one housing). These platforms the so-called premium segment: Audi, BMW, Mercedes, have their advantages and disadvantages. Their detailed Lexus, Toyota and Honda.
discussion is beyond the scope of this article – some ad-
Image registration at night-time or in reduced visibil- ditional information can be found in [3]. The final soluity using daylight video cameras requires the use of addi- tion should satisfy two, usually conflicting, requirements tional lighting (visual or near infrared) or is even impos- – real-time video stream processing capability and energy sible (under heavy fog). Furthermore, a separate lighting efficiency. In the vision system described in this paper an FPGA
system generates additional costs and its use is often un-
desirable. Thermal (infrared) cameras are able to register device was used as the main computing platform. As it the heat emitted by an object. This is a huge difference will be demonstrated, it allowed to satisfy both mentioned to vision cameras, where the light reflected from objects requirements, as well as provide great upgrade flexibility. is captured. Good heat sources are human body, vehicles The main contributions of this paper are: (engine, tires) and essentially all objects that are warmer
• a verified in hardware pedestrian detection system in
than their environment. Unfortunately, such solutions also
thermal images,
have a number of drawbacks. First, thermal cameras are • proposals of some optimizations, which allowed to
much more expensive (even 10 times) compared to clas-
obtain multi-scale object detection,
sical ones. Second, the used sensors usually have a relatively low spatial resolution (typically 320 × 240) – in
• an improvement to a described in the literature
general the bigger the resolution the more expensive the
method, which allowed to increase the detection ac-
device. Third, the camera requires frequent calibration be-
curacy.
cause the measurement results are heavily affected by the housing, as well as the sensor temperature. Fourth, in the
The reminder of this paper is organized as follows. The context of pedestrian detection the two following boarder issue of pedestrian detection in thermal images, in general situations are problematic. If it is very cold and people and in FPGA devices, is presented in Section 2. The prowear quite thick clothes, than only exposed parts of the posed vision system i.e. used algorithm, its evaluation and human body are visible in the thermal image – sometimes FPGA implementation are described in Section 3. The aronly the faces. On the contrary, if it is very hot and the ticle ends with a short summary and future research disdifference between the body and the surroundings is rela- cussion. tively small, the output thermal image is quite noisy and its analysis is quite difficult. When designing an embedded vision system a proper
2
Pedestrian detection in thermal images
computing platform should be used. Among the vast possibilities the following may be mentioned: general pur- Pedestrian detection is an issue of great practical imporpose processors CPU/GPP (Central Processing Unit, Gen- tance. It is used both in Advanced Driver Assistance eral Purpose Processor), ASIC (Application Specific In- Systems (ADAS) and Advanced Automated Surveillance tegrated Circuit) or reprogrammable FPGAs (Field Pro- System (AVSS). Over the years a number of different algrammable Gate Array) and heterogeneous systems like gorithms have been proposed. A comprehensive review Xilinx’s Zynq (a combination of ARM processor and re- can be found in the papers [6] and [2].
57
Image Processing & Communications, vol. 21, no. 3, pp. 55-68
Among the most important solutions two representative ones are definitely worth mentioning. First, the Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) approach proposed in 2005 by N. Dalal and B. Trigs [4]. It is an example of a ”classical” object detection algorithm, with clear division between feature extraction (HOG – local distribution of edge orientations) and classification (SVM). In the literature several similar solutions can be found – using other features like Haar wavelets, LBP (Local Binary Patterns), HOF (Histogram of Optical Flow), as well as various improvements to the Fig. 1: Sample human silhouettes registered by a thermal camera HOG method. Second, recently deep convolutional neural networks (DCNN) have become extremely popular.
An exam-
ple of this approach is described in the work [5]. At
• detection of symmetrical blobs of certain size and shape, vertical histogram shape analysis, correlation with a pre-defined model,
a very general level the concept is similar to HOG+SVM. The DCNN architecture consist of two parts: convolutional/pooling layers (feature extraction) and a fully connected neural network (classification). However, both el-
• contour detection and AdaBoost based classification, • the use of SURF (Speeded-Up Robust Features) feature points and a codebook approach,
ements are trained and therefore the feature selection process in fully automated (learned not designed). DCNNs
• the use of methods designed for visual images: HOG
obtain very good classification results in various tasks
or edgelets features and SVM or AdaBoost classifier,
and usually outperform the earlier approaches. The only drawback of this solution is its long training process (mea-
• stereovision approach to determine the region of interest (ROI),
sured in days for large networks and requiring many training images), as well as the relatively long classification
• the use of multi threshold binarization.
time for a single sample. To accelerate the computations
Additionally, in the recent work [8] a fairly complex usually powerful GPU (Graphic Processing Unit) acceler- pedestrian detection algorithm was proposed. In the first ators are used. step the regions of interest were determined. This inBoth of the above mentioned methods have been devel- volved head detection (as bright objects having a specific oped and tested on images registered by a standard cam- shape). Then, vertical histograms were used to determine era. Thermal images have a certain specificity and there- the size of the pedestrian (as small, medium or large). Fifore usually slightly modified algorithms for pedestrian nally, a region growing procedure was applied and then detection should be used. The most important and notice- the bounding boxes were determined. In the second step, able differences is that a thermal image of a human does the detected candidate areas were subjected to validation. not contain much texture and strong edges (example in A standard solution involving feature extraction (curvelet Fig. 1). It the review papers [7, 10] and article [8] a num- transform) and classification (SVM) was used. The auber of different approaches to this issue are discussed:
thors emphasize the high accuracy of the method (compa-
58
T. Ka´nka, T. Kryjak, M. Gorgon
rable to DCNN) with only moderate computational com- Camera (164 × 128 pixels resolution) and Flir Systems plexity.
Thermacam PM595 (320 × 240 pixels resolution). The XUP Virtex II Pro Development System (XC2VP30) was
2.1
FPGA implementations
FPGAs (Field Programmable Gate Arrays) are a proven platform for implementing various embedded vision systems [1]. They consist of a number of simple logic elements (look-up tables, flip-flops, multipliers, block memories etc.). This architecture greatly supports all kind of parallelism – from fine to coarse grain. Additionally, if the pipeline pixel processing paradigm is used, very high performance and real-time processing are possible, while keeping the energy consumption at a relatively low level. What’s more, FPGAs are very flexible and can be reconfigured also in the target system. However, this solution has also some drawbacks. First, the process of logic design is much more difficult and time-consuming than writing a software application for a general purpose processor (GPP). Especially, when complex vision systems are considered (for simple applications it is possible to effectively use HLS (High Level Synthesis) tools). Second, the pipeline paradigm in conjunction with the real-time processing requirement imposes certain restrictions on the used algorithms. For example, a fully random access to the image pixels is not feasible. Therefore, it is impossible to use for example region growing segmentation – like in [8]. In the literature several FPGA implementations of pedestrian detection in thermal images were described. In the work [12] a system based on foreground object segmentation was proposed. First, the background model was stored in the internal FPGA block memory (BRAM)
used as the computing platform. Real-time image processing for a 640 × 480 @ 25 fps video stream was obtained (the thermal image was up-scaled in the used frame-grabber). It should be noted that the whole embedded system had been reduced to only a rather simple foreground object segmentation followed by single object analysis. In the paper [13] a more advanced version of the above mentioned solution was presented. It used an adaptive foreground object segmentation algorithm (so-called single Gaussian), followed by morphological opening (erosion and dilation) and connected component labelling (CCL). Additionally, tracking was implemented on the Power PC processor available in Virtex II Pro FPGA device. Therefore, the whole system could be described as a hardware-software solution. It was possible to detect up to 127 objects for 320 × 240 @ 25 MHz input video stream. It is worth noting that the false detection analysis was possible only on the basis of simple geometrical features obtained during CCL i.e. bounding box, area or centroid. A more comprehensive description of the tests conducted by the first author of these two papers can be found in his Ph. D. thesis [11]. In the work [14] hardware implementations of two solutions were described: for rigid and non-rigid objects (like pedestrians). In the first case, a method called Shape Constrain (SC) was used. It consisted of two stages: • local binarization with a 32 × 32 sliding window approach,
resources. In the next step the object mask was obtained and analysed. It was assumed that a frame can contain
• detection based on comparing the content of the slid-
only one object (one human). The computed bounding
ing window to a template database (56 templates in
box parameters were sent via Bluetooth to a PC, where
two scales were used for the 3 considered objects –
a tracking procedure was performed. The system was
in total 672 detectors). However, in the design the
tested for two cameras: Thermo Vision Micron Infrared
authors did not use the entire database, but only the
Image Processing & Communications, vol. 21, no. 3, pp. 55-68
59
most representative (discriminatory) templates. In the second case, a method based on the Naive Bayes classifier was used. In the first stage, the input image was also subjected to local binarization. Then detection using a sliding window was applied. It was based on comparing the current sample and a pattern (probability map) obtained during a training phase. The authors have implemented processing in three scales – 10 × 15, 12 × 8 and 9 × 6 pixel silhouettes could be detected. In order to improve the reliability of the system, the analysis was limited only to the upper part of the body. To make the hardware implementation feasible, the authors used the logarithm of the probability map. This allowed to replace the costly multiplication operations by much simpler additions. The described vision system has been implemented
Fig. 2: The used probabilistic template in different scales
on an evaluation board with the Cyclone III FPGA device from Altera (EP3C40F484). Real-time image processing of a 360 × 288 @ 25 fps video stream was reported.
ing a human silhouette. A collection of 63 images was used – an example is shown in Fig. 1. They were acquired
3
The proposed solution
during winter (temperature −10o C) with a portable thermal camera (VIGOcam v50) in 320 × 240 pixels resolu-
The main aim of this work was the realization of an em- tion. Then, 86 human silhouettes were manually selected bedded vision system able to perform pedestrian detection and rescaled to the size of 96 × 240 pixels. In order to inin thermal images in real-time. The Spartan 6 FPGA de- crease the size of the training set, also the mirrored images vice from Xilinx was chosen as the computing platform. along the vertical axis were used. It allowed to achieve relative high processing performance
Then for each location in the 96 × 240 window the sum
with low power consumption. The used algorithm was of pixel values was computed – acc(i, j) = acc(i, j) + based on the solutions described in papers [9] and [14]. Its p(i, j), where: acc – accumulator, (i, j) pixel coordinates. structure allows upon relatively easy parallelization and The obtained sums were divided be the number of samdetection of object of different sizes. In the next subsec- ples in the training set and then normalized to the range tions the used algorithm, its software implementation and [0, 1]. In this way, a probability measure which describes evaluation, as well as the designed embedded vision sys- if a pixel belongs to a human silhouette was obtained. tem are described. It was also assumed, that the designed system must
3.1
The used algorithm
support multi-scale pedestrian detection. There are two common approaches to achieve this goal. The first in-
The used pedestrian detection algorithm was based on volves scaling the input image and the second using mula probabilistic template. It was created on the basis of tiple instances of the detector. For the considered algoa training set, which included rectangular images contain- rithm, as will be explained later, the second solution was
60
T. Ka´nka, T. Kryjak, M. Gorgon
formula: M X N X P (i, j) = (I(x, y) − 127) · (p(x, y) − 0.5) (1) x=1 y=1
where: I – considered window (grayscale), p – probabilistic template, M, N – the width and height of the window, P – detection probability. Pixels with values grater than 127 in case of compliance with the pattern (p > 0.5) will increase the value of the final sum. In case of noncompliance, the contribution will be negative. A similar reasoning can be applied to dark pixels (