Image Processing Theory, Tools and Applications
Embedded System for Real-Time Human Motion Detection Mohammad Mayya, Nizar Zarka and Mhd. Soubhi Alkadi Higher Institute for Applied Sciences and Technology, Syria e-mail:
[email protected],
[email protected],
[email protected]
Abstract------ This paper describes an embedded system for real-time human motion detection using a fixed camera. A modified version of the Codebook algorithm is developed to detect moving objects. This algorithm provides fast background modelling and subtraction with small storage memory requirements. Then, the system detects humans using a simplified Skeletonization algorithm, which uses the individual human shape and does not need a model comparison. Functional and timing simulations are applied by using MATLAB and Visual Studio on PC. Finally, the system is installed on ALTERA Cyclone™ II DSP development board and implemented using the Nios II processor and some hardware accelerators. Keywords------real-time system, Codebook, background subtraction, Skeletonization, human motion, FPGAs, Nios II processor.
I. INTRODUCTION Human motion detection and tracking in a complex environment is a hard task. This requires a robust system, which copes with different motions, without being affected by occlusions and environment feature changes. To overcome changes in the environment monitored by the system, we have to design a robust background model that can deal with both slow illumination changes like light changes between day and night, and fast illumination changes (clouds blocking the sun). The simplest background model assumes that the intensity values of a pixel can be modeled by a single modal distribution [1]. However, a single-mode model cannot handle multiple backgrounds, like waving trees and it requires a learning set without moving objects. The generalized mixture of Gaussians (MOG) [2] has been used to model complex and non-static backgrounds. Unfortunately, backgrounds which have fast variations are not easily modeled accurately with just a few Gaussians, because it needs a long learning set. In addition, depending on the learning rate to adapt to background changes, MOG faces trade-off problems. For a low learning rate, it produces a wide model that has difficulty in detecting a sudden change to the background. If the model adapts too quickly, slowly moving foreground pixels will be absorbed into the background model, resulting in a high false negative rate. The Codebook algorithm [3] intends to sample values over long times under limited memory, without making parametric assumptions. Mixed backgrounds can be modeled by multiple codewords. It also allows unconstrained training that allows moving foreground objects in the scene during the initial training period. We will present a modified version of the Codebook algorithm, based on YUV space instead of RGB space which leads to less false detection alarms and less undetected moving areas. There are several vision systems for detecting and tracking people such as Pfinder [4] and W4 [5]. Such systems use human features like the head or the body shape, leg symmetry analysis and statistical models which restrict them to human figures. They also need a large number of pixels in target due to the shape based nature of the model which leads to misidentification of small targets. These drawbacks are alleviated by using the star Skeleton model [6, 7]. The main idea is that a simple form of Skeletonization, which only extracts the broad internal motion feature of the target, can be employed to analyze its
978-1-4244-7249-9/10/$26.00 ©2010 IEEE
motion. This method does not require a priori human model, or a large number of pixels in target. We will present a simplified version of the Skeletonization algorithm. It consists of extracting two object features: Centroid point and Extreme points (head and feet). This modification reduces the computation complexity, and fits better at real video applications. Embedded systems such as System-On-Chip (SOC) are recently widely spread and used in many fields. SOC provides more powerful processors, capable of running software, which needs external memory chips with various external peripherals. FPGA boards are developed to provide powerful hardware environments to many signal and image processing applications. There are similar systems for human motion detection implemented using FPGA like the ones proposed in [8, 9, and 10]. In fact [8] proposes an FPGA-based system for detecting people in a video sequence. The system is designed to use JPEG-compressed frames from a network camera and uses a Machine-Learning-Based approach to train an accurate detector instead of using background subtraction and motion detection. The system can detect people accurately at a rate of about 2.5 frames per second. [9] Proposes a system for fast human motion recognition. It uses standard processor cores to simplify system design but yield to the reduction in unit processing performance when compared to an optimized application specific IP-Core. [10] Describes the application of FPGA and distributed RAM to image object detection and compared performance with DSP processor-based alternatives. [11] Presents some implementations of robust motion detection algorithms on three architectures: a general purpose RISC processor, a parallel artificial retina dedicated to low level image processing and the Associative Mesh, a specialized architecture based on associative net. This paper presents an embedded system for human motion detection using a modified version of the Codebook algorithm for background modeling and simplified version of Skeletonization algorithm for human detection. The system is implemented as an embedded system on ALTERA Cyclone™ II DSP development board. The system can detect people accurately at a rate of about 4 frames per second. The paper is organized as follows: section II presents the Codebook algorithm for the moving objects detection. The algorithm is composed of background modeling, filtering, subtraction, and update. Section III presents how human beings are determined among all moving objects by Skeletonization algorithm. Finally, in section IV, system installation on ALTERA Cyclone™ II DSP Development Board is presented using the Nios II processor and hardware system description.
II. MOVING OBJECTS DETECTION Moving objects detection can be done into four steps: background modelling, background timing filtering, background subtraction and background update. For this we apply the Codebook Method [3] which provides fast and robust performance. Each pixel in the background is presented with a “Codebook” which consists of “codewords”. The codeword is presented in the RGB color space, with Brightness and Color boundaries as shown in Fig. 1. Ĭ and Î are the
minimum and the maximum brightness, respectively, of all values pixels assigned to the codeword. vm is the RGB vector and xt is the input pixel value. Ilow and Ihigh are the Brightness boundaries derived from Ĭ and Î, allow the pixel brightness to vary. δ is the color distribution normalized to the brightness which leads to a better detection in dark areas. Color boundary ‘ε’ is the threshold value to decide whether the input value belongs to the corresponding codeword or not. Ihigh G
δ
Ilow
Where the two conditions are: , ,
Codeword
(input value)
,
Ѳ
, R
O
B. Background timing filtering
B
Up till last stage, the background model still contains codewords that represent moving objects (let us call them stales). To remove those stales we use the value of λ (or MNRL), and this is done by dropping all codewords with λ’s value higher than a specific threshold th (usually taken /2). The new Codebook will be:
Fig. 1. Codeword representation A codeword also contains the following timing parameters: → f the frequency with which the codeword has occurred. → λ the maximum negative run-length (MNRL) defined as the longest interval during the training period that the codeword has NOT recurred. → p, q the first and last access times, respectively, that the codeword has occurred. These timing parameters will be used to update and filter the codewords. We present the four steps of moving objects detection:
A. Background modeling algorithm First let us call i,j the Codebook which represents the (i,j)th pixel, th i,j the number of codewords contained in i,j, k the k codeword and t the sampling time where 1 ≤ t ≤ and is the learning period. The background modeling algorithm takes the following steps: Codebook initializing:
0,
,
,
IV.
V.
,
,
,
,
1,
, , , λ, , λ ,
After, wrap around every codeword
,
,
,λ ,
, , ,
, k:
,
,
,
,
⁄λ
,
,1
,
C. Background subtraction algorithm Keeping the previous symbols definitions, for each sampling time t and for each (i,j) pixel of the current frame of the learning set. We apply the following steps: I. II.
III.
,
, , , √ Look for k within i,j which matches the following two conditions: , , , If there is no match or , : 1 , , Create a new codeword i,j , , , , λ, , , , 1, 1, , If there is a match at codeword m, update it:
;
Thus, the background layers are still kept and the moving objects that occur in learning period have no effect.
, , , √ Look for k within i,j which matches the following two conditions: , , , If there is a match at codeword m, is a background value and m will be updated: ,
For each sampling time t and for each (i,j) pixel of the current frame of the learning set: II. III.
,
,
xt
P
I.
1
Decision boundary
vm
Ĭ
λ,
Where:
ε
Î
λ
, , , λ, , 1, λ , If there is no match,
IV.
,
,
,
,
,
,λ ,
,
,
, , , , is a moving value.
,
,
D. Background update using Cache model Background model must be able to adapt with permanent changes that may be caused due to an entry/exit of an object into/out of the scene. To do so we create a new model called “Cache” through which we can manipulate these objects. As shown in Fig. 2, a new codeword will be created in the cache model to represent the moving values. If a codeword in cache recurs for enough time (let it be Tadd), it will be moved to the background model (we will call it permanent model). Thus, any object that enters the scene and stays for Tadd of time will be considered as a background instead of a moving object. On the other hand, all codewords that did not occur for a specific time Tdelete (i.e. an object exited the scene); will be removed from the permanent model. The cache model will be also filtered out of the stales as we did with the permanent model.
Camera
Update the corresponding permanent codeword
Background Foreground value value Update the corresponding codeword in Codewords cache occurred for Cache model enough time in cache
number 1001 of the original video, followed by the original Codebook output to the left and the modified Codebook to the right. The outputs of the Fig. 5 shows that our modification reduces the false alarm detection comparing to the original algorithm.
Delete codewords that did not occur for enough time Permanent model
Fig. 2. Background update using Cache Model
E. Codebook algorithm experimental results Functional simulation of the Codebook algorithm is done by MATLAB, followed by timing simulation by Visual Studio and OpenCV library [12]. We applied the Codebook algorithm on different videos. The average speed we obtained (with a 1.99 GHz Core 2 Due processor and a 320x240 video resolution) is 30 fps for BG modeling and 29 fps for BG subtraction. Fig. 3 shows the result of applying this algorithm on a video of a passenger and a moving car with two cars parked in the background. 300 frames is the learning set and the result frame is represented in binary.
Fig. 5.Another example of comparison between the original and the modified Codebook output Many tests applied to the original Codebook and the modified one confirmed these results. In fact the RGB to YUV color space conversion smoothes the highly contrasted areas of the image and reduces its noisy details which leads to false detection alarms or reduces detection sensitivity.
III. HUMAN DETECTION USING SKELETONIZATION ALGORITHM
Fig. 3. Codebook algorithm result in binary Unlike the algorithm proposed in [3], we define the Codebook in the YUV color space instead of the RGB one. Fig. 4 shows a comparison between the output of the algorithms proposed in [3] and of the modified one we propose. The picture on the left is the original Codebook algorithm output, while the picture on the right is the modified Codebook algorithm output. Note that the modification gives more accurate detection than the original one.
After the moving objects are detected, we are going to determine which of them are humans. To do so we will use the Skeletonization method [6], but first we will apply pre-processing morphological operations to refine the Codebook output. First, the output frame is filtered with median filter to clean it from any noise stuck in. After, to remove any small holes in the object, detected object is dilated twice then eroded once. The double dilation prevents the loss of thin arms and legs. Fig. 6 shows the result before and after the pre-processing operation.
Fig. 6. Pre-processing refine
Fig. 4. Example of comparison between the Original and the modified Codebook output Fig. 5 shows another comparison between the two algorithms. An outdoor video for moving trees was processed by the two algorithms (1000 frames training set). The top picture of Fig. 5 shows the frame
Let us note that some noise is also extended but this won’t affect the detection process since small objects won’t be taken into consideration. After refining the Codebook output, we calculate the contour of all moving objects. Small contours are considered as noise and are omitted. Unlike the algorithm proposed in [6] which requires calculating five points’ coordinates, and which is not always possible, we only calculate three points’ coordinates, Centroid point and Extreme points (head and feet).
We also propose a new criterion based on the calculation of the horizontal distance between the head and the centroid scaled to the contour, this prevents moving vehicles from being detected.
IV. SYSTEM INSTALLATION ON FPGA BOARD To increase the system’s performance and robustness, and to make it embedded, it is installed on a FPGA board, which is Cyclone™ II DSP Development Board from Terasic. The hardware where the system is installed consists of an FPGA chip, Flash, SDRAM and SSRAM devices and a VGA decoder.
d D3 C D1
Note that the moving car is not detected but only the walking man. We can also notice the centroid, head, and feet.
D2
Board
b
a
Fig. 7. Feature extraction of a human
166 MHz
Fig. 7 shows the extraction process for a human object. This can be done by calculating the centroid ‘C’ as the average point of the contour points first. Second, by calculating the distance between the centroid ‘C’ and the other contour points. The distance will be smoothed using a Low Pass Filter as shown in Fig. 8.
SDRAM Device
133 MHz
Flash Device
Code + Background model
SSRAM Device
Processed frame
Nios II Processor
Frame
On-chip RAM
Frame Scatter-
Gather DMA
100 MHz 100 MHz
Monitor
VGA Decoder
Data to be displayed
+ Sync. signals
Display Controller
100 MHz
FPGA chip 25 MHz
LPF D1
D2
D3
Fig. 8. Extreme points’ extraction We have now the three distances between the centroid and the head (d) and the two feet (a and b) (D3, D1 and D2 respectively, Fig. 7). We also calculate the horizontal distance between the head and the centroid scaled to the contour value (Dx). The scaling operation is necessary to detect persons who pass with different distance from the camera. Thus, there is no need to determine the scene depth. Now the features will be tested as below:
Where α1, α2, α3, α4, α5, α6, α7 are chosen empirically as follow: α1 = 0.9, α2 = 1.5, α3 = 0.9, α4 = 1.5, α5 = 0.8, α6 = 1.2, α7 = 0.1. The first term specifies the ratio between leg length and torso length. The second term means that the two legs must be almost the same length, and the third term allows the human to bend while running or walking for a specific angle. The following results in Fig. 9 are taken from the same video previously mentioned.
Fig. 9. Results of human identifying
Fig. 10. System box diagram Fig. 10 shows how the system devices are connected and shows the data flow: → The video is stored in the flash device. → The raw frames are moved frame-by-frame to the on-chip RAM through a DMA. → The Nios II fetches the instructions stored in the SDRAM device, processes the raw frame, stores the background model in the SDRAM device and stores the processed frame in the SSRAM device. → The processed frame is moved serially to the VGA decoder through a display controller to be displayed. An FPGA solution is preferred to implement the suggested design. Providing the possibility to build the whole system on a single chip, FPGA helps to ease the design procedure, reduce the occupied size, increase timing performance and lessen the power consumption. In addition, a soft-processor can be built and integrated within the FPGA. Hence, some of the non-critical functions and complex decision making stages can be implemented by software.
A. The Nios II processor The Nios II is a soft, built-in, 32-bit fixed-point processor. It is used to realize complex and non-time-critical functions. Nios II fast version is implemented in order to have high performance to manage the high through-put which is the case in our application. This version provides data and instruction caching using on-chip RAM, dynamic branch prediction, barrel shifting, embedded multipliers and hardware dividers utilization. External memories and other board peripherals are interfaced to the Nios II through built-in controllers as shown in Fig. 11. It also shows that the system components are connected to each other through the avalon switch fabric network.
Flash Device
SG-DMA I/O interface
Flash controller
Nios II processor
Arbitration & Multiplexing
Timer
On-chip memory
SSRAM controller
SSRAM device
Avalon-MM
Avalon switch fabric network
SDRAM controller
SDRAM device
Buffer Video Sync Generator
VGA decoder
→ About 62% of processing time is consumed by calculating the two functions “colordist” and “bounding” (which needs an access to the codewords parameters) → The 10% left is consumed by the rest of the code body. We accelerate the processing operation by building hardware objects for these two functions to be processed with few clock cycles. These objects are a combination of a number of instructions called “custom instructions”. Every instruction is built by the FPGA logic elements, takes only two arguments and needs only one clock cycle to be executed. For example, the bounding hardware object needs three custom instructions (two comparators and one AND gate).
D. Display Results are displayed on a VGA monitor through a VGA output connector and a VGA decoder. VGA display system impose 60 fps display rate which is different from processing rate. Consequently, each frame must be re-displayed more than once in order to keep the display rate fixed. For this, Ping-Pong strategy is applied.
Avalon-ST
Fig. 11. System Diagram Note that there are two types of network interfaces: Avalon-MM (Memory Mapped) and Avalon-ST (Streaming). The first interface is used between multiple masters and multiple slaves with the ability of arbitration and multiplexing. The second interface allows point-topoint data transfer with data bursting enabled to provide with a high transfer speed.
B. Frame acquisition The video is stored in the Flash. Raw frames are moved one-byone to the on-chip RAM through a Scatter-Gather DMA. The ScatterGather DMA is configured as a Memory-to-Memory transmitter. That means it reads the data located in the Flash device and writes it to the on-chip RAM. Unlike the normal DMA, the Scatter-Gather DMA can have a linked descriptor list (each descriptor define one transfer). The transfer of the whole list is done without CPU interruption, because switching between descriptors is handled by hardware with only one clock cycle for each switch. Thus, the Scatter-Gather DMA controller core provides a significant performance enhancement over the normal DMA controller core, which could only queue one transfer at a time. Note that the CPU does not have to do the transmission of the raw frames, but only sends commands to the Scatter-Gather DMA to start the transmission and receives its interrupts. As soon as the frame is transmitted, the Scatter-Gather DMA interrupts the CPU to start processing the stored frame in the on-chip RAM.
C. Processing In order to use the Nios II for implementing our algorithm, the code must be written using fixed-point variables. The conversion from floating-point to fixed-point is done using the barrel shifter which only needs one clock cycle to perform an n-bit shift. On the other hand, since the raw frame is stored in the on-chip RAM, the processed frame is stored in the SSRAM device and the background and the cache model are stored in the SDRAM device, variables allocation is taken into consideration. Using a high resolution timer we estimated the code sections time consumption as follow: → About 28% of processing time is consumed by calculating and assigning the codewords parameters
VGA decoder
Processed Frame
Display Controller
Raw Frame
Ping CPU Pong
Processed Frame
SSRAM Device
Fig. 12. Ping-Pong block diagram Fig. 12 shows how the SSRAM is divided into two buffers (Ping and Pong). The CPU processes the raw frame and stores the processed frame in one of the two buffers (for instance the Pong buffer), while the Display Controller displays the frame stored in the other buffer (the Ping buffer). Next, this operation is flipped so the Display Controller displays the frame in the Pong buffer and the CPU stores the new processed frame in the Ping buffer. This strategy requires the following rules to be considered: → One frame cannot be displayed while it is being stored in one of the two buffers → One frame cannot be stored while it is being displayed → One buffer cannot be used to store a frame if its current frame has not been displayed yet. The Display Controller [13] in Fig. 13 consists of 4 units: ScatterGather DMA, Dual Clock FIFO, Pixel Converter and Video Sync Generator. The Scatter-Gather DMA plays a very important role here, because, as in Frame Acquisition, the CPU does not have to transmit the processed frames. It only sends commands and receives interrupts from the Scatter-Gather DMA. Note that the Scatter-Gather DMA is configured as a Memory-to-Stream transmitter. That means it reads the data located in the SSRAM and writes it to the Dual Clock FIFO streaming sink. The Dual Clock FIFO is used for switching between the system and the display clock domain (i.e. 100 MHz → 25 MHz). One pixel is represented by four bytes, and to remove the unused byte, reserving other synchronization signals, we use the Pixel Converter unit. Finally, the Video Sync Generator receives the data from the Pixel Convertor and resends it accompanied with the proper synchronization signals to the VGA decoder.
Avalon-MM
Avalon-ST BGR
BGR 24
Video Sync Generator
24
BGR0 Pixel 32 Converter
0RGB
BGR0 Dual Clock FIFO
32
ScatterGather DMA
3 Sync. signals
32
Ping Pong
32 Nios II
Display Clock
System Clock
Fig. 13. Display Controller block diagram
V. EMBEDDED SYSTEM EXPERIMENTAL RESULTS Fig. 14 shows our embedded system using the ALTERA Cyclone™ II DSP development board. The monitor shows the results of applying our algorithm into the described hardware.
VI. CONCLUSION AND FUTURE WORK This paper describes an embedded system for real-time human motion detection. The system applies a modified version of the Codebook algorithm to detect moving objects, and a simplified version of the Skeletonization algorithm to detect human motion. The system is installed on an ALTERA Cyclone™ II DSP development board and implemented using the Nios II processor. Enhancements could be made to this system to achieve greater performance and better architecture: → A camera could be connected to the system, to process live video sequence. → Improve the used hardware by replacing the board with another one having larger spaces and better performance memories.
ACKNOWLEDGMENT The authors would like to thank Dr. Chadi Albitar form electromechanical department at HIAST for his valuable help.
REFERENCES
Fig. 14. Human Motion Detection System with FPGA board and monitor Due to excessive and frequent access in the Codebook algorithm to the background model stored in the SDRAM, the overall speed is limited by the access speed of this memory. Experiment results show that the background modeling, the foreground subtraction and the human detection frame rates are 4 fps. These results can be further improved by simply using another board which can drive the SDRAM with higher clock rate and wider bus. In fact, the used SDRAM –which is a DDR II memory, supports 200MHz clock frequency and 64 bit data bus– could not be performed by more than third its nominal performance. This is caused by the following reasons: → The memory could not be operated with more than 32-bit data bus due to its controller insufficiency (1/2 performance reduction) → The memory could not be operated with more than 133 MHz clock rate due to its controller insufficiency (1/3 performance reduction) → Raw-to-raw switching reduces the memory performance by 2/3.
[1] Horprasert T, Harwood D, Davis LS. A statistical approach for real-time robust background subtraction and shadow detection. IEEE Frame-Rate Applications Workshop, Kerkyra, Greece, 1999. [2] Andrew W. Moore. Clustering with Gaussian Mixtures, Slides from Carnegie Mellon University, Nov 10th, 2001. [3] Kyungnam Kima, Thanarat H. Chalidabhongseb, David Harwooda, and Larry Davisa. Real-time foreground–background segmentation using Codebook model, Real-Time Imaging, Special Issue on Video Object Processing Volume 11, Issue 3, 172-185, June 2005. [4] C. Wren, A. Azarbyejani, T. Darrell and A. Paul Pentland. Pfinder Real-Time Tracking of Human Body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997. [5] I. Haritaoglu, D. Harwood and L. S. Davis. A Real Time System for Detecting and Tracking People. Computer vision laboratory, University of Maryland College Park, 2004. [6] H. Fujiyoshi, A. J. Lipton and T. Kanade. Real-Time Human Motion Analysis by Image Skeletonization. IEICE TRANS, 2004. [7] Z. Halah, R. Deeb and N. Zarka, Real-Time Human Motion Detection and Traking. ICTTA’08, 391-392, Damascus-Syria, 2008. [8] H. Meng, M. Freeman, N. Pears and C. Bailey. Real-time human action recognition on an embedded, reconfigurable video processing architecture. Journal of Real-Time Image Processing. Springer Berlin Heidelberg Volume 3, Number 3, 163-176, September, 2008. [9] V. Nair, P.O. Laprise, J. Clark. An FPGA-Based People Detection System. EURASIP Journal on Applied Signal Processing, Hindawi Publishing Corporation, 1047–1061, 2005. [10] Peter Mc Curry, Fearghal Morgan, Liam Kilmartin. Xilinx FPGA Implementation of a Pixel Processor for Object Detection Application. Communications and Signal Processing Research Unit, Department of Electronic Engineering, National University of Ireland, Galway, 2001. [11] L. Lacassagne, A. Manzanera, J. Denouletm, A. Merigot. High performance motion detection: some trends toward new embedded architectures for vision systems. J Real-Time Image Proc 4, 127–146, 2009. [12] G. Bradski and A. Kaehler. Learning OpenCV, 1st September, 2008. [13] ALTERA. Quartus II Handbook, Volume 5, Version 9.0, March 2009.