Proceeding of the IEEE International Conference on Information and Automation Shenyang, China, June 2012
Robust Human Detecting and Tracking Using Varying Scale Template Matching Songmin Jia, Shuang Wang, Lijia Wang, Xiuzhi Li College of Electronic Information & Control Engineering Beijing University of Technology Beijing, China
[email protected] being tracked, and the controlling parameters are the center position and the height of the torso in the images, which can be affected by the variation of light. In this paper, the system detects and tracks human using the disparity images which are attained from stereo cameras. A novel method of template matching which is based on the feature of head-shoulder is proposed in human detecting algorithm. In addition, robust human tracking is performed using the EKF. The EKF is used to locate the position of the target, and it is flexible and effective in the practical environment. When the human is occluded by other objects the EKF is used in predicting the trend of human movement. This paper introduces the architecture of the proposed method and presents some experimental results. The paper consists of 5 sections. Section 2 describes the hardware platform of the proposed system. Section 3 presents the software structure and the proposed methods of human detection and tracking. The experimental results are given in Section 4. Section 5 concludes the paper.
Abstract – This paper employs the methods of human detecting and tracking based on stereo vision in the indoor environment. A novel method of template matching based on head-shoulder model is proposed to detect human. The presented method is achieved by attaining the disparity images from the stereo cameras, and extracting the head-shoulder model of human. Robust human tracking is performed using the EKF. The EKF is used to locate the position of the target and it is flexible and effective in the practical environment. When the human is occluded by other objects, the EKF is used in predicting the trend of human movement. This paper introduces the architecture of the proposed method and presents some experimental results. Index Terms – Human detecting. Human tracking. Stereo vision. Template matching. EKF.
I. INTRODUCTION Recently, the human detecting and tracking technologies are widely used in the field of indoor robot which can help people in the hospitals, offices and so on. Static cameras are often utilized in human detecting and tracking, and in this case the background subtraction is the most popular approach. Recently, many methods have been proposed to model the background, such as Gaussian mixture model [1]. When the platform is mobile, however, as in the case of a mobile robot, none of the background subtraction methods can be applied. Most solutions of human detection rely on different features of human. The method of detecting face [2] [3] is widely used in the system. However, the method is only available when the person faces towards the robot, and the robot can hardly follow behind or even walk next to the person. Besides, the color of clothes is popular in the system of human detection and tracking. Hiroshi Takemura [4] illustrates the method which is based on stereo vision and a laser range sensor using distance information and color. But the robot cannot distinguish human and other objects when the clothes color is not different from other objects. Nowadays, with the development of algorithms for human tracking, important advances have been achieved in the field of human tracking, but there is still space for further improvements, especially in the case of the robots. Chunhua Hu [5] presents a method that detecting and tracking human combines the clothes color features of the person’s upper body and the contour of the head-shoulder. However, the method of human tracking only depends on the pixels changes of objects
978-1-4673-2237-9/12/$31.00 ©2012 IEEE
II. HARDWARE PLATFORM The experimental hardware platform consists of American Mobile Robots Inc.’s Pioneer3-DX mobile robot and FLEA2 FL2-08S2M cameras which are made by point grey company, as depicted in Fig. 1. Stereo cameras
Mobile robot
Fig. 1 The experimental platform.
A. Mobile Robot The P3-DX adopts a two-wheel differential, reversible drive system with a rear caster for balance. A Hitachi H8S microcontroller is used as an interface to the low-level hardware for the P3-DX. The Pioneer 3-DX is available with an embedded PC (EBXPC104+) which gives it a number of
25
interface and connection options: 8- bit exterrnal I/O bus with connection up to 16 devices, connection of upp to 3 PC104 I/O boards, 4 RS-232 connectors, 2 USB conneectors, 1 wireless Ethernet station adapter and access point, annd 16 max sonar inputs.
based on block matching is implem mented to attain the disparity image in Fig. 3(c) using OPENCV.
B. Stereo camera mpact Point Grey Our stereo system consists of two com Flea2 cameras. The focal length of camera is about 3.5mm. wo cameras both The horizontal field-of-view is about 75°. Tw look forward and are fixed nearly parallell in optical axis directions, and the baseline is about 140m mm. The grabbed images are fed to computer by 1394 cable. R Resolution of the image is 800×600.
(a) Left image
(b) Right image
RE III. THE SOFTWARE ARCHITECTUR
The software Architecture mainly incluudes three parts, human detection, human tracking and motion control. A. Human detection wn in Fig. 2: The flowchart of human detection is show
mage (c) Disparity im Fig. 3 The process of diisparity image.
2) Background Elimination n principle, there are three According to the stereo vision formulas between the camera coordinate and the image coordinate as follow: xc = B ⋅ X left / Disparity D (1)
yc = B ⋅ Y / Disp parity
(2)
zc = B ⋅ f / Dispparity
(3)
[ xc , yc , zc ] is a spatial point coorddinate in camera coordinate system,
[ X left , y] is the imagge coordinate in image
coordinate system. Using camera caalibration we can obtain the camera baseline distance B and camera focal f . We transform the formula (3) as followss: Disparity = B ⋅ f / zc . (4)
Fig. 2 The flowchart of human detectionn.
According to the formula (4), we w can set a disparity range to attain the special disparity. In thiss paper, zc is range 1 meter
1) Disparity Image In order to detect and track the target, the presence of wever, this task human in video streams is identified. How becomes more complicated in the presennce of different variations in brightness, lightings, contrast leevels, poses, and backgrounds. In stereo vision, the disparity iimage has strong anti-interference characteristic and dispariity depends on precision of matching, camera’s intrinsic parameters and extrinsic parameters. The robustness of the disparity image for mon color image illumination changes is better than the comm and the grey image. Moreover, distance inforrmation which is included by the disparity image will be usedd in tracking. As discussed above, the disparity image is adopteed as input image to realize accurate detection in our research. As shown in Fig. 3(a) and Fig. 3(b), thee original images are extracted from the stereo cameras which aare rectified using the calibration parameters. Then a stereo maatching algorithm
to 3 meters in order to eliminate bacckground. 3) Contour Filter As we known, there are targett, background and noise in the input image. Using backgrround elimination, some candidate blobs are extracted from the disparity image. In this paper, the contour filter is used to eliminate noise from orm the disparity image to candidate blobs. Firstly, we transfo the binary image using OPENCV in Fig. 4(b). Secondly, the y image using Canny in Fig. contour is extracted from the binary 4(c). In this way, an adaptive threeshold is utilized. And the system filters out blobs by followiing rules:(a) The blobs are wider than their height ; (b) The blobs b have a height that is less than three times its width; (c) The T blobs have a height that is more than five times its width;(d) The blobs have an area that is smaller than the threshold.. The whole processing is shown in Fig. 4.
26
Blob 1 Blob2
Contour filter
Blob3 (c) Canny image
…
(b) Binary image
…
…
(a) Disparity image
(d) Contour filtered image
Blob n
Fig. 4 The processing of the contour filter
4) Head-shoulder model extraction It’s inevitable that the detected human is occluded by other objects. However, the part from head to shoulders of human can be identified easily and is rarely sheltered, so it can be employed as ideal feature. The human’s rough contour has been attained through the above process. Then we need to preprocess contour before extracting head-shoulder model. The chain code table [6] and line code table are applied to fill the contours. In our research, the vertical projection of the object is used firstly. As shown in Fig. 5(b), the point A is the vertex of vertical projection, and the point B is considered as head width. According to the theory of morphology, the point C is the shoulder width which is the widest point for common human. The distance H between point A and point C is the height from head to shoulders. Using this method, we gain the identified module from head to shoulders of human. Then the horizontal projection of identified module is used, as shown in Fig. 5(d). H is the height of identify module, and W is the width of identify module.
shoulder model. Varying Scale Template Matching (VSTM) is called in this paper. A standard head-shoulder model is used as the template in this paper. The standard template is compressed by comparing its length and width with the current model which is extracted from the video streams. The height of template is H t , while the height of current headshoulder model is H , the template is compressed H / H t times in Fig. 6(b) so that the height of template is equal to the head-shoulder model’s. The area of template and current model is defined respectively St and S . The width of every line is shorter 1 / S t times than template and 1 / S times than the current model in Fig. 6(c) and Fig. 6(f). In this way, the template and the current head-shoulder model are normalized. The Hausdorff distance (HD) is adopted to match the template with the current head-shoulder model. The Hausdorff distance is used in calculating the degree of mismatch between two sets. In this paper, the modified Hausdorff distance [7] is adopted to find the best matching contour and is defined as:
H ( A, B ) =
H C
∑ min || a − b ||
N A a∈A
A
normalized. Set B= { b1 , ", bH } is the width of template which is normalized. || • ||is some underlying norm on the template point set A and point set B. In our research, a Euclidean distance is used between set A and set B.
(b) The vertical projection
W
H
H
t
H
St (a)
(c) The identify model
(5)
b∈B
Set A= { a1 , " , a H } is the width of template which is
B
(a) The filled image
1
(b)
(c)
(d) The horizontal projection image
Fig. 5 The extracting of identify model.
H
5) Template Matching Template matching is the important stage in the human detection. In this paper, a novel method of template matching based on the head-shoulder model is proposed. The scale of the template can be varied by different scale of the head-
H
S (d)
(e)
Fig. 6 Normalized process.
27
(f)
The experimental results of human detection are shown in Fig. 7.
The relation of the coordinate systems is given by:
xc = B ⋅ X left / Disparity.
yc = B ⋅ Y / Disparity. zc = B ⋅ f / Disparity. ZC [ x (a) Frame 15
y 1]T = A( R T )[ X r
(6)
2) EKF estimation (1) State equation The states variables are used in the robot coordinate are as follow:
(b) Frame 25
[
xr = X r
Yr
X r
Zr
The [ X r Yr Z r ] is the 3D position and
Yr
]
(7)
[ X r , Yr ] is the
velocities in the horizontal plane. Acording to the robot’s motion model, the relationship between the position and velocity of a person from time t to time t+1 are as follows:
(c) Frame 35 (d) Frame 55 Fig. 7 The experimental result of human detection
X r t +1 = ( X r t − ΔX r ) cos Δθ + (Yr t − ΔYr ) sin Δθ .
B.
Human tracking Tracking a walking person is a very challenging task due to the environmental noise and the randomness of human movement. The Extend Kalman Filter is a set of mathematical equations that provide an efficient computational mean [8]. Its post information is more accurate than the sensor measurement. Due to the human body is non-rigid and the detected human is easily occluded by other objects, the EKF is used in estimating human moving trend in this paper. The human tracking mainly includes three parts: 1) Coordinate transformation In order to control the robot’s motion, the 3D point of human is applied. Transforming the image coordinate to the robot coordinate is an essential step, because the 2D image coordinate is only attained by human detection. As mentioned above, using the calibration results, the image coordinate can be transformed to the camera coordinate. A space point‘s coordinate in left and right image coordinate is Pleft = ( X left , Yleft ), Pright = ( X right , Yright ) respectively, and the disparity is X left
Z r 1]T
Yr
Yr t +1 = −( X r t − ΔX r )sin Δθ + (Yr t − ΔYr ) cos Δθ . .
.
.
X r t +1 = X r t cos Δθ +Yr t sin Δθ − v. .
.
.
(8) Yr t +1 = X r t sin Δθ +Yr t cos Δθ . The locomotive distance in X axis and Y axis from the robot position at time t are defined respectively as Δ X r and ΔYr , the turned angular is expressed as Δθ . The control variable is
controlt = [ vl , vr ]. Thus the
state equation in robot coordinate system can be expressed as:
xrt +1 = f t ( xrt , controlt ) + Rt wt Where
(9)
wt is process noise and
⎡1 0⎤ Qt = Cov( wt ) = E[ wt , wtT ] = σ 2 ⎢ ⎥ ⎣0 1⎦
− X right . The coordinate system is
(2) Observation equation The observed human’s 3D position by camera is defined
definite as shown in Fig. 8.
t
as yr , so the observation equation can be described as follow:
yrt = H t xrt + pt Where
(10)
pt is the observation noise and ⎡1 0 0 ⎤ Rt = Cov ( pt ) = E[ pt p ] = σ ⎢⎢0 1 0⎥⎥ ⎢⎣0 0 1⎥⎦ T t
2 p
(3)Robust tracking with EKF It is dangerous that the robot moves randomly when the target is lost. In order to avoid this situation, the target’s position at next time t+1 can be predicted using EKF. The prior estimation of time t+1 is used as the observation value of the EKF. When the observation value is obtained, the state
Fig. 8 The coordinate system definition.
28
variables can be updated. After EKF filter, the position of the target is sent to the robot controller to realize a robust tracking. C. Motion control An appropriate motion rule is used to control the robot reach the target position from the current position. The robot is driven by the left wheel and the right wheel. The velocity of wheel is slow when the system cannot detect target. Considering the motion model of the robot, we use the following control method [9] [10].From the equation:
(c) Frame 40 (d) Frame 55 Fig. 10 Human detecting based on varying scale template matching. Table I COMPARE TWO METHODS OF TEMPLATE MATCHING
vl = v (1 − 2 KdYr / ( X r 2 + Yr2 )).
vr = v(1 + 2 KdYr / ( X r 2 + Yr2 )).
The 2d is the distance between the left wheel and right wheel, K is a motion parameter for controlling the robot turns around smoothly.
Varying Scale Template Matching
15.4ms
16ms
Accurate rate
95%~97%
96%~98%
Human detection and tracking Some experiments of human detecting and tracking have been done using the proposed method, and they are performed in our laboratory. The initial speed of the robot is set to 120mm/s. Some experimental results are shown in Fig. 11. From the above image, it can be seen that the robot can correctly detect the human and track in different position. As shown in Fig. 11, when the human turns an angular, the robot also turns the direction of human and realizes robust tracking. By the results of experiments, the proposed methods of the detection and tracking are proved effectively. The Fig. 11(a) to Fig. 11(e) are attained in the environment view, the Fig. 11(f) to Fig. 11(j) are attained in the robot’s view.
A.
Human detection As we known, Hu moment is a common method of template matching based on contour of target. In this paper, we compare Varying Scale Template Matching with Hu moment in terms of delay and accurate rate. The result is shown in Fig. 9, Fig. 10and Table I. As shown in Fig. 9 and Fig. 10, wrong targets are probably detected by Hu moment, while the system rarely detects wrong target using VSTM. Through the analysis of the experimental results, we find that the time of two methods are approximate, but the VSTM is more accurate than another, as shown in Table I.
V. CONCLUSION In this paper, a robust human detecting and tracking system which could be used in the indoor environments was proposed. In the phase of human detection, Varying scale template matching (VSTM) was proposed to detect human from disparity image. In the phase of human tracking, EKF was employed to locate and predict the position of target. The experimental results verified the effectiveness of the developed system. But when human was occluded for a long time, the predicted position was inaccuracy and the target was lost. For future work, the technology of human recognition will be applied in our system in order to solve this problem. In addition, we will further improve the VSTM algorithm to enhance the accuracy of human detecting. Further, more image analysis and processing methods will be used to reduce the impact of light on disparity image.
(b) Frame 23
ACKNOWLEDGMENT
(c) Frame 40 (d) Frame 55 Fig. 9 Human detecting based on Hu moment.
(a) Frame 8
Hu Moment
Time
B.
IV. EXPERIMENT RESULT
(a) Frame 8
Method
(11)
The research work is financially supported by Key Program of Beijing Natural Science Foundation (B) , National Natural Science Foundation (61175087), National Natural Science Foundation (61105033) and Scientific Research Staring Foundation for the Returned Overseas Chinese Scholars, Ministry of Education of China.
(b) Frame 23
29
(a)
(b)
(f)
(g)
(c)
(h) Fig. 11 The experimental results of the human track
REFERENCES [1] Duan-Yu Chen, Kevin Cannons, Hsiao-Rong Tyan, Sheng-Wen Shih, and Hong-Yuan Mark Liao“Spatiotemporal Motion Analysis for the Detection and Classification of Moving Targets,” IEEE Transactions On Multimedia, vol. 10, no. 8, December 2008. [2] M. Bennewitz, F. Faber, D. Joho, M. Schreiber, S. Behnke, “Towards a humanoid museum guide robot that interacts with multiple persons,” 5th IEEE-RAS International Conference on Humanoid Robots, 2005, pp.418–423. [3] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G.A. Fink, G. Sagerer, “Providing the basis for human–robot-interaction: a multi-modal attention system for a mobile robot,” International Conference on Multimodal Interfaces, 2003, pp. 28–35. [4] Hiroshi Takemura, Nemoto Zentaro, and Hiroshi Mizoguchi, “Development of Vision Based Person Following Module for Mobile Robots In/Out Door Environment,” International Conference on Robotics and Biomimetics, Guilin, China, December 19 -23, 2009. [5] Chunhua Hu, Xudong Ma, Xianzhong Dai, “A Robust Person Tracking and Following Approach for mobile robot,” International Conference on Mechatronics and Automation, Harbin, China ,August 5 - 8, 2007 [6] SONG Kai, JI, “Application of chain codes table and line segment table in computer image processing,” Journal of Liaoning Technical University, Apr. 2007. [7] D. P. Huttenlocher, W. J. Rucklidge, “Comparing images using the hausdorff distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol .15, pp. 850-863, 1993. [8] Thombre D.V., Nirmal. J. H., “Human Detection and Tracking using Image Segmentation and Kalman Filter”, KJ Somaiya college of Engg, Mumbai. [9] Juji statake, jun miura, “Robust stereo-based person detection and tracking for a person following robots,” Proc. ICRA-2009 workshop on person detection and tracking, Kobe, Japan, May 2009. [10]Songmin Jia, Liang Zhao, Xiuzhi Li, Wei Cui, “Autonomous Robot Human Detecting and Tracking Based on Stereo Vision,” International Conference on Mechatronics and Automation, Beijing, China, August 7 – 10.
30
(d)
(e)
(i)
(j)