Human Target Detection and Tracking Under Parallel ... - IEEE Xplore

3 downloads 0 Views 1MB Size Report
Human Target Detection and Tracking Under Parallel. Binocular Cameras. XING Yufeng, ZHU Qiuyu, YUAN Sai and ZHAO Baozhu. School of Communication ...
Human Target Detection and Tracking Under Parallel Binocular Cameras XING Yufeng, ZHU Qiuyu, YUAN Sai and ZHAO Baozhu School of Communication & Information Engineering Shanghai University Shanghai, China Abstract—This paper proposes a human body targets tracking algorithm based on depth information under parallel binocular cameras. The key issue of pedestrian targets tracking in the parallel binocular cameras system is how to achieve the correspondence between the cameras. We get intrinsic and extrinsic parameters by camera calibration.Then, the foreground target region is got by Gaussian background difference based on current and background images,which is used to obtain the parallax images of human body target by using region matching. Finally, the locations and depth information of human bodies are obtained by the projection of the parallax information, which are used to track human targets by Kalman filter. The experiment results shows that the proposed algorithm has high accuracy and is feasible in the real world. Keywords—depth information, camera calibration, parallax, target tracking

I.

INTRODUCTION

Video surveillance is widely used in our life. How to predict the human bodies’ behavior effectively is one of the major concerned problems to the researchers. The monocular, binocular and multiple cameras are used to detect human targets with video image processing technology. The disadvantage of monocular camera is tending to appear the error detection or loss of target when crowd have serious occlusion. As to multi-camera monitoring, cameras viewing angle is big because of wide baseline, it is hard to find matching feature points. For the above problems, binocular cameras are not very serious, and the parallel binocular also has a small computational amount and is easy to implement feature matching. Liang Zhao[1] proposed a three-step algorithm based on stereo vision, which was used to detect the pedestrians of city street, and can adapt to different posture, lighting, background etc., but pedestrian segmentation is error-prone when the traffic is heavy. Stephen J.[2] put forward an algorithm to combine size, shape, parallax to detect head position, but only for vehicles and single person state detection. Ya-Li Hou and Grantham K. H. Pang[3] proposed a crowd segmentation method in stereo vision based on clues, which refers to the shape characteristics, with rectangular boxes target pedestrians. According to its depth information, pedestrians are to determined. Li Jian[4] proposed fast moving object detection algorithm combined improved frame differential method and the improved optical flow method. Through improved frame differential method, precise motion target areas are got. Using the improved optical flow method, under the discontinuous brightness condition, accurately light flow of moving object This work was supported by the Development Foundation of Shanghai Municipal Commission of Science and Technology (13dz1202404).

978-1-4799-6092-7/15/$31.00 ©2015 IEEE

region feature points are extracted. Optical flow vector can be obtained by thresholding, and then moving targets are detected. The paper discusses the detection and tracking of the human bodies under the parallel binocular cameras environment, which are the foundation of the subsequent processing, such as trace analysis and behavior recognition. The system includes camera calibration, target detecting, image matching, target positioning, target tracking. II.

BINOCULAR VISUAL MODEL AND CAMERA CALIBRATION

Camera calibration is actually a process of optical imaging, there are four coordinate systems to describe the process, the image pixel coordinate system, the imaging plane coordinate system, camera coordinate system and world coordinate system. As shown in Fig.1.

Fig. 1. Linear model calibration process.

The imaging plane coordinate is perpendicular to Zc axis of the camera coordinate, the distance from the optic center of camera to the imaging plane is the focal length f. The computer image coordinate is a measure for pixel coordinate. The corresponding relationship between spatial point and pixel point in monocular visual system is 1 ⎧1 ⎪ Z X c = f (u − u0 )dx ⎪ c ⎨ ⎪ 1 Y = 1 (v − v )dy 0 ⎪⎩ Z c c f ⎡Xc ⎤ ⎡Xw ⎤ ⎡Xw ⎤ ⎢Y ⎥ ⎢ ⎥ ⎢ ⎥ R T ⎤ Yw ⎢ c ⎥=⎡ ⎢ ⎥ = M ⎢ Yw ⎥ 1 ⎢ ⎥ T ⎢ Z c ⎥ ⎣ 03 1 ⎦ ⎢ Z w ⎥ ⎢ Zw ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣1 ⎦ ⎣ 1 ⎦ ⎣ 1 ⎦

(1)

(2) In equation (1), (2), dx, dy is the physical size factor of pixels in the image coordinate system, and uo,vo is the main point coordinates in the image coordinate system; R is a 3 × 3 rotation matrix; T is a 3 × 1 translation vector. The parameters u0 ,v0 , f , dx, dy are only related to the intrinsic

structure of the camera, so it’s called the intrinsic parameters. Matrix M1 is determined by the position and the direction of the camera, and is relative to the world coordinate system, so is called the extrinsic parameters. All these parameters should be calibrated first.

⎧⎪ xl = xr + slx + sd (4) ⎨ ⎪⎩ yl = yr + sl y In equation (4), xl , xr is the coordinate of the interesting

The binocular vision model with optical axis of two cameras are parallel is one of the most simple stereo vision model, as shown in Fig. 2.

point in the x direction of left and right image separately, yl , yr is the coordinate of the interesting point in the y

P ( X c , Yc , Z c )

O

u pl ( xl , yl )

zl

v

pr ( xr , yr )

Il

Ol

xl

Cl

yl

zr Ir

Or

B

Cr

xr

yr

direction of left and right images separately. lx , l y is the main point difference in the x and y direction separately. The value of s is ±1 ,which is decided by the reference image is left image or right image and d is the parallax. We put the cameras at the specific height, such as 2.3m in our experiment, to take the photos of the ground. We find the prominent corresponding points of the ground in the left and right images to get the pixel coordinate difference xl − xr , yl − yr of corresponding point separately. Through the equation (3), we can get the parallax d at the height of 2.3m. Finally we get the main point difference lx , l y through the equation (4).

Fig. 2. Parallel to the optical axis binocular transverse imaging mode

Fig. 3 is the vertical view of Fig. 2. Based on the triangle range finding principle in parallel binocular camera system. Cl and Cr is the optic center of left and right camera respectively and Ol and Or is the center of left and right camera imaging plane. ll and lr is the displacement of imaging position Pl and Pr of spatial point P relative to the Ol , Or . f is the focal length, b is the length of baseline.

III.

DETECTION AND DISPARITY COMPUTATION

A. Foreground Detection We adopt Gaussian mixture background model to detect the human target as scenery changes constantly in real-time. Considering the background pixels is multi-peak distribution, multiple Gaussian models are used to describe state of pixel value over a period of time.

For the detected foreground target of left image, the disparities of every pixels are computing, and then their 3D position are got, which will be used to object projection and tracking.

Fig. 3. The range finding principal of parallel binocular camera system

d is the distance between spatial point P and imaging plane, and is proportional to Zc in the equation (1).We can get the equation (3) based on the triangular geometry relationship. bf (3) lr − ll In equation (3), lr-ll is the parallax of spatial point P in the left and right imaging plane. There is the inverse relationship between the distance of spatial point P to imaging plane and parallax, the bigger the parallax, the closer the distance of object to the cameras, and vice versa. d=

The main point difference is the differences of the main point coordinates of left image and right image, which is crucial for accuracy disparity computing. As the precision of parameters by calibration board is not high, we get the main point differences by manual operation for higher precision.

B. Disparity computation For the disparity computation, after we got the main point differences, we subtract the main point difference in y direction between left and right image. So it becomes one-dimensional matching from two-dimensional in x direction. Stereo matching always is core problem in the stereo vision research to find corresponding points. Regional matching algorithm uses the center of matching pixel with a window of m × n as matching primitives in the reference image. The center pixel of the window is characterized by grey distribution and then we can find the best matching pixels in the matching image. Fig. 4 is the diagram of the stereo matching process. The size of the window is 5 × 3 . 5×3

Matching pixel

epipolar

Referrance (left image) image

Fig. 4. The principle of region matching algorithm

Regional matching mainly discusses the similar measurement function and the size of window. Similar measurement function is used to measure the image similar degree between pixels, also called the cost function. At present, the most common cost function is Normalized Cross Correlation (NCC), Sum of Absolute Differences (SAD),Sum of Squared Differences(SSD).we mainly introduce the NCC. NCC(x, y,d) =

i

r

l

l

j

∑∑[I (x+i, y+ j)−I (x, y)] ∑∑[I (x+i, y+ j +d)−I (x, y)] 2

r

i

j

2

r

l

i

(5)

l

j

Where (x,y) is current pixel coordinates, and d is coordinate difference in x direction of left and right image Il,Ir. The maximum of NCC is the best match pixel and polynomial curve fitting method can be used to get the sub pixel accuracy. The simplest fitting method is quadratic parabola fitting and the coefficient of the parabola is decided by correlation coefficient in the field of most relevant to the pixels and then the extreme value of quadratic curve represents the best matching pixel position to realize sub pixel matching. The NCC function could eliminate the influence of brightness differences between left and right image relatively. In our experiment, the size of image is 368*240 and the best matching length is 4% of the width of image. So the size of window is 15*3. The parallax of the every point is the difference of bestmatched coordinate difference and the main point difference in x direction, Then we can obtain the depth of the correspondence point by equation (3). IV.

(6)

X k = ϕ k ,k −1 X k −1 + W k

Yk = H k X k + V k

∑∑[I (x+i, y+ j)−I (x, y)][I (x+i, y+ j +d)−I (x, y)] r

dynamic state and observation equations to be described. Any point can be taken as a starting point to observe, and the recursive filtering method is used to calculate optimal estimation of the next state. Kalman filter state and observation equations are respectively as:

PROJECTION AND TRACKING

A. Projection The foreground targets in parallax image should be projected to ground plane, so as to track individual object in ground plane.

Because we have know Zc which is proportional to depth, the intrinsic parameters: u0 ,v0 , f , dx, dy . X c , Yc can be obtained by equation (1). We also have known the extrinsic parameters R, T, so X w , Yw , Z w can be got through equation (2),which just need to compute the inverse matrix of M1. Zw represents the height of the target in the world coordinate system. Before tracking, the centroid of target position is needed to know. Because the gray value in the position of target is bigger than background whose gray value is zero. So the target contour is easy to obtain and the centroid of each target contour can be got, too. B. Tracking Kalman filter is an optimal recursive data processing algorithm. When the tracking system is stable relatively, the effect of the Kalman filter predicts position very good after transitional state of the initial tracking, because the error is small, and unbiased predictable, stable, and optimal. Kalman filter is a linear recursive filter and is the system through

(7) T

X k is 3*1-dimensional state vector (xk,yk,vk) of the system state,which represents the position and velocity of T target. Yk is 2*1-dimensional state vector (xk,yk) of the system state by observing, which represents the position of ⎡1 0 T ⎤ ⎢ ⎥ target. ϕ k , k −1 is 3*3-dimensional state transition ⎢ 0 1 T ⎥ ,T ⎢⎣0 0 1 ⎥⎦ is the time interval between tk and

tk-1.

H k is 2*3-

⎡1 0 0 ⎤ dimensional observation matrix ⎢⎢0 1 0 ⎥⎥ at the time of tk . ⎢⎣0 0 1 ⎥⎦ W k −1 is 3*1-dimensional random vector of white noise at the time of tk −1 . V k is 2*1-dimensional observation noise vector at the time of tk . So we can update the system current status X k by Kalman filter, estimate the system future status X k +1 by Kalman prediction. Combined with the depth information of targets, the tracking error rate is much lower. V.

THE EXPPERIMENTS AND RESULTS ANALYSIS

We adopt the 8 × 8 calibration board, and the calibrated parameters of camera are mainly based on the method of Zhang Zhengyou[7]. The plane calibration steps are as follows, • Printing a calibration board, and then posting it in a plane which is smooth, flat and would not be deformation. • Taking several template photos from different angles, the each angle is not equal. The calibration board view area is large enough to take up the view of image. • Detecting the corners of each calibration board. • Computing the intrinsic and extrinsic parameters of camera, getting optimal solution.

We adopt the method in Camera Calibration Toolbox for Matlab with 8 × 8 calibration board to calibrate the cameras [6]. The corners were extracted, as shown in Fig. 5. We can see that the extracted corners are precise, which can make the calibration more accurate. We take 40 photos by our system, the 20 photos are from left camera and the others are from right camera. We take photos by the left and right camera at the same time, and the left camera is reference image.

Fig. 7. The background and current images of left camera Fig. 5. Extracting grid corner

The focal length of cameras: f c = [282.36413 282.02598] ± [ 2.65209 2.63251]

We get the foreground target regions by Gaussian background difference, as Fig.8, which is obtained by difference calculation of background and current images.

We use the same specifications of the lens, so the focal length of right camera is approximate to the left camera The principal point of right camera:

[ μ0 ,ν 0 ] = [ 215.18496 135.63030] ± [ 7.07482 7.05549] The principal point of left camera: [ μ 0 ,ν 0 ] = [205.98367 118.33623] ± [1.79617 1.930 70]

Then we calibrate the extrinsic parameters. The world coordinate original point is lower right corner of a calibration image, as Fig. 6.

Fig. 6. The world coordinate

Fig. 8. The foreground target region

The parallax image of the foreground target region is obtained by disparity computation of left and right image, as Fig. 9.

Fig. 9. the parallax image of foreground with one target

Because the cameras are parallel, the rotation vector average of the left and right camera is closer to the accurate value. The translation vector is the left camera translation vector.

The projection to the ground is obtained by taking calibration parameters, and the centroid position of target show as Fig. 10.

The rotation matrix: ⎡ −0.027539 −0.998629 −0.044510 ⎤ R = ⎢⎢ −0.724839 0.050612 −0.687057 ⎥⎥ ⎢⎣ 0.688368 0.013342 −0.725239 ⎥⎦

The translation vector: T = [29.804562

727.454185

2374.545849]

The background and current images are shown in Fig. 7, which are taken by left camera.

Fig. 10. The projection result and the centroid position with one target

At last, we track the target by Kalman filter. The result show as Fig. 11, it shows the effect of 30 frames with one target.

VI.

Fig. 11. The tracking effect of 30 frames with one target

The effect of two targets tracking is showed as below.

CONCLUSIONS

The paper describes a method for extracting human targets from a real-time parallel binocular video stream, and then tracking them robustly. The key point which make this system robust is the tracking method based on a combination of Gauss differencing and correlation matching, Kalman filter. Targets are robustly identified in spite of partial occlusions and ambiguous poses, and background clutter is effectively rejected. As the intrinsic and extrinsic parameters of the cameras are got by offline calibration, it is not very flexible. Future work involves online calibration, and multiple targets recognition and tracking. REFERENCES [1]

[2]

Fig. 12. The parallax image of foreground with two targets

[3]

[4]

[5]

[6] [7]

[8]

[9]

Fig. 13. the projection result and the centroid position with two targets

[10] [11]

[12]

[13]

[14] Fig. 14. the tracking effect of 15 frames with two human body targets

The results show that tracking paths are almost identical to the actual path when there is one target or more targets in the video stream.

[15] [16]

[17]

L. Zhao, C. E. Thorpe,"Stereo- and Neural Network-Based Pedestrian Detection [J],” IEEE Transactions on Intelligent Transportation Systems, Vol. 1, No. 3, 2000, pp. 148-154. S. J. Krotosky, S. Y. Cheng, M. M. Trivedi,”Real-time stereo-based head detection using size, shape and disparity constraints [C],” IEEE International Symposium on Intelligent Vehicles, 2005:1-7. Hou Y L, Pang G K H,” Multi-cue-based crowd segmentation in stereo vision[C],”Computer Analysis of Images and Patterns. Springer Berlin Heidelberg, 2011: 93-101. Li Jian, Lan Jinhui, Li Jie,”A new type of fast moving object detection algorithm [J],” Journal of Central South University (Natural Sciences), 2013, 44(3). Wang T, Li H Y, Xie S R,” Quick Self-calibration Method for Binocular Vision Sensor[J],”Jisuanji Gongcheng Computer Engineering, 2012, 38(12). Camera Calibration Toolbox for Matlab, http://www.vision.caltech.edu/bouguetj/calib_doc/index.html 2013. Zhang Zhengyou,”Flexible Camera Calibration by Viewing a Plane from Unknown Orientations[C],”Proc. of the 7th International Conference on Computer Vision. Corfu, Greece: [s. n.], 1999:666-673. Alan J.Lipton Hironobu Fujiyoshi Raju S.Patil,”Moving Target Classification and Tracking from Real-time Video,” http://www.cs.cmu.edu/~vsam,1998. B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool,” Coupleddetection and tracking from static cameras and movingvehicles,” IEEE TPAMI, 30(10):1683–1698,2008. M. Andriluka, S. Roth, and B. Schiele,”Monocular 3d pose estimation and tracking by detection,” In CVPR,2010. C. Stauffer, W. E. L. Grimson, “Learning Patterns of Activity Using Real-Time Tracking,”IEEE Trans. PAMI, vol.22, no.8, pp. 747-757, Aug.2000 M. Y. Kim, S. Yang, and D. Kim, “Head-Mounted Binocular Gaze Detection for Selective Visual Recognition Systems,” Sensors and Actuators A: Physical, Vol. 187, pp. 29-36, Nov. 2012. D. Li, “Starburst: A hybrid algorithm for video-based eye tracking combining feature-based and model-based approaches,” Proc. of 2005 IEEE Computer Vision and Pattern Recognition, pp. 79, 2005. C. Huang and R. Nevatia,”High performance object detection by collaborative learning of joint ranking of granule features,”In CVPR 2010. B. Leibe, K. Schindler, and L. V. Gool,”Coupled detection and trajectory estimation for multi-object tracking,”In ICCV 2007. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,and L. V. Gool,”Robust tracking-by-detection using a detector confidence particle filter,”In ICCV 2009. C. Stauffer andW. E. L. Grimson,”Adaptive background mixture models for real-time tracking,” In CVPR, 1999.