Object detection and measurement using stereo images Christian Kollmitzer Electronic Engineering University of Applied Sciences Technikum Wien Höchstädtplatz 4, A-1200 Vienna Austria
[email protected]
Abstract: This paper presents an improved method for detecting objects in stereo images and of calculating the distance, size and speed of these objects in real time. This can be achieved by applying a standard background subtraction method on the left and right image, subsequently a method known as subtraction stereo calculates the disparity of detected objects. This calculation is supported by several additional parameters like the center of object, the color distribution and the object size. The disparity is used to verify the plausibility of detected objects and to calculate the distance and position of this object. Out of position and distance the size of the object can be extracted, additionally the speed of objects can be calculated when tracked over several frames. A dense disparity map produced during the learning phase serves as additional possibility to improve the detection accuracy and reliability.
Keywords: Computer Vision, Stereo Vision, Foreground Segmentation, Disparity Map, Subtraction Stereo
1
Introduction
In surveillance systems and autonomous mobile robots cameras are common sensors for detecting objects. The standard one-camera system is able to detect objects by differentiating these objects from the background by motion, brightness or color. This leads to misinterpretations and unclear situations, especially when the object is similar to the background, occluded or when the background changes rapidly, like at changing illumination. In a setup, where a static camera observes a scene, objects have to be distinguished from the background. The evaluated system learns over time to find out, when background or foreground is present. The improvement of the robustness of this detection requires additional information to normal color cameras. This additional information can be provided by a second camera, which leads to depth information and allows calculating the position of objects in the scene. All computational processes are evaluated by means of a computer system using an i5 processor running with a frequency of 2.4GHz. The target is a detection- frame-rate
of 20 fps at a resolution of 640x480 pixels. All algorithms use the libraries of OpenCV [7] and run in Windows7.
2
Camera rig
The used camera set is constructed in a way that two identical cameras are mounted on a stable bar with a base distance of 413 mm (Fig. 1). Base distance: The resolution of the distance (Z) measurement is based on the horizontal distance (T) between the left and right camera. (1) offset = distance between the left and right horizontal camera center in pixels; f = focus in pixels; T = base distance in mm; Z = distance in mm. The pixel unit is present in the numerator and denominator; therefore the resulting unit is mm.
(2) The resolution of the distance measurement can be improved by increasing the base distance (T) or by increasing the focus (f; tele lens). Δd = minimal disparity = 1 pixel. In this setting the base distance was chosen in a way that distances up to 20m can be measured. The focus is 530 pixels, which gives a field of view which is suitable for surveillance of indoor and outdoor areas without camera movement. Camera resolution: The resolution is a tradeoff between exact object detection and computational cost. As a standard resolution 640x480 was chosen with the ability to use also 1280x1024 for higher accuracy.
Fig. 1. Camera rig
3
Image calibration and rectification
The cameras have to be calibrated and the acquired images have to be rectified to achieve congruent images, which allow disparity calculation. The calibration for cameras which cover a higher distance range is better performed in two steps. First each camera has to be calibrated separately by means of chessboards, which are presented several times and lead to intrinsic parameters, which cover misalignment of the camera chip, focal distance and distortion of the lenses. In a second step the rig is calibrated again by presenting a chessboard in several positions. By this the extrinsic parameters like camera distance and rotation to each other is examined. With these calibration parameters both images are rectified, which leads to horizontally aligned images, which ease the disparity calculation (Fig.2). In this case the search for identical points in both images is reduced to a horizontal search. After calibration the focus setting of the cameras should not change, therefore all automatic focus adjustments of cameras have to be turned off.
Fig. 2. Rectified images
4
Disparity calculation
Disparity calculation is a matching problem, the position of identical points in the left and the right image has to be detected. Due to rectification the search can be limited to horizontal lines. Several algorithms have been already presented and evaluated. The computational effort (matching cost) for different methods and algorithms have been evaluated in [1]. Better algorithms find more correct corresponding points but typically the matching cost is higher. In this evaluation the disparity algorithm “semi global block matching” is used [5]. The implementation of this algorithm has a high calculation cost and is depending on the resolution of the images.
Disparity calculation with “semi global block matching” mode per frame: Resolution 320x240: Resolution 640x480: Resolution 1280x1024:
39ms 210ms 920ms
For a real time application calculation times should be smaller than 50ms to achieve a frame rate of 20 fps. This method has been used in the further evaluation during the learning phase of 100 frames and gives a dense disparity map for the background used as reference background with a resolution of 320x240 pixels. An even denser disparity map can be achieved by averaging all images during the learning phase and forming a stable background (Fig.3).
Fig. 3. Left image and corresponding background disparity image
5
Background registration
During this evaluation different algorithms have been used, starting with a method called “modified codebook” [4] [5]. The quality of detecting objects and separating these objects from background is good but the computational effort is very high. Codebook background registration time per frame: Resolution 320x240: Resolution 640x480: Resolution 1280x1024:
49ms 230ms 1700ms
The combination of disparity algorithm “semi global block mode” and background registration with “modified codebook” results in a calculation time of 440ms per frame at the intended resolution of 640x480 pixels. With these algorithms the average frame rate is about 2fps, which is not acceptable for surveillance purposes. This problem has been solved by evaluating a different type of backgroundregistration and disparity calculation. For the background registration an adaptive median background subtraction was used [6]. During a learning phase of 40 images
the median of each image pixels history is calculated and used as background. During the detection phase this background is subtracted from the actual image; a threshold function is used to distinguish between foreground and background.
6
Subtraction Stereo
The reduction of the matching cost for the disparity calculation can be achieved by a method known as “subtraction stereo” [2]. In this case the disparity calculation is not applied to the whole left and right image but to the stereo images after the background registration. Thus only areas which have changed and are different from background are used for disparity. This can be done in several ways. One is proposed in a paper of K. Umeda [2] calculating the horizontal distance of detected foreground objects. Evaluating this method results in noisy position calculation, caused by varying object sizes, due to background registration. This method is modified in this evaluation in a way, that only the left image undergoes a background registration. After smoothing foreground pixels with a procedure known as “connecting components” [7] foreground pixels are clustered and objects identified. The bounding box of the identified object is used to cut out a template of the rectified left image. The next step is to search image data of the right image for this template. The search can be limited to a rectangle, which lies left of the objectposition in the left image on the same horizontal line. The search process uses normed correlation.
∑ √
∑
(3) ∑
The best correspondence can be found by searching the maximum within the result area. This position marks the center of the object in the right image and the distance between the object centers in the right and left image represent the disparity. Out of the disparity and the position of the object in the image the 3D position of the object can be calculated. If the center of the object is tracked over several images an x/y diagram of the projected path can be drawn (Fig. 7). The object center position is calculated (d = disparity; T = distance of the cameras; f = focus of the cameras, X,Y,Z = position of the object center in 3D space)
(4)
(5) (6) The calculated position is inserted in the display (Fig.4). Additionally certain points can be selected by the user in the right and left image and thus determine the coordinates of these points. This allows to measure distances and positions of reference points (Fig.5) within the field of view.
Fig. 4. Left and right image with object detection and position
Fig. 5. Distance measurement
7
Object tracking
The center position of registered object (Fig.6) is tracked and allows drawing the way of tracked objects in an x/y diagram (Fig.7).
Fig. 6. Frame of object tracking video with position of center
This diagram (Fig.7) shows that the person first walked in z direction away from the camera (blue) and then walked towards the camera in a wiggly line (red). There are still positions which are not correctly recognized, this has to be improved by filtering. The resolution changes with distance; at 20m distance the resolution is 75cm.
Fig. 7. Object center tracking projection (x/y in dm)
8
Object measurement
Out of the center coordinates the position of the center can be calculated. The speed calculation uses the center positions, determines the distance between neighboring frames centers and measures time between frames (Fig.8). Height and width of objects can be calculated as estimation or with more computational effort in detail. In the estimation it is assumed, that the object is farther away and all points of the object have nearly the same distance to the camera. If the calculation has to be more detailed, all individual measuring points have to undergo disparity detection, similar to the center disparity detection.
Fig. 8. Detected object with distance, height, width and speed
The error of the distance measurement has been verified by distance measurement with a laser measuring device and lies below 10%.
9
Algorithm
In the complete algorithm left and right images are acquired and during a learning phase of 100 images a standard dense disparity map of the background is produced with a running average process and the “semi global block matching” method. Moving objects are detected in both images with the method “adaptive median background subtraction”, detected pixels are evaluated belonging to the object with the method “connected components”. This produces two images holding detected objects. A second disparity map is produced out of the two images with the detected objects. This disparity map is calculated by evaluating the horizontal difference of the center of objects in the left and right image.
A third disparity map is calculated by using the method described in 6 (subtraction stereo). The first disparity map represents the background and can be used to measure the distance between the objects and the background. The second disparity map decides, if the object is visible in both images and if a detailed detection is reasonable. The third disparity map is used to calculate the objects properties like center, size, and speed. (Fig.9)
Fig. 9. Object detection algorithm
Computational cost of this algorithm for up to three objects is 50ms per frame at a resolution of 640x320 pixels, which is sufficient for real time use. The calculation time depends on the object size and on the number of objects.
10
Discussion
These methods allow a stable localisation of objects. Based on this work, improvements like better shadow removal [3], multi-object-tracking, identifying multiple objects out of a crowd are under consideration. To achieve better distance resolu-
tion the cameras should be mounted with greater distance which complicates calibration. As calibration is crucial for calculation accuracy, methods for automatic calibration and recalibration should be developed.
11
Acknowledgments
The work reported in this article has been done within the framework of the European FP7-SEC Project INDECT (http://www.indect-project.eu)
12
References
1. Hirschmüller H.,Scharstein D,: Evaluation of Stereo Matching Costs on Images with Radiometric Differences, IEEE Transactions on Pattern Analysis and Machine Intelligence (2008) 2. Umeda K., et al., Subtraction Stereo - A Stereo Camera System that focuses on Moving Regions, Proceedings of SPIE-IS&T Electronic Imaging, 7239 Three-Dimensional Imaging Metrology (2009). 3. Terabayashi K., et al., Improvement of Human Tracking in Stereoscopic Environment Using Subtraction Stereo with Shadow Detection, International Journal of Automation Technology, Vol. 5, No. 6, pp.924-931, (2011) 4. Kollmitzer C., Weichselbaum J., Hager C.: Background modeling by combining codebook method and disparity maps, Proceedings of MCSS (2010) 5. Kim K., Chalidabhongse T.H., Harwood D., Davis L.S.: Real-time foreground-background segmentation using codebook model. Real-Time Imaging 2005. 6. Lo B.,Velastin S.: Automatic congestion detection system for underground platforms," in Proceedings of 2001International symposium on intelligent multimedia, video, and speech processing, pp. 158{161, (Hong Kong), May 2001. 7. G. Bradsky, A. Kaehler, “Learning OpenCV, Computer Vision with the OpenCV Library” (Book Style), O’Reilly Media, Sebastopol, CA, 2008