Robot Detection and Localization Based on Deep Learning Luo Sha
Huimin Lu
Junhao Xiao
College of Mechatronic Engineering and Automation National University of Defense Technology
[email protected]
College of Mechatronic Engineering and Automation National University of Defense Technology
[email protected]
College of Mechatronic Engineering and Automation National University of Defense Technology junhao.xiao@hotmail
Qinghua Yu
Zhiqiang Zheng
College of Mechatronic Engineering and Automation National University of Defense Technology
[email protected]
College of Mechatronic Engineering and Automation National University of Defense Technology
[email protected]
Abstract—Real-time and accurate robot detection and localization is important for the RoboCup Middle size League (MSL) soccer robots. In the current robot detection methods used by most of the teams, the blackcolor-based information is used to distinguish robots from the environment, which is not robust if the robot changes its makers’ color according to the current rule. Considering the good performance of deep learning on the feature extraction and object detection, in this paper, we propose a novel approach for robot detection and localization based on Convolutional Neural Networks (CNNs) for RoboCup MSL soccer robots. The approach is composed of two stages: robot detection using the RGB image, and robot localization using the depth point cloud. The high accuracy and mean average precision (mAP) verify that the proposed method is suitable for robot detection during the MSL competition, which will benifit the following strategy design and obstacle avoidance procedures. The proposed approach can be easily exploited to deal with different objects and adapted to be used in other RoboCup leagues. The acquired dataset is made available for the community. Keywords—Robot Detection and Localization; Deep learning; RoboCup MSL; Convolutional Neural Networks (CNNs);
I. INTRODUCTION Robot Soccer World Cup (RoboCup 1 ) is a worldwide competition and academic event for promoting the research and development of artificial intelligence (AI) and robotics by providing a challenging and public testing platform. The Middle Size League (MSL) is one of the most important events of RoboCup. For a completely autonomous soccer 1
RoboCup Homepage: http://www.robocup.org/ This research was supported by National Natural Science Foundation of China (61403409, 61503401), China Postdoctoral Science Foundation (2014M562648).
robot, the accurate and real-time detection of robots plays an important role in the robot's entire system, and helps the robot to perceive the environment as a basis for realizing autonomous capabilities such as motion planning and decision-making. In the past decade, most MSL teams used the color-based methods for robot detection [1, 2] because the main color of the robots is black, which is obvious and easy to be recognized. According to the RoboCup MSL 2017 new rule, the robots are allowed to wear clothes of different colors [3], which introduces a new challenge to those methods. Recently, deep learning has attracted increasing attentions on the object detection and recognition as it allows computational models composed of multiple processing layers to learn the representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and so on [4]. Since the early 1990s, CNNs have been applied with great success to the detection, segmentation and recognition of objects and regions in images. In 2012, Krizhevsky et al. [5] applied deep CNNs to a million of images and showed high image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [6, 7] which made the CNNs well-known by computer vision researchers. Usually, the increased detection speed is achieved at the cost of the decreased detection accuracy. Ribeiro et al used deep learning on the MSL for object detection [12] where the classifiers implemented by artificial neural networks are trained using the images obtained from an omnidirectional vision sensor. Different objects can be detected although there are imaging distortions due to the use of omnidirectional vision sensor. A Deep Learning method for NAO robot detection was proposed in the RoboCup Standard Platform League [13], which performs very well in terms of accuracy.
However, the processing time is too long to meet the real time requirement. Robot 1 Kinect
Vision
proces s
Industrial computer Communication
ROS_MASTER
Decision
(a)
LAN
The pipeline of our proposed approach is shown in Fig. 2 which includes two stages: robot detection and 3D position localization.
(b)
There are some state-of-the-art object detection systems such as SSD [8], Faster-RCNN [9], YOLO v2 [10, 11] and so on. YOLO v2 applys a single neural network to the full image and the network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. According to the Table. 3 in [11], YOLO v2 is faster and more accurate than previous detection methods. It can also run at different resolutions for an easy tradeoff between the speed and accuracy (67 FPS with mAP 76.8% on VOC2007 test). In this paper, we used the Kinect v22 as the vision sensor for robot detection and localization which is an active 3D depth estimation setup employing IR laser structured patterns for depth calculation. This characteristic makes it more accurate in object localization. However, the huge amount of data from the Kinect will increase the onboard industrial computer’s CPU burden which needs to run motion control, omnidirectional vision, and decision control and communication modules at the same time. Considering the weight and real time requirements, we decided to use the NVIDIA TX2 3 embedded development board to process the data from the Kinect for robot detection and localization and transmit the localization results to the onboard industrial computer for later decision making and obstacles avoidance. The detection and localization system is equipped on our NuBot soccer robot (Fig. 1(a)), and the communication between TX2 and the robot’s onboard industrial computer is handled by the ROS master messages as illustrated in Fig. 1 (b). For employing the accurate and fast features of the YOLO detection system, we adjusted the framework for the fast and accurate robot detection during the detection stage. We also built a novel dataset for robot detection which contained fully annotated images acquired from MSL competitions on a regular field. The dataset is publicly available at: https://github.com/Abbyls/robocup-MSL-dataset. After the robot detection step, we obtained robot’s 2D positions from the RGB image. Then we registered the RGB image to the depth image and produced the depth point cloud to obtain robot’s 3D positions.
3
II. PROPOSED APPROACH
Jetson TX2
Fig. 1 Illustration of NuBot implementation; (a) NuBot robot (b) Connection architecture of Kinect, Jetson TX2 and robot’s onboard industrial computer
2
The structure of the paper is organized as follows. Section 2 introduces the proposed approach for the robot detection and localization in detail. Section 3 presents the experiments conducted on our NuBot soccer robot, and analyzes the accuracy and efficiency of the approach. Section 4 concludes this paper and discusses the future work.
Kinect Homepage: https://developer.microsoft.com/en-us/windows/kinect Jetson TX2 introduction: http://elinux.org/Jetson_TX2
A. Dataset building To train the network model for robot detection, a huge amount of data from a real scenario are needed which require accurate ground-truth annotations. Considering the lack of open source dataset about the RoboCup MSL, we decided to collect a set of images taken under varying conditions with different vision sensors from different point of view. We also made the dataset publicly available and encourage the similar research in robotics/vision community. Pipeline
Detection Dataset creation
Image pre-process acceleration
Robot detection
Localization Image registration
Noise elimination
3D position
Depth point cloud
Fig. 2 The pipeline of the proposed approach
A typical image from our dataset is shown in Fig. 3, where there are some robots with different shapes and colors on the green soccer field. These robots have some common features such as having a black base, the upper body being thinner than the bottom body and having similar background. We need to detect the robots and consider annotating the robots’ relative positions with respect to the image. All the annotations are provided in an XML file named "annotations.xml" that contains the robot’s Class, Width, Height, Xmin, Xmax, Ymin, Ymax and so on. The origin of the image coordinate is placed in the upper-left corner. Since usually more than one robot are presented in an image, bounding boxes could overlap. There are 1456 images in total without any preprocessing in our dataset, and each image has different dimensions. We selected 1000 images for training and 456 images for testing.
X (Xmin,Ymin)
height Y
width
(Xmax,Ymax)
Fig. 3 One of the annotated images from our dataset
B. Image pre-processing acceleration As mentioned above, the YOLO detection system is extremely fast and accurate in comparison with other detection systems when it is used on the NVIDIA Titan X4, which is the strongest graphics processing unit at that time. Obviously, the embedded development board Jetson TX2 we used is weaker than the NVIDIA Titan X which makes the detection process slower for the robot detection. For fast and convenient prediction, we transformed and resized (at the ratio) the input image before the prediction process and then used a fixed sized box to constrain the resized image to a fixed width and height image as the input data to the predict process. The image pre-processing process and the prediction process need about 58ms and 60ms respectively, failing to meet the real-time requirements in the real competition. CPU Transmit the RGB image to YOLO framework
image pre-processing step, we use m × n threads for an image with the dimension of m×n, and use blocks of the appropriate size running on multiple cores. The pipeline of the data transmission between the CPU and GPU is shown in Fig. 4. C. Robot Detection After the image pre-processing, we put the processed box images into the test network to predict the robot’s box position and the corresponding probability. Considering the real-time requirement in robot detection process, we use a tiny version model which has 9 convolutional layers and 1 region detection layer. The first 6 convolution layers of the network extract features from the image, and the pooling layer is for the subsampling and enhancing the rotational invariance while the region detection layer predicts the boxes’ coordinates and classifying probabilities. The network architecture is shown in Fig. 5. Conv Pool
Conv Pool
Conv Pool
Conv Pool
Conv Pool
Conv Pool
Conv
Conv
Conv
Region Detection
Fig. 5 The network architecture of Convolutional Neural Networks used for
D. Image Registration and Depth Point Cloud Production robot detection. We register the RGB image to the depth image and produce the depth point cloud, and then find the depth value according to the box’s center position in the registered image. After that, we can get the 3D position in the depth point cloud. The open source driver libfreenect2 [14, 15] is used here, but it does not provide parallel registration of the color and depth image. The Jetson TX2 has relatively weaker CPU comparing with the mainstream CPU processors. To improve the real time performance of the algorithm and for the convenience of robot detection to be discussed later, we parallelized the image registration and depth point cloud production on the GPU.
GPU
Image data processing
(a)
(b)
(c)
(d)
Copy the boxed image to CPU
Fig. 4. Data transmission between the CPU and GPU for the image preprocessing
The pipeline of the image pre-processing algorithm offers pixel-level data parallelism which can be easily exploited on the CUDA (Compute Unified Device Architecture) architecture. To accelerate the detection process, we decided to parallelize the data pre-processing step using the CUDA. Since the GPU consists of multiple cores, it allows independent thread scheduling and execution, and is suitable for the computation of independent pixels. Therefore, in the 4
NVIDIA Titan X Homepage: https://www.nvidia.com/en-us/geforce/ products/10series/titan-x-pascal/
Fig. 6 Image registration and depth point cloud production (a) RGB image (b) Depth image (c) registered image (d) depth point cloud
The bottlenecks of this step implemented on the GPU are that the amount of data to process is huge, the global storage bandwidth is small, and the number of shared storage and constant storage is limited. To get around the bottlenecks, texture storage provided by the GPU architecture is utilized. It is cached on the chip and provides more effective bandwidth by reducing memory requests to off-chip DRAM. We use the
texture memory to store depth-to-color-mapping data. The computation speed is greatly improved on each thread since memory access time is significantly reduced. The registered image and depth point cloud is shown in Fig. 6. E. Noise Elimination Sometimes we will get the false positive results because of the blurred features and noise in the image. To avoid these situations, a validation process is employed to decide whether the box includes a real robot. The robot’s size is limited in a fixed range, and all of the robots have similar sizes. So, we can get the relationship between the box’s size and the box’s center to the Kinect. Using this knowledge, we estimate the number of pixels that a robot box should have according to its distance to Kinect. We can get the relation function as equation (1) in which the x represents the distance and f(x) represents the box’s size. The curve of f(x) is plotted in Fig. 7. Then we can eliminate the box the size of which deviates much from the curve.
f ( x) 3.67e 09 * x 1.972 6
x 10
(1)
relations_between_ball_pixels_and_distance
5
ball_pixels vs. distance fit curve
5
ball_pixels
4 3
III. EXPERIMENTAL RESULTS This section presents quantitative experimental results obtained by our approach. In order to test both the real-time performance and measurement accuracy of robot detection and localization, we used our NuBot soccer robots as the experiment platforms which were equipped with Kinect v2 sensor and employed Jetson TX2 as the processor. Jetpack 3.0 was used here as the programming interface which uses CUDA 8.0 for parallel computing and CUDNN 5.0 for the CNNs acceleration. A. Detection precision and mAP As there is no other public dataset about MSL robots available for us to measure our dataset performance, we tested the dataset using the 400 testing images contained in the dataset. The testing results are shown in TABLE I. From the table we can find that the precision is approximately to 90% which meets the detection accuracy requirements in the RoboCup competition. The mAP results indicate the good performance of the MSL robots datsaset. We encourage other people to use our dataset and compare the algorithm performance based on the dataset in the robot vision community. Furthermore, we tested the detection performance using some images acquired in the 2017 RoboCup competition, Nagoya with different viewpoints, and the detection results are shown in Fig. 9. From the results we can conclude that we can detect visible robots in the MSL competition using our proposed approach.
2 1 0 0
1000
2000
3000
4000 5000 distance(mm)
6000
7000
8000
Fig. 7 The relation between robot box’s size and its distance to the Kinect sensor.
F. Get the 3D Positions We get the most probable candidate robot’s regions after validating the targets, and then we search every box’s center position in the depth point cloud to get the 3D position of each detected robot. If the box’s center has no depth information, we adjust it by adding a small offset on the center. At last, we can get the robot’s 3D position in the depth point cloud as shown Fig. 8(b), where the grey ball represents the detected robot’s center position. .
Fig. 9 Detection examples on the 2017 RoboCup competition TABLE I. DATASETS PERFORMANCE (a)
(b)
Fig. 8 (a) Detected robots in registered image (b) The robots’ 3D positions in depth point cloud.
IOU 72.76%
Recall 94.79%
Precision 89.09%
mAP 70.65%
B. Real-time performance after the parallel computing The detection process includes two steps: image preprocessing and prediction. The prediction process employs the
GPU parallel computing technique which leads to little acceleration space because of the hardware restriction. The image pre-processing process impacts the speed a lot, and the computation time to process each frame of image is approximately 58ms. We accelerated this process by parallelizing it based on CUDA. We tested the processing time of parallelized image preprecessing under different scenes in the robot soccer field. The improved performance is shown in TABLE II. We can realize real-time robot detection in the RoboCup MSL competiton using such approach with the YOLO v2 architecture. The statistics about the computing time in the image preprecessing after the acceleration and in the prediction process are shown in Fig. 10, where the red lines represent the median computing time, grey lines represent the maximal and minimal computing time, and the blue boxes are the Box-whiskerPlot acquired by Matlab.
6000
robots localization detected robot 1 position detected robot 2 position
5500
Z(mm)
5000 4500
robots groundtruth positions
4000
NuBot position
3500 3000 2500 2000 1500 1000 500 0 -3000 -2500 -2000 -1500 -1000 -500
0 500 X(mm)
1000 1500 2000 2500 3000
Fig. 11 robot localization results in our soccer field TABLE III LOCALIZATION ERROR (mm) Robot positions mean min max (-1660,1800) 24.140 0.568 77.967 (1750,3280) 36.994 0.832 181.65 8
IV. CONCLUSION AND FUTURE WORK This paper proposed a novel approach used for the robot detection and localization in the RoboCup MSL competition using the Kinect v2 and Jetson TX2 as the hardware platform.
Fig. 10 Statistics of processing time. TABLE II. PROCESSING TIME AFTER ACCELERATION Time TX2 After acceleration
Pre-processing 51-60 0-10
Prediction 54-60 54-60
FPS 8.3-9.2 14.2-18.5
C. Lcalization error After the accurate robot detection in the registered RGB image, we need to localize the robots in the soccer field for decision-making and obstacles avoidance procedures. The accuracy of the localization has a great impact on the robot’s competition performance during the game. As mentioned above, we combined the depth information with RGB information to get the detected robot’s 3D position in the field. To measure the localization error, we set an experiment in which two soccer robots are placed at positions (-1660, 1800) and (1750, 3280) respectively in our soccer field, and the robots’ positions are measured from the position (0, 0) by a robot with the proposed detection and localization system. The localization results are shown in Fig. 11 (red points: detected robots’ positions, blue points: robots’ ground truth positions, maroom point: NuBot’s position with a Kinect sensor for robot detection and localization) and the average Euclidean distance measurement error is shown in TABLE III. Considering the physical dimensions of the robot (approximately 52 cm*52 cm) and the error of placing the robot at a specific position, the localization error (30.567 mm) of our proposed system is acceptable.
We built a dataset about the MSL robots as there is no public dataset for the research of MSL robot detection and we shared it on the github for the development in the robot vision community. We accelerated the image pre-precessing process in the YOLO detection system to make it meet the real-time requirements of RoboCup MSL competition. More importantly, we combined the detected robots’ 2D positions in the RGB image with the depth point cloud to obtain the 3D positions, so the accuracy of robot localization was improved. We think the real-time performance will be further improved if we use a more powerful GPU. In the future, we will maintain and improve our dataset of MSL robots and continue the research on the accurate and real-time detection and localization of the robots, ball and referee. REFERENCES [1]
[2]
[3] [4] [5] [6]
[7]
Neves, A.J., Pinho, A.J., Martins, D.A., and Cunha, B, "An Efficient Omnidirectional Vision System for Soccer Robots: From Calibration to Object Detection," Mechatronics. 21 (2), 2011, pp.399-410. Lu, H., Yang, S., Zhang, H., and Zheng, Z, "A Robust Omnidirectional Vision Sensor for Soccer Robots," Mechatronics. 21(2), 2011, pp. 373389. MSL Technical Committee. "Middle Size Robot League Rules and Regulations for 2017," 2017. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning," Nature 521.7553, 2015, pp. 436-444. A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet classification with deep convolutional neural networks," In NIPS, 2012. J. Deng, A. Beng, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, " ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)," 2012. http://www.image-net.org/challenges/ LSVRC/2012/. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, " ImageNet: A large-scale hierarchical image database," In CVPR, 2009.
[8]
Liu, Wei, et al, "Ssd: Single shot multibox detector," European conference on computer vision. Springer, Cham, 2016. [9] Ren, Shaoqing, et al, "Faster R-CNN: Towards real-time object detection with region proposal networks," Advances in neural information processing systems. 2015. [10] Redmon, Joseph, et al, "You only look once: Unified, real-time object detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. [11] Redmon, Joseph, and Ali Farhadi, "YOLO9000: better, faster, stronger," arXiv preprint arXiv:1612.08242, 2016. [12] de Almeida Ribeiro, P.R., Lopes, G., and Ribeiro, F, "Neural Network in Computer Vision for RoboCup Middle Size League," Journal of Software Engineering and Applications. 9(07), 319, 2016.
[13] Albani, D., et al, "A Deep Learning Approach for Object Recognition with NAO Soccer Robots," Robocup Symposium: Poster presentation. 2016. [14] Lawin, F.J., Forssén, P.-E., and Ovrén, H, "Efficient Multi-frequency Phase Unwrapping Using Kernel Density Estimation. In European Conference on Computer Vision," pp. 170-185. oct 2016. [15] Lingzhu Xiang, F.E., Christian Kerl, Thiemo Wiedemeyer, Lars, hanyazou, Alistair: libfreenect2: Release 0.2 [Data set],Zenodo. http://doi.org/10.5281/zenodo.50641, 2016.