Leaving Flatland: Realtime 3D Stereo Semantic ... - Willow Garage

Leaving Flatland: Realtime 3D Stereo Semantic Reconstruction Radu Bogdan Rusu1,2 , Aravind Sundaresan1 , Benoit Morisset1 , Motilal Agrawal1 , and Michael Beetz2 1

2

Artificial Intelligence Center, SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025 USA Intelligent Autonomous Systems, Technische Universit¨ at M¨ unchen Boltzmannstr. 3, D-85748 Garching b. M¨ unchen, Germany

Abstract. We report our first experiences with Leaving Flatland, an exploratory project which studies the key challenges in closing the loop on autonomous perception and action in challenging terrain. A primary objective of the project is to demonstrate the acquisition and processing of robust 3D geometric model maps from stereo data and Visual Odometry techniques. The 3D geometric model is used to infer different terrain types and construct a 3D semantic model which can be used for path planning or teleoperation. This paper presents the set of methods and techniques used for building such a model, and provides insight on the mathematical optimizations used for obtaining realtime processing. To validate our approach, we show results obtained on multiple datasets and perform a comparison with other similar initiatives.

1

Introduction

A major area of research for ground-based robots will be the development of perception, planning, and control systems that will enable UGVs to autonomously navigate in the same environments, and at the same speeds, as their human counterparts. The capability of navigating over virtually any type of environment is a necessary prerequisite for the realization of UGVs as tactical multipliers. The goal of our “Leaving Flatland” project is to study the key challenges in closing the loop on autonomous perception and action in challenging terrain. Most of the navigation systems today rely on 2D representations of the world (e.g. occupancy maps), which are solely used to plan the robot’s motion. With such a flat representation, the geometry of the world is lost, which means the robot can only go around possible obstacles and not over them. To increase the mobility of the robot, our first challenge is to build a threedimensional model of the environment that preserves the geometric structure of the scene and more importantly the obstacles in it. The model is built and updated online as the robot moves, at a frequency compatible with the robot speed. Periodical updates are sent to a motion planner module which can generate 3D trajectories for using predefined motion primitives, such as “fast straight walk”, “slow straight walk”, “climbing steps” or “irregular obstacle climbing”.

2

The quality of the model must therefore be as high as possible and compatible with the requirements of the motion planner. In order to help the planner select the most appropriate motion primitive, we propose the addition of semantic labels to the 3D model, by associating each part of the model to a predefined class of environment such as “stairs”, “flat ground”, “elevated flat ground”, “ramp” or “irregular obstacle”. The literature on 3D mapping and reconstruction from range data is abundant and due to space constraints we are unable to reference all of them (see [1–4] to name a few). Most of the maps are created using mobile robots equipped either with laser range finders or with stereo cameras. However, they all have one thing in common: their representation of the world includes only the geometry and texture information, and they lack any additional semantics. Furthermore, they are unsuitable for realtime processing. One special category of maps are hybrid 2D-3D maps such as [5–8], where 2D laser sensors are used to create a map used for navigation and additional semantics are acquired through the use of vision. For example in [5] places are semantically labelled into corridors, rooms and doorways. A similar decomposition of maps into regions and gateways is presented in [7]. The advantages of these hybrid representations is straightforward: it keeps computational costs low enough and base their localization and pose estimation on the well known 2D SLAM (Simultaneous Localization and Mapping) problem, while the problem of place labeling is solved through the usage of feature descriptors and machine learning. However, by reducing the dimensionality of the mapping to 2D, all the world geometry is lost. Also, the label classes need to be learned a priori through supervised learning and this makes it unclear whether these representations scale well. Semantic mapping interpretations in 3D are used in [9, 10], where either generic architectural elements (floor, ceiling, walls, doors) identified based on the relationships between geometric features (parallel, orthogonal, above, under, equal height) are taken into account [9], or segmentation methods based on 3D point statistics (eigenvalue decomposition) are applied [10]. The classes obtained from outdoor ladar data in [10] are clutter (for grass and trees), linear (for wires and branches), and surface (for ground terrain and rocks). However, these mapping interpretations cannot be applied to data coming from stereo cameras, as the levels of noise there are much higher and the computational time requirements lower. The remainder of this paper is organized as follows. Section 2 describes the overall system architecture. We present our Visual Odometry system, followed by the core Geometrical Reasoning module in Section 3. Once a complete polygonal model of the world is acquired, we perform simple analysis based on the orientation of regions of polygons to obtain a 3D semantic model using the techniques presented in this section. We describe experimental results in Section 4, and discuss our conclusions and future work in Section 5.

3

2

System Overview

Our system builds a 3D model of the world using images obtained from a stereo camera attached to the robot. Each pair of images (left and right) from the stereo camera is used to compute a disparity image together with the associated point cloud data (PCD) in the camera coordinate frame [11].

RHex Mobile Robot Stereo Camera

Inertial Measurement Unit Local pose

Disparity Images

Visual Odometry Point Cloud Data (PCD)

Registration and Polygonal Modeling

Global pose

Local Model

Polygonal growing Global Model

Refinement and Labeling Final semantic model

Fig. 1. The overall system architecture.

A global 3D map is built incrementally by transforming each local PCD to a global coordinate frame. The accuracy of the global map depends directly on the precision with which we evaluate the position of each separate PCD. This process is called registration and its accuracy is crucial for the quality of the final model. Instead of employing a classical registration framework (e.g. Iterative Closest Point based), our solution employs a Visual Odometer which takes as input monocular gray-scale images together with disparity images and computes the camera motion (3D rotation and translation) between two consecutive views. This motion is estimated by tracking characteristic features from one image to another [12]. The robot pose is directly derived from the estimate of the camera motion. Given the pose of the robot in an arbitrary global coordinate system, each PCD computed in the robot’s camera frame can be registered to our global model.

4

The registered PCD associated with a frame (stereo pair of images) at t is represented by Pt = {x1 , x2 , · · · }. For each frame, the PCD is passed through a pipeline in order to update our 3D model. The complete system architecture including our processing pipeline is illustrated in Fig. 1. The pipeline consists in the following steps: i) compute the local model ML t from Pt ; ii) merge the G local model ML with the global model M ; and iii) refine the global model t t and add semantic labels to compute the final model MF t . The output of our processing pipeline is represented by a complete 3D polygonal model with annotated semantic labels. Each of the steps listed above are described in detail in the following sections.

3

3D Model construction

We first briefly describe the Visual Odometry algorithm and its role in the registration process in Section 3.1. We then describe the spatial data structure we use in order to facilitate the efficient merging of the local model with the global model in Section 3.2. The computation of the local model and the merging with the global model is described in Section 3.3. We introduce techniques for model simplification and refinement and discuss semantic labeling in Section 3.4. 3.1

Visual Odometry and PCD registration

Our Visual Odometry (VO) system [12] uses feature tracks to estimate the relative incremental motion between two frames that are close in time. Interest points which are Harris corners are detected in the left image of each stereo pair and tracked across consecutive frames. These interest points are then triangulated at each frame based on stereo correspondences. Three of these points are used to estimate the motion using absolute orientation. This motion is then scored using the pixel reprojection errors in both the cameras. We use the disparity space homography [13] to evaluate the inliers for the motion. In the end, the hypothesis with the best score (maximum number of inliers) is used as the starting point for a nonlinear minimization problem that minimizes the pixel reprojection errors in both the cameras simultaneously, resulting in a relative motion estimate between the two frames. The VO computes a new camera pose at about 10Hz. The variation between two consecutive poses is typically small at this frequency and it would be highly inefficient to update the 3D model at the same frequency. Therefore, we only process the new PCD information if the change in camera pose since the last update is greater than a threshold (we use 0.1 m for the translation and 0.1 rad for the rotation). Using the pose of the camera at frame t, we can compute the transformation (Rt , tt ) from the camera coordinate frame where the PCD reside to the global coordinate frame. This transformation is applied to the PCD to obtain the registered PCD, Pt (see Figure 2).

5

Fig. 2. Registration of PCD data using global poses from Visual Odometry. The two images on the left represent the data obtained from two different view points (poses). The image on the right illustrates the two point cloud datasets that have been transformed and aligned in the global coordinate frame.

3.2

Spatial decomposition using Octrees

We decompose the volume occupied by the PCD in each frame (Pt ) using a local octree structure. The bounds of this octree are determined by the bounds of Pt . Figure 3 illustrates the decomposition of two datasets using an octree. The size of the cell at the leaf of the octree is fixed (lSIZE ) and a model (See 3.3) is computed for each occupied cell in the octree using the points that belong to that cell. We employ a dynamic octree structure for the global model (MG ), the size of which grows in order to accommodate the data from each frame as it is processed. In order to facilitate the merging of the local and global models, we fix the cells in both octrees to be the same size and also perfectly aligned by snapping them to a common grid. Thus, there is a one-to-one correspondence between cells in the global and local octrees. The use of octrees enables us to efficiently decompose the volume of interest and access quickly certain locations, as well as obtain lists of occupied cells.

Fig. 3. Spatial decomposition using dynamic octrees for two point cloud datasets. Only the occupied cells are drawn.

6

Fig. 4. Creating polygonal representations out of Point Cloud Data. From left to right: overlapping point cloud views, the superimposed PCD model with the resulted polygonal model for the entire scene, and the resulted polygonal model.

3.3

Local and Global Model computation

In this module, we first compute the local model (ML t ) using the point cloud (Pt ) from each frame and then merge the local model with the global model (MG t ). Point-based representations are unsuitable for direct use in a number of perception applications because as they do not render a continuous surface. There is also a high redundancy due to the density of the data as well as points obtained from overlapping frames. Searching for shape models directly in large point cloud datasets is a costly process as its efficiency depends on the number of points supporting a shape candidate. We therefore choose a polygonal surface representation that is less expensive to store and manipulate. While stereo-based PCD are cheaper and faster to acquire than PCD obtained from laser scans, they contain higher noise levels and classical non-linear shape fitting techniques are prone to failure. Furthermore, modelling non-linear geometric shapes, while raising the computational complexity of the reconstruction considerably, does not bring additional semantic information for our application. Consequently, scenes and objects are approximated by sets of convex planar patches. Our model for a scene, thus consists of a set of convex polygons (planar patches) M. The individual steps which lead to the creation of the local polygonal model include: i) the creation of the local octree; ii) the search for planar models in each leaf of the local octree; and iii) the creation of 2D polygonal hulls which approximate the input point cloud data in each leaf. In more detail, once the local octree is created, an iteration over every leaf begins. For every point cloud subset present in a leaf, a RMSAC (Randomized M-Estimator SAmple Consensus) method is used to estimate the best planar models present in it. Each planar model is further refined by projecting the inliers onto the plane and then replacing the points with a planar convex polygon (we also keep track of the support of the polygon, i.e., the number of inliers). In our experiments, we observed that convex polygons are not only good approximations to the underlying polygonal model present in the scene, but they provide a big advantage over triangulation methods or concave hulls (alpha shapes): they are fast to compute, geometrically stable, and do not require any threshold or parameters. Figure 4 presents the polygonal fitting results for a given scene in the stairs dataset. We next address the merging of the local model with the global model, which essentially involves dealing with the intersecting cells in both of the global models. In each intersect-

7

ing cell, we consider the similarity of the polygons from the global and the local models (in terms of the angle and distance between them). If the polygons are similar, we construct a new polygon which is the “average” of the two polygons weighted by their support. If the polygons are dissimilar we replace the model with the polygon that has a larger support. 3.4

Model refinement and semantic labeling

In this section, we describe techniques to refine the global model by weeding out outliers and duplicate (or ghost) polygons, expanding polygons where possible and finally perform semantic labeling. We can minimize the processing time by only considering cells that have been updated in the most recent frame. We perform simple outlier removal by removing lone polygons, i.e., polygons that do not have any neighbors. Often, there are surfaces that have duplicate polygons in adjacent cells in the global model. We identify such duplicates by searching for parallel polygons in the neighborhood of each polygon. If there exists Fig. 5. Reducing the model such a polygon, then we remove the polygon complexity through merging with smaller support from the model. If polyand duplicate removal: before gons in a 2 × 2 array of cells are coplanar, we merge them into a single (compound) polygon (left) and after (right). which results in a simplified model. It is possible to extend this in a recursive manner which is a future extension that we target. Figure 5 presents the effects of duplicate removal and polygon merging in the model. The chosen semantic labels are motivated by our application which is to plan a trajectory over navigable regions. Figure 6 shows a polygonal representation before (left) and after (right) merging and semantic labeling. Each polygon in the global model will be annotated with one of the following semantic labels: – Ground - Polygons that are within 5cm from the ground plane and form an angle less than 15◦ with it; – Level - Polygons that are greater than 5cm from the ground plane and form an angle less than 15◦ with it; – Vertical - Polygons that form an angle larger than 75◦ with the ground plane; – Steps - Groups of adjacent polygons that form a step: i.e., belong to planes that are parallel or perpendicular to the ground plane. – Unknown - everything else. After refining and labeling the global model we expand “Ground” polygons. We begin with a random set of polygons labeled as “Ground”. All neighboring polygons labeled “Ground”, “Level” or “Ramp” which are within 15◦ of a ground polygon are relabeled as “Ground”. This step is iteratively performed until there are no more neighbors. This ground extension is particularly helpful for 3D path planning especially in scenes with multi-level ground planes connected by a ramp.

8

Fig. 6. Annotating the polygonal models with class labels leads to the creation of the 3D semantic model. Left: the global polygonal model MG , right: the final model MF .

4

Experimental Results

We have applied the system to several data sets that contain different terrain types to test the validity and the overall performance of our approach. In our experiments, along with the overall modeling accuracy, two of the main characteristics of the system’s behavior were monitored and analyzed, namely: memory usage and computational requirements (i.e. processing time / frame). The number of frames in each data set ranged between 2, 000 and 3, 000 stereo frames, recorded at a resolution of 640×480 and 15fps. An acquired stereo frame is considered as useful for model building, only if the pose of the camera changed sufficiently since the previous useful frame (see Section 3.1). This means that in a static environment, a still camera will generate only one useful stereo frame. While the acquired data sets used are preliminary and thus not too large, they are fairly representative and are useful in determining the system’s behavior under various conditions. We plan to perform tests with sequences that will cover large-scale areas acquired over extended periods of time in the near future. In order to analyze the computational requirements of each individual system component, we selected two data sets consisting of 2, 097 (stairs) and 3, 236 (dumpster) frames respectively. In Figure 7 the plots for each component represent the amount of time spent per useful stereo frame. The list of components consists of: – Stereo + VO - obtains the disparity image from the stereo system, passes it to the Visual Odometry module which computes the global pose estimate; – PCD→LocalOctree - obtains the 3D point cloud data from the disparity image and creates/updates the local octree; – GlobalOctree growing - activated only when the local octree’s bounds exceed the global octree, and the latter needs to be expanded; – LocalOctree→Polygon→GlobalOctree - geometrical modeling methods which convert the point cloud data in the local octree into a polygonal model and and merge it with the global octree; – GlobalOctree→GlobalModel - model refinement (e.g. polygonal merging, polygonal growing, outlier removal);

9

Stereo + VO PCD->LocalOctree GlobalOctree growing LocalOctree->Polygon->GlobalOctree GlobalOctree->GlobalModel Functional Labeling Total

Time (in seconds)

1.5

1.0

Computational time requirements for the stairs sequence

2.0

Stereo + VO PCD->LocalOctree GlobalOctree growing LocalOctree->Polygon->GlobalOctree GlobalOctree->GlobalModel Functional Labeling Total

1.5

Time (in seconds)

Computational time requirements for the dumpster sequence

2.0

1.0

0.5

0.5

0.00

500

1000

1500

2000

Stereo Frames

2500

3000

0.00

500

1000

Stereo Frames

1500

2000

Fig. 7. Computational time requirements per useful stereo frame for the dumpster and stairs data sets. Memory usage for the dumpster sequence 250000

200000

200000

150000

150000

100000

100000 50000 00

Memory usage (alloc/dealloc) Upper cap (after alloc) Lower cap (after dealloc)

250000

Total Memory Used (in bytes)

Total Memory Used (in bytes)

Memory usage for the stairs sequence

Memory usage (alloc/dealloc) Upper cap (after alloc) Lower cap (after dealloc)

300000

500

1000

1500

2000

Stereo Frames

2500

3000

50000 00

500

1000

Stereo Frames

1500

2000

Fig. 8. Memory usage for the dumpster and stairs data sets.

– Functional Labeling - assigns classes of labels to each polygon in the scene, and performs region segmentation and growing. The last component of the plot in Figure 7 (Total ) represents the sum of all the above time plots. A direct important aspect that can be noticed is that the majority of the geometrical based methods have low computational requirements. In fact, almost consistently, the stereo processing together with the Visual Odometry needs the largest amount of resources. Even taken together however, the total execution time per useful stereo frame was below half a second (≈ 0.44s for dumpster, and ≈ 0.35s for stairs), which renders our approach suitable for fast, online processing. Figure 8 illustrates the system’s memory usage during execution for the same two data sets. The plots represent the amount of memory used by the entire system at each frame, and the overall needed upper and lower memory caps. Not surprisingly, the total occupied memory at the end of each iteration grows almost linearly, with an occupied average of 50MB after 3, 000 frames. Through extrapolation, we can deduce that in 1GB of RAM it would be possible to store roughly 60, 000 frames, with virtually no penalties in processing time. The processing time does not increase with the number of frames, because we deal with the increment in the model. However, further tests still need to be performed to verify this assumption.

10

Fig. 9. The point cloud data and semantic models for the stairs data set.

Figures 9 and 10 present the 3D semantic models obtained at the end of the stairs and dumpster sequences. The class labels (see Figure 6) are: the ground plane (green), vertical surfaces (red), horizontal or level surfaces (yellow), stairs (purple), and everything else as unknown or unlabeled (gray). Perhaps one of the important aspects that we have not addressed thus far is the accuracy of the final model, which can be tackled on two separate dimensions: i) the accuracy of the complete point cloud data set acquired using VO poses; and ii) the accuracy of the polygonal representation given a certain point cloud data set. While the latter is clearly dependent on factors such as stereo processing and resolution of the cameras, the accuracy of the complete PCD model is directly based on the performance of the Visual Odometry module. Due to the fact that our point cloud views are on average over 150, 000 point samples, employing standard loop closure techniques in real-time is not trivial. The reason is that our pipeline attempts to process new frames as fast as possible and store the results in a compact polygonal format (see Figure 8). Since our model is incremental, there might be situations where the robot comes back to a previously visited point, and the newly acquired data model does not fully correspond (i.e. is not perfectly aligned) with the current model. In this case, the robot will prioritize and use the newer model over the older one. However, this may be undesirable in certain applications, and our future plans are to address the loop closure problem in order to propagate pose errors and perform changes throughout the model. Figure 11 presents a concrete example of VO drift in a larger data set (ground).

11

Fig. 10. The point cloud data and semantic models for the dumpster data set.

Fig. 11. Vertical model misalignment on loop closure caused by VO pose drift in the ground data set.

5

Conclusions and Future Work

We have presented a comprehensive and complete system for real-time 3D mapping using point cloud data from stereo. The locally acquired point cloud views are registered in a consistent global frame using visual odometry techniques, and transformed into an accurate polygonal representation. Based on the geometric properties of each polygon in the world, informative semantic annotations are added, thus leading to the creation of a 3D semantic polygonal model. The computational properties of our system, namely processing time of under 0.5s per relevant stereo frame, and low memory requirements, make it extremely suitable for fast, online 3D mapping. To validate these properties we have analyzed its behavior under a wide range of situations using different datasets. Our project is under continuous development, and we are currently investigating the reinforcement of visual motion registration with loop closure techniques and more complex models for representation. As future work we plan to build accurate 3D maps from stereo over large-scale environments, as well as annotate additional label classes to the polygonal model.

12

Acknowledgments The authors thank Edward Van Reuth and DARPA for support on the “Leaving Flatland” project (contract #FA8650-04-C-7136). This work was partially supported by the CoTeSys (Cognition for Technical Systems) cluster of excellence at the Technische Universit¨ at M¨ unchen.

References 1. Liu, Y., Emery, R., Chakrabarti, D., Burgard, W., Thrun, S.: Using EM to Learn 3D Models of Indoor Environments with Mobile Robots. In: ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning. (2001) 329–336 2. Biber, P., Andreasson, H., Duckett, T., Schilling, A.: 3D modeling of indoor environments by a mobile robot with a laser scanner and panoramic camera. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. (2004) 3. Weingarten, J., Gruener, G., Siegwart, R.: A Fast and Robust 3D Feature Extraction Algorithm for Structured Environment Reconstruction. In: International Conference on Advanced Robotics. (2003) 4. Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In Weiss, Y., Sch¨ olkopf, B., Platt, J., eds.: Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA (2006) 291–298 5. Mozos, O.M., Triebel, R., Jensfelt, P., Rottmann, A., Burgard, W.: Supervised semantic labeling of places using information extracted from laser and vision sensor data. Robotics and Autonomous Systems Journal 55(5) (May 2007) 391–402 6. Vasudevan, S., G¨ achter, S., Nguyen, V., Siegwart, R.: Cognitive maps for mobile robots-an object based approach. Robotics and Autonomous Systems 55(5) (2007) 7. Schr¨ oter, D., Weber, T., Beetz, M., Radig, B.: Detection and Classification of Gateways for the Acquisition of Structured Robot Maps. In: Proc. of 26th Pattern Recognition Symposium (DAGM), T¨ ubingen/Germany. (2004) 8. Iocchi, L., Pellegrini, S.: Building 3D maps with semantic elements integrating 2D laser, stereo vision and INS on a mobile robot. In: 2nd ISPRS International Workshop 3D-ARCH. (2007) 9. Nuechter, A., Surmann, H., Hertzberg, J.: Automatic Model Refinement for 3D Reconstruction with Mobile Robots (2003) 10. Vandapel, N., Huber, D., Kapuria, A., Hebert, M.: Natural terrain classification using 3-d ladar data. In: IEEE International Conference on Robotics and Automation. Volume 5. (April 2004) 5117 – 5122 11. Konolige, K.: Small vision systems: hardware and implementation. In: International Symposium on Robotics Research. (1997) 111–116 12. Agrawal, M., Konolige, K.: Real-time localization in outdoor environments using stereo vision and inexpensive GPS. In: International Conference on Pattern Recognition. (2006) 13. Konolige, K., Agrawal, M., Iocchi, L.: Real-time detection of independent motion using stereo. In: IEEE workshop on Motion (WACV/MOTION). (2005)