Using Extended EM to Segment Planar Structures in 3D Rolf Lakaemper
Longin Jan Latecki
Temple University, Philadelphia,PA,USA
[email protected]
Temple University, Philadelphia,PA,USA
[email protected]
Abstract The proposed algorithm segments planar structures out of data gained from 3D laser range scanners, typically used in robotics. The approach first fits planar patches to the dataset, using a new, extended Expectation Maximization (EM) algorithm. This algorithm solves the classical EM problems of insufficient initialization by iteratively determining the number and positions of patches in a split and merge framework. Determining the fitting quality of the gained patches, the approach then allows for segmentation of planar surfaces out of the 3D environment. The result is a set of 2D objects, which can be used as input for classical computer vision applications, in particular for object recognition. Our approach makes it possible to apply classical tools of 2D image processing to solve problems of 3D robot mapping, e.g. landmark recognition. Index Terms – Segmentation, 3D Robot Mapping, EM
I. Introduction This paper deals with segmentation of 3D range data. 3D range data results from sensors like laser range scanners, radar or ultra sonic, mainly offering a 3D mapping in the spatial domain rather then reflection intensity. Laser range scanners belong to the main sensors used in mobile robots. In particular, the break though in robot mapping is due to the usage of laser range scanners that produce 2D as well as 3D point clouds representing the surfaces of perceived objects. Though the advantages of data gained by range scanners are obvious, the nature of the data leads to a new problem: the raw data material consists of a huge number of single reflection points, i.e. the environment is described on a very low geometrical level. Although the scan order induces a topology when using a single scan, even this basic structure information about the interrelation of the points is lost in most practical applications, since the overlay of multiple scans erases any structural correspondence and minimizes the information in each data point to its pure location. Implementation of the raw data into a superimposed grid representation re-establishes an intuitive topology, but is not helpful in many cases. In the first place, in most environments (e.g. outdoors) the memory behavior makes grid approaches infeasible. But mainly, it does not directly induce a higher geometrical representation of the data (we still have single grid cells), needed for higher processing, e.g. for mapping, segmentation or object recognition. A grid-based approach to robot mapping can be found in [10].
Another approach is to directly fit a model consisting of higher geometrical elements to the data, these elements being line segments or polylines in 2D, planar patches in 3D. Main techniques for this approach are Hough Transform [12] and Expectation Maximization (EM) [5],[8]. These approaches yield a representation of the environment that is more useful for higher processing. Other interesting approaches to 3D robot mapping, based on statistics, can be found in [6], [7]. An overview of approaches to obtain polygonal 2D maps from laser range data can be found in [9]. General segmentation of arbitrary objects is currently not in reach, yet for many applications only the filtering of arbitrary planar objects (planar, but arbitrarily shaped in 2D) augmented by post processing in 2D is required. An example is robot mapping and localization, especially the task of landmark detection. The landmarks would be planar patches, e.g. walls, distinguished by their specific properties (area, shape, etc.) in 2D, using existing techniques of CV. In order to achieve a segmentation of planar structures out of 3D laser range data, the approach introduce a new approach to fit planar patches to 3D datasets by an extended EM algorithm, called SMEMPF (Split and Merge Expectation Maximization Patch Fitting). SMEMPF iteratively adjusts number and position of planar patches by analyzing the quality of fit of each patch to its corresponding points. Observe that a ‘good’ fit can only be achieved for surfaces in the original data that are ‘sufficiently’ planar. We select only the ‘sufficiently’ planar’ patches for further processing, and compute the orthogonal projection of the corresponding points to the planar patches to obtain 2D frontal views of (nearly) planar objects. The obtained frontal views can now be processed using simple image processing techniques. We use morphological filtering followed by connected components, and by boundary extraction (see fig. 4,5). The following steps outline the algorithm (the paper follows this outline): 1. Fitting of planar patches to the dataset using SMEMPF (section II) 2. Segmentation of (nearly) planar patches by fitting quality (section III A) 3. Projection of data points onto the corresponding planes to yield a 2D representation (section III B) 4. Processing in 2D using classical CV methods (section III C)
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
II. The Extended EM Framework We begin with a short overview of EM applied to plane fitting. The core of the procedure is simple least square fitting which is dimension independent, hence plane fitting is the 3D version of line fitting, being presented in detail in [1]. 3D EM fits (infinite) planes to the data given. Taking a set of data points in 3D space and an initial set of 2D planes as input, the algorithm alternates the following two steps until it converges (the algorithm is guaranteed to converge to some local optimum).
E-Step (Expectation Step): Given a current set of planes, for each point the probabilities of its correspondences to all planes are estimated based on its distances to planes. M-step (Maximization Step): Given the probabilities computed in the E-step, the new positions of the planes are computed using a regression weighted with these probabilities.
A. Expectation Maximization Patch Fitting (EMPF) The data structure we utilize to fit the data is a set of planar rectangles, called patches, which are subsets of the planes computed by the general EM. To be more versatile, each patch is subdivided into a grid of tiles. We distinguish between supported and unsupported tiles by the number of data points close to the respective tile (more detailed explanation below). All computations are solely processed on the set of supported tiles, e.g. the distance of a point to a patch is the distance of the point to the closest supported tile. The grid size G of a patch defines the length of the edges of its tiles. To conclude the data structure setup: the final result of the modified EM is a set of patches identifying planar macro structures (e.g. walls) that consist of a collection of (coplanar) supported tiles. The tiles identify the shape of these structures in a granularity or resolution determined by the patch's grid size. The proposed approach requires a minor extension of the general EM plane fitting to work with patches, which we will call Expectation Maximization Patch Fitting (EMPF). The only difference of EMPF in comparison to EM plane fitting is, that the input/output consist of patches, instead of infinite planes. The input additionally consists of the set of data points to be fitted. EMPF is composed of the following three steps: 1. E-step with patches (the EM probabilities are computed based on the point distances to sets of supported tiles) 2. M-step with the probabilities computed in the E-step, resulting in infinite planes containing the new patches. 3. Trimming planes to patches and determining supported tiles. Step (3) is composed of two sub steps: 1. Assignment of supporting data points to planes. 2. Cutting the planes to patches and tiles by distance projection. Detailed description of each step:
E-step: First we need to recall the computation of EM probabilities. Let a1,..., am be a set of data points in 3D space, and let s1,...,sn be a set of patches. Usually m is significantly larger than n. For each point ai, the probability pij that ai corresponds to patch sj is computed for j=1..n . Analog to EM plane fitting, this probability is computed based on the distance d(ai,sj) from point ai to patch sj, i.e. the distance to the closest supported tile in patch sj : pij ~ exp(- (d(ai,sj)2 / 2S2)
(1)
and normalized so that ∑j=1..n pij= 1 for each i. S is the assumed standard deviation of the dataset. After every E-step we obtain a matrix (pij), with each row i representing the patch affiliation probabilities for point ai, also called the support of point ai for the patches sj. Each column j can be seen a set of weights representing the influence of each point on the computation of a new patch position in the M-step. M-step: the output of the M-step, which performs an orthogonal regression weighted with (pij), is a set of (untrimmed) planes e1, ... , en corresponding to the input patches s1, ... , sn. Trimming: In order to trim the planes to patches with supported tiles, we first need to assign max. supporting data points to planes (step 3.1). This assignment is based on the probabilities computed in the E-step. A support set S(sj) for a given plane ej is defined as set of points whose probability of supporting plane ej is the largest (in comparison to other planes), i.e., S(ej)={ai : pij = max (pi1, ..., pin) }. A point ai supports a plane ej if ai ∈ S(ej). Trimming the planes to patches (step 3.2) is a simple step now, using the support set S. For each j, we define the set SP(ej) as the set S(ej) projected (orthogonal) onto the plane ej. The trimmed patch sj is the minimum bounding rectangle of SP(ej), with one edge direction defined by the principal axis of SP(ej) .We obtain a set of patches s1, ... , sn with sj ⊂ ej. The support set for each patch is simply the support set of the corresponding plane, i.e., S(sj) = S(ej) for j =1, ..., n, a point ai supports a patch sj if ai ∈ S(sj). Determination of supported tiles: A patch sj is decomposed into equal tiles of edge length G (the grid size). For each tile tk of sj we define its support support(tk) as the number of data points supporting sj in the cube C(tk), a cube whose edge length is G, placed around tk. We call the union set of points meeting these requirements for all tiles tk of a patch sj the reduced support set of sj , Sr(sj) ⊂ S(ej).
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
In each iteration a support threshold T is computed from the statistics of the support(tk) values over all tiles of a patch. Tiles tk with support(tk) > T are marked as supported tiles. If a patch does not contain supported tiles, it is removed. The parameters G and T are computed dynamically each time in the trimming step. G is computed as 3 std(d), where std(d) is the standard deviation of point patch distances computed for all data points to the patches theysupport. T is computed as mean(c) - 2 std(c), where c is the number of points in the reduced support set Sr.
A)
B)
Fig. 1: Fitting data without and with splitting. A) fitting the dataset with one patch (result of classical EM with initial model of 1 patch). B) after splitting into 4 patches using SMEMPF.
B. Split and merge EM PATCH fitting The following introduces the outline of the proposed algorithm. The proposed split and merge EM patch fitting (SMEMPF) algorithm iterates the following steps (described in detail below): 1) EMPF (Expectation Maximization Patch Fitting) 2) PS (Patch Split): data support evaluation of patch obtained by EMPF (Section II.C) 3) EMPF 4) Patch merge (Section II.D) PS splits patches based on the distribution of supported tiles. If a patch contains a large number/area of unsupported tiles, it will be split into multiple coplanar patches to allow a better fit of the supported tiles to the data in the next EMPF step (3). Hence PS creates a higher number of patches in order to optimally fit the input data points; the number of patches is not constant, as it is in classical EM. This way we overcome the problem of locally optimal solutions as appearing in the classical EM due to a wrongly defined number (too low) of fitting patches, since such a solution will not have a good global support in the data points. Additional to the starting number of patches the initial position of the patches also does not matter, since the following EMSF will reposition the split patches to better fit the data. In the merging step (4), pairs of similar patches are merged to single ones. Due to this process, the number of patches cannot grow to infinity. A few iterations of the proposed algorithm are illustrated in fig. 4. The proposed algorithm converges, since EM converges and the PS procedure (Section II.C) stops splitting if a certain goodness of fit criterion is met. Usually just a few iterations of steps (1)-(4) are required, e.g. 8 iterations in the experiment (Stanford Campus, fig. 4). Our stop condition is the stability of distances of data points to the closest patches. It has to be mentioned that in between certain iterations (e.g. all 5 iterations) new patches are added if there's a high number of points not supporting any patches. This situation can occur if the initial patch setup is placed far away from certain data points.
C. PS: Patch Split A classical case of the EM local optimum problem is illustrated in fig. 1.
Clearly, the problem in fig.1 A) is that only one patch is used, while 4 patches are needed. B) shows the result using SMEMPF, automatically gaining the required number of patches. Fig. 2 illustrates the split operation described in this section. Splitting is recursively processed along axes in the patch having insufficient support of data points, until each resulting patch has a sufficiently homogeneous distribution of supporting points and therefore is not split further.
Fig. 2: Split. Left the original patch, right the split patches. The three new patches are slightly replaced compared to their origin, due to new patch directions determined by the principal axis of projected support points. Yet, by definition they still are coplanar with the original patch.
The gridsize G indirectly steers the splitting process. A smaller G enables reaction to more local density differences, and the split resolution is higher (due to smaller bin size). The resulting new patches are coplanar with the original patch sj, split determines a new number of patches, a number that leads to a better fit to the data due to the replacement of the newly created patches in the follow up EMPF step 3.
D. Merging Motivation: iterated EMPF followed by PS only, without merge, could possibly exceed a reasonable number of patches. Therefore, merging 'similar' patches is necessary. Merging is responsible for the accuracy of the statistical model, without it the model may end up fitting the noise. The main idea is, that if a given patch is properly split, then EMPF will reposition the resulting patches to better fit the data points in a way that will turn them away from each other. If a patch is unnecessarily split, the patches remain very similar after an EMPF iteration, where similar means that they will be nearly collinear and close to each other. This suggests that merging
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
should combine two perceptually similar patches to a single one, and leave unchanged perceptually dissimilar patches.
Fig. 3: Merging of two patches (left) to a single one (right). The merged patch results from fitting a single plane to the union of supporting points of the two original patches, followed by trimming.
Merging is derived from 2D line merging algorithms based on principles of perceptual grouping [11]. Intuitively spoken, the underlying similarity measure takes into account the closeness, coplanarity and angle between normals of two patches. Note that the similarity therefore is based on the model, not on the dataset to be fit. Merging of a single pair sa , sb of patches is processed in two steps. 1. determine similarity 2. if sufficiently similar: merge by creating a single patch using least square fitting and trimming Step 1: similarity determination The patches have to be sufficiently close. This is defined by checking for overlap of bounding cubes of each patch, expanded by G (grid size) in each direction. If this condition is met, then the patches are tested if they are also sufficiently coplanar and parallel. To determine this, the points ai∈ Sr(sa) supporting sa are projected onto the plane eb containing sb and r vice versa (bj∈ S (sb) are projected onto ea). The mean distance of the points to their projections is computed for both sets, resulting in two values da, db. If min(da, db) < G the patches will be merged. Note that again the parameter G plays an important role. Step 2: merge A new plane is fit to the set union Sr(sa) ∪Sr(sb) with a classical regression and trimmed to a patch. Merging is done iteratively until all possible pairs of patches are sufficiently dissimilar. A simple merge example using only 2 patches is shown in fig.3.
III. Segmentation A. Segmentation of planar patches by fitting quality The fitting quality of a patch to its support points is given by the standard deviation STDi of the distances of the support points in the (non reduced) support set S(si) to the patch si. A nearly planar object in the environment has a low value of STDi. A non planar, e.g. round structure, will yield a high value of STDi , reflecting the insufficient modeling of its shape by a planar patch. An example can be seen in fig. 5(A): the tree tops form a point cloud of large radius which would be misrepresented by any single planar patch (hence no patch is shown in 5A), whereas the supporting points of patches representing the walls are all relatively close. To distinguish between a good and bad fit, the standard deviation STD of distances of all points ∪iS(si) to their corresponding planes is computed. A patch si is considered to be of sufficient fitting quality or well fitting, if STDi < k*STD with a constant k determined by experiment. See fig. 5(A) for an example of well fitting patches.
B. Projection of data points onto the corresponding planes to yield a 2D representation To achieve a front view of the objects represented by a well fitting patch si, the set of supporting points S(sj) is orthogonally projected onto the plane ei ⊃ si (ei is the untrimmed plane containing si as defined in section II). Fig. 5(C) shows the projected points of the main wall of fig. 5(A). The projection of points is the transitional step from 3D to 2D, which enables further processing of the 3D elements in 2 dimensions.
C. Processing in 2D using classical CV methods Step B yields a 2D point cloud that allows for straightforward post processing using classical CV methods [3]. The obtained 2D point clouds can be segmented using alpha shapes [2], or the points can be embedded into a grid and then be treated as image pixels, e.g. using morphological processing. For an example see fig. 5(C),top, and center, with center being the result of morphological closing.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
B) A)
C) D) Fig. 4: Performance of SMEMPF on the 3D data set of Stanford Campus: A) data set and initial patches (10 patches, randomly selected), B) after iteration 1,C) r final result: iteration 8, 61 patches, D) the reduced support sets S (ej) orthogonally projected onto their corresponding planes ej. Different colors mark different sets.
A)
B)
C)
Fig. 5: Segmentation of planar patches from the dataset. A) is a subset of a part of fig. 4C) , shown are patches with a low standard deviation in the distances of the supporting point sets to the supported patch. Only planar environmental structures are left. B) Marking the projected reduced support sets representing the segmented planar objects, using the same technique as described in fig.4D. C) Further processing of the resulting 2D point set of the main structure of B) using classical CV techniques
IV. Experimental results a) SMEMPF The dataset, taken by a real mobile robot equipped with a 3D laser scanner on Stanford Campus, consists of ~100000 data points. The points, resulting from multiple scans, are aligned using the technique explained in [10]. Due to the mechanical properties of the robot the dataset is extremely noisy, e.g. points corresponding to the (planar) side wall of the archway (see fig. 4,5) create a point cloud of width about 50cm. The initialization for the experiment is a set of 10 randomly placed
patches. Fig. 4 shows: (A) initialization, (B),(C) patches and supported tiles after iterations 1 and 8 (final iteration). The result contains 61 patches. The direction of the patches' bounding edges is determined by the principal axis of the supporting dataset, and is therefore not always in correspondence with the visual expectation. This has no influence on the final result, shown in fig. 4(D), consisting of the reduced support sets Sr(ej) orthogonally projected onto
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
their corresponding planes ej. Different colors mark different sets. The final result decomposes the original point set into subsets of points belonging together (i.e. supporting one patch), the points of each subset being relocated to be coplanar. The result models the data using 61 patches. Hence the initial number of 10 randomly placed patches was automatically increased, and the resulting 61 patches fit to the data yielding a visually intuitive solution
2.
H. Edelsbrunner. Weighted alpha shapes. Report UIUCDCS-R-92-1760, Dept. Comput. Sci., Univ. Illinois, Urbana, Illinois, 1992.
3.
David A. Forsyth, Jean Ponce, Computer Vision: A Modern Approach., Prentice Hall, 2002
4.
L.J. Latecki and R. Lakämper: Convexity Rule for Shape Decomposition Based on Discrete Contour Evolution. Computer Vision and Image Understanding (CVIU) 73, 441-454, 1999.
5.
S. Thrun, C. Martin, Y. Liu, D. Hähnel, R. EmeryMontemerlo, D. Chakrabarti, and W. Burgard. “A real-time expectation maximization algorithm for acquiring multiplanar maps of indoor environments with mobile robots.” IEEE Transactions on Robotics and Automation, 2003.
6.
S. Thrun, D. Hähnel, D. Ferguson, M. Montemerlo, R. Triebel, W. Burgard, C. Baker, Z. Omohundro, S. Thayer, and W. Whittaker “A system for volumetric robotic mapping of underground mines.” In IEEE International Conference on Robotics and Automation (ICRA), Taipei, Taiwan, 2003. ICRA.
7.
S. Thrun, W. Burgard, and D. Fox. “A real-time algorithm for mobile robot mapping with applications to multi-robot and 3D mapping.” In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), San Francisco, CA, 2000. IEEE.
8.
R. Triebel, W. Burgard and F. Dellaert , “Using Hierarchical EM to Extract Planes from 3D Range Scans”. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2005
9.
M. Veeck and W. Burgard, “Learning polyline maps from range scan data acquired with mobile robots,” in Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004.
b) Segmentation Figure 5(A) shows a subset of patches and points from fig. 4(C). It can clearly be seen, that the tree tops are not sufficiently represented by the planar model, they form a point cloud of large radius. Fig. 5(A) shows all patches having a value STDi < k*STD, the well fitting patches, with a value of k = 0.5 . Planar structures are kept, walls and parts of tree trunks (being flattened half spheres in the data set), while more voluminous objects are dropped. Fig. 5(B) shows the projected supporting points forming planar point sets. Again different colors mark different objects, the result of the segmentation.
c) Further Processing Fig. 5(C) is the 2D projection of the most prominent wall in Fig.5(B). To illustrate possibilities of further 2D processing, the point set was embedded into a grid and processed by morphological closing, to obtain planar objects, fig.5(C),center. Fig. 5(C), bottom, shows the boundary of the objects, after simplification using the Discrete Curve Evolution (DCE) process described in [4]. The resulting polygonal curves clearly express planar shape features, which can be analyzed using the existing shape similarity approaches, e.g. [4], in order to recognize or classify landmarks. This approach can significantly improve the landmark recognition.
V. Conclusions SMEMPF, a combination of Expectation Maximization and a new split and merge framework, is a powerful tool to model 3D data sets by planar patches in a visually natural way. Its robustness with respect to the initializing condition, especially the dynamic adjustment of the number of patches, solves the main problem of classical EM approaches. It is the foundation for the extraction of (nearly) planar objects in 3D datasets, using the quality of fit as the segmentation criterion. Extracting these objects from the entire 3D data set and projecting them to 2 dimensions, simplifies 3D object recognition by mapping it to object recognition in 2D. The presented experiments show an excellent performance of the SMEMPF on ground truth and real world data sets, and successfully segment planar objects.
10. Weingarten, J. and Siegwart, R. “EKF-based 3D SLAM for Structured Environment Reconstruction”. In Proceedings of IROS, Edmonton, Canada, August 2-6, 2005. (IROS'2005) 11. M. Wertheimer, “Maximum likelihood from incomplete data via the EM algorithm,” Psycologische Forschung, vol. 4, pp. 301–350, 1923. 12. Denis F. Wolf, Andrew Howard, and Gaurav S. Sukhatme, "Towards Geometric 3D Mapping of Outdoor Environments Using Mobile Robots," In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1258-1263, Aug 2005
VI. References 1.
A. Blake and B. Freeman, “Learning and vision: Generative methods,”in Lecture at ICCV, 2003.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE