Simultaneous Localization and Mapping for event-based Vision Systems David Weikersdorfer, Raoul Hoffmann, and J¨org Conradt Neuroscientific System Theory Technische Universit¨ at M¨ unchen
[email protected]
[email protected]
[email protected]
Abstract. We propose a novel method for vision based simultaneous localization and mapping (vSLAM) using a biologically inspired vision sensor that mimics the human retina. The sensor consists of a 128x128 array of asynchronously operating pixels, which independently emit events upon a temporal illumination change. Such a representation generates small amounts of data with high temporal precision; however, most classic computer vision algorithms need to be reworked as they require full RGB(-D) images at fixed frame rates. Our presented vSLAM algorithm operates on individual pixel events and generates high-quality 2D environmental maps with precise robot localizations. We evaluate our method with a state-of-the-art marker-based external tracking system and demonstrate real-time performance on standard computing hardware.
1
Introduction
Estimating your position within an environment is a crucial task for any autonomous artificial mobile system. In mobile robotics, intense research and development efforts have been devoted to systems that localize themselves on a map while creating and maintaining such an a-priori unknown map (simultaneously localization and mapping, SLAM, [1]). Typical engineered systems rely on a combination of sensors, such as ego-motion estimation (e.g. odometry or inertial measurements), distance sensors (e.g. laser range finders or RGB-D cameras), or on vision based feature tracking to create precise environmental representations for self-localization. An example of a visual SLAM method is KinectFusion [2], a dense surface mapping and tracking algorithm which requires highly optimized code and GPU hardware to run in real-time for a quite small spatial volume. This is typical for many state-of-the-art vision SLAM systems which perform extremely well with the drawback of high computational requirements: typical algorithms by far exceed the computing power provided by small autonomous robots and often perform offline, time-delayed data interpretation. In this paper we present a novel approach to visual SLAM in unknown environments for a biologically inspired event-based vision sensor that operates in real time. The integrated Dynamic Vision Sensor (DVS) [3] is a neurobiologically
Fig. 1. Left: The event-based embedded Dynamic Vision Sensor (eDVS). Right: An example data reading. Each image shows the 128x128 sensor with pixel events integrated over 0.1 seconds. Light pixels indicate change of illumination from dark to light - dark pixels a change from light to dark.
inspired pixel-wise asynchronous vision sensor that generates elementary events (”pieces of information”) only for every temporal illumination change perceived by any pixel (see fig. 1). Such a visual representation generates small amounts of data with high temporal precision, typically faster than 30 micro-seconds. Each individual event contains very little reliable information, as the underlying cause for this event might be any one of a large collection of possible state changes, or even pure noise. Frame based visual SLAM methods which operate on sequences of traditional camera images have nearly full localization information in each new image. With event-based sensors we only marginally update the systems belief about the true underlying state with every new event. This allows a more efficient system design, but most classic computer vision algorithms need to be redesigned as they require full images at fixed frame rates. In [4] an event-based particle filter algorithm for visual, event-based selflocalization using the eDVS sensor has been proposed. In that work a fixed and pre-build map stitched from photos has been used for navigation. This is a severe restriction for real world applications as the map should be generated directly by the acting robotic system without need for manual intervention or additional sensors. In this paper we extend this event-based tracking algorithm to a complete event-based simultaneous localization and mapping method which generates a map automatically during self-localization (see fig. 2). The rest of the paper is outlined as follows: In §2 we describe the concept behind our event-based simultaneous localization and mapping method. Details about our main contribution - the mapping - will be presented in §3 in general and specialized to the case of 2D SLAM. We evaluate our method on a large dataset in §4 and show that we can achieve excellent tracking and mapping results.
2
Event-based SLAM
In [4] a novel event-based tracking algorithm which adapts the particle filter Condensation algorithm [5] to the characteristics of an event-based sensor like
Ceiling Events from sensor
Events at edges
State estimate (particles)
eDVS sensor Event-based Localization
Tracked path
Event-based Mapping
Map
Floor
Fig. 2. Schematic overview of our tracking scenario for event-based tracking (left) and corresponding program flow (right).
the eDVS is proposed. An event-based sensor provides a continuous stream of events e(k) , e(k+1) , . . . ∈ R. R denotes the range of possible pixel positions on the sensor, i.e. R = [0, 127] 2 , and upper subscription is used to indicate event index. Due to the fact that the number of events created by the event-based sensor depends on movement and rotation speed, the event index does not correspond to system time. For each event e(k) ∈ R, the event-based tracking algorithm in [4] uses a particle set P (k) ⊂ Ω ×R+ for a multivariate probabilistic estimate of the current system state. Each particle p = (x, s) ∈ P (k) consists of the corresponding state estimate x ∈ Ω (position and orientation of the sensor) and a score value s ∈ [0, 1] representing the likelihood of the state estimate. State likelihoods sum up to 1 for a given particle set. Towards the goal of simultaneous localization and mapping, we introduce a dynamic map over the map domain Γ (Γ = R2 for a two-dimensional map), which our event-based SLAM method will continuously update during localization. This is a major extension of the event-based tracking method in [4] as it is no longer required to manually provide an a priori map. For each location u ∈ Γ on the map we represent the likelihood that the event-based sensor will generate an event when one of it pixels observes this location while the sensor is moving. Formally we model this likelihood as M(u) =
# of occurred events for u O(u) =: # of possible obervations for u Z(u)
(1)
It is important to note the difference between the event-based sensor and a classic image sensor: While a classic image sensor reports observations at fixed time intervals which can also be easily repeated by measuring multiple times, the event-based sensor generates events only if one of its pixels detects a change in illumination. This happens either when objects move in the scene or, as in our case, when the sensor moves within a static environment and one of its
Fig. 3. Example for occurrence map O (left), normalization map Z (middle) and final likelihood map M (right).
pixels traverses an edge in the perceived brightness of the scenery. Therefore map likelihood M measures the number of events generated at a map location with respect to the number of sensor pixel crossings. As locations on the map can be observed several times with varying frequency, the occurrence map O needs to be normalized using a normalization map Z in order to get stable and comparable map likelihoods. Our event-based localization and mapping method is outlined as follows: For every event, we alternately update the state estimates using the latest map and update the map using the new state estimate (see fig. 2). First, for event-based localization, we update the particle set P (k) using the current event e(k) and the likelihood map M(k−1) from the previous step provided by event-based mapping. State estimates xi are propagated using a diffusion model specific to the eventbased sensor and state probabilities si are computed using the likelihood map: (k)
si
(k−1)
= si
(k)
+ α M(k−1) (µ−1 (xi |e(k) ))
(2)
(see [4] for details). The projection function µ : Γ × R → Ω and its inverse are discussed in more detail in §3. Second, the map M(k) is updated using the current event e(k) and the current state estimate P (k) . The second step, eventbased visual mapping, is the major contributions of this paper and explained in more detail in section §3.
3
Event-based Visual Mapping
For the computation of the likelihood map M, we iteratively compute three maps over the map domain Γ during the mapping phase of our event-based SLAM method (see fig. 3): First, the occurrence map O : Γ → R+ where observation probabilities for events are summed up as events occur. Second, the normalization map Z : Γ → R+ which records the possibility of observations depending on the magnitude of movement of the sensor. Finally occurrence and normalization map are combined to the likelihood map M : Γ → R+ with eq. 1.
3.1
Computation of the occurrence map O
For the computation of the occurrence map, we project events from the sensor position back onto the map domain and integrate them using a Gaussian sensor model. The current event e(k) and the current state estimate P (k) = (k) (k) (k) (k) (k) {p1 , . . . , pn }, pi = (xi , si ), are used to compute corresponding observed −1 (k) (k) map locations µ (e |xi ). Here we use the inverse of a projection function µ : Γ ×Ω → R which projects a map location onto the sensor given a fixed state. We assume that µ−1 has a unique solution - for a discussion see §3.3. Thus the occurrence map is computed iteratively as O
(k)
(u) = O
(k−1)
(u) +
n X
(k)
si
(k) N u µ−1 (e(k) |xi ), σ , O(0) = 0 .
(3)
i=1
The Gaussian normal distribution N is used to represent measurement and discretization uncertainties. The standard deviation σ depends on camera parameters and the distance of the sensor to event location. As generally done in SLAM algorithms, it can be beneficial to not use all particles for updating the occurrence map, but only a fraction with highest probabilities. 3.2
Computation of the normalization map Z
The number of possible observations for the normalization map Z can be computed by considering the special properties of the event-based sensor: Assuming a strong edge in the perceived light intensity, the sensor will generate one event for every pixel which passes over the edge. Thus the fractional number of possible generated events for a map location u ∈ Γ is proportional to the length of its path on the sensor in pixel coordinates. Given a state estimate x ∈ Ω, we compute the corresponding fractional pixel position on the sensor using the projection function µ. Note that this position does not necessarily lie inside the sensor boundaries as not all areas of the map are visible by the sensor at all times. If a map point is not visible by the sensor under the current or previous state estimate the normalization map is not updated at this map location. Otherwise the normalization map is computed as (k)
(k−1)
Z (k) (u) = Z (k−1) (u) + kµ(u|x∗ ) − µ(u|x∗
)k , Z (0) = 0 .
(4)
(k)
x∗ denotes the expected state which is computed as the weighted mean of the whole particle set P (k) . Due to noise in the expected state and the high rate at which events are generated by the sensor, it is sensible to update the normalization map only periodically and not for every event. 3.3
Implementation example: Event-based SLAM for 2D scenarios
In this section we briefly explain how the general formulation of our event-based SLAM algorithm can be specialized for a 2D localization and mapping scenario
as presented in the introduction (see fig. 2). In this case, the state domain is the position of the robot on the floor and its rotation, i.e. Ω = R2 × SO(2). The map is constructed for a flat ceiling parallel to the floor on which the robot is moving, i. e. Γ = R2 . For this setting the projection function µ is realized using the pinhole camera model: µ : R2 × R2 × SO(2) → R, (u, x, θ) 7→ proj Rz (θ)−1 (u, D)T − (x, 0)T (5) with proj : R3 → R, v 7→
f
vy vx − cx , f − cy vz vz
(6)
where f and c are the usual camera model parameters and Rz (θ) is rotation about the z-axis by an angle θ. D is the constant height of the ceiling over the floor. The standard deviation σ in eq. 3 is chosen equal to half the size of a sensor pixel projected onto the ceiling. This represents the fact that the size of the Gaussian on the occurrence map for an individual pixel event should correspond to the size of a sensor pixel projected onto the map space. The inverse of the projection function µ−1 , which indicates where the occurrence map is updated in eq. 4, has a unique solution when the distance D is constant. For the more general case where the distance of the source of an event to the sensor is not known this is no longer the case. There are several possible strategies to solve this problem which are not further investigated in this paper: The depth information could be provided by additional sensors like Primesens depth-sensing sensors [6], or the depth information could be estimated by using multiple event-based sensors [7,8]. Another approach would be to use the full trace of the back projection µ−1 throughout a volumetric map using a cone-like probability distribution instead of a punctual Gaussian distribution in eq. 3.
4
Evaluation
In order to test our method we equipped a mobile robot with an upwards facing event-based dynamic vision sensor (eDVS) as depicted in fig. 2. The robot was remotely driven on varying paths through an indoor environment. For ground truth comparison, the robot was tracked externally by the marker-based motion capture system OptiTrack V100:R2. The event streams from the sensor were recorded and afterwards processed offline by our SLAM algorithm, resulting in estimated trajectories and constructed maps of the ceiling. Our tracking system did not use any additional information such as user driving commands or wheel rotation measurements. Our implementation uses pixel gridmaps for occurrence map O, normalization map Z and likelihood map M with a pixel size of 1 cm. To evaluate the quality of trajectories resulting from our method, we compare them against the ground truth paths from the external tracking system by calculating the Root-mean-square error (RMSE) of the position error and the
Fig. 4. Left: Photo of the ceiling. Middle: Resulting map from our method. Darker spots indicating a higher likelihood of events. The green scale bar indicates the size of the field of view of the sensor on the ceiling (ca. 2 meters). Right: Overlay of our map (magenta) and the edge map of the ceiling photo (blue).
error in orientation angle. As the world coordinate system of the external tracking system differs from the coordinate system chosen by our SLAM method we can only compare relative coordinates and thus have to first align our path to the ground truth path using a simple optimization over global position, global orientation, time offset and sensor rotation offset. We evaluated our method on a dataset with 40 different randomly selected paths. Figure 6 shows three typical examples and depicts the paths and maps created by our system, a comparison against ground truth as tracked by the overhead system and the corresponding error in position and orientation over time. An overview over the mean RMSE in position and rotation, the number of successfully tracked scenarios 1 and the processed events per second is evaluated on the whole dataset against a varying number of particles (table 1). Reported mean errors of 6.0 cm and 5.5 degrees are expected regarding the low sensor resolution of 128x128 pixels and the height of the ceiling of about three meters. The reported number of processed events per second demonstrate the efficiency of our method as in normal operating mode the sensor typical generates only around 25000 events per second. Fig. 4 shows that maps created by our method are clearly resembling reality by comparing against a photo of the ceiling. As our map captures illumination changes we show an overlay of our map and an edge map of the photo created 1
Scenarios are marked as not successful if a manual inspection shows a severe deviation of path and map from the the ground truth.
Table 1. Positional and rotational root-mean-square error (RMSE), failure rate and processing speed for a varying number of particles. Particles 5 10 25 75
RMSE pos. 35.4 5.9 6.0 6.0
cm cm cm cm
RMSE rot.
Failure rate
Events/s
◦
18/40 5/40 4/40 3/40
87800 80700 65600 38800
51.2 5.5◦ 5.5◦ 5.4◦
Fig. 5. Left: Photo of an indoor ceiling in a common office. Middle: Map and path created by our method. The green bar indicates the size of the field of view of the sensor on the ceiling (ca. 3 meters). Right: Overlay of our map (magenta) and the edge map of the ceiling photo (blue).
with the Sobel edge detector. As shown on with the scale bar in the figure, the mapped region is multiple times larger than the field of view of the sensor. It is observable that the likelihood M is higher at points of high local contrast in the photo. The ceiling of the room where the external tracking system is installed has various natural and artificially added distinctive features and edges. To show that our method is also suitable with ceilings with fewer features, e.g. only overhead lamps, we tested it in a common office room. A photo of the ceiling, the estimated robot path and the map created by our method and a map comparison is shown in fig. 5. This demonstrates that even though our method relies on variations of brightness in the scenery, i.e. the ceiling, and works best with high feature density, it also produces correct results for common office ceilings with a low feature density.
5
Conclusion
We presented a novel method for simultaneous localization and mapping for systems equipped with an event-based vision sensor (DVS). This sensor asynchronously reports individual events for perceived changes of pixel illumination. Due to the sparse nature of data reported by such a sensor, we achieved a significant reduction in required computing resources compared to current state-ofthe-art visual SLAM algorithms [1], allowing faster than real-time localization and map generation (table 1) on a single core desktop computer. We are currently investigating a time-optimized implementation that runs in real time on standalone microcontroller boards for possible commercial applications. For this paper we applied our method to a 2D visual SLAM scenario where a mobile robot moves on the ground and continuously localizes itself using features on the ceiling. Evaluation on a large recorded dataset demonstrates that our method works reliable with high accuracy already for a small number of particles.
1.5
0.5 0.4 0.3 0.2 0.1 0.00 0.5 0.4 0.3 0.2 0.1 0.00
2000 4000 6000 8000 10000 12000 14000 events
1.5
0.5 0.4 0.3 0.2 0.1 0.00 0.5 0.4 0.3 0.2 0.1 0.00
2000 4000 6000 8000 10000 12000 14000 events
1.5
0.5 0.4 0.3 0.2 0.1 0.00 0.5 0.4 0.3 0.2 0.1 0.00
1.5
m
1.0
m
0.5 0.0
1.0 1.51.5
4000
6000 8000 10000 12000 events
2000
4000
6000 8000 10000 12000 events
rad
0.5
2000
1.0
0.5
0.0 m
0.5
1.0
1.5
m
1.0
m
0.5 0.0
rad
0.5 1.0 1.51.5
1.0
0.5
0.0 m
0.5
1.0
1.5
m
1.0
2000 4000 6000 8000 10000 12000 14000 events
m
0.5 0.0
rad
0.5 1.0 1.51.5
1.0
0.5
0.0 m
0.5
1.0
2000 4000 6000 8000 10000 12000 14000 events
Fig. 6. Top to bottom: Three typical examples out of a total of 40 from the dataset. Left: Map and path as created by our method. Middle: Trajectories resulting from our method (red) and the external tracking system (blue). The trajectory starting point is marked with a X. Right: Positional and rotational error over number of resamples.
Acknowledgements The authors would like to thank Nicolai Waniek (NST, TUM) and Jan Funke, Florian Jug, Michael Pfeiffer and Matthew Cook from the Institute of Neuroinformatics (ETH and University Zurich) for valuable discussions.
References 1. Neira, J.P., Davison, A.J., Leonard, J.J.: Special issue on visual slam. 24 (2008) 2. Newcombe, R.A., Molyneaux, D., Kim, D., Davison, A.J., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Real-time dense surface mapping and tracking. In: IEEE International Symposium on Mixed and Augmented Reality. (2011)
3. Lichtsteiner, P., Posch, C., Delbruck, T.: A 128x128 120db 15us latency asynchronous temporal contrast vision sensor. IEEE Journal of Solid State Circuits (2007) 4. Weikersdorfer, D., Conradt, J.: Event-based particle filtering for robot selflocalization. In: IEEE International Conference on Robotics and Biomimetics. (2012) 5. Isard, M., Blake, A.: Condensation conditional density propagation for visual tracking. International journal of computer vision (1998) 6. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. IEEE Conference on Computer Vision and Pattern Recognition (2011) 7. Schraml, S., Belbachir, A.N., Milosevic, N., Sch¨ on, P.: Dynamic stereo vision system for real-time tracking. In: IEEE International Symposium on Circuits and Systems. (2010) 8. Rogister, P., Benosman, R., Ieng, S.H., Lichtsteiner, P., Delbruck, T.: Asynchronous event-based binocular stereo matching. IEEE Transactions on Neural Networks and Learning Systems (2012)