Research Article
Event-based stereo matching using semiglobal matching
International Journal of Advanced Robotic Systems January-February 2018: 1–11 ª The Author(s) 2018 DOI: 10.1177/1729881417752759 journals.sagepub.com/home/arx
Zhen Xie1, Jianhua Zhang1 and Pengfei Wang2
Abstract In this article, we focus on the problem of depth estimation from a stereo pair of event-based sensors. These sensors asynchronously capture pixel-level brightness changes information (events) instead of standard intensity images at a specified frame rate. So, these sensors provide sparse data at low latency and high temporal resolution over a wide intrascene dynamic range. However, new asynchronous, event-based processing algorithms are required to process the event streams. We propose a fully event-based stereo three-dimensional depth estimation algorithm inspired by semiglobal matching. Our algorithm considers the smoothness constraints between the nearby events to remove the ambiguous and wrong matches when only using the properties of a single event or local features. Experimental validation and comparison with several stateof-the-art, event-based stereo matching methods are provided on five different scenes of event-based stereo data sets. The results show that our method can operate well in an event-driven way and has higher estimation accuracy. Keywords Event-based camera, stereo matching, semiglobal matching, event-driven, disparity map Date received: 23 August 2017; accepted: 5 November 2017 Topic: Special Issue—3-D Vision for Robot Perception Topic Editor: Antonio Fernandez-Caballero Associate Editor: Shengyong Chen
Introduction Three-dimensional (3-D) vision tries to investigate the 3-D information of the real word, which is a challenging problem in computer vision. In the 3-D vision, depth perception is the key issue in sensing for robots and other autonomous applications. Different technologies of depth perception can generally be categorized into two groups including active and passive methods. The active methods emit light and use triangulation or time-of-flight measurements for analysis. The laser, Kinect, and time-of-flight camera are active sensors. These sensors perform remarkable result but usually suffer from limited range and outdoor sunlight condition, since light strength falls off with distance and struggles in ambient noise light. The passive methods process pairs of images and use stereo or structure from motion methods to reconstruct the depth as is done by ZED and VI-sensor. These methods usually have relative long detection range and high resolution.
Thanks to the publicly available benchmarks such as the Middlebury1 and KITTI,2 which provide good test images for both indoor and automotive context environment, stereo vision continues to steadily mature. However, frame-based stereo algorithms are not enough for agile motion such as fast flying and immediate obstacle avoidance. One sensible reason is that the frame-based sensing pipelines do not meet the real-time performance requirements. The latency of
1
College of Computer Science, Zhejiang University of Technology, Hangzhou, People’s Republic of China 2 Temasek Laboratories, National University of Singapore, Singapore, Singapore Corresponding author: Jianhua Zhang, College of Computer Science, Zhejiang University of Technology, No. 288, Liuhe Road, Hangzhou, Zhejiang 310014, People’s Republic of China. Email:
[email protected]
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage).
2
International Journal of Advanced Robotic Systems
Log illuminance ΔΤ ΔΤ Time Change detection events
Time
events
Figure 1. The principle operation of single DVS pixel and a single snapshot of the event stream in 20 ms.6 DVS: dynamic vision sensor.
current pipelines is typically in the order of 50–200 ms including the time-of-sensor data capturing and data processing. A brute force solution is to use high frame rate cameras and a high-performance computer. However, this brute force approach is not suitable for application to the autonomous flight or the mobile robot because of the limited payload and power consumption. The bioinspired vision sensor or called event-based camera mimics retinas to asynchronously generate the response to relative light intensity variations rather than the actual image intensity.3 Event-based cameras are datadriven sensors that have advantages of low redundancy, high temporal resolution (in the order of microseconds), low latency, and high dynamic range (130 dB compared with 60 dB of standard cameras). These properties make it an ideal device for using on a mobile robot, but the traditional frame-based algorithms are not well suited to operate on event-based data. In this article, we present a fully event-based stereo matching algorithm for reliable 3-D depth estimation using semiglobal matching (SGM). For that, we use two dynamic and active-pixel vision sensors (DAVISs) to build an eventbased stereo setup (“Event-based stereo camera setup” section) as input and calculate the matching cost by spatiotemporal attribute of events (“Semi-global matching” section). Cost aggregation is performed as an event-driven approximation of a global cost function by path-wise optimizations from all directions through the image (“Event-based matching cost calculation” section). Disparity computation is done by a winner taking all mechanism (“Event-driven semi-global cost aggregation” section). “Event-driven disparity computation” section presents the experiments of comparison with the state-of-the-art, eventbased stereo matching methods on five actual scenes data sets. The results which demonstrate our method have a higher estimation accuracy and are robust in different scenes.
Related work Event-based vision sensor Before direct moving to the event-based stereo, we first introduce the event-based vision sensor. Unlike the conventional frame-based camera, the event-based camera is not
driven by certain frequency clock signals. Each pixel is independently recording an event (a structure presents pixel’s activity at a particular time) when it detects the illuminance change larger than a threshold. Today’s event-based camera is similar to the Mahowald and Mead’s silicon retina,4 and the data are encoded with the address event representation.5 The principle and visualized output is shown in Figure 1. In this work, DAVIS is used, which is an extension of the dynamic vision sensor (DVS)7 with higher resolution (240 180) and an additional frame-based intensity readout (not used in this work).8 Each event is presented with a quadruplet eðt; x; y; pÞ; t is the time stamp, ðx; yÞ is the image coordinates, and p means polarity (ON/OFF) which indicates the luminance increase (ON) or decrease (OFF). The output of DAVIS consists of a stream of asynchronous high-rate (up to 1 MHz) events together with a stream of synchronous gray-scale frames at a low frame rate (on demand and up to 24 Hz).
Event-based stereo vision The event-based stereo matching task is to find corresponding events from two different views and estimate the disparity. However, events do not encode information about absolute intensity and cannot construct general features that are widely used in frame-based stereo matching. Mahowald and Delbruck9 first explored the event-based stereo vision by developing a chip for one-dimensional (1-D) image matching. The result was quite promising, but this chip was not flexible enough for two-dimensional (2-D) matching and variable range of disparity. In recent years, more researchers try to explore eventbased matching criteria using spatiotemporal information of DVS. Kogler et al.10 came up with the more flexible result using the temporal and polarity correlation with DVS sensors and achieved promising initial results. Rogister et al. found that using temporal and polarity criterion alone is prone to errors because the latency of events varies (jitter). So, they added geometry constraint, event ordering, and temporal activity constraint. Camuasmesa’s et al. mentioned previous methods using temporal and geometrical constraints alone still got relatively low reconstruction accuracy because ambiguities
Xie et al.
3 approach in the study by Weikersdorfer et al.,17 which uses the smallest depth value from the latest frame in a one-pixel neighborhood, to assign depth value to each event, which can depress noise and absent values. Later, the ground truth depth will be used to evaluate the eventbased matching method.
Algorithm
Figure 2. The stereo camera setup.6
cannot be uniquely solved.11 So they tried to explore more invariance motion constraints such as orientation and time surfaces for event-based stereo matching.11,12 Piatkowska et al. and Firouzi and Conradt13 both came up with modification of cooperative network to make use of the events stream. These methods used the idea from Marr’s cooperative computing approach14 but made it dynamic to take into account the temporal aspects of the stereo events. Our previous work6 also uses a Markov random field (MRF) to consider the depth continuity between the nearby events. The results show that the estimation rate of these methods considerably outperforms previous works. Recently, Eibensteiner et al. 15 implemented an event-based stereo matching algorithm in hardware on a field-programmable gate array. Osswald et al.16 proposed a spiking-based cooperative neural network that can be directly implemented with neuromorphic engineering devices. These methods considering local temporal and geometrical constraints work remarkable in some simple artificial data sets, but the estimation rate (ratio of depth estimates to input events) and matching rate (ratio of the correct depth estimated to the total depth estimated) are still low. Recent cooperative network13 and belief propagation6-based methods show their advantages using disparity uniqueness and depth continuity constraints between the nearby events. However, the neighbor radius is 1 or 2 for the large radius, increasing the computational cost. The small neighborhood may lead to local optimum and causes mismatches.
The main idea of our algorithm is borrowed from SGM, which offers a good trade-off among accuracy, robustness, and runtime compared to other global stereo-matching algorithm. In this section, we first give a brief review of the SGM algorithm. Then, three main steps of our eventdriven stereo matching are introduced: event-based matching cost calculation, event-driven semiglobal cost aggregation, and event-driven disparity computation.
Semiglobal matching Stereo matching can be defined as a labeling problem. The label corresponds to the disparity. Typically, the quality of labeling is formulated in a cost function. Finding labels that minimize the cost function is the key problem. However, the labeling problem is an nondeterministic polynomial time (NP)-complete problem. Graph cuts and belief propagation18 are good approximate solutions to this problem, but the main drawback is high computation cost. SGM18 has become a popular choice in real-time depth perception applications, which successfully incorporate the advantages of global and local stereo methods. Some extensions and modifications to the original SGM have been proposed to improve the runtime and robustness to different environments.19–21 Typically, the framework of SGM can be broken down into three steps: matching cost estimation, cost aggregation, and disparity computation.22 The general cost function EðDÞ (defined as 1) assigns costs to a disparity image D by summing the matching cost and the penalties for smoothness constraints X X EðDÞ ¼ P1 T jDp Dq j ¼ 1 Cðp; Dp Þ þ p
þ
Event-based stereo camera setup Our stereo setup is built using two DAVISs mounted sideby-side and a ZED sensor under the event stereo cameras, as shown in Figure 2. The hardware setup and software driver are the same as our previous work.6 The baseline of the event-based cameras is 12 cm. Two DAVISs are synchronized by hardware to output event streams. ZED sensor records the depth frames as depth ground truth. Although ZED works at 100 Hz, the depth information is still not available between frames at the very moment the events being generated. Then, we use a similar
q2Np
X
P2 T jDp Dq j > 1
ð1Þ ;
q2Np
where N ðdÞ denotes d’s neighborhood and T ½ is an indicator function if its argument is true and 0 otherwise. The matching costs Cðp; Dp Þ measure the difference between a certain pixel at pðx; yÞ in the left image and pixel ðx Dp ; yÞ in the right image (assuming the left image is the base image). The difference can be defined in different forms, such as absolute gray-scale value, mutual information,22 and census transform.23 The penalties depend on the difference to the neighborhood disparities. In general global stereo matching, the penalties are defined as potts,
4
International Journal of Advanced Robotic Systems
linear, or quadratic models. In SGM, if the disparity difference is < yl DðdÞ ¼ if miny ðC total ðd; yl ÞÞ < Do : l > : Dt ; otherwise
ð3Þ
ð4Þ
. Ct is the cost in time and is normalized by the scalar of the time difference. Cg is the cost of epipolar geometry, which measures the distance of the candidate points to the epipolar line and eg is a normalizing scalar of geometry difference. Do is a saturation term used to limit the maximum value of the matching cost. We rectified the events in both cameras, so the form of Cg ðel ; er Þ is simplified as the absolute difference in the y coordinate. As a result, the matching cost of each coming event and its candidates is a vector. The length of the vector is dmax, and the matching cost will be used in event-driven cost aggregation.
Event-driven semiglobal cost aggregation Traditionally, SGM algorithm accumulates the cost along all the paths at every pixel of the image. Our modified event-driven SGM does not simultaneously update the entire image but update whenever a new event comes from the stereo matching step. We redefine the path costs Lr ðp; dÞ for pixel p in disparity d along a path r in original SGM22 with the temporal correlation kernel as
Xie et al.
5
h tm Dt Lr ðp; dÞ ¼ Hp2NðpÞ Dðp; dÞ þ minðLr ðp r; dÞ;
Algorithm 1: Event-driven SGM
Lr ðp r; d 1Þ þ P1 ; Lr ðp r; d þ 1Þ þ P1 ;
i min Lr ðp r; iÞ þ P2 Þ min Lr ðp r; kÞ ; i
k
ð5Þ where the HðÞ is a Heaviside step function and t is the time since the last update for the node. The first term D uses the event-based matching cost computed in the previous section. The matching cost of nonactive pixels will be set back to the initial constant big value. The second term redefined the penalty with minimal path costs of the previous pixel pr in a certain direction. The last term is the minimum path cost of the previous pixel from the whole disparities, which prevents constantly increasing path costs. The temporal correlation kernel ensures that only the recent update nodes are active. The kernel can be defined as inverse linear, Gaussian, or quadratic. In our algorithm, we use a Heaviside step function, which ensures only the active pixel is used to update the current incoming event. Meanwhile, considering the edge-like sparsity of the event data, the range of each path is not cross the whole image but using N ðpÞ (from p nr to p þ nr ), which is a good tradeoff between the accuracy and the efficiency. All Lr are initialized to zero. After all the paths in all direction are calculated, the cost for each coming event and each disparity is obtained using X Sðp; dÞ ¼ Lr ðp; dÞ ð6Þ r
Event-driven disparity computation Finally, the disparity that minimizes the cost for each pixel is selected as the output disparity. Stereo matching is a labeling problem. We select the label dp which minimizes bp ðdp Þ as the best disparity for the pixel. If the cost of the best disparity, bp ðdp Þ, is greater than an outlier threshold to , then no disparity output is generated for the node. This mechanism makes our disparity computation event driven. Whenever there is a new event arriving, the most likely disparity at the location of the observation can be output. The general workflow of the algorithm is depicted in Algorithm 1.
Experiments and results Experiment setup The experiments compared our algorithm with other eventbased stereo matching algorithms.11,13,25 We found the previous researches typically use simple objects such as
Initialize last spike time map, matching cost, cost of 1-D paths and the parameters for each incoming event, e ¼ ðc; t; x; y; pÞ do Update last spike time map Construct set of possible corresponding candidates, using 2 for each candidate matching pair Ce do Calculate temporal and geometrical difference as data cost term using 3,4 end for Aggregate matching cost in 1-D from 4 directions equally. The cost along a path in direction r of pixel p at disparity d is defined as 5 Compute cost in x, y using 6 Select disparity which minimizes cost (if less than outlier threshold to ). Store the time at which each node was updated for future use in 5 end for
pens, 25 ring, and cube 11 as stimulus and showed the detected disparity. But the accuracy of the algorithm is not quantitatively analyzed, and there is no ground truth of depth for each event to precisely analyze the results. Firouzi and Conradt used more complex stimulus such as hand shaking in different depths. However, the ground truth is estimated by manually measuring the distance between the camera and the object and assumed all the triggered events are in the same disparity.13 Recently, some data sets for event-based simultaneous localization and mapping26,27 are available, but none of those are created for event-based stereo matching, and the abovementioned previous research do not release the test data sets. So, we have tried to replicate these algorithms and use our own data sets including not only the simple rigid objects such as the boxes but also the flexible objects such as walking people with depth ground truth.6 Five different scenes of event stereo data sets are recorded for comparison, which are listed as follows 1. one box moving sidewise (one box), 2. two boxes at different depths moving sidewise (two boxes), 3. one person walking sidewise (one person), 4. two people in different depths walking sidewise (two people), and 5. one person walking from near to far (one person at different depths). The parameters used for the comparison are manually tuned using the one box moving sideways recording to achieve the best result. The same parameters are then used for the other four recordings. The main parameters of our algorithm are set as follows: P 1 ¼ 15, P 2 ¼ 200, tt ¼ 20 ms, tm ¼ 10 ms, d max ¼ 50, t ¼ 1 ms, g ¼ 1, td ¼ 2, Dt ¼ 32, and to ¼ 190.
6
International Journal of Advanced Robotic Systems
>5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Depth(m)
6000 4000 2000
10,000 8000 6000 4000 2000
10,000 8000 6000 4000 2000
0
10 20 30 40 50 Disparity of ST
10,000 8000 6000 4000 2000 0
0
0
0
Number of events
8000
Number of events
Number of events
Number of events
10,000
12,000
12,000
12,000
12,000
0
10 20 30 40 50 Disparity of STS
0
10 20 30 40 50 Disparity of Cop-Net
0
10 20 30 40 50 Disparity of ESGM
Figure 3. The result of one box scene. The upper row is a color-coded depth map generated by accumulating 40 ms of depth estimates. The ground truth depth of the box is 2 m and the corresponding disparity is 15. The lower row shows the event disparity histograms over a period of 3 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
× 104
3 2 1 0
3 2 1 0
0
10 20 30 40 Disparity of ST
50
× 104
4 Number of events
4 Number of events
Number of events
4
× 104 4 Number of events
× 104
3 2 1 0
0
10 20 30 40 Disparity of STS
50
>5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Depth(m)
3 2 1 0
0
10 20 30 40 50 Disparity of Cop-Net
0
10 20 30 40 Disparity of ESGM
50
Figure 4. The result of the one person scene. The upper row is a color-coded depth map generated by accumulating 40 ms of disparity estimates. The ground truth depth of the person is 3 m and the corresponding disparity is 10. The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: eventbased SGM.
The depth map and the disparity histogram are used to evaluate the performance of each algorithm. In order to visualize the event with depth, the depth maps are accumulated with 40-ms events. Each pixel represents an event, and the color map of jet is used to present the depth from red to blue (red means close and blue means far away). The disparity histograms are created to show the number of events with a certain disparity.
In order to quantitatively evaluate the result, we used four measures: estimation rate, estimation accuracy, average depth error, and depth accuracy. The estimation rate is the ratio of stereo matches detected divided by the number of input events from the left camera. The estimation accuracy is the percentage of estimated disparities that are within one pixel disparity of the ground truth (obtained from ZED).10 The estimation accuracy is a good indicator
7
3000
3000
2500
2500
2500
2000 1500 1000 500 0 0
10 20 30 40 Disparity of ST
2000 1500 1000 500 0 0
50
10 20 30 40 Disparity of STS
Number of events
3000
2500
Number of events
3000 Number of events
Number of events
Xie et al.
2000 1500 1000 500 0 0
50
>5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Depth(m)
2000 1500 1000 500 0 0
10 20 30 40 50 Disparity of Cop-Net
10 20 30 40 Disparity of ESGM
50
Figure 5. The result of the two boxes scene. The upper row is a color-coded depth map of a 40-ms long stream of events for two moving boxes (one is at 1.5 m and another at 3 m). The lower row shows the event disparity histograms over a period of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
2 1.5 1 0.5 0 0
× 104
2.5
2 1.5 1 0.5
20
30
40
Disparity of ST
50
0
10
20
30
40
2.5
2 1.5 1 0.5
0 10
× 104
Number of events
2.5
Number of events
× 104
Number of events
Number of events
2.5
50
Disparity of STS
0 0
10
20
30
40
50
>5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Depth(m)
× 104
2 1.5 1 0.5 0 0
Disparity of Cop-Net
10
20
30
40
50
Disparity of ESGM
Figure 6. The result of the two people scene. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for two waling persons (one at 1.5 m and another at 3 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
to show the performance of the algorithm, which is also used in the evaluation of the previous work. The error of depth is not only determined by the error of disparity but also by the truth depth. So, we use the average depth error and the depth accuracy. Average depth error is the mean of the error between the estimated depth and the ground truth (obtained from ZED). The depth accuracy is measured as a percentage, defined as
z0i zi j i¼1 H Y j
PN z acc ðYÞ ¼
zi
N
:
ð7Þ
where z indicates depth (z-direction from the camera), and z0i and zi are the estimated and ground truth depths, respectively. Y is the error tolerance percentage, H ðÞ_ is the Heaviside step function, and there are N depth estimates
8
International Journal of Advanced Robotic Systems
>5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Depth(m)
2000 1000
4000 3000 2000 1000
0
10
20
30
40
50
Disparity of ST
4000 3000 2000 1000
4000 3000 2000 1000 0
0
0
0
Number of events
3000
Number of events
Number of events
Number of events
4000
5000
5000
5000
5000
0
10
20
30
40
50
Disparity of STS
0
10
20
30
40
50
Disparity of Cop-Net
0
10
20
30
40
50
Disparity of ESGM
Figure 7. The results of one person different depth scenes. The upper row is a color-coded disparity frame map of a 40-ms long stream of events for one walking person (from 1 to 5 m). The lower ones are event disparity histogram within time of 5 s. From the left to right, the result is extracted by ST, STS, Cop-Net, and our ESGM methods. SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM.
generated for the sequence. z acc ðYÞ gives the percentage of estimates for which the error is below the threshold of Y%. In the experiments, ST denotes Rogister’s method which enforces space (epipolar) and time constraints for stereo matching. STS is used to denote matching based on spatiotemporal surfaces.12 Cop-Net is used to denote Firouzi’s cooperative network approach, and ESGM is used to denote our event-based SGM approach. All the algorithms are implemented in MATLAB2016b 64-bit, running on an Intel Xeon 3.2GHz processor with 12 GB RAM.
Result and discussion In the scene of one box moving sidewise and one person walking sidewise, we test the algorithm with one moving object with different complexities. The depth map and the disparity histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are showed, respectively, in Figures 3 and 4. For the depth maps, we see STS has fewer mismatches than the ST. Typically, the mismatches appear in complex areas (events are triggered almost at the same time and in near rows). The result using Cop-Net is denser than others and can remove some mismatches. But we can see mismatches or blank areas in the hand and leg parts of the walking people. Our method clearly has less wrong matches (red or dark blue pixels). For the disparity histograms, it is obvious that the result using CopNet and our algorithm is sharper around the ground truth, and our algorithm performs better for eliminating mismatches. To evaluate the performance in the case of temporally overlapping situation, the scenes of two objects (two boxes and two walking people) are used. The depth map and the disparity histogram using ST, STS, Cop-Net, and our
Table 1. The choice of options. Data set
Method
Est-rate (%)
Est-Acc (%)
Average error (m)
One box
ST STS Cop-Net ESGM ST STS Cop-Net ESGM ST STS Cop-Net ESGM ST STS Cop-Net ESGM ST STS Cop-Net ESGM
40.74 49.16 71.84 39.22 45.43 47.87 50.77 34.10 34.98 34.34 61.23 44.13 43.06 40.89 49.73 42.78 37.86 35.42 53.84 22.24
68.16 73.33 75.29 76.47 52.33 56.53 74.10 86.94 54.25 62.61 75.29 89.10 42.59 47.28 67.08 75.38 41.33 46.08 40.78 60.84
0.2074 0.1396 0.1803 0.0875 0.3418 0.1824 0.1663 0.0999 0.4452 0.3325 0.3729 0.0707 0.3618 0.2702 0.4780 0.1219 0.4572 0.3448 0.4890 0.1219
One person
Two boxes
Two people
One person different depth
SGM: semiglobal matching; ST: space and time constraint; STS: spatiotemporal surface; Cop-Net: cooperative network; ESGM: event-based SGM. The bold values indicate the quantitative results of the estimation rate, the estimation accuracy, and depth accuracy.
ESGM methods (from left to right) are showed, respectively, in Figures 5 and 6. The performance for each method follows the same pattern in one object scenes. To evaluate the performance in the case of one flexible object moving in different depth situations, the scene that
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Depth accuracy
9
Depth accuracy
Xie et al.
ST STS Cop-Net ESGM 0
0.2
0.4
0.6
0.8
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
ST STS Cop-Net ESGM 0
0.2
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
ST STS Cop-Net ESGM 0
0.2
0.4
0.6
0.8
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Depth accuracy
0.6
0.8
1
ST STS Cop-Net ESGM 0
0.2
Error tolerance 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.4
Error tolerance
Depth accuracy
Depth accuracy
Error tolerance
0.4
0.6
0.8
1
Error tolerance
ST STS Cop-Net ESGM 0
0.2
0.4
0.6
0.8
1
Error tolerance
Figure 8. The relationship between depth accuracy z acc ðYÞ and error tolerance (Y) calculated using Equation (7). (a) One box. (b) Two boxes. (c) One person. (d) Two people. (e) One person different depths.
one person walking from near to far is tested. The disparity map and the histogram using ST, STS, Cop-Net, and our ESGM methods (from left to right) are shown in Figure 7. This data set is more difficult, for the disparity of object varies in time and different parts of the body trigger events across common epipolar lines. All the methods do not perform as well as the previous data sets. The proposed algorithm performs better than the previous algorithms when compared to the ground truth. The quantitative results of the estimation rate, the estimation accuracy, and depth accuracy for each method are shown in Table 1. The estimation rate of Cop-Net is higher
than other methods. The average estimation accuracy of the proposed method is 10–20, present better than the others, and the mean depth error is much smaller than others. The depth accuracy, z acc ðYÞ (vertical axis), lies within an acceptable error tolerance, Y (horizontal axis), as described in equation (7). To visualize the performance of depth accuracy of each algorithm, the accuracy and error tolerance figure are shown in Figure 8. The bigger of the area under the curve, the better of the depth accuracy. It should be mentioned that there are some black grids in the disparity map. It is more obvious if the events are dense enough. This is caused by the rectification. For each
10 coming event, we need to find the corresponding event in rectified camera coordinate, and it usually needs to be approximated to get an integer coordinate. For all the recordings tested in this article, the ESGM algorithm performs better in estimation accuracy and shows the robustness in different data sets. The ST and STS algorithms try to use properties or the local feature of the event to find the corresponding match. Although the criterions used are well defined, these methods may still suffer from the problems of noise and occlusion. The cooperative network algorithm creates a network to store the “state” of recent detected events. The disparity of each coming event will not only rely on the matching result but also on its spatiotemporal neighborhood, which suppresses the mismatches and increases the estimation rate. However, the small neighborhood is likely to be affected by noisy and previous mismatches, and the large neighborhood updating is time-consuming and not suitable for the edge-like event streams. One good example can be found in the first row of Figure 4. Focusing on the front leg of the walking person, the ST and STS methods both have some mismatches, because the spatial, temporal, and local feature criterions are similar in the leg area. The cooperative network method may solve this situation, but the parameter is not tuned based on the walking data set, and a high outlier threshold may remove all the events as mismatches. The advantage of our method is the use of more information from paths in all directions to decide the influence between a pixel and the neighborhood. The 1-D cost regularized can be efficiently calculated, so the range of each path is much larger than the neighborhood of the cooperative network, which suppresses the local mismatches and makes the algorithm more robust in different data sets with the same parameter.
Conclusion We propose a fully event-based 3-D depth perception algorithm using a message passing method. Different from the previous algorithms which only consider the constrain of one event, our algorithm considers the uniqueness constraints and the disparity continuity constraints between the adjacent events. Based on the traditional idea of SGM, we propose a novel event-driven SGM framework. Compared to several state-of-the-art, event-based stereo matching methods, the results show our method has higher estimation accuracy. Further work will focus on considering the situation of motion and dense depth reconstruction. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This
International Journal of Advanced Robotic Systems work was supported by National Natural Science Foundation of China U1509207, 61325019.
ORCID iD Pengfei Wang
http://orcid.org/0000-0002-0567-8406
References 1. Scharstein D and Szeliski R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vis 2002; 47(1): 7–42. 2. Menze M and Geiger A. Object scene flow for autonomous vehicles. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015, pp. 3061–3070. Los Alamitos: IEEE Computer Society Press. 3. Posch C, Matolin D, and Wohlgenannt R. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J Solid-State Circ 2011; 46(1): 259–275. 4. Mahowald MA and Mead C. The silicon retina. Sci Am 1991; 264(5): 76–82. 5. Mahowald MA. VLSI analogs of neural visual processing: a synthesis of form and function. Calif Institc Technol. Pasadena, CA, USA: California Institute of Technology, 1992. 6. Xie Z, Chen SY and Orchard G. Event-based stereo depth estimation using belief propagation. Frontiers in Neuroscience, October 2017; 11: 535. 7. Lichtsteiner P, Posch C, and Delbruck T. A 128 * 128 120 dD 15 us latency asynchronous temporal contrast vision sensor. IEEE J Solid-State Circ 2008; 43(2): 566–576. 8. Brandli C, Berner R, Yang MH, et al. A 240 * 180 130 dB 3 us latency global shutter spatiotemporal vision sensor. Circuits IEEE J Solid-State 2014; 49(10): 2333–2341. 9. Mahowald MA and Delbrck T. Cooperative stereo matching using static and dynamic image features. Analog VLSI Implementation of Neural Systems 1989; 8: 213–238. 10. Kogler J, Humenberger M, and Sulzbachner C. Eventbased stereo matching approaches for frameless address event stereo data. In: Proceedings, Part I Advances in Visual Computing: 7th International Symposium, Las Vegas, NV, USA, 26–28 September 2011, pp. 674–685. Las Vegas: ISVC. 11. Camuasmesa LA, Serranogotarredona T, Ieng SH, et al. On the use of orientation filters for 3D reconstruction in eventdriven stereo vision. Front Neurosci 2014; 8(8): 48. 12. Lagorce X, Orchard G, Gallupi F, et al. HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans Patt Anal Mach Intell 2016; PP(99): 1–1. 13. Firouzi M and Conradt J. Asynchronous event-based cooperative stereo matching using neuromorphic silicon retinas. Neural Process Lett 2016; 43(2): 311–326. 14. Marr D. Vision: a computational investigation into the human representation and processing of visual information. Quar Rev Biol 1983; 8. 15. Eibensteiner F, Brachtendorf HG, and Scharinger J. Eventdriven stereo vision algorithm based on silicon retina sensors.
Xie et al.
16.
17.
18.
19.
20.
21.
In: Radioelektronika (RADIOELEKTRONIKA), 2017 27th international conference, Brno, Czech Republic, 19–20 April 2017, pp. 1–6. Los Alamitos: IEEE. Osswald M, Ieng SH, Benosman R, et al. A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems. Sci Rep 2017; 7: 40703. Weikersdorfer D, Adrian DB, Cremers D, et al. Event-based 3D slam with a depth-augmented dynamic vision sensor. In: IEEE international conference on robotics and automation, Hong Kong, China, 31 May–7 June 2014, pp. 359–364. Los Alamitos: IEEE Computer Society Press Felzenszwalb PF and Huttenlocher DP. Efficient belief propagation for early vision. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, Washington, DC, USA, USA, 27 June–2 July 2004, Vol. 1, pp. I–261–I–268. Los Alamitos: IEEE Computer Society Press, CVPR. Hermann S and Klette R. Iterative semi-global matching for robust driver assistance systems. Lect Notes Comput Sci 2012; 7726: 465–478. Michael M, Salmen J, Stallkamp J, et al. Real-time stereo vision: optimizing semi-global matching. In: Intelligent Vehicles Symposium, Gold Coast, QLD, Australia, 23–26 June 2013, pp. 1197–1202. Los Alamitos: IEEE Computer Society Press. Spangenberg R, Langner T, Adfeldt S, et al. Large scale semiglobal matching on the CPU. In: Intelligent vehicles symposium proceedings, Dearborn, MI, USA, 8–11 June
11
22.
23.
24.
25.
26.
27.
2014, pp. 195–201. Los Alamitos: IEEE Computer Society Press. Hirschmuller H. Stereo processing by semiglobal matching and mutual information. IEEE Trans Patt Anal Mach Intell 2008; 30(2): 328. Cremers D, Kohlberger T, and Schnrr C. Nonlinear shape statistics in Mumford–Shah based segmentation. In: European conference on computer vision, Copenhagen, Denmark, 28–31 May 2002, pp. 93–108. Berlin, Heidelberg: Springer Berlin Heidelberg. Czech D and Orchard G. Evaluating noise filtering for eventbased asynchronous change detection image sensors. In: IEEE international conference on biomedical robotics and biomechatronics, Singapore, Singapore, 26–29 June 2016, Los Alamitos: IEEE Computer Society Press. Rogister P, Benosman R, Ieng SH, et al. Asynchronous eventbased binocular stereo matching. IEEE Trans Neural Net Learn Syst 2012; 23(2): 347. Serrano-Gotarredona T, Park J, Linares-Barranco A, et al. Improved contrast sensitivity DVS and its application to event-driven stereo vision. In: 2013 IEEE international symposium on circuits and systems (ISCAS2013), Beijing, China, 19–23 May 2013, Vol. 2013, pp. 2420–2423. Los Alamitos: IEEE Computer Society Press. Kogler J, Eibensteiner F, Humenberger M, et al. Ground truth evaluation for event-based silicon retina stereo data. In: IEEE conference on computer vision and pattern recognition workshops, Portland, OR, USA, 23–28 June 2013, pp. 649–656. Los Alamitos: IEEE Computer Society Press.