Speers, A. and Jenkin, M. Tuning stereo image matching with stereo ...

1 downloads 0 Views 3MB Size Report
RESULTS. 4.1 Wedding cake stereogram. Multi-layered “wedding cake” random dot stereograms are a common test stimuli for static stereopsis algorithms. The.
Tuning Stereo Image Matching with Stereo Video Sequence Processing Andrew Speers and Michael Jenkin Department of Computer Science and Engineering, York University 4700 Keele Street Toronto, Ontario

(speers, jenkin)@cse.yorku.ca

ABSTRACT Algorithms for stereo video image processing typicaly assume that the various tasks; calibration, static stereo matching, and egomotion are independent black boxes. In particular, the task of computing disparity estimates is normally performed independently of ongoing egomotion and environmental recovery processes. Can information from these processes be exploited in the notoriously hard problem of disparity field estimation? Here we explore the use of feedback from the environmental model being constructed to the static stereopsis task. A prior estimate of the disparity field is used to seed the stereomatching process within a probabilistic framework. Experimental results on simulated and real data demonstrate the potential of the approach.

Categories and Subject Descriptors I.2.10 [Vision and Scene Understanding]: 3D/stereo scene analysis; I.4.8 [Scene Analysis]: Stereo

General Terms Algorithms, Design

Keywords stereo vision, environmental reconstruction, single hypothesis stereo SLAM

1. INTRODUCTION The generation of three-dimensional models or maps is a field that is of interest to many disciplines and has been an active area of research for many years. Intuitively, being able to preserve the three-dimensional structure of the world for later viewing within a virtual environment is the natural extension to our present-day photograph. Large-scale 3D environmental reconstructions find applications in a wide number of areas including autonomous robotics (e.g. [29]), forensics (e.g. [28]), mining (e.g. [20, 21]), geology (e.g. [15, 31]), archaeology (e.g. [26]), real estate (e.g. [10, 32]), and virtual

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HCCE ’12, March 8–13, 2012, Aizu-Wakamatsu, Fukushima, Japan. Copyright 2012 ACM 978-1-4503-1191-5 ...$10.00.

reality (e.g. [17]). Systems in these application domains attempt to capture the three-dimensional world we live in and then visualize and manipulate the resulting model. Current systems typically utilize some sort of rgb-depth sensor coupled with an egomotion estimation process to merge measurements into a large scale model. Various laser- and structured/unstructured-light systems have been deployed (see [8] and [25] for representative examples). Although such systems can be effective, they require the emission of energy into the environment. This limits their application in certain environments and for certain surface types. An alternative approach to generating realistic 3D models of large scale environments is to make use of a passive multi-view camera system to capture video within the environment and then use a simultaneous localization and mapping (SLAM) technique to merge the data into a coherent model. This is also known as stereo video reconstruction. The rise of consumergrade 3D displays and capture devices makes this a perfect time to work in multi-view scene reconstruction as the increased accessibility of 3D data and devices is expected to dramatically increase the desired uses for this type of technology in potentially previously untapped fields of interest. A key component of a stereo video algorithm is the stereo disparity estimation module. Traditionally, research in stereo vision concentrates on binocular static (two-frame) disparity recovery (see [27] for a recent review of this field). Although static stereopsis has been exploited with some success in stereo video processing, within a stereo mapping framework, other alternatives exist. Here we explore the potential of seeding the disparity estimation process using prior estimates of disparity. Such an approach is not applicable in static stereopsis algorithms, but can be used to focus the disparity search task to likely disparities in the context of stereo video processing.

2.

BACKGROUND

Simultaneous localization and mapping algorithms have found wide application in autonomous systems. For long-range motions such systems typically utilize a multihypothesis / probabilistic SLAM framework. Such an approach provides a useful framework within which loopclosing and uncertainty in matching can be placed, A simpler approach that is more suitable for shorter missions is a single-hypothesis framework. The single-hypothesis model is more computationally tractable and results from the single-hypothesis model can be easily transferred to a multi-hypothesis approach. This single-hypothesis stereo

Figure 1: Basic single-hypothesis stereo SLAM approach.

video approach is taken by several groups [28, 18, 1, 12, 11] and at it’s most basic level the system is as depicted in Figure 1. At each time step t the left and right images are captured and a disparity map is produced. This stereo map is converted into 3D coordinates in some local coordinate system. An egomotion process is then used to transform the local 3D coordinates into a global reference frame. Data is then merged over time to construct an environmental model of a static scene. Different algorithms can be distinguished primarily through their use of different stereo matching and egomotion estimation strategies. The stereo matching module is a key component of any stereo video reconstruciton algorithm. The vast majority of systems leverage a static (two frame) dense stereo algorithm. Static stereo algorithms typically fall into one of two broadly defined categories: local and global-based methods. Local (window-based) methods compute the disparity at a given point in the image based only on the intensity values of pixels within a finite window centered about that point. These approaches typically have built in smoothness constraints due to the fact that they aggregate support over these windows. Typical matching cost / aggregation methods over these windows include sum-of-absolute-differences, sum-ofsquared-differences (SSD) and normalized cross-correlation. Global methods instead formulate a global cost function, making explicit smoothness assumptions, and then solve for the disparity map as a global optimization problem. Continuation [3], simulated annealing [14, 22, 2], graph cut [4], highest confidence first [5], belief propagation [30, 9] and mean-field annealing [13] are typical methods found in the literature. Given the embedded nature of the stereo algorithm within space-time processing, information from previous stereo matches can be used to seed the stereo matching task at the current time step. An early example of this type of approach [24] limited the search space for point correspondence using the results of the previous time step in order to speedup the processing time. Such an approach benefits from the assumption that the stereo frames are spaced closely enough together in time that the relative motion between frames is very small (thus limiting the necessary search space to a small area if a good match is available at the previous time step). This paper generalizes the approach considered in [24] and examines how a prior estimate of the disparity field can

be used to seed the disparity estimation process.

3.

METHOD

In order to seed the stereo algorithm the general stereo SLAM framework seen in Figure 1 must be adapted to allow for feedback. Several options for providing a prior estimate of the disparity field exist in this SLAM model (see Figure 2). For example, the 3D point cloud generated at the previous time step could be used as feedback to the correspondence module (the bottom dashed feedback path in Figure 2) or the entire world model could be used as input to the correspondence module (the top dashed feedback path). Both methods require that the motion of the camera be estimated and the 3D world model be reprojected into the estimated camera view to generate a “predicted” disparity map for the current time step. In order to help predict the motion of the camera the rotation and translation components of the motion of the camera from the previous time step are used as feedback (the middle dashed feedback path). When incorporating temporal data into the correspondence problem there are two main issues that need to be resolved, namely how to account for motion in the temporal propagation of data and how to make use of this propagated data when determining correspondences. Early attempts in seeded stereo algorithms made little or no attempt to track the motion of the camera or scene when propagating disparity information through time. Such an approach quickly leads to errors should any significant motion be present [24, 7]. Several attempts have been made to integrate motion estimates into the process including using 2D optical flow [6, 19] and 3D disparity flow maps [16], both having some success. Systems that have access to other sensing modalities (such as an IMU or encoder feedback on a mobile robot) will have access to other methods for predicting camera motion as well. Given the explicit representation of camera motion within a stereo vision SLAM algorithm, at the very least estimates of the camera’s position, orientation and their first derivatives are available to the system. Such estimates can be used to predict camera motion over the current time step, and standard estimation processes (e.g., Kalman filtering [23]) can be used to predict the dispartiy field assuming a static environment and Newtonian motion. Several options are available for making use of the propagated disparity information to seed the stereo matching

Figure 2: Proposed stereo SLAM approach. L(t) and R(t) - Left and right images at time t 3D(t) - depth map in local 3D co-ordinates A(t), T (t) - rotation and translation components of egomotion estimation. process. One approach would be to only search in a small region around the expected correspondence position. Limiting stereo matches only to regions “near” the expected disparity given the world and plant models is likely to lead to simple failures of the system when neither is appropriate. A more robust approach involves using the prior information to seed the disparity search process while still admitting disparities over a wide range. Lacking any prior information, the process of searching the potential disparity space is uninformed and every disparity (over some range defined by the viewing geometry and sensor limits) is equally likely. An alternative way of thinking of this is that the the utility of sampling each potential disparity is equally likely. Let d be a potential disparity value that must be searched (tested) by the stereo matching algorithm. Lacking any prior disparity information p(d) = c, and given limited computational resources, such a stereo algorithm must treat the problem of which disparities to test as being the problem of sampling the space uniformly, and most algorithms choose to implement this as the process of choosing on some sampling grid. This sampling can be informed by the known (or assumed) bandwidth of the disparity signal, and it is common for stereo algorithms to seek sub-sampling accuracy through an interpolation process operating on the local samples near the assumed true disparity value. An alternative approach to sampling the disparity surface would be to draw k samples randomly from the set of potential disparities from this prior disparity probability function. Again, this sampling process can be informed by the assumed bandwidth of the disparity signal and the sampling process permits the use of interpolative mechanisms to refine initial disparity estimates. Within a probabilistic sampling framework prior disparity estimates can be used to shape the sampling pdf so that it incorporates both the need to sample over the potential disparity range and also the desire to sample where we predict structure to exist. Given a prior pdf as to the true disparity value (µ) and a constant pdf that represents the need to sample all of the disparity space, how can these two requirements be combined into a single sampling pdf? Ideal characteristics of this pdf are that it has high values around the predicted value, to bias for the predicted surface disparity, and that all disparity values have some (non-zero) prob-

ability of being checked. Many such combination processes are possible. Perhaps the most straightforward of these is one based on a mixture of two simple distributions, one capturing the need to sample everywhere and the other representing the predicted disparity given the estimated system state. Such a representation has the advantage of having an easily represented cumulative distribution function (cdf) which will prove useful when choosing samples from the pdf.

2

P (disparity = d) =

(d−µ) − 1 k1 √ e 2σ1 2 2πσ1 2 ! "# $

“Predicted Value” Component 2

+

− d 1 k2 √ e 2σ2 2 2πσ2 2 "# $ !

“Baseline Probability” Component

The pdf is the sum of two Gaussian distributions (σi > 0, ki > 0, k1 + k2 = 1): a “predicted value” component, which biases the pdf to the disparity predicted as a result of the propagated disparity map, and a “baseline probability” component, which dictates the (nearly) constant probability of checking any given disparity value and which allows for some chance of recovering from a poorly predicted value. For the predicted value component, µ identifies the predicted disparity and σ1 captures the uncertainty in this estimate. For the baseline component, a wide-tail Gaussian (σ2 " 0) is used to model the constant component. Although the function is only approximately constant, it allows us to add a baseline probability across the entire disparity search while maintaing % +∞ the identity of a probability density function f that −∞ f (x)dx = 1. For a finite disparity domain, this Gaussian can be approximated by a non-zero constant disparity value. The two constants k1 and k2 are used to maintain the area identity of the pdf while allowing for two distinct components (a biasing function centred around the predicted value and a constant component allowing for a small probability of any value being sampled in disparity space) to be present in the pdf. µ represents the expected disparity value while σ1 represents the uncertainty in this estimate. Various possibilities exist for estimating the parameters (µ, σ1 , and k1 ), but

(a)

(b)

(c)

Figure 3: The ground truth disparity (a) as well as generated disparity maps using both a two-frame SSD block matching algorithm (b) and a seeded version of the same algorithm (σ = 5, k1 = 0.9) (c) are compared. one straightforward approach is to set k1 proportional to the previous match strength, µ to the estimated disparity value, and σ1 as the estimated disparity variance. Given this form of the pdf, sampling in disparity space is then achieved using an inverse cumulative distribution function (cdf) method.

3.1 Basic algorithm We assume that the ongoing process of stereo-video estimation can produce for each image coordinate in the current frame an estimate of the disparity (µ) along with its certainty, and disparities must lie in the rnage (0, w). Then for each image location n samples are drawn from the predicted cdf. Samples are drawn without duplicates, with two samples being identified as duplicates if the disparity sampling range of the samples overlap. The chosen dense stereo matching algorithm is then run with these selected disparities with the peak response corresponding to the prior disparity for this image location. (Note that interpolating of the resulting disparity surface is possible should further reminement of the disparity signal be desired.) For the preliminary results presented here subpixel disparities are limited to 1/4 of a pixel disparity and duplicate samples were identified if the samples were identical.

4. RESULTS 4.1 Wedding cake stereogram Multi-layered “wedding cake” random dot stereograms are a common test stimuli for static stereopsis algorithms. The wedding cake random dot stereogram used here has depth planes at 6, 12 and 24 pixel disparity shown in Figure 3(a). The basic SSD algorithm was required to search over the disparity range from 0 to 24 pixels. As expected, the SSD algorithm performs well although a number of false matches are detected due to the range of valid disparities (Figure 3(b)). Figure 3(c) shows the results from the seeded algorithm. The algorithm was seeded with the results from the basic SSD algorithm with µ given by the results of the ssd algorithm, k1 = 0.9 and σ1 = 5. As anticipated the seeded algorithm is more tuned to the underlying structure and thus fewer outliers are reported. Figure 5 shows the set of disparity sample points used by the algorithm which shows

Figure 4: Error distribution for the wedding cake example. Shows the number of pixels with a pixel error greater than or equal to a given error distance (in pixels).

the clear tuning of the disparity samples to the structure. Figure 4 plots the error distribution for the two approaches. The tuned nature of the sampling leads to a reduction in outliers in the recovered disparities.

4.2

Underwater stereo sequence

Figure 6 shows the results of underwater stereo reconstruction using both the SSD and the sampling-based approach. Here the sampling based approach was seeded with results from the previous time step under a zero motion model. As with the previous experiment k1 = 0.9, σ1 = 5. Figure 7 shows the reconstructed 3D point cloud computed from the disparity map. Figure 7(a) and (b) show 3D reconstruction with 60 disparity samples. Figure 7(c) and (d) show results with 10 disparity samples. The seeding process leads to a more dense disparity map in the 60 sample case and a visibly superior result for the 10 sample case. The ability to tune the disparity search process to regions “near” known disparity values can be particularly useful when the disparity field is more coarsely sampled.

(a)

(b)

Figure 5: Sampling methods depicted in 3D based on uniform (a) and seeded (b) sampling methods on the wedding cake stereogram example seen in Figure 3. These figures depict sampled disparities (height) for each (u, v) image coordinate.

(a)

(b)

(c)

Figure 6: Disparity maps of coral scene. Figure (a) depicts the left rectified camera view of the scene. Figures (b) and (c) depict the unseeded and seeded results respectively.

(a)

(b)

(c)

(d)

Figure 7: 3D Reconstruction of coral scene. Figure (a) and (c) are unseeded results from an SSD algorithm while (b) and (d) are seeded from the unseeded results of the previous frame. Figures (a) and (b) show results with 60 samples being used while (c) and (d) only use 10 samples to generate the results.

5. SUMMARY AND FUTURE WORK Stereo video image processing is inherently a feedback process. Current systems typically treat each of the sub-tasks within the process as black boxes, and limit the feedback process to one of egomotion estimation. Here we have looked at how information from this feedback process can be used by a (traditionally) static task stereo disparity estimation. The approach tunes the stereo search process to likely disparity locations while still seeding the full range of valid disparities. Experimental results illustrates the efficacy of the approach. Ongoing work includes embedding the seeded appraoch within a full stereo video reconstruction algorithm. This system is currently being evaluated on underwater stereo video streams.

6. ACKNOWLEDGEMENTS

[12]

[13]

[14]

[15]

The financial support of NSERC Canada is greatfully acknowledged.

[16]

7. REFERENCES

[17]

[1] A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, H. Towles, D. Nister, and M. Pollefeys. Towards Urban 3D Reconstruction from Video. Third International Symposium on 3D Data Processing, Visualization, and Transmission, pages 1–8, 2006. [2] S. T. Barnard. Stochastic stereo matching over scale. Int. J. Computer Vision, 3:17–32, May 1989. [3] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, USA, 1987. [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE PAMI, 23:1222–1239, 2001. [5] P. B. Chou and C. M. Brown. The theory and practice of Bayesian image labeling. Int. J. Computer Vision, 4:185–210, 1990. [6] S. Crossley, A. J. Lacey, N. A. Thacker, and N. L. Seed. Robust Stereo via Temporal Consistency. In Proceedings of the British Machine Vision Conference (BMVC), pages 659–668, Essex, UK, 1997. [7] J. Davis, D. Neh, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo: a unifying framework for depth from triangulation. IEEE PAMI, 27:296–302, 2005. [8] S. El-Hakim, J.-A. Beraldin, M. Picard, and G. Godin. Detailed 3d reconstruction of large-scale heritage sites with integrated techniques. Computer Graphics and Applications, IEEE, 24:21 –29, may-june 2004. [9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition., pages 261–268, Washington, DC, 2004. [10] W. F¨ orstner. 3D-City Models: Automatic and Semiautomatic Acquisition Methods. In Photogrammetric Week ’99, pages 291–303, Stuttgart, Germany, 1999. [11] P. Furgale, T. D. Barfoot, N. Ghafoor, K. Williams, and G. Osinski. Field Testing of an Integrated Surface/Subsurface Modeling Technique for Planetary

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Exploration. Int. J. Robotics Research, 29:1529–1549, 2010. M. Garcia and A. Solanas. 3D simultaneous localization and modeling from stereo vision. In IEEE Int. Conf. On Robotics and Automation (ICRA), pages 847–853, New Orleans, Jan. 2004. D. Geiger and F. Girosi. Parallel and deterministic algorithms from MRFs: Surface reconstruction and integration. In European Conference on Computer Vision (ECCV), pages 89–98, Antibes, France, 1990. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. J. of Applied Statistics, 20:25–62, 1993. J. Gong, P. Cheng, and Y. Wang. Three-dimensional modeling and application in geological exploration engineering. Computers & Geosciences, 30:391–404, 2004. M. Gong. Enforcing temporal consistency in real-time stereo estimation. In European Conference on Computer Vision (ECCV), pages 564–577, Graz, Austria, 2006. A. Hogue, S. Gill, and M. Jenkin. Automated avatar creation for 3D games. In Proceedings of the 2007 conference on Future Play, pages 174–180, Toronto, Canada, 2007. P. Jasiobedzki, S. Se, M. Bondy, and R. Jakola. Underwater 3D mapping and pose estimation for ROV operations. In Conference on Oceans, Poles and Climate: Technological Challenges (OCEANS), pages 1–6, Quebec City, Quebec, 2008. E. S. Larsen, P. Mordohai, M. Pollefeys, and H. Fuchs. Temporally consistent reconstruction from multiple video streams using enhanced belief propagation. In IEEE International Conference on Computer Vision (ICCV), pages 1–8, Rio de Janeiro, Brazil, 2007. M. Magnusson, R. Elsrud, L.-E. Sakagerlund, and T. Duckett. 3D modelling for underground mining vehicles. In Proceedings of the Conference on Modeling and Simulation for Public Safety (SimSafe ’05), Linkping, Sweden, 2005. M. Magnusson, A. J. Lilienthal, and T. Duckett. Scan Registration for Autonomous Mining Vehicles Using 3D-NDT. J. Field Robotics, 24:803–827, 2007. J. Marroquin, S. Mitter, and T. Poggio. Probabilistic Solution of Ill-Posed Problems in Computational Vision. J. the American Statistical Association, 82:76–89, Mar. 1987. L. Matthies, T. Kanade, and R. Szeliski. Kalman filter-based algorithms for estimating depth from image sequences. Int. J. Computer Vision, 3:209–238, 1989. R. Nevatia. Depth measurement by motion stereo. Computer Graphics and Image Processing, 5:203–214, 1976. R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 127–136, 2011. M. Pollefeys, L. van Gool, I. Akkermans, D. de Becker, and K. Demuynck. A Guided Tour to

Virtual Sagalassos. In 2001 Conference on Virtual Reality, Archeology, and Cultural Heritage, pages 213–218, Glyfada, Greece, 2001. [27] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Computer Vision, 47:7–42, 2002. [28] S. Se and P. Jasiobedzki. Instant Scene Modeler for Crime Scene Reconstruction. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 123–131, Washington, DC, USA, 2005.

[29] S. Se, H.-K. Ng, P. Jasiobedzki, and T.-J. Moyung. Vision based modeling and localization for planetary exploration rovers. In Proc. of the 55th International Astronautical Congress, 2004. [30] J. Sun, N. Zheng, and H. Shum. Stereo matching using belief propagation. IEEE PAMI, 25:787–800, 2003. [31] Q. Wu, H. Xu, and X. Zou. An effective method for 3D geological modeling with multi-source data integration. Computers & Geosciences, 31:35–43, 2005. [32] Y. Zhang, Z. Zhang, J. Zhang, and J. Wu. 3D Building Modelling with Digital Map, Lidar Data and Video Image Sequences. The Photogrammetric Record, 20:285–302, 2005.