Pushing the Limits of Stereo Using Variational Stereo ... - CiteSeerX

197 downloads 0 Views 2MB Size Report
As the focus lies on the stereo reconstruction of street scenes, it therefore is convenient to employ TGV of order k = 2, yielding piecewise affine disparity maps:.
Pushing the Limits of Stereo Using Variational Stereo Estimation Ren´e Ranftl1 , Stefan Gehrig2 , Thomas Pock1 and Horst Bischof1 1

Graz University of Technology

{ranftl,pock,bischof}@icg.tugraz.at 2

Daimler Research, Boeblingen

[email protected]

Abstract— We examine high accuracy stereo estimation for binocular sequences that where obtained from a mobile platform. The ultimate goal is to improve the range of stereo systems without altering the setup. Based on a well-known variational optical flow model, we introduce a novel stereo model that features a second-order regularization, which both allows sub-pixel accurate solutions and piecewise planar disparity maps. The model incorporates a robust fidelity term to account for adverse illumination conditions that frequently arise in real-world scenes. Using several sequences that were taken from a mobile platform we show the robustness and accuracy of the proposed model.

I. INTRODUCTION The estimation and measurement of distances is a fundamental task for environmental perception. Binocular camera setups offer such estimates in a dense and accurate way using techniques from Computer Vision. In this paper we propose a robust global stereo matching model. The model, motivated by optical flow models, features a continuous disparity range, intrinsically leading to sub-pixel accurate solutions. Moreover we employ a robust similarity measure, which leads to an extremely robust stereo estimation procedure in the context of adverse conditions that frequently arise in automotive applications of vision systems. The stereo matching problem can be stated as follows: Given a rectified stereo image pair IL , IR : Ω ⊂ R2 → R, a stereo matching algorithm aims to determine the dense disparity map u : Ω ⊂ R2 → R, such that IR (x + u, y) = IL (x, y). Note that the images as well as the disparity map are defined in a spatially continuous domain. The treatment of images as continuous intensity functions forms the basis of the variational approach to Computer Vision tasks. This specific formulation postpones the discretization of the model (and images) to a relatively late stage in the algorithm design and avoids metrication errors. The most important property of this formulation, with respect to the stereo matching problem, is the natural inclusion of sub-pixel accurate solutions, as opposed to classical discrete formulations, where sub-pixel accuracy is typically achieved using an ad-hoc modification to an existing algorithm [1]. These methods often suffer from intrinsic inaccuracies due to the pixel-locking effect [2]. In the context of driving assistance systems, sub-pixel accuracy is a highly desirable property, because it allows for long-range stereo estimation, which is especially relevant in high-speed scenarios. The maximal depth accuracy of any binocular stereo vision system is determined by three factors: Focal length of the cameras, baseline between

(a) Reference Image

(b) Disparity Map

Fig. 1: An exemplary result from the proposed algorithm. Left: the reference image. Right: The calculated disparity map. Note that the algorithm is able to reconstruct smooth surfaces like the road. the cameras, and finally sub-pixel accuracy of the stereo estimation procedure [3]. Since the physical setup of the employed stereo system is regarded as fixed (i.e. neither the baseline nor the focal length of the system is changed), the only possible improvement for an accurate far-field depth estimation lies within the improvement of the sub-pixel accuracy of the employed stereo matching algorithm. Fig. 1 shows a preliminary result of the proposed algorithm. The remainder of this paper is organized as follows: In Section II we formulate stereo matching as a variational problem and introduce a novel anisotropic regularizer that is particularly well suited for the task of stereo estimation. We further introduce a robust fidelity term based on the Census transform and show how the proposed model can be optimized using an efficient algorithm. In Section III experiments on real-world data sets are presented and compared to other popular stereo matching algorithms. Finally, Section IV summarizes our findings. II. VARIATIONAL S TEREO M ATCHING Starting with the seminal work by Rudin, Osher and Fatemi [4], variational models have enjoyed tremendous success in a variety of image processing tasks like denoising, deblurring and segmentation. A notable example is the estimation of optical flow [5, 6], which is considered similar to the stereo matching problem. Stereo matching can be formulated as the following general variational problem: min R(u) + G(u; IL , IR ). u

(1)

The function G(u; IL , IR ) measures data fidelity, i.e. the difference of the warped image IL to the image IR , given

the disparity map u under some metric. Popular examples for the fidelity term are the quadratic difference and the absolute value of the difference of the two images. Stereo matching is an ill-posed problem, so the fidelity term alone does not suffice to reliably infer a disparity map. Therefore the energy is regularized using an additional term R(u), which captures prior knowledge of the problem at hand. One of the most popular R regularizers is the Total Variation semi-norm TV(u) = Ω |∇u| dx. Both of the terms strongly influence the quality of the estimated disparity maps. While the data term should be robust with respect to common effects in real-world imagery like outliers from occlusions, illumination differences and blur, the regularization term constrains the form of admissible solutions of the model. It is therefore crucial to model both terms carefully. A. Total Generalized Variation Total Generalized Variation (TGV) [6] is a generalization of the popular Total Variation (TV) regularizer. Models that rely on TV favor piecewise constant solutions, leading to the formation of fronto-parallel artifacts in the estimated disparity maps. In contrast, solutions of models that are regularized using TGV can be composed of piecewise polynomials of arbitrary order (only constrained by the predefined order of the regularizer). Generally, the TGV regularizer of order k can be defined as follows: nZ k TGVα (u) = sup u divk v dx v ∈ Cck (Ω, IRd ), Ω o l k div vk∞ ≤ αl , l = 0, . . . , k − 1 , (2) where Cck (Ω, IRd ) denotes the space of tensors of order k and αl are user-defined scalars. Note that the definition of TGV resembles the dual definition of TV. In fact, TGV1α (u) directly reduces to the dual definition of the Total Variation norm: nZ o TGV1α = sup u div v dx kvk∞ ≤ α = α TV(u). Ω

Generally a model that is regularized with TGV of order k will favor solutions that are piecewise composed of polynomials of order k − 1. As the focus lies on the stereo reconstruction of street scenes, it therefore is convenient to employ TGV of order k = 2, yielding piecewise affine disparity maps: nZ 2 TGVα (u) = sup u div2 v dx v ∈ Cc2 (Ω, IR2 ), Ω o k div vk∞ ≤ α1 , kvk∞ ≤ α0 , . (3) Note that (3) itself poses an optimization problem, featuring a complex feasible set, which can not be computed efficiently. The corresponding primal definition, however, is unconstrained and can be efficiently solved: Z n Z o TGV2α (u) = min2 α1 |∇u − w| dx + α0 |∇w| dx . w∈R





From the primal definition we can intuitively deduce, how the TGV regularization acts in a model: Before the variation of the image u is measured, a vector field w, which itself is forced to have low variation is subtracted from the gradient. Affine functions will lead to a constant field w. This leads to a combined cost of zero for such functions under TGV, which explains the tendency to produce piecewise affine solutions in models employing this type of regularization. TGV in its standard form is completely oblivious to actual image content. The prior enforces piecewise affine disparity maps, the structural contents of those disparity maps (i.e depth discontinuities), however, are completely determined by the fidelity term. We propose a novel modification to the TGV regularizer, such that the prior also has knowledge about the image content. Depth discontinuities more likely arise at edges in the input image. We can take this prior knowledge into account to construct a better regularization, which enforces a low degree of smoothness around edges and more smoothness in homogeneous regions. Such a behavior can be enforced, by incorporating an anisotropic weighting of the image gradients. Based on the well-known NagelEnkelmann operator [7], we define the image-driven TGV regularization (ITGV) by introducing the diffusion tensor 1 D2: n Z 1 ITGV2α (u) = min2 α1 |D 2 ∇u − w| dx w∈R Ω Z o + α0 |∇w| dx . (4) Ω

Werlberger et al. [8] proposed a similar extension to a Huber-regularized optical flow model. We adopt their choice of the anisotropic diffusion tensor, which is given by 1

T

D 2 = exp(−γ|∇IL |β )nnT + n⊥ n⊥ ,

(5)

∇IL where n is the direction of the image gradient n = |∇I , L| ⊥ n is a vector perpendicular to the image gradient and γ, β are user-chosen parameters.

B. Fidelity Term Popular fidelity terms are the squared Euclidean norm or the L1-norm between the images. Their popularity mainly stems from their relative simplicity in the resulting optimization procedure. While the squared Euclidean norm is differentiable everywhere, which allows to apply standard principles from optimization, it is not robust to outliers that arise from occlusions. With the (relative) recent advent of powerful non-smooth optimization algorithms, energies which incorporate a L1-fidelity term can also be efficiently optimized. While such a fidelity provides some robustness against outliers from occlusions, it is not able to handle varying illumination or radiometric differences between cameras. It therefore performs poorly in many real-world scenarios. Possible remedies for this problem include appropriate preprocessing of the input image [9] or to explicitly correct the fidelity for illumination changes [10]. We instead use a fidelity term, which is itself invariant with respect to illumination. The Census transform [11] is

well-known for its illumination invariant properties and is thus frequently used in local stereo algorithms. For global continuous models M¨uller et al. [12] recently showed its usefulness in the context of optical flow estimation. We adopt a similar approach here: The data fidelity is measured by first computing the Ternary Census Transform of the input images and subsequent comparison using the pixel-wise Hamming distance between the images. Let us define   0 if I(q) − I(p) < − s(I; p, q) = 1 if |I(q) − I(p)| ≤  (6)   2 if I(q) − I(p) >  then the Ternary Census signature of a pixel p is given by C(I; x) = ⊗y∈N (x),x6=y {s(I; x, y)}. Note that ⊗ denotes concatenation and N (p) is some neighborhood around p (typically a rectangular window). By applying the Census transform to an image, one obtains a ternary string of length dim(N ) − 1 for each pixel that encodes the structure of its local neighborhood in an illumination invariant way. To measure similarity between two Census-transformed images, the Hamming distance [13] is commonly used. The Hamming distance between two strings p and q, is given by the number of different symbols in the strings: X ∆(p, q) = 1. (7) pi 6=qi

need to embed the estimation procedure within a coarseto-fine warping framework. Moreover, from a qualitative viewpoint, one can expect piecewise affine solutions, sharp discontinuities and robustness against outliers and illumination changes. The model does not explicitly incorporate occlusions, we therefore use a left-right consistency check to detect occlusions in the disparity maps. C. Optimization The model (10) is convex, but entirely non-smooth. While this non-smoothness largely accounts for the edge-preserving and robust qualities of the model, it also makes it hard to optimize. Recent advances in non-smooth convex optimization, allow us to perform the optimization in an efficient way, however. We use the primal-dual algorithm, proposed in [15]. In order to apply this algorithm, we first have to recast the minimization problem (10) as an equivalent convex-concave saddle point problem. This can be achieved by applying the Legendre-Fenchel transform [16] to the individual absolute values in the model and a discretization of the images on a cartesian grid of size M × N (i.e. the usual pixel grid), which results in: E n D 1 min max α1 D 2 ∇u − w, p + u∈RM N ,w∈R2M N p∈P,q∈Q o α0 h∇w, qi + G(u) , (11) where G(u) = λkˆ ρ(u, x)k1 and the sets P and Q are given by P

Using this metric, the difference between two pixels can be written as ρ(u, x) = ∆(C(IL ; x + u), C(IR ; x)).

(8)

(8) is non-convex in the argument u, which renders the optimization involving this term intractable. To finally obtain a tractable metric, we linearize (8) around some initial displacement u0 : ∂ρ(u0 , x) . (9) ∂x This linearization is only valid around some small displacement from u0 . The complete optimization therefore needs to be embedded into a coarse-to-fine warping framework [14]. Based on ITGV regularization, we propose to solve the following model for stereo estimation: Z n Z 1 min α1 |D 2 ∇u − w| dx + α0 |∇w| dx+ u,w Ω Z Ω o λ |ˆ ρ(u, x)| dx . (10) ρ(u, x) ≈ ρˆ(u, x) = ρ(u0 , x) + (u − u0 )



Before we proceed to the derivation of an optimization algorithm to solve this model, let us briefly summarize its properties: The model is convex, which allows to employ efficient local methods for its optimization. The model is only valid around some small displacement u0 , we therefore

= {p ∈ R2M N , kpk∞ ≤ 1}

Q = {q ∈ R4M N , kqk∞ ≤ 1}. Using this formulation we can directly apply the algorithm from [15] to our problem. The algorithm is an iterative procedure, that acts on the individual pixel in each iteration. The basic iterations are gradient-ascent in p and q, followed by a gradient-descent in u and w. Moreover the algorithm uses an extrapolation step from the previous iteration, in order to ensure convergence of the procedure. Let us formalize these steps: Choose step sizes σp > 0, σq > 0, τu > 0, τw > 0 and iterate    1  2 ∇¯ u − w ¯ ) p = P p + σ α (D  n+1 P n p 1 n n     qn+1 = PQ (qn + σq α0 (∇w ¯n ))    1 un+1 = (I + τu ∂G)−1 (un + τu div(D 2 pn+1 ))  wn+1 = wn + τw (div qn+1 + pn+1 )      u ¯n+1 = 2un+1 − un    w ¯n+1 = 2wn+1 − wn until a stopping criterion is reached. The operators PP and PQ are point-wise Euclidean projection operators onto the sets P and Q, respectively: (PP (ˆ p))i,j =

pˆi,j qˆi,j , (PQ (ˆ q ))i,j = max{1, kˆ pi,j k} max{1, kˆ qi,j k}

Furthermore, the algorithm needs to solve the so-called resolvend operator (I + τu ∂G)−1 (ˆ u) in each iteration. This

operator depends on the current value of u ˆ = un + τu div pn+1 and the fidelity term G(u) and in general poses an optimization problem itself. For the given fidelity term the operator admits a closed-form solution, however, that is given by the following soft-thresholding scheme:   ρx if ρˆ < −τu λˆ ρ2x τu λˆ −1 2 (I + τu ∂G) (ˆ u) = u ˆ + −τu λˆ ρx if ρˆ > τu λˆ ρx ,   −1 −ˆ ρρˆx if |ˆ ρ| ≤ τu λˆ ρ2x ˆ 0 ,x) and ρˆ is evaluated at u ˆ. where ρˆx = ∂ ρ(u ∂x In practice we use preconditioning [17] to determine the step sizes and choose a fixed number of iterations as stopping criterion. The divergence and gradient operators in the optimization are approximated using standard finite differences.

III. EXPERIMENTS We evaluated the proposed model on two real-world data sets: The first one is provided by the .enpeda. project 1 . This data set consists of eight sequences of 400 images each (except for the sequence People, which only includes 235 images), taken from an automotive platform. The sequences are trinocular, i.e. additionally to the two stereo cameras, there is also a third view included, that can be used for an evaluation using novel-view prediction. The sequences include challenging scenarios, like severe illumination artefacts, a night sequence and occluding windscreen wipers. [18] presents an evaluation of several popular stereo algorithms (with different cost functions) on this data set, which we will use as baseline for the evaluation of the ITGV model. For the second evaluation we use the KITTI Stereo Benchmark Suite [19]. This benchmark consists of 195 images that were taken from an automotive platform. The disparity maps are evaluated using semi-dense groundtruth, which was obtained from a laser scanner. A. .enpeda.-Dataset For this data set, we closely follow the evaluation methodology introduced in [20]: The reference view is given by the center camera, the matching view is given by the right camera. Finally the frame from the left camera is used as control view. We use the normalized cross-correlation (NCC) between the control image IC and the predicted image IP as evaluation score: 1 X (IC (x, y) − µC )(IP (x, y) − µP ) , N (IC , IP ) = |Ω| σC σP

TABLE I: Parameters for the evaluation of the .enpeda.Dataset. WS = window size of Census transform, PF = subsampling factor between pyramid levels, W = # of warps per level in the coarse-to-fine framework, I/W = # of iterations per warp. WS 5x5

λ 5

α0 5

α1 1

γ 5

β 1

 0.0045

PF 0.9

W 10

I/W 40

Analysis” in [18]. In this approach the NCC score can not capture errors in the disparity maps in regions of low texture. To overcome this problem we perform a second evaluation only in regions with sufficient texture (please refer to [18] for details of the detection procedure). We will refer to this approach as “Masked Analysis”. Fig. 2 shows an example pair of a control image and a predicted image. For full analysis the NCC score is evaluated everywhere, except in the black regions on the left corner of the predicted image. In the masked analysis approach large homogeneous regions, like the sky or the road surfaces are excluded from the analysis. Table I shows the parameter settings that were used throughout this experiment. Fig. 3 shows the NCC score in each frame for the eight evaluated sequences for full and masked analysis, respectively. The proposed algorithm reaches high scores for all of the sequences. Some sequences occasionally show a strong drop in NCC scores for single frames. This can be attributed to severe adverse conditions (e.g. wiper visible in Wiper, large saturated areas in Dusk, lightsaber artefacts in Night) or synchronization problems during recording. The algorithm, however, is able to handle these problems rather gracefully. Small performance drops in the other sequences are largely caused by thin structures like street signs. The coarse-to-fine strategy sometimes causes faulty disparities at such small objects. The NCC scores for the People sequence show another interesting behavior: The sequence shows many people simultaneously crossing the street directly in front of the car. Such a scenario introduces large occluded areas and also has large displacements between the reference and the matching image. This is challenging for any stereo algorithm, even under otherwise good conditions. We can therefore see a strong drop during the most crowded frames in the sequence (approximately between frames 50 and 110).

x,y∈Ω

where Ω is the evaluation domain, µC and µP denote the mean gray values, and σC and σP denote the variances of the control image and the predicted image, respectively. The proposed model assigns a disparity value to every pixel in the image, it is therefore possible to set the evaluation domain Ω to the full image, only excluding obvious occlusions that arise due to non-visibility at the left border of the predicted image. This approach is referred to as “Full 1 http://www.mi.auckland.ac.nz

(a) Control Image

(b) Predicted Image

Fig. 2: Example of novel-view prediction. Left: Control Image. Right: Predicted Image

TABLE II: Mean NCC scores for full analysis. Numbers for GCM, BPM and SGM taken from [18]. All values were rounded to the nearest integer. Barriers Dusk Harbour Midday Night People Queen Wiper

Fig. 3: NCC scores for the .enpeda.-Dataset. “Full” refers to the full analysis approach, “Masked” to the masked analysis approach. µ denotes the mean correlation score over the whole sequence. (Best viewed in color) The difference between full and masked analysis becomes most apparent in sequences with large homogeneous regions like Dusk, Midday and Wiper. However, we see a relatively small difference (≈ 6%) between both types of analysis here. The algorithm sometimes fails to handle objects which are too thin, it provides sufficient detail in the disparity maps, however. This is mostly evident from the sequences Barriers and Harbour Bridge, where many thin bars are visible but both scores are very similar. Note that we see the strongest difference in sequences where the sky is prominently visible. The full analysis may show a high score in such areas, even tough the disparity map includes wrong disparities. We also compared the proposed model to the evaluation shown in [18]. They compare three popular stereo matching approaches with three different cost functions each: Belief Propagation Matching (BPM), Graph Cut Matching (GCM) and Semi-Global Matching (SGM). Tables II and III show our result in terms of the mean NCC score in percent over each sequence next to the highest scoring variants reported in [18]. Our approach outperforms all of the compared methods. For full analysis, a significant performance gain

GCM 59 87 62 93 41 66 82 89

BPM 66 50 70 91 64 68 80 87

SGM 76 95 81 96 90 79 89 94

Ours 88 97 93 98 94 81 94 97

can be seen when compared to the global methods GCM and BPM. The performance gain is smaller, when compared to SGM but still in the order of a few percent. Note, however, that SGM, unlike the other approaches, does not generate completely dense disparity maps. The NCC score in this case was only evaluated on pixels with valid disparity (which according to [18] is approximately 60% on average). The better performance becomes even more apparent in the masked analysis. While the NCC scores significantly drop in all other methods, our algorithms maintains a consistently high score on all sequences. The very large differences to the other algorithms on the Barriers and Harbour Bridge sequences again show that our algorithm not only is able to infer smooth disparity maps, but also shows rich detail in scenes with many thin objects. Finally, Fig. 4 shows one example disparity map for each sequence. It is evident that the algorithm performs very well on the road surface and other large smooth structures that are typically encountered in urban scenes. The model shows a strong inpainting effect, that makes the approach very robust on the one hand, e.g. Fig. 4h shows a result where the wiper is prominently visible in the reference image (from lower right to upper left). The inpainting effect can be observed on the tree on the left side and, to some extend, in lower right corner. On the other hand this effect may result in oversmoothing as visible in Fig. 4f. This effect, however, can be adapted using the regularization parameter λ. Fig. 4b was also taken under adverse conditions. The input images for this examples where strongly saturated in the upper left corner, due to the sun being close to the horizon. This scenario in general does not severely degenerate performance. Saturated regions are simply detected as occlusions, while the rest of the disparity map is still correctly inferred. TABLE III: Mean NCC scores for masked analysis. Numbers for GCM, BPM and SGM taken from [18]. All values were rounded to the nearest integer. Barriers Dusk Harbour Midday Night People Queen Wiper

GCM 54 76 54 82 43 66 78 77

BPM 61 75 64 79 64 66 75 73

SGM 34 33 25 38 31 53 53 22

Ours 87 92 92 95 94 80 92 92

(a) Barriers

(b) Dusk

(c) Harbour Bridge

(d) Midday

(e) Night

(f) People

(g) Queen Street

(h) Wiper

Fig. 4: Results for the .enpeda. data set. Note that the examples include pairs featuring adverse conditions. Finally Fig. 4e shows an example from the Night sequence. The input images are of very low contrast and do not contain much information. Due to the interpolating effects, large parts of the disparity map can be reconstructed nonetheless. In our last experiment we compared the sub-pixel accuracy on a fronto-parallel surface of our approach to a sub-pixel accurate version of SGM [21]. The surface that was used for the evaluation is marked with a green rectangle in 4g. We compared the algorithms in terms of the standard deviation of the estimates in this area. SGM yields a standard deviation of 0.31px with a mean disparity of 29.75px, while our approach has a standard deviation of 0.22px with the same mean disparity. This shows that our approach indeed reaches the intended goal of excellent sub-pixel accuracy.

consists of 194 pairs of training images with corresponding groundtruth and 195 test pairs without groundtruth. The training images where used to adapt the parameters of the model to the modalities of this dataset. Due to overall simpler illumination conditions we are able but stronger emphasis on the diffusion tensor, which also influences the magnitude of the regularization parameter λ. Table V shows the parameters that were used in this experiment.

B. KITTI-Dataset The KITTI Benchmark Suite was recently released with the target of benchmarking stereo and optical flow algorithms in the context of automotive applications [19]. The data set

Fig. 5 shows an exemplary input image together with the result of the proposed algorithm. The qualitative results for this dataset are similar to the previous evaluation: We are able to robustly reconstruct smooth surfaces, which leads to low

To assess the performance of the algorithm we count the percentage of disparity values that have an error larger than 3 pixels. Note that this dataset is not suited to assess sub-pixel accuracy because of uncertainties in the laser measurements. We can, however, measure robustness and performance relative to other state-of-the-art algorithms.

TABLE V: Parameters for the evaluation of the KITTIDataset. WS 7x7

λ 0.5

α0 5

α1 1

γ 0.2

β 4

 0.0045

PF 0.9

W 20

I/W 40

man-made structures, because they perfectly fit the implicit model assumption of a piecewise planar world. Future work will be directed towards a speed-up of the optimization procedure, in order to enable real-time performance.

(a) Reference Image

R EFERENCES

(b) Disparity Map

Fig. 5: An example result for the KITTI-Dataset. error rates on the street and on other slanted surfaces. The algorithm handles specular and semi-transparent surfaces rather gracefully. Moreover, we observe that the resulting disparity maps are sufficiently detailed (see for example the side mirror of the parked car in Fig. 5). The quantitative evaluation supports these observations. Our algorithm ranks second out of ten evaluated algorithms1 . Table IV shows the average error on the test set for the five top-ranked algorithms. Note that the only algorithm with a smaller average error needs 5 minutes to process a single stereo pair of size 1240 × 376, while the proposed approach only takes 7 seconds (using a GPU-based implementation on a Nvidia Quadro 6000 GPU). TABLE IV: Average number of bad pixels (i.e. disparity error larger than 3 pixels) on the KITTI-Dataset. Rank 1 2 3 4 5

Method PCBP Ours OCV-SGBM ELAS SDM

Error 6.16 % 7.40 % 9.13 % 9.95 % 12.19 %

Density 100.00 % 100.00 % 86.50 % 94.55 % 63.58 %

Runtime 5 min 7s 1.1 s 0.3 s 1 min

IV. CONCLUSION We presented a novel stereo estimation algorithm based on a variational framework. The model incorporates an image-adaptive regularization term and allows for piecewise affine and dense disparity maps. A fidelity term that is based on the Census transform allows robustness against many adverse conditions that are typically encountered in an automotive context. The variational formulation naturally supports sub-pixel accurate solutions, which together with the regularization results in smooth and accurate disparity maps. Using two challenging real-world data sets, we have shown that the proposed model outperforms many algorithms that are considered state-of-the-art. We have further shown that the algorithm performs particularly well in scenes with many 1 Further results and descriptions of the algorithms can be found at http://www.cvlibs.net/datasets/kitti

[1] S. Gehrig and U. Franke, “Improving stereo sub-pixel accuracy for long range stereo,” in International Conference on Computer Vision (ICCV), 2007, pp. 1 –7. [2] M. Shimizu and M. Okutomi, “Precise sub-pixel estimation on areabased matching,” in International Conference on Computer Vision (ICCV), vol. 1, 2001, pp. 90 –97 vol.1. [3] D. Gallup, J.-M. Frahm, P. Mordohai, and M. Pollefeys, “Variable baseline/resolution stereo,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1 –8. [4] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, pp. 259–268, 1992. [5] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” ARTIFICAL INTELLIGENCE, vol. 17, pp. 185–203, 1981. [6] K. Bredies, K. Kunisch, and T. Pock, “Total generalized variation,” SIAM Journal on Imaging Sciences, vol. 3, no. 3, pp. 492–526, 2010. [7] H.-H. Nagel and W. Enkelmann, “An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 5, pp. 565 –593, 1986. [8] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof, “Anisotropic Huber-L1 optical flow,” in Proceedings of the British Machine Vision Conference (BMVC), 2009. [9] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An improved algorithm for tv-l1 optical flow,” 2009, pp. 23–45. [10] D. Shulman and J.-Y. Herve, “Regularization of discontinuous flow fields,” in Workshop on Visual Motion, 1989, pp. 81 –86. [11] R. Zabih and J. W. Ll, “Non-parametric local transforms for computing visual correspondence,” in European conference on Computer Vision (ECCV), 1994, pp. 151–158. [12] T. M¨uller, C. Rabe, J. Rannacher, U. Franke, and R. Mester, “Illumination-robust dense optical flow using census signatures,” in German Association for Pattern Recognition (DAGM), 2011, pp. 236– 245. [13] R. W. Hamming, “Error detecting and error correcting codes,” BELL SYSTEM TECHNICAL JOURNAL, vol. 29, no. 2, pp. 147–160, 1950. [14] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in European Conference on Computer Vision (ECCV), 2004, vol. 3024, pp. 25–36. [15] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of Mathematical Imaging and Vision, vol. 40, pp. 120–145, 2011. [16] R. T. Rockafellar, Convex analysis, ser. Princeton Landmarks in Mathematics. Princeton, NJ: Princeton University Press, 1997, reprint of the 1970 original, Princeton Paperbacks. [17] T. Pock and A. Chambolle, “Diagonal preconditioning for first order primal-dual algorithms in convex optimization,” in International Conference on Computer Vision (ICCV), 2011. [18] S. Morales, S. Hermann, and R. Klette, “Real-world stereo-analysis evaluation,” .enpeda. Project, The University of Aukland, Tech. Rep., 2010. [19] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012. [20] S. Morales and R. Klette, “A third eye for performance evaluation in stereo sequence analysis,” in International Conference on Computer Analysis of Images and Patterns (CAIP), 2009, vol. 5702, pp. 1078– 1086. [21] S. K. Gehrig, F. Eberli, and T. Meyer, “A real-time low-power stereo vision engine using semi-global matching,” in International Conference on Computer Vision Systems (ICVS), 2009, pp. 134–143.