Scalable Full Flow with Learned Binary Descriptors

5 downloads 0 Views 4MB Size Report
arXiv:1707.06427v1 [cs.CV] 20 Jul 2017 .... cV i (vi), where cV i (vi) = min ui ci(ui,vi). (4b). The inner step in (4a) and (4b), called min-projection, minimizes out ...
Scalable Full Flow with Learned Binary Descriptors Gottfried Munda1

arXiv:1707.06427v1 [cs.CV] 20 Jul 2017

1

Alexander Shekhovtsov2 Thomas Pock1,3

Patrick Kn¨obelreiter1

Institute of Computer Graphics and Vision, Graz University of Technology, Austria 2 Czech Technical University in Prague 3 Center for Vision, Automation and Control, Austrian Institute of Technology

Abstract. We propose a method for large displacement optical flow in which local matching costs are learned by a convolutional neural network (CNN) and a smoothness prior is imposed by a conditional random field (CRF). We tackle the computation- and memory-intensive operations on the 4D cost volume by a min-projection which reduces memory complexity from quadratic to linear and binary descriptors for efficient matching. This enables evaluation of the cost on the fly and allows to perform learning and CRF inference on high resolution images without ever storing the 4D cost volume. To address the problem of learning binary descriptors we propose a new hybrid learning scheme. In contrast to current state of the art approaches for learning binary CNNs we can compute the exact non-zero gradient within our model. We compare several methods for training binary descriptors and show results on public available benchmarks.

1

Introduction

Optical flow can be seen as an instance of the dense image matching problem, where the goal is to find for each pixel its corresponding match in the other image. One fundamental question in the dense matching problem is how to choose good descriptors or features. Data mining with convolutional neural networks (CNNs) has recently shown excellent results for learning task-specific image features, outperforming previous methods based on hand-crafted descriptors. One of the major difficulties in learning features for optical flow is the high dimensionality of the cost function: Whereas in stereo, the full cost function can be represented as a 3D volume, the matching cost in optical flow is a 4D volume. Especially at high image resolutions, operations on the flow matching cost are expensive both in terms of memory requirements and computation time. Our method avoids explicit storage of the full cost volume, both in the learning phase and during inference. This is achieved by a splitting (or min-projection) of the 4D cost into two quasi-independent 3D volumes, corresponding to the u and v component of the flow. We then formulate CNN learning and CRF inference in this reduced setting. This achieves a space complexity linear in the

2

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock

size of the search range, similar to recent stereo methods, which is a significant reduction compared to the quadratic complexity of the full 4D cost function. Nevertheless, we still have to compute all entries of the 4D cost. This computational bottleneck can be optimized by using binary descriptors, which give a theoretical speed-up factor of 32. In practice, even larger speed-up factors are attained, since binary descriptors need less memory bandwidth and also yield a better cache efficiency. Consequently, we aim to incorporate a binarization step into the learning. We propose a novel hybrid learning scheme, where we circumvent the problem of hard nonlinearities having zero gradient. We show that our hybrid learning performs almost as well as a network without hard nonlinearities, and much better than the previous state of the art in learning binary CNNs.

2

Related Work

In the past hand-crafted descriptors like SIFT, NCC, FAST etc. have been used extensively with very good results, but recently CNN-based approaches [23,13] marked a paradigm shift in the field of image matching. To date all top performing methods in the major stereo benchmarks rely heavily on features learned by CNNs. For optical flow, many recent works still use engineered features [5,1], presumably due to the difficulties the high dimensional optical flow cost function poses for learning. Only very recently we see a shift towards CNNs for learning descriptors [9,10,22]. Our work is most related to [22], who construct the full 4D cost volume and run an adapted version of SGM on it. They perform learning and cost volume optimization on 13 of the original resolution and compress the cost function in order to cope with the high memory consumption. Our method is memory-efficient thanks to the dimensionality-reduction by the min-projection, and we outperform the reported runtime of [22] by a factor of 10. Full flow with CRF [5] is a related inference method using TRW-S [12] with efficient distance transform [8]. Its iterations have quadratic time and space complexity. In practice, this takes 20GB1 of memory, and 10-30 sec. per iteration with a parallel CPU implementation. We use the decomposed model [19] with a better memory complexity and a faster parallel inference scheme based on [18]. Hand-crafted Binary Descriptors like Census have been shown to work well in a number of applications, including image matching for stereo and flow [14,15,20,4]. However, direct learning of binary descriptors is a difficult task, since the hard thresholding function, sign(x), has gradient zero almost everywhere. In the context of Binary CNNs there are several approaches to train networks with binary activations [2] and even binary weights [7,16]. This is known to give a considerable compression and speed-up at the price of a tolerable loss of accuracy. To circumvent the problem of sign(x) having zero gradient a.e., surrogate gradients are used. The simplest method, called straight-through estimator [2] is to assume the derivative of sign(x) is 1, i.e., simply omit the sign function in the gradient computation. This approach can be considered as the state of the art, as it gives 1

Estimated for the cost volume size 341×145×160×160 based on numbers in [5] corresponding to 13 resolution of Sintel images.

Scalable Full Flow with Learned Binary Descriptors

3

best results in [2,7,16]. We show that in the context of learning binary descriptors for the purpose of matching, alternative strategies are possible which give better results.

3

Method

We define two models for optical flow: a local model, known as Winner-TakesAll (WTA) and a joint model, which uses CRF inference. Both models use CNN descriptors, learned in § 3.1. The joint model has only few extra parameters that are fit separately and the inference is solved with a parallel method, see § 3.2. For CNN learning, we optimize the performance of the local model. While learning by optimizing the performance of the joint model is possible [11], the resulting procedures are significantly more difficult. We assume color images I 1 , I 2 : Ω → R3 , where Ω = {1, . . . H} × {1, . . . W } is a set of pixels. Let W = S × S be a window of discrete 2D displacements, with S = {−D/2, −D/2 + 1, . . . , D/2 − 1} given by the search window size D, an even number. The flow x : Ω → W associates a displacement to each pixel i ∈ Ω so that the displaced position of i is given by i + xi ∈ Z2 . For convenience, we denote by x = (u, v), where u and v are mappings Ω → S, the components of the flow in horizontal and vertical directions, respectively. The per-pixel descriptors φ(I; θ) : Ω → Rm are computed by a CNN with parameters θ. Let φ1 , φ2 be descriptors of images I 1 , I 2 , respectively. The local matching cost for a pixel i ∈ Ω and displacement xi ∈ W is given by ( d(φ1i , φ2i+xi ) if i + xi ∈ Ω, ci (xi ) = (1) coutside otherwise, where d : Rm × Rm → R is a distance function in Rm . “Distance” is used in a loose sense here, we will consider the negative2 scalar product d(φ1 , φ2 ) = −hφ1 , φ2 i. We call x ˆi ∈ arg min ci (xi ) (2) xi ∈W

the local optical flow model, which finds independently for each pixel i a displacement xi that optimizes the local matching cost. The joint optical flow model finds the full flow field x optimizing the coupled CRF energy cost: hX i X ci (ui , vi ) + wij (ρ(ui − uj ) + ρ(vi − vj )) , (3) x ˆ ∈ arg min u,v : Ω→S

i∈Ω

i∼j

where i ∼ j denotes a 4-connected wij are contrast-sensitive P pixel neighborhood, 1 1 −Ij,c |) and ρ : R → R is a robust weights, given by wij = exp(− α3 c∈{R,G,B} |Ii,c penalty function shown in Fig. 2(a). 2

since we want to pose matching as a minimization problem

4

3.1

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock

Learning Descriptors

A common difficulty of models (2) and (3) is that they need to process the 4D cost (1), which involves computing distances in Rm per entry. Storing such cost volume takes O(|Ω|D2 ) space and evaluating it O(|Ω|D2 m) time. We can reduce space complexity to O(|Ω|D) by avoiding explicit storage of the 4D cost function. This facilitates memory-efficient end-to-end training on high resolution images, without a patch sampling step [22,13]. Towards this end we write the local optical flow model (2) in the following way u ˆi ∈ arg min cui (ui ), where

cui (ui ) = min ci (ui , vi );

(4a)

vˆi ∈ arg min cvi (vi ), where

cvi (vi ) = min ci (ui , vi ).

(4b)

ui vi

vi

ui

The inner step in (4a) and (4b), called min-projection, minimizes out one component of the flow vector. This can be interpreted as a decoupling of the full 4D flow problem into two simpler quasi-independent 3D problems on the reduced cost volumes cu , cv . Assuming the minimizer of (2) is unique, (4a) and (4b) find the same solution as the original problem (2). Using this representation, CNN learning can be implemented within existing frameworks. We point out that this approach has the same space complexity as recent methods for learning stereo matching, since we only need to store the 3D cost volumes cu and cv . As an illustrative example consider an image with size 1024 × 436 and a search range of 256. In this setting the full 4D cost function takes roughly 108 GB whereas our splitting consumes only 0.8 GB. Network Fig. 1 shows the network diagram of the local flow model Eq. (2). The structure is similar to the recent methods proposed for learning stereo matching [13,23,6,11]. It is a siamese network consisting of two convolutional branches with shared parameters, followed by a correlation layer. The filter size of the convolutions is 3 × 3 for the first layer and 2 × 2 for all other layers. The tanh nonlinearity keeps feature values in a defined range, which works well with the scalar product as distance function. We do not use striding or pooling. The last convolutional layer uses 64 filter channels, all other layers have 96 channels. This fixes the dimensionality of the distance space to m = 64. Loss Given the groundtruth flow field (u∗ , v ∗ ), we pose the learning objective as follows: we define a probabilistic softmax model of the local prediction ui (resp. vi ) as p(ui ) ∝ exp(−cui (ui )), then we consider a naive model p(u, v) = Q i p(ui )p(vi ) and apply the maximum likelihood criterion. The negative log likelihood is given by X  L(u, v) = − log p(u∗i ) + log p(vi∗ ) . (5) i∈Ω

This is equivalent to cross-entropy loss with the target distribution concentrated at the single point (u∗i , vi∗ ) for each i. Variants of the cross-entropy loss, where the target distribution is spread around the ground truth point (u∗i , vi∗ ) are also used in the literature [13] and can be easily incorporated.

Scalable Full Flow with Learned Binary Descriptors I1

Convolution tanh

Convolution φ1 tanh

5

Correlation min-projection v

cu

min-projection u

cv

shared parameters

I2

Convolution tanh

Convolution φ tanh

2

Fig. 1. Network architecture: A number of convolutional layers with shared parameters computes feature vectors φ1 , φ2 for every pixel. These feature vectors are cross-fed into a correlation layer, that computes local matching costs in u and v direction by minimizing out the other direction. The result are two quasi-independent cost volumes for the u and v component of the flow.

Learning Quantized Descriptors The computational bottleneck in scheme (4) is computing the min-projections, with time complexity O(|Ω|D2 m). This operation arises during the learning as well as in the CRF inference step, where it corresponds to the message exchange in the dual decomposition. It is therefore desirable to accelerate this step. We achieve a significant speed-up by quantizing the descriptors and evaluating the Hamming distance of binary descriptors. Let us define the quantization: we call φ¯ = sign(φ) the quantized descriptor field. The distance between quantized descriptors is given by d(φ¯1 , φ¯2 ) = −hφ¯1 , φ¯2 i = 2H(φ¯1 , φ¯2 ) − m, equivalent to the Hamming distance H(·, ·) up to a scaling and an offset. Let the quantized cost function be denoted c¯i (xi ), defined similar to (1). We can then compute quantized min-projections c¯u , c¯v . However, learning model (2) with quantized descriptors is difficult due to the gradient of the sign function being zero almost everywhere. We introduce a new technique specific to the matching problem and compare it to the baseline method that uses the straight-through estimator of the gradient [2]. Consider the following variants of the model (4a) u ˆi ∈ arg min ci (ui , vˆi (ui )), where

vˆi (ui ) ∈ arg min c¯i (ui , vi );

(FQ)

u ˆi ∈ arg min c¯i (ui , vˆi (ui )), where

vˆi (ui ) ∈ arg min c¯i (ui , vi ).

(QQ)

ui

ui

vi

vi

The respective variants of (4b) are symmetric. The second letter in the naming scheme indicates whether the inner problem, i.e., the min-projection step, is performed on (Q)uantized or (F)ull cost, whereas the first letter refers to the outer problem on the smaller 3D cost volume. The initial model (4a) is thus also denoted as FF model. While models FF and QQ correspond, up to nonuniqueness of solutions, to the joint minimum in (ui , vi ) of the cost c and c¯ respectively, the model FQ is a mixed one. This hybrid model is interesting because minimization in vi can be computed efficiently on the binarized cost with Hamming distance, and the minimization in ui has a non-zero gradient in

6

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock

cu . We thus consider the model FQ as an efficient variant of the local optical flow model (2). In addition, it is a good learning proxy for the model QQ: Let u ˆi = arg minui ci (ui , vˆi (ui )) be a minimizer of the outer problem FQ. Then the derivative of FQ is defined by the indicator of the pair (ˆ ui , vˆi (ui )). This is the same as the derivative of FF, except that vˆi (ui ) is computed differently. Learning the model QQ involves a hard quantization step, and we apply the straightthrough estimator to compute a gradient. Note that the exact gradient for the model FQ can be computed at approximately the same reduced computational cost as the straight-through gradient in the model QQ. 3.2

CRF

the streaking problem. Nevertheless, as can be seen from Figure 1c, a large number of edges have to be sacrificed in order to obtain a tree structure. The information of these edges remains unused, which is most likely the reason for the only average results of this method. Subsequent work (Deng and Lin, 2006; Lei et al., 2006) combines tree-based DP with colour segmentation. These algorithms improve the (a) (b) results on standard images such as the Middlebury set (Scharstein and Szeliski, 2002). They, however, fail if segments overlap disparity discontinuities. A different approach to handle the streaking probp lem is to compute multiple DP passes. Two-pass VISAPP 2008 - International Conference on Computer Vision Theory and Applications methods (Gong and Yang, 2005; Kim et al., 2005) first apply DP on the horizontal scanlines and use the results to bias the second pass, which operates on (c) (d) the vertical scanlines. While horizontal streaks are Figure 1: Grid structures of previous approaches. Nodes reduced, these algorithms introduce vertical streaks, represent pixels, while edges indicate that the smooththe problem. Nevertheless, as visible can beinseen andstreaking their scanline-based nature is clearly the ness function operates on the adjacent nodes. (a) Fourfrom Figure 1c, a large resulting disparity maps.number of edges have to be connected grid. (b) Scanline-based DP approaches. (c) sacrificed in order obtain a tree structure. The inTree-based DP proposed by (Veksler, 2005). (d) Approach (Hirschm¨ uller,to 2005) proposed a hybrid approach formation of these edges remains unused, which is of (Hirschm¨uller, 2005) to derive the disparity of pixel p. between local and global methods. The disparity most likely for using the only average results of each pixeltheis reason computed the winner-takes-all of this method. Subsequent work and Lin, strategy, i.e. without considering the (Deng disparity assigning (Scharstein and Szeliski, 2002). However, a se2006; et al., 2006) combines tree-based DP with mentsLei of neighbouring pixels. Instead of aggregating vere limitation algorithms is that colour segmentation. These algorithms improve (a)of these optimization (b) matching costs from spatially surrounding pixels, the the they are computationally rather expensive. Especially results on computes standard images the Middlebury algorithm DP pathssuch fromasvarious directions for the graph-cut approach, calculation of a single disset (Scharstein and Szeliski, 2002). They, 1d. however, towards each pixel p as shown in Figure Cost parity map can still take several minutes. fail if segments overlap disparity aggregation is then performed bydiscontinuities. summing up the inTo bypass the NP-complete optimization problem, A different approach to handleuller’s the streaking probp dividual path costs. In Hirschm¨ approach, the classical DP approaches (Bobick and Intille, 1999; lem is to of compute multiple DPinfluenced passes. by Two-pass disparity an image point is only a Ohta and Kanade, 1985; Wang et al., 2006) adopt methods (Gong andwhole Yang,image’s 2005; pixels. Kim et This al., 2005) small subset of the reprea greatly simplified neighbourhood structure in their first DP onifthe horizontal scanlines and enough use the sentsapply a problem none of the paths captures smoothness terms. They enforce smoothness only results bias thea second pass, which atoperates on texture to to provide clear cost minimum the correct (c) (d) within, but not across horizontal scanlines. The correthe verticalToscanlines. While horizontal streaks are disparity. weaken this problem, Hirschm¨ uller proFigure 1: Grid of previous in approaches. sponding gridstructures graph is illustrated Figure 1b.Nodes Since reduced, these algorithms introduce vertical streaks, posed increasing the number of paths. Nevertheless, represent pixels, while edges between indicate that the smooththere is no interconnection horizontal scanand scanline-based nature is clearly visible the thistheir results in higher computational demands andinonly ness function operates on the adjacent nodes. (a) Fourlines, an grid. energy(b)minimum for thisDPgrid structure can resulting connected Scanline-based approaches. (c) partially disparity representsmaps. a remedy to the problem. In a subbe derivedDP byproposed computing the optimum scanTree-based by (Veksler, 2005).for (d)each Approach (Hirschm¨ ller, 2005)uller, proposed hybrid approach sequent paperu(Hirschm¨ 2006),a he addressed this separately. The exact optimum of (1)ofonpixel eachp.inofline (Hirschm¨ uller, 2005) to derive the disparity between local image and global methods. The disparity problem using segmentation. dividual scanline is then determined using DP. Such of each pixel is computed using the winner-takes-all approaches are favourable for their excellent comstrategy, i.e. without considering the disparity assigning (Scharstein and Skipping Szeliski, the 2002). However, a seputational speed. vertical smoothness ments of neighbouring pixels. Instead of aggregating vere limitation of these algorithmsscanline is that 2 THE SIMPLE TREE METHOD edges, however, leadsoptimization to the well-known matching costs from spatially surrounding pixels, the they are computationally rather expensive. Especiallya streaking effect. This inherent problem represents algorithm computes DP paths from various directions for the graph-cut calculation ofquality a singleofdismajor reason forapproach, the bad reconstruction DP The algorithm proposed this paper performs septowards each pixel p as in shown in Figure 1d. aCost parity map can still take several minutes. in comparison to the state-of-the-art. arate disparity computation forbyeach individual pixel. aggregation is then performed summing up the inTo bypass the NP-complete problem, Recently, (Veksler, 2005) optimization proposed approximatWe apply an costs. individual tree construction in orderthe to dividual path In Hirschm¨ uller’s approach, classical DP approachesgrid (Bobick 1999; ing the four-connected via a and tree.Intille, The motivadetermine the disparity of a single pixel. The tree’s disparity of an image point is influenced by only a Ohta Wang et al., 2006) tion isand thatKanade, efficient 1985; DP-based optimization also adopt works root node is formed by the pixel whose disparity small subset of the whole image’s pixels. This repreaon greatly simplified neighbourhood structure tree structures. Roughly spoken, the treeinistheir conneeds to be computed. Although our trees prove to be sents a problem if none of the paths captures enough smoothness terms. They enforce smoothness only structed by discarding edges that show a high gradient effective, their structure relatively simple. texture to provide a clear is cost minimum at the(Hence, correct within, not across horizontal The correin the but intensity image from thescanlines. four-connected grid. we call our theproblem, Simple Tree Method.) For disparity. Toalgorithm weaken this Hirschm¨ uller prosponding grid is illustrated Figure 1b.and Since In contrast to graph scanline-based DP,inhorizontal vernow, itincreasing is only important to know that aNevertheless, tree spans all posed the number of paths. there no interconnection between horizontal scantical is edges are treated symmetrically, which weakens pixels of the reference frame. A global minimum of this results in higher computational demands and only lines, an energy minimum for this grid structure can partially represents a remedy to the problem. In a subbe derived by computing the optimum for each scansequent paper (Hirschm¨uller, 2006), he addressed this line separately. The exact optimum of (1) on each inproblem using image segmentation. dividual scanline is then determined using DP. Such 416 approaches are favourable for their excellent computational speed. Skipping the vertical smoothness 2 THE SIMPLE TREE METHOD edges, however, leads to the well-known scanline streaking effect. This inherent problem represents a major reason for the bad reconstruction quality of DP The algorithm proposed in this paper performs a sepin comparison to the state-of-the-art. arate disparity computation for each individual pixel. Recently, (Veksler, 2005) proposed approximatWe apply an individual tree construction in order to ing the four-connected grid via a tree. The motivadetermine the disparity of a single pixel. The tree’s tion is that efficient DP-based optimization also works root node is formed by the pixel whose disparity on tree structures. Roughly spoken, the tree is conneeds to be computed. Although our trees prove to be structed by discarding edges that show a high gradient effective, their structure is relatively simple. (Hence, in the intensity image from the four-connected grid. we call our algorithm the Simple Tree Method.) For In contrast to scanline-based DP, horizontal and vernow, it is only important to know that a tree spans all tical edges are treated symmetrically, which weakens pixels of the reference frame. A global minimum of

The baseline model, which we call product model, has |Ω| variables xi with the state space S × S. It has been observed in [8] that max-product message passing in the CRF (3) can be computed in time O(D2 ) per variable for separable interactions using a fast distance transform. However, storing the messages for a 4-connected graph requires O(|Ω|D2 ) memory. Although such an approach was shown feasible even for large displacement optical flow [5], we argue that a more compact decomposed model [19] gives comparable results and is much faster in practice. The decomposed model is constructed by observing that the regularization in (3) is separable over u and v. Then the energy (3) can be represented as a CRF with 2|Ω| variables ui , vi with the following pairwise terms: The in-plane term wij ρ(ui − uj ) and the cross-plane term c(ui , vi ), forming the graph shown in Fig. 2(b). In this formulation there are no unary terms, since costs ci are interpreted as pairwise terms. The resulting linear programming (LP) dual is more economical, because it has only O(|Ω|D) variables. The message passing for edges inside planes and across planes has complexity O(|Ω|D) and O(|Ω|D2 ), respectively. VISAPP 2008 - International Conference on Computer Vision Theory and Applications VISAPP 2008 - International Conference on Computer Vision Theory and Applications

u

⌧1 1 (a)

v

the streaking problem. Nevertheless, as can be seen the streaking problem. Nevertheless, as can be seen from Figure 1c, a large number of edges have tofrom be Figure 1c, a large number of edges have to be in order to obtain a tree structure. The insacrificed in order to obtain a tree structure. Thesacrificed information of these edges remains unused, which is formation of these edges remains unused, which is the streaking problem. Nevertheless, as can be seen most likely the reason for the only average results most likely the reason for the only average results thefrom streaking Nevertheless, as can have be seen Figureproblem. 1c, a large number of edges to be of this method. Subsequent work (Deng and Lin, of this method. Subsequent work (Deng and Lin, from Figure in 1c,order a large number of edges have to sacrificed to obtain a tree structure. Thebein2006; Lei et al., 2006) combines tree-based DP with Lei et al., 2006) sacrificed in2006; order to obtain aremains tree combines structure. The in- is DP with formation of these edges unused,tree-based which 1the segmentation. These algorithms improve2the + (a) segmentation. (b)algorithms colour These improvecolour + + (a) (b) formation of these edges + remains unused, which is most likely the reason for the only average results results on standard images such as the Middlebury results on standard such as the Middlebury most likely the reason for the images only of this method. Subsequent workaverage (Deng results and Lin, set (Scharstein and Szeliski, 2002). They, however, set (Scharstein and work Szeliski, 2002). They, however, of 2006; this method. (Deng andDP Lin, Lei et al.,Subsequent 2006) combines tree-based with fail if segments overlap disparity 4 discontinuities. 3 segments overlaptree-based disparity discontinuities. 2006; Leisegmentation. etfail al.,if 2006) combines DP withthe colour These algorithms improve (a) (b) A different approach to handle the streaking probp A different approach to handle the streaking probp colour segmentation. improve the (a) (b) results on standard These imagesalgorithms such as the Middlebury lem is to compute multiple DP passes. Two-pass lem is to compute multiple DP passes. Two-pass results on standard such2002). as theThey, Middlebury set (Scharstein andimages Szeliski, however, methods (Gong and Yang, 2005; Kim et al., 2005) methods (Gong and Yang, 2005; Kim et al., 2005) setfail (Scharstein andoverlap Szeliski, 2002).discontinuities. They, however, if segments disparity first apply DP on the horizontal scanlines and usefirst the apply DP on the horizontal scanlines and use the fail if A segments overlap disparity discontinuities. different approach to handle streaking p results secondthe pass, whichproboperatesresults on to bias the second pass, which operates on (c) to biasto the (d) (c) (d) A different approach handle thepasses. streaking probp lem is tothe compute Two-pass verticalmultiple scanlines.DPWhile horizontal streaksthe arevertical scanlines. While horizontal streaks are Figure 1: Grid structures of previous approaches. Nodes Figure 1: Grid structures of previous approaches. lem Nodes is to compute multiple DP passes. Two-pass methods reduced, (Gong and Yang, 2005; Kim et al.,vertical 2005) streaks, reduced, these algorithms introduce vertical streaks, these algorithms introduce represent pixels, while edges indicate that the smoothrepresent pixels, while edges indicate that the smoothmethods (Gong and Yang, 2005; scanlines Kim etisal., 2005) first apply DPoperates on the horizontal and usevisible the inand and their scanline-based nature clearly thetheir scanline-based nature is clearly visible in the ness function on the adjacent nodes. (a) Fourness function operates on the adjacent nodes. (a)first FourapplytoDP on the horizontal scanlines and use the results bias thedisparity second maps. pass, which operates on resulting disparity maps. resulting grid. (b) Scanline-based DP approaches. (c) (c) grid. (b) Scanline-based (d) connected DP approaches. connected (c) results to bias the second pass, which operates onareapproach(Hirschm¨uller, 2005) proposed a hybrid approach the vertical While horizontal Tree-based DPscanlines. proposed by (Veksler, 2005). (d)streaks Approach (c) structures Tree-based DP proposed by(d) (Veksler, 2005). (d) Approach (Hirschm¨ uller, 2005) proposed a hybrid Figure 1: Grid of previous approaches. Nodes theof vertical scanlines. While horizontal streaks are (Hirschm¨ uller, algorithms 2005) to derive the disparity of pixel p. disparity reduced, these introduce vertical streaks, of (Hirschm¨ u ller, 2005) to derive the disparity of pixel p. between local and global methods. The disparity between local and global methods. The Figure 1: Grid structures previous approaches. represent pixels, while ofedges indicate that the Nodes smoothreduced, these algorithms introduce vertical streaks, and their scanline-based is clearly in the of each pixel is computed using the winner-takes-all of each pixel isnature computed usingvisible the winner-takes-all represent pixels, operates while edges the smoothness function on theindicate adjacentthat nodes. (a) Fourtheir scanline-based nature isconsidering clearly visible the resulting disparityi.e. maps. strategy, i.e. without considering the disparity assignstrategy, without the in disparity ness functiongrid. operates the adjacent DP nodes. (a) Four-(c) and connected (b) on Scanline-based approaches. ing (Scharstein and Szeliski, 2002). However, a se- assigning (Scharstein and Szeliski, 2002). However, a seresulting disparity maps. ments of neighbouring pixels. Instead of aggregating connected grid. Scanline-based DP2005). approaches. (c) Tree-based DP (b) proposed by (Veksler, (d) Approach (Hirschm¨ uller, 2005) proposed a hybrid approach ments neighbouring pixels. Instead aggregating vere limitation ofof2005) these optimization algorithms isofthat of these optimization algorithms is that Tree-based DPvere (d) of Approach (Hirschm¨ uller, proposed a hybrid approach of (Hirschm¨ uproposed ller,limitation 2005)byto(Veksler, derive the2005). disparity pixel p. costs from spatially surrounding pixels, the between local andcosts global methods. The disparity matching from spatially surrounding pixels,matching the they arelocal computationally rather expensive. Especially they2005) are computationally ratherofexpensive. of (Hirschm¨uller, to derive the disparity pixel p. Especially between and global methods. The disparity algorithm computes DP paths from various directions of each pixel is computed using the winner-takes-all algorithm computes DP paths various for the graph-cut approach, calculation offrom a single dis-directions for the graph-cut approach, calculation of a single of diseach pixel iswithout computed usingp the winner-takes-all towards each pixel p as shown in Figure 1d. Cost towards each pixel as shown in Figure 1d. Cost strategy, i.e. considering the disparity assignparity map can still take several minutes. parity and mapSzeliski, can still take several minutes. ing (Scharstein 2002). However, a se- strategy, is then performed by summing up the ini.e.aggregation without considering the disparity assign- up theaggregation is pixels. then performed byaggregating summing inments of neighbouring Instead of ingvere (Scharstein and Szeliski, 2002). However, a isseTo the NP-complete optimization problem, limitationTo ofbypass these optimization algorithms that problem, the NP-complete optimization dividual path costs. In Hirschm¨uller’s approach, the ments ofbypass neighbouring of aggregating dividual pathpixels. costs.Instead In Hirschm¨ uller’s approach, the matching costs from spatially surrounding pixels, the vere limitation of theseDP optimization algorithms is that classical DP approaches (Bobick and Intille, 1999; classical approaches (BobickEspecially and Intille, matching 1999; they are computationally rather expensive. disparity of an image point is influenced by only a costs from of spatially surrounding pixels, the by only disparity an paths image point is influenced a algorithm computes DP from directions Ohta and Kanade, 1985; Wang et various al., 2006) adopt they rather expensive. Ohta and Kanade, 1985; Wang et al., dis2006) algorithm adopt forare thecomputationally graph-cut approach, calculation ofEspecially a single small subset of the whole image’s pixels. This reprecomputes DP paths from various directions smallpixel subsetp of wholeinimage’s This repreasthe shown Figure pixels. 1d. Cost atowards greatly each simplified neighbourhood structure in their forparity the graph-cut approach, calculation of a singlestructure disgreatly neighbourhood intowards their mapa can still simplified take several minutes. sents a problem if none of the paths captures enough sents a problem if none the paths each pixel p performed as shown in of Figure 1d.upcaptures Cost aggregation is then by summing the in- enough smoothness terms. They enforce smoothness only parityTo map can still take several smoothness terms. minutes. They enforceproblem, smoothnessaggregation only bypass the NP-complete optimization texture to provide a clear cost minimum at the correct texture toperformed provide a by clear cost minimum atthe the correct is then summing up the individualbut path costs. In Hirschm¨ uller’s approach, within, not across horizontal scanlines. The correwithin, but not across horizontal scanlines. The correTo bypass the NP-complete optimization problem, classical DP approaches (Bobick and Intille, 1999; dividual disparity. To weaken this problem, Hirschm¨uller prodisparity. ToHirschm¨ weaken problem, Hirschm¨ path In ller’s approach, the auller prodisparity ofcosts. angraph image point isuthis influenced by only sponding grid is illustrated in Figure 1b. Since grid graph is and illustrated in1999; Figure Since classical DPsponding approaches (Bobick Intille, Ohta and Kanade, 1985; Wang et al., 2006) adopt1b. disparity posed increasing the number of paths. Nevertheless, posed increasing the number of by paths. Nevertheless, of image point image’s is influenced only a small issubset of the whole pixels. This reprethere noan interconnection between horizontal scanthere is 1985; noneighbourhood interconnection between horizontal scanOhta and Kanade, Wang et al.,structure 2006) adopt this results in higher computational demands and only a greatly simplified in their small this in higher and only subsetenergy of results theminimum image’s pixels. Thisdemands represents ifwhole none of for thecomputational paths captures enough this grid structure can lines, anneighbourhood energy for thisingrid structure lines, can aanproblem a greatly simplified their partially represents a remedy to the problem. In a subsmoothness terms. They minimum enforcestructure smoothness only partially represents a remedy to the problem. In a subsents a problem if none of the paths captures enough texture to provide a clear the costoptimum minimum the correct be derived by computing forateach scanderived by computing the optimum for each scansmoothness terms. They enforcescanlines. smoothness sequent paper (Hirschm¨uller, 2006), he addressed this within, butbe not across horizontal Theonly corresequent paper (Hirschm¨ u ller, 2006), he addressed this texture to provide a clear cost minimum at the disparity. To weaken this problem, Hirschm¨ ller proline The exact optimum of (1) on ucorrect each inlineacross separately. The exact of Since (1) on each in- separately. within, but not scanlines. The1b. correproblem using image segmentation. sponding grid graphhorizontal is illustrated inoptimum Figure problem using image segmentation. disparity. To weakenis this problem, Hirschm¨ uller posed increasing the number of paths. Nevertheless, dividual scanline then determined using DP. proSuch dividual scanline is then determined using DP. Such sponding grid graph is illustrated in Figure 1b. Since there is no interconnection between horizontal scan- posed increasing the number of paths. Nevertheless, this results inare higher computational demands and only approaches favourable for their excellent comapproaches are favourable for their excellent comthere is no scan-can this results in higher computational demands and only lines, an interconnection energy minimumbetween for this horizontal grid structure partially represents a remedythe to the problem. In a subputational speed. Skipping vertical smoothness putational speed. Skipping the vertical smoothness lines, an energy minimum for grid structure be derived by computing thethis optimum for each can scan- partially aleads remedy to problem. Inscanline a sub2 THE SIMPLE TREE METHOD sequentrepresents paper (Hirschm¨ utoller, 2006), he addressed this edges, however, thethe well-known 2 THE SIMPLE TREE METHOD edges, however, leads to for the well-known beline derived by computing theoptimum optimum scan-in- scanline separately. The exact of (1)each on each sequent (Hirschm¨ ller, 2006), he addressed thisa problem using image streaking effect. Thisusegmentation. inherent problem represents streaking effect. This inherent represents a paper line separately. The is exact of (1) onproblem each dividual scanline thenoptimum determined using DP. inSuch problem using image segmentation. the bad proposed reconstruction DP The algorithm proposed in this paper performs a sepmajorisreason for the bad reconstruction quality ofmajor DP reason The for algorithm in thisquality paper of performs a sepdividual scanline then determined using DP. Such approaches are favourable for their excellent comin comparison the state-of-the-art. arate disparity computation for each individual pixel. in comparison to the state-of-the-art. arate to disparity computation for each individual pixel. approaches favourable for the their excellent computationalare speed. Skipping vertical smoothness (Veksler, 2005) proposed approximat-in order Wetoapply an individual tree construction in order to Recently, (Veksler, 2005)smoothness proposed approximat-Recently, We apply an individual tree construction putational speed. Skipping 2 the THE SIMPLE TREE METHOD edges, however, leads to the the vertical well-known scanline ingTHE four-connected grid via aMETHOD tree. The motivadetermine the disparity of a single pixel. The tree’s ing the four-connected grid via a tree. The motivadetermine the disparity of a single pixel. The tree’s 2 SIMPLE TREE edges, however, leads the well-known scanline a streaking effect. Thistoinherent problem represents tion is thatroot efficient optimization alsowhose works disparity root node is formed by the pixel whose disparity tion is that efficient DP-based optimization also works nodeDP-based is formed by the pixel streaking effect.forThis inherent problem represents major reason bad reconstruction quality the of aDP Thetree algorithm in this paper performs aconsepon structures. spoken, the tree on treethe structures. Roughly spoken, tree is conneedsproposed to beRoughly computed. Although our is trees prove toneeds be to be computed. Although our trees prove to be major reason for the bad reconstruction quality of DP The algorithm proposed instructure thisthat paper performs asimple. sepin comparison to the arate disparity computation for each individual pixel. (Hence, structed by discarding edges show a high gradient effective, their structure is relatively simple. (Hence, structed by state-of-the-art. discarding edges that show a high gradient effective, their is relatively in comparison to(Veksler, the state-of-the-art. arate disparity computation for each individual pixel. Recently, 2005) proposed approximatWe apply an individual tree construction in order to in the intensity image from the four-connected grid. in the intensity image from the four-connected grid. we call our algorithm the Simple Tree Method.) we Forcall our algorithm the Simple Tree Method.) For Recently, (Veksler, 2005) proposed approximatWe apply an individual tree construction in order to ing the four-connected grid via a tree. The motivadetermine the disparity of a single pixel. The tree’s In contrast to scanline-based DP, horizontal andaverIn contrast to scanline-based DP, horizontal and vernow, it is only important to know that tree spansnow, all it is only important to know that a tree spans all ingtion theisfour-connected gridtreated viaoptimization a symmetrically, tree. The also motivadetermine the of single pixel. tree’s thattical efficient works root edges node isdisparity formed bya the pixel whose disparity tical are treated which weakens pixels edgesDP-based are which weakens pixels of the symmetrically, reference frame. AThe global minimum of of the reference frame. A global minimum of tion thatstructures. efficient DP-based node is computed. formed byAlthough the pixelour whose disparity onistree Roughlyoptimization spoken, thealso treeworks is con- root needs to be trees prove to be onstructed tree structures. Roughly theatree conneeds to be computed. Although our trees prove(Hence, to be by discarding edgesspoken, that show highisgradient effective, their structure is relatively simple. structed discarding edges thatthe show a high gradient structure is simple. (Hence,For in the by intensity image from four-connected grid. effective, we call their our algorithm therelatively Simple Tree Method.) in In thecontrast intensity from the DP, four-connected grid. call itour algorithm the Simple Tree Forall to image scanline-based horizontal and ver- we416 now, is only important to know thatMethod.) a tree spans 416 In tical contrast to are scanline-based DP, horizontal and vernow, it isof only knowAthat a treeminimum spans all of edges treated symmetrically, which weakens pixels theimportant reference to frame. global tical edges are treated symmetrically, which weakens pixels of the reference frame. A global minimum of

v

⌧2

u

=

(b)

(c)

Fig. 2. Building blocks of the CRF. (a) Robust pairwise function ρ. (b) Decomposition of the pairwise CRF into 5 subproblems. (c) Lagrange multipliers in the dual corresponding to equality constraints between the subproblems. They act as offsets of unary costs between subproblems, increasing on one side of the arrow and decreasing on the other.

We apply the parallel inference method [18] to the dual of the decomposed model [19] (see Fig. 2(b)). Although different dual decompositions reach different

416 416

416

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

2008 - and International Conference on Computer Vision Theory and Applications VISAPP 2008 - International Conference on ComputerVISAPP Vision Theory Applications

Scalable Full Flow with Learned Binary Descriptors

7

objective values in a fixed number of iterations, it is known that all decompositions with trees covering the graph are equivalent in the optimal value [21]. The decomposition in Fig. 2(b) is into horizontal and vertical chains in each of the u- and v- planes plus a subproblem containing all cross-layer edges. We introduce Lagrange multipliers λ = (λk ∈ RΩ×S | k = 1, 2, 3, 4) enforcing equality constraints between the subproblems as shown in Fig. 2(c). The Lagrange mulP tipliers λk are identified with modular functions λk : S Ω → R : u 7→ i λki (ui ). Let us also introduce shorthands for the sum of pairwise terms over horizontal P chains f h : S Ω → R : u 7→ ij∈E h wij ρ(ui − uj ), and a symmetric definition f v for the sum over the vertical chains. The lower bound Ψ (λ) corresponding to the decomposition in Fig. 2(c) is given by: Ψ (λ) = Ψ 1 (λ) + Ψ 2 (λ) + Ψ 3 (λ), where     Ψ 1 (λ) = min (λ1 + λ3 )(u) + f h (u) + min − λ1 (u) + f v (u) ; u u     Ψ 2 (λ) = min (λ2 + λ4 )(v) + f h (v) + min − λ2 (v) + f v (v) ; v v X   Ψ 3 (λ) = min ci (ui , vi ) − λ3i (ui ) − λ4i (vi ) . i

ui ,vi

(6a) (6b) (6c) (6d)

Our Lagrangian dual to (3) is to maximize Ψ (λ) in λ, which enforces consistency between minimizers of the subproblems. The general theory [21] applies, in particular, when the minimizers of all subproblems are consistent they form a global minimizer. In (6b), there is a sum of horizontal and vertical chain subproblems in the u-plane. When λ3 is fixed, Ψ 1 (λ) is the lower bound corresponding to the relaxation of the energy in u with the unary terms given by λ3 . It can be interpreted as a stereo-like problem with 1D labels u. Similarly, Ψ 2 (λ) is a lower bound for the v-plane with unary terms λ4 . Subproblem Ψ 3 (λ) is simple, it contains both variables u, v but the minimization decouples over individual pairs (ui , vi ). It connects the two stereo-like problems through the 4D cost volume c. Updating messages inside planes can be done at a different rate than across planes. The optimal rate for fast convergence depends on the time complexity of the message updates. [19] reported an optimal rate of updating in-plane messages 5 times as often using the TRW-S solver [12]. The decomposition (6a) facilitates this kind of strategy and allows to use the implementation [18] designed for stereo-like problems. We therefore use the dual solver [18], denoted Dual Minorize-Maximize (DMM) to perform in-plane updates. When applied to the problem of maximizing Ψ 1 (λ) in λ1 , it has the following properties: a) the bound Ψ 1 (λ) does not decrease and b) it computes a modular P minorant s such that s(u) ≤ λ3 (u) + f h (u) + f v (u) for all u and Ψ 1 (λ) = i minui si (ui ). The modular minorant s is an excess of costs, called slacks, which can be subtracted from λ3 while keeping Ψ 1 (λ) non-negative. The associated update of the u-plane can be denoted as (λ1 , s) := DMM(λ1 , λ3 , f h , f v ), λ3 := λ3 − s.

(7a) (7b)

8

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock

Algorithm 1: Flow CRF Optimization

1 2 3 4 5 6 7

Input: Cost volume c; Output: Dual point λ optimizing Ψ (λ); Initialize λ := 0; for t = 1, . . . , it outer do Perform the following updates: v → u: pass slacks to u-plane by (9), changes λ3 ; u-plane: DMM with it inner iterations for u-plane (7a), changes λ1 , λ3 ; u → v: pass slacks to v-plane by (8), changes λ4 ; v-plane: DMM with it inner iterations for v-plane, changes λ2 , λ4 ;

The slack s is then passed to the v plane by the following updates, i.e., message passing u → v:   λ4i (vi ) := λ4i (vi ) + min ci (ui , vi ) − λ3i (ui ) . (8) ui

The minimization (8) has time complexity O(|Ω|D2 ), assuming the 4D costs ci are available in memory. As discussed above, we can compute the costs ci efficiently on the fly and avoid O(|Ω|D2 ) storage. The update v → u is symmetric to (7a):   λ3i (ui ) := λ3i (ui ) + min ci (ui , vi ) − λ4i (vi ) . (9) vi

The complete method is summarized in Algorithm 1. It starts from collecting the slacks in the u-plane. When initialized with λ = 0, the update (9) simplifies to λ3i (ui ) = minvu ci (ui , vi ), i.e., it is exactly matching to the min-projection cu (4). The problem solved with DMM in Line 5 in the first iteration is a stereo-like problem with cost cu . The dual solution redistributes the costs and determines which values of u are worse than others, and expresses this cost offset in λ3 as specified in (7a). The optimization of the v-plane then continues with some information of good solutions for u propagated via the cost offsets using (8).

4

Evaluation

We compare different variants of our own model on the Sintel optical flow dataset [3]. In total the benchmark consists of 1064 training images and 564 test images. For CNN learning we use a subset of 20% of the training images, sampled evenly from all available scenes. For evaluation, we use a subset of 40% of the training images. Comparison of our models To investigate the performance of our model, we conduct the following experiments: First, we investigate the influence of the size of the CNN, and second we investigate the effect of quantizing the learned features. Additionally, we evaluate both the WTA solution (2), and the CRF model (3). To assess the effect of quantization, we evaluate the local flow model

Scalable Full Flow with Learned Binary Descriptors

9

Table 1. Comparison of our models on a representative validation set at scale 12 . We present the end-point-error (EPE) for non-occluded (noc) and all pixels on Sintel clean. Local Flow Model (WTA)

CRF

Train

#Layers

as trained noc (all)

QQ noc (all)

F noc (all)

Q noc (all)

FF

5 7 9

5.25 (10.38) 4.72 (10.04) –1

10.45 (15.67) 9.43 (14.93) –1

1.58 (4.48) 1.53 (4.32) –1

1.64 (4.87) 1.61 (4.70) –1

FQ

5 7 9

6.15 (11.36) 5.62 (10.98) 5.62 (11.13)

11.43 (16.78) 10.15 (15.70) 9.87 (15.52)

–2 –2 –2

1.63 (4.62) 1.65 (4.62) 1.64 (4.69)

QQ

5 7 9

same as QQ same as QQ same as QQ

9.63 (14.80) 9.75 (15.23) 9.72 (15.31)

–2 –2 –2

1.72 (4.91) 1.66 (4.78) 1.72 (4.85)

a) as it was trained, and b) QQ, i.e., with quantized descriptors both in the min-projection step as well as in the outer problem on cu , cv respectively. In CRF inference the updates (8) and (9) amount to solving a min-projection step with additional cost offsets. F and Q indicate how this min-projection step is computed. CRF parameters are fixed at α = 8.5, (3), τ1 = 0.25, τ2 = 25 (Fig. 2) for all experiments and we run 8 inner and 5 outer iterations. Table 1 summarizes the comparison of different variants of our model. We see that the WTA solution of model FQ performs similarly to FF, while being much faster to train and evaluate. In particular, model FQ performs better than QQ, which was trained with the straight through estimator of the gradient. If we switch to QQ for evaluation, we see a drop in performance for models FF and FQ. This is to be expected, because we now evaluate costs differently than during training. Interestingly, our joint model yields similar performance regardless whether we use F or Q for computing the costs. Runtime The main reason for quantizing the descriptors is speed. In CRF inference, we need to compute the min-projection on the 4D cost function twice per outer iteration, see Alg. 1. We show an exact breakdown of the timings for D = 128 on full resolution images in Table 3, computed on a Intel i7 6700K and a Nvidia Titan X. The column WTA refers to computing the solution of the local model on the cost volumes cu , cv , see Eq. (4). Full model is the CRF inference, see § 3.2. We see that we can reach a significant speed-up by using binary descriptors and Hamming distance for computing intensive calculations. For comparison, we also report the runtime of [22], who, at the time of writing, report the fastest execution time on Sintel. We point out that our CRF inference on full resolution images takes about the same time as their method, which 1 2

Omitted due to very long training time. Not applicable.

10

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock Table 2. Timings of the building blocks (seconds). Method

Feature Extraction

WTA

Table 3. Comparison on the Sintel clean test set.

Full Model Method

FF FQ QQ

0.04 – 0.08 0.04 – 0.08 0.04 – 0.08

4.25 1.82 0.07

24.8 3.2

[22] ( 31 res.) QQ ( 13 res.)

0.02 0.004 – 0.008

0.06 0.007

3.4 0.32

constructs and optimizes the cost function at

EpicFlow [17] FullFlow [5] FlowFields [1] DCFlow [22] Ours QQ

1 3

noc

all

1.360 1.296 1.056 1.103 2.470

4.115 3.601 3.748 3.537 8.972

resolution.

Test performance We compare our method on the Sintel clean images. In contrast to the other methods we do not use a sophisticated post-processing pipeline, because the main focus of this work is to show that learning and inference on high resolution images is feasible. Therefore we cannot compete with the highly tuned methods. Fig. 3 shows that we are able to recover fine details, but since we do not employ a forward-backward check and local planar inpainting we make large errors in occluded regions.

Fig. 3. Sample output of our method. Left figure, top row shows the WTA solution of a 7-layer network for FF, FQ, QQ training. The bottom row shows results of the same network with CRF inference. The right part shows the highlighted region enlarged.

5

Conclusion

We showed that both learning and CRF inference of the optical flow cost function on high resolution images is tractable. We circumvent the excessive memory requirements of the full 4D cost volume by a min-projection. This reduces the space complexity from quadratic to linear in the search range. To efficiently compute the cost function, we learn binary descriptors with a new hybrid learning scheme, that outperforms the previous state-of-the-art straight-through estimator of the gradient. Acknowledgements We acknowledge grant support from Toyota Motor Europe HS, the ERC starting grant HOMOVIS No. 640156 and the research initiative Intelligent Vision Austria with funding from the AIT and the Austrian Federal Ministry of Science, Research and Economy HRSM programme (BGBl. II Nr. 292/2012)

Scalable Full Flow with Learned Binary Descriptors

11

References 1. Bailer, C., Taetz, B., Stricker, D.: Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In: International Conference on Computer Vision (ICCV) (2015) 2. Bengio, Y., L´eonard, N., Courville, A.C.: Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432 (2013), http://arxiv.org/abs/1308.3432 3. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European Conference on Computer Vision (ECCV) (2012) 4. Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Binary robust independent elementary features. In: European Conference on Computer Vision (ECCV) (2010) 5. Chen, Q., Koltun, V.: Full flow: Optical flow estimation by global optimization over regular grids. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 6. Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence embedding model for stereo matching costs. In: International Conference on Computer Vision (ICCV) (2015) 7. Courbariaux, M., Bengio, Y.: Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR abs/1602.02830 (2016), http://arxiv.org/abs/1602.02830 8. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. International Journal of Computer Vision 70(1), 41–54 (2006) 9. Gadot, D., Wolf, L.: Patchbatch: A batch augmented loss for optical flow. In: Conference on Computer Vision and Pattern Recognition, (CVPR) (2016) 10. G¨ uney, F., Geiger, A.: Deep discrete flow. In: Asian Conference on Computer Vision (ACCV) (2016) 11. Kn¨ obelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end training of hybrid CNN-CRF models for stereo. In: Conference on Computer Vision and Pattern Recognition, (CVPR) (2017), http://arxiv.org/abs/1611.10229 12. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. Transactions on Pattern Analysis and Machine Intelligence 28(10) (october 2006) 13. Luo, W., Schwing, A., Urtasun, R.: Efficient deep learning for stereo matching. In: International Conference on Computer Vision and Pattern Recognition (ICCV) (2016) 14. Ranftl, R., Bredies, K., Pock, T.: Non-local total generalized variation for optical flow estimation. In: European Conference on Computer Vision (ECCV) (2014) 15. Ranftl, R., Gehrig, S., Pock, T., Bischof, H.: Pushing the limits of stereo using variational stereo estimation. In: IEEE Intelligent Vehicles Symposium (IV) (2012) 16. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision (ECCV) (2016) 17. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow. In: Computer Vision and Pattern Recognition (CVPR) (2015) 18. Shekhovtsov, A., Reinbacher, C., Graber, G., Pock, T.: Solving Dense Image Matching in Real-Time using Discrete-Continuous Optimization. ArXiv e-prints (Jan 2016)

12

G. Munda, A. Shekhovtsov, P. Kn¨ obelreiter and T. Pock

19. Shekhovtsov, A., Kovtun, I., Hlav´ aˇc, V.: Efficient MRF deformation model for non-rigid image matching. CVIU 112 (2008) 20. Trzcinski, T., Christoudias, M., Fua, P., Lepetit, V.: Boosting binary keypoint descriptors. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2013) 21. Wainwright, M., Jaakkola, T., Willsky, A.: MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IT 51(11) (November 2005) 22. Xu, J., Ranftl, R., Koltun, V.: Accurate Optical Flow via Direct Cost Volume Processing. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) ˇ 23. Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

Suggest Documents