Real-time self-adaptive deep stereo

Real-time self-adaptive deep stereo

arXiv:1810.05424v1 [cs.CV] 12 Oct 2018

Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia and Luigi Di Stefano Department of Computer Science and Engineering (DISI) University of Bologna, Bologna, Italy {name.surname}@unibo.it Abstract Deep convolutional neural networks trained end-toend are the undisputed state-of-the-art methods to regress dense disparity maps directly from stereo pairs. However, such methods suffer from notable accuracy drops when exposed to scenarios significantly different from those seen in the training phase (e.g.real vs synthetic images, indoor vs outdoor, etc). As it is unlikely to be able to gather enough samples to achieve effective training/tuning in any target domain, we propose to perform unsupervised and continuous online adaptation of a deep stereo network in order to preserve its accuracy independently of the sensed environment. However, such a strategy can be extremely demanding regarding computational resources and thus not enabling real-time performance. Therefore, we address this side effect by introducing a new lightweight, yet effective, deep stereo architecture Modularly ADaptive Network (MADNet) and by developing Modular ADaptation (MAD), an algorithm to train independently only sub-portions of our model. By deploying MADNet together with MAD we propose the first ever realtime self-adaptive deep stereo system.

1. Introduction Many key tasks related to prominent application areas such as autonomous vehicles and mobile robotics leverage on the availability of a dense and reliable 3D reconstruction of the sensed environment. Among depth sensing technologies, passive stereo has proven particularly amenable to the above tasks, being potentially able to fulfill such requirements with low latency, at an affordable cost, in indoor and outdoor environments. Following the groundbreaking work by Mayer et al that introduced DispNet [19], current stateof-the-art stereo methods rely on learnable deep convolutional neural networks (CNNs) that take as input a pair of left-right frames and directly regress a dense disparity map. In challenging real-world scenarios, like the popular KITTI benchmarks [7, 21], this strategy turns out to be faster and

more effective than traditional (i.e., not learning-based) algorithms. However, as recently highlighted in [36, 23], deep stereo networks experience a huge performance drop when tested on completely unseen scenarios due to the domain shift between training data (often synthetic [19]) and testing environment. This domain shift can be partially solved by fine-tuning on few annotated samples from the target domain. Yet, obtaining reliable groundtruth data is a challenging task per se, which requires costly active sensors such as LIDARs followed by noise removal by postprocessing [38] or manual intervention. Although recent works have proposed to overcome the need for groundtruth [36, 23, 42, 9, 40], all of them assume availability of stereo pairs from the target domain beforehand, which inherently limits adaptation ability to the available samples. We agree with [36, 23] about the importance of addressing the domain shift issue. Nonetheless, we cast adaptation as a continuous learning process whereby the stereo network evolves online based on the images gathered by the camera during its real deployment. We believe that the ability to continually adapt itself in real-time is key to any deep learning machinery meant to work in relevant practical scenarios, like autonomous driving, where gathering training samples to be deployed offline from all possible surroundings turns out unfeasible. We will show how continuous online adaptation can be achieved by deploying one of the unsupervised losses proposed in [5, 9, 36, 40], computing the loss on the current frames, updating the whole network by back-propagation (from now on shortened as back-prop) and moving to the next pair of input frames. Adapting, however, comes with the side effect of reducing inference speed to ∼ 13 at best, as we will show experimentally, a price far too high to pay for most modern state of the art deep stereo systems. To keep a high enough frame rate we have developed a novel Modularly ADaptive Network (MADNet) designed to be lightweight, fast and modular. Our architecture ex1 hibits accuracy comparable to DispNetC [19] yet use 10 of the parameters, runs at ∼ 40F P S for disparity inference and performs online adaptation of the whole network

(a)

(b)

(c)

(d) 0th frame

150th frame

300th frame

Figure 1. Disparity maps predicted by MADNet on a KITTI sequence [6]. Left images (a), no adaptation (b), online adaptation of the whole network (c), online adaptation by MAD (d).

at ∼ 15F P S. Moreover, to allow an even higher frame rate during adaptation at the cost of slightly less accuracy, we have developed Modular ADaptation (MAD) an algorithm that leverages the modular architecture of MADNet in order to train sub-portions of the whole network independently. Using MADNet together with MAD we are able to adapt our system without supervision to unseen environment at ∼ 25F P S. Figure 1 shows the disparity maps predicted by MADNet on three successive frames of a video from the KITTI dataset [6] either without undergoing any adaptation (b) or by adapting online the whole network (c) or by our computationally efficient MAD approach (d). Thus, (c) and (d) show how online adaptation can improve the quality of the predicted disparity maps significantly in as few as 150 frames, i.e.with a latency of about 10 and 6 seconds for whole adaption and MAD, respectively. Extensive experimental results on KITTI raw dataset [6] shall highlight that i) our network is more accurate than models with similar complexity [16] ii) it can dramatically boost its accuracy in few seconds reaching accuracy comparable to offline fine-tuning over groundtruth. To the best of our knowledge, MADNet and MAD realize the first-ever real-time, self-adapting, deep stereo system.

2. Related Work We review here the recent literature of deep stereo models alongside with methods concerned with image reconstruction losses for unsupervised learning. Machine learning for stereo. Early attempts to leverage machine learning for stereo matching concerned estimating confidence measures [28], by random forest classifiers [10, 34, 24, 26] and – later – by CNNs [27, 32, 37], typically plugged into conventional pipelines to improve accuracy. CNN based matching cost functions [39, 4, 18] achieved

state-of-the-art on both KITTI and Middlebury v3 by replacing conventional cost functions [12] within the SGM pipeline [11]. Eventually, Shaked and Wolf [33] proposed to rely on deep learning both for matching cost computation and disparity selection, while Gidaris and Komodakis [8] for refinement. Mayer et al [19] proposed the first endto-end stereo architecture. Although not achieving state-ofthe-art accuracy, this seminal work turned out quite disruptive compared to the traditional stereo paradigm outlined in [31], highlighting the potential for a totally new approach. Thereby, [19] ignited the spread of end-to-end stereo architectures [15, 22, 17, 3, 14] that quickly outmatched any other technique on the KITTI benchmarks by leveraging on a peculiar training protocol. In particular, the deep network is initially trained on a large amount of synthetic data with groundtruth labels [19] and then fine-tuned on the target domain (e.g., KITTI) based on stereo pairs with groundtruth. All these contributions focused on accuracy, only recently Khamis et al. [16] proposed a deep stereo model with a high enough frame rate to qualify for online usage at the cost of sacrificing accuracy. We will show how in our MADNet this tradeoff is less disruptive. Unfortunately, all those models are particularly data dependent and their performance dramatically decay when running on environments different from those observed at training time, as shown in [36]. Batsos et al [2] partially tackled this weakness by designing a novel matching cost obtained by combining traditional functions and confidence measures [13, 28] within a random forest framework, to attain a better generalization compared to CNN-based method [39]. Image reconstruction for unsupervised learning. A recent trend to train deep depth estimation networks in an unsupervised manner relies on image reconstruction losses. In particular, for monocular depth estimation this is achieved by warping different views, coming from stereo

pairs or image sequences, and minimizing the reconstruction error [5, 43, 9, 40, 25, 29]. This principle has also been used for optical flow [20] and stereo [42]. For the latter task, alternative unsupervised learning approach consists in deploying traditional stereo algorithms and confidence measures [36] or combining with iterative optimization prediction obtained at multiple resolutions [23]. However, we point out that all these works have addressed offline training only, while we propose to solve the very same problem casting it as online (thus fast) adaptation to unseen environments. Concurrently with our work Zhong et al [41] propose to directly train from scratch a (deeper) stereo model on video sequences without supervision, however their proposal does not focus on real time performance requiring 0.81.6 seconds for inference from a single frame and several minutes of adaptation to achieve good results. In contrast we maintain the offline training phase on synthetic data and limit the online training to address domain adaptation in the least time possible, making it better suited to adapt in case of sudden context changes.

3. Online Domain Adaptation Modern machine learning models can suffer from a massive drop in accuracy when tested on data too different from the training one, an issue commonly referred to by the machine learning community as domain shift problem. A lot of research work has been done on how to reduce the loose in accuracy and obtain a more general algorithm. However, the most effective practice still relies on additional offline training on samples from the target environments as to get a new system configuration specially adapted for that scenario. As pointed in [36] the domain shift curse is inherently present in a deep stereo network due to the training procedure being mostly performed on synthetic images quite different from real ones. However, the adaptation can be effectively achieved without the need for annotated data fine tuning offline using unsupervised losses [5, 9, 36, 40]. We go one step further and argue that the adaptation can be effectively performed online as soon as new frames are available, thus obtaining a deep stereo system able to adapt itself dynamically. Our online adaptation strategy, to be universally deployable, can not rely on the availability of groundtruth annotations. Thus, we limit ourselves to use one of the proposed unsupervised losses and perform a single train iteration (forward and backward pass) for each incoming stereo pair. Our model is always in training mode, and thus it is continuously fine-tuning to the actual sensed environment. In section 6 we will show how by performing online adaptation we can achieve results comparable to offline fine-tuning on similar data in as few as ∼ 100 train iterations vouching for the effectiveness of our formulation.

4. MADNet - Modularly ADaptive Network We believe that one of the main limitations that have prevented exploration of online adaptation is the computational cost of performing a full train iteration for each incoming frame. Indeed, we will show experimentally how it roughly corresponds to reducing the inference rate of the system to ∼ 31 , a price far too high to be paid with most modern architectures. To address specifically this issue, we have developed Modularly ADaptive Network (MADNet), a novel lightweight model for depth estimation inspired by very recent fast, yet accurate, architectures for optical flow [30, 35]. We deploy a pyramidal strategy for dense disparity regression for two key purposes: i) maximizing speed and ii) obtaining a modular architecture as depicted in Figure 2. The pyramidal towers extract features from the left and right frames through a cascade of independent modules sharing the same weights. Each module consists of convolutional blocks aimed at reducing the input resolution by deploying two 3 × 3 convolutional layers, respectively with stride 2 and 1, followed by Leaky ReLU non-linearities. According to Figure 2, we count 6 blocks providing us with feature 1 , namely F1 to representations F from half resolution to 64 F6 , respectively. These blocks extract 16, 32, 64, 96, 128 and 192 features. At the lowest resolution (i.e., level 6), we forward features from left and right images to a correlation layer [19]. Then, from the raw matching costs, we deploy a disparity decoder D6 consisting of 5 additional 3 × 3 convolutional layers, with 128, 128, 96, 64, and 1 output channels. Again, each layer is followed by Leaky ReLU, except the last one, which provides the disparity map at the lowest resolution. Then, D6 is up-sampled to level 5 by bilinear interpolation and used both for warping right features towards left ones before computing correlations and as input to D5 . From this level onwards, the aim of the disparity decoders Dk is to refine and correct the upsampled disparity coming from the lower resolution using as guidance the correlation score obtained by matching convolutional features. All the correlations inside our network are computed along a [-2,2] range of possible shift. This process is repeated up to quarter resolution (i.e., level 2), where we add a further refinement module consisting of 3 × 3 dilated convolutions [35], with, respectively 128, 128, 128, 96, 64, 32, 1 output channels and 1, 2, 4, 8, 16, 1, 1 dilation factors, before bilinearly upsampling to full resolution. Additional details on the MADNet architecture are available in the supplementary material. MADNet has a smaller memory footprint and delivers disparity maps much more rapidly than other end-to-end networks such as [19, 15]. Moreover, if properly trained or adapted, it turns out more accurate than DispNetC. Concerning efficiency, working at decimated resolutions allows for computing correlations on a small horizontal window [35], while warping features and forwarding disparity pre-

Right

Left

Disparity

ℱ1 ℱ1 ℱ 2

𝑅𝑒𝑓𝑖𝑛𝑒𝑚𝑒𝑛𝑡 𝐷2

ℱ2 ℱ3 ℱ3

ℱ4

𝐷3

ℱ4

ℱ5

𝐷4

ℱ5

ℱ6

𝐷5

ℱ6

Cost Volume k warp

𝐷6

k-th layer Pyramid Towers

Initial disparity

Disparity Estimator

(a)

(b)

Figure 2. Schematic description of the proposed MADNet architecture (a), each circle between an Fk and the corresponding Dk representing a warp and correlation layer (b).

dictions across the different resolutions enables to maintain a small search range and look for residual displacements only. For instance, a search window of 2 corresponds to a 128 displacement at lowest resolution, scaling up to a maximum possible range of 248 by summing the contributions from level 6 to 2, which ends up in a broader search range compared to DispNetC (i.e., 40 at quarter resolution for a maximum range of 160) with a much lower computational cost. With a 1080Ti GPU, the proposed lightweight architecture runs at about 40 FPS and 15 FPS when deploying online adaptation with full back-prop at the full KITTI resolution.

5. MAD - Modular ADaptation As we will show, MADNet is remarkably accurate while performing the online adaptation of the whole network at 15 FPS. However, for some applications, it might be desirable to have a higher frame rate without losing the ability to tune the network to unseen environments automatically. Most of the time needed to perform online adaptation is effectively spent executing back-prop and weights update across the whole network layers. A naive way to speed up the process will be to freeze the initial part of the network and fine tune only a subset of k final layers, thus realizing a shorter back-prop that would yeld an higher frame rate. However, there is no guarantee that these last k layers are indeed those that would benefit most from online fine-tuning. One might reasonably guess that other portions of the network should be adapted as well, in particular, the initial layers, which directly interacts with the images from a new, unseen, domain. Indeed, we will show in subsection 6.4 that training only the final layers is not enough for handling drastic domain changes that typically occur in practical applications. Thus, following the key intuition that to keep up with fast inference we should pursue a partial, though effective, online adaptation, we developed Modular ADaptation

(MAD) an approximate online adaptation algorithm tailored for MADNet, but extensible to any multi-scale inference network. The key intuition for our method is to take a network N and subdivide it into p non-overlapping portions, each referred to as n, such that N = [n1 , n2 , ..np ]. Each portion ni consists of k connected layers with the last one (the deeper) able to output a disparity estimation yi . Given as input a couple of stereo frames x, N will output p+1 disparity estimation [y, y1 , . . . , yp ] = f orward(N , x), with y being the final prediction of the model at full resolution. Due to its modular structure, MADNet naturally allows such subdivision by assigning to the same ni layers working at the same resolution both in the convolutional towers and among the disparity estimators (e.g., referring to Figure 2 ni = [Fi , Di ]). Each ni can be independently trained by computing a loss directly on the disparity yi and back-prop only across the layers belonging to ni , thereby skipping the costly backward pass through lower resolution layers in between. This process can be figured out as an approximation of the standard full back-prop. In fact, by alternating the portion ni of the network that is trained, it sequentially allows updating all layers. Such a strategy drastically reduces the time needed to compute and apply gradients without sacrificing too much accuracy as confirmed by experimental results. Offline, we statically subdivide the network into different portions, that with MADNet correspond to different layers of the pyramids. Online, for each incoming stereo pair, first we perform a forward pass to obtain all the disparity predictions at multiple scales [y, y1 , . . . , yp ], then, we choose a portion θ ∈ [1, . . . , p] of the network to train according to some heuristic and update the weights of layers belonging to nθ according to a loss computed on yθ . We consider a valid heuristic any function that outputs a probability distribution among the p trainable portions of N from whom we could perform sampling. Among different functions, we obtained effective results using a simple reward/punishment

Algorithm 1 Independent online training of MADNet 1: Require: N = [n1 , . . . , np ] 2: H = [h1 , . . . , hp ] ← 0 3: while not stop do 4: x ← readF rames() 5: [y, y1 , . . . , yp ] ← f orward(N , x) 6: Lt ← loss(x, y) 7: θ ← sample(sof tmax(H)) 8: Lθt ← loss(x, yθ ) 9: updateW eights(Lθt , nθ ) 10: if f irstF rame then 11: Lt−2 ← Lt , Lt−1 ← Lt , θt−1 ← θ 12: end if 13: Lexp ← 2 · Lt−1 − Lt−2 14: γ ← Lexp − Lt 15: H ← 0.99 · H 16: H[θt−1 ] ← H[θt−1 ] + 0.01 · γ 17: θt−1 ← θt , Lt−2 ← Lt−1 , Lt−1 ← Lt 18: end while mechanism detailed in Algorithm 1. We start by creating a histogram H with p bins all initialized at 0. For each stereo pair we perform a forward pass to get the disparity predictions (line 5) and measure the performance of the model by computing the loss Lt using the full resolution disparity y and potentially the input frames x, e.g., reprojection error between left and right frames [9] (line 6). Then, we sample the portion to train θ ∈ [1, . . . , p] according to the distribution sof tmax(H) (line 7) and compute one optimization step for layers of nθ with respect to the loss Lθt computed on the lower scale prediction yθ (line 8-9). We have now partially adapted the network to the current environment. Next, we update H increasing the probability of being sampled for the ni that have proven effective. To do so, we can compute a noisy expected value for Lt by linear interpolation of the losses at the previous two time step: Lexp = 2·Lt−1 −Lt−2 (line 13). By comparing it with the measured Lt we can assess the impact of the network portion sampled at the previous step (θt−1 ) as γ = Lexp − Lt , and then increase or decrease its sampling probability accordingly (i.e., if the adaptation was effective Lexp > Lt , thus γ > 0). We found out that adding a temporal decay to H helps increasing the stability of the system, so the final update rule for each step is: H = 0.99 · H, H[στ −1 ]+ = 0.01 · γ (lines 15 and 16). We will show how with the proposed MAD we can online adapt MADNet at ∼ 25F P S without losing much accuracy.

6. Experimental Results 6.1. Evaluation Protocol and Implementation To properly address practical deployment scenarios in which there are no groundtruth data available for fine-tuning

in the actual testing environments, we train our stereo network using only synthetic data [19]. More details regarding the training procedure are available in the supplementary material. To test the online adaptation we use those weights as a common initialization and carry out an extensive evaluation on the large and heterogeneous KITTI multi-view raw dataset [6] with depth labels [38] converted into disparities by knowing the camera parameters. Overall, we assess the effectiveness of our proposal on 43k images. Specifically, according to the KITTI classification, we evaluate our framework in four heterogeneous environments, namely Road, Residential, Campus and City, obtained by concatenation of the available video sequences resulting in 5674, 28067, 1149 and 8027 frames respectively. Although these sequences are all concerned with driving scenarios, each has peculiar traits that would lead deep stereo model to gross errors without a suitable fine-tuning in the target domain. For example, City and Residential often depict road surrounded by buildings, while Road mostly concern highways and country roads where the most common objects depicted are cars and vegetation. By processing stereo pairs within sequences, we can measure how well the network adapts, either with full back-prop or MAD, to the target domain compared to an offline trained model. For all experiments, we analyze both average End Point Error (EPE) and percentage of pixels with disparity error larger than 3 (D1-all). Due to image format being different for each sequence, we extract a central crop of size 320 × 1216 from each frame, which suits to the downsampling factor of our architecture and allows for validating almost all pixels with available groundtruth disparities. Finally, we highlight that for both full back-prop and MAD, we compute the error rate on each frame before applying the model adaptation step. That is, we measure performances achieved by the current model on the stereo frame at time t and then adapt it according to the current prediction. Therefore, the model update carried out at time t will affect the prediction only from frame t + 1 and so on. As unsupervised loss for online adaptation we rely on the photometric consistency between the left frame and the right one reprojected according to the predicted disparity. Following [9], to compute the reprojection error between the two images we combine the Structural Similarity Measure (SSIM) and the L1 distance, weighted by 0.85 and 0.15, respectively. We selected this unsupervised loss function as it is the fastest to compute among those proposed in literature [36, 23, 42] and does not require additional information besides a couple of stereo pairs. Further details are available in the supplementary material.

6.2. MADNet performance Before assessing the performance obtainable through online adaptation, we test the effectiveness of MADNet by following the commonly used two-phase training using syn-

D1-all: 4.70

D1-all: 3.75

D1-all: 2.70

Figure 3. Qualitative comparison between disparity maps from different architectures. From left to right, reference image from KITTI 2015 online benchmark and disparity map by DispNetC [19], StereoNet [16] and MADNet.

Model DispNetC [19] StereoNet [16] MADNet (Ours)

D1-all 4.34 4.83 4.66

Runtime 0.06 0.02 0.02

Table 1. Comparison between fast disparity regression architectures on the KITTI 2015 test set without online adaptation.

thetic [19] and real data. Thus, starting from a network trained according to the protocol mentioned above on synthetic data, we perform fine-tuning on the training sets of KITTI 2012 and KITTI 2015 and submit to the KITTI 2015 online benchmark. Additional details on the fine-tuning protocol are provided in the supplementary material. On Table 1 we report our result compared to other (published) fast inference architectures on the leaderboard. At the time of writing, our method ranks 78th . Despite the mid-rank achieved in terms of absolute accuracy, MADNet compares favorably to StereoNet [16] ranked 80th , the only other high frame rate proposal on the KITTI leaderboard. Moreover, we get close to the performance of the original DispNetC 1 [19] while using 10 of the parameter and running at more than twice the speed. Figure 3 also reports a qualitative comparison between the output of the three architectures, showing how MADNet better maintains thin structures compared to StereoNet.

6.3. Online Adaptation We will now show how our online adaptation schema is as effective as offline fine-tuning in recovering good performance in unseen environments. In Table 2, we compare the performance yielded by MADNet when deploying models trained on synthetic data only [19] or adapted online either by full back-prop (which cuts speed down) or MAD (aimed at efficient adaptation) on the different KITTI environments. The three different setups are marked in Table 2 on the Adapt. column as No, Full, MAD respectively. Since Algorithm 1 has a non-deterministic sampling step, we have run the tests regarding MAD 5 times each and reported here the average performance. We refer the reader to subsection 6.4 for analysis on the standard deviation across different runs. Although we primarily address scenarios in which no groundtruth data are available beforehand, we also report, as a baseline, the results attained by the model finetuned offline on groundtruth labels as described in subsection 6.2 (those results will be tagged with the suffix -GT after the model name). To assess more broadly the effec-

tiveness of online adaptation for deep stereo models, we also report performances obtained by our implementation of the widely used DispNetC [19] architecture, trained on synthetic data following authors’ guidelines. As for DispNetC, focusing on the D1-all metric, we can notice how running full back-prop online to adapt the network decimates the number of outliers on all scenarios compared to the model trained on the synthetic dataset only. In particular, this approach can consistently halve D1-all on Campus, Residential and City and nearly reduce it to one third on Road. Alike, the average EPE drops significantly across the four considered environments, with improvement as high as a nearly 40% relative improvement on the Road sequences. These massive gains in accuracy, though, come at the price of slowing the network down to about one third of the original inference rate, i.e. from nearly 16 to 5.22 FPS. Due to the much higher errors yielded by the model trained on synthetic data only, full online adaptation turns out even more beneficial with MADNet, leading to a model which is more accurate than DispNetC with Full adaptation in all sequences but Campus and can run nearly three times faster (i.e. at 14.26 FPS compared to the 5.22 FPS of DispNetC-Full). Moreover, MADNet employing MAD for adaptation allows for effectively and efficiently improving the model online by updating only a portion of the whole network at each inference step. Indeed, MAD provides a significant improvement in all the performance figures reported in the table compared to the corresponding models trained by synthetic data only. Using MAD, MADNet can be adapted paying a relatively small computational overhead which results in a remarkably fast inference rate of about 25 FPS. Overall, these results highlight how, whenever one has no access to training data from the target domain beforehand, online adaptation is feasible and definitely worth. Moreover, if the frame rate is something to take into account MADNet combined with MAD provides a quite favorable trade-off between accuracy and speed. As mentioned above, the Table also reports the performance of the models fine-tuned offline on the 400 stereo pairs with groundtruth disparities from the KITTI 2012 and 2015 training dataset [21, 7]. It is worth pointing out how, with both architectures, online and unsupervised adaptation by full back-prop turns out competitive with respect to fine-tuning offline by groundtruth, even more accurate in the Residential environment. This fact may hint at training usupervisedly by a more considerable amount of data possi-

Model DispNetC DispNetC DispNetC-GT MADNet MADNet MADNet MADNet-GT

Adapt. No Full No No Full MAD No

City D1-all(%) 8.31 4.34 3.78 37.42 2.63 5.82 2.21

EPE 1.49 1.16 1.19 9.96 1.03 1.51 0.80

Residential D1-all(%) EPE 8.72 1.55 3.60 1.04 4.71 1.23 37.41 11.34 2.44 0.96 3.96 1.31 2.80 0.91

Campus D1-all(%) EPE 15.63 2.14 8.66 1.53 8.42 1.62 51.98 11.94 8.91 1.76 23.40 4.89 6.77 1.32

Road D1-all(%) 10.76 3.83 3.25 47.45 2.33 7.02 1.75

EPE 1.75 1.08 1.07 15.71 1.03 2.03 0.83

FPS 15.85 5.22 15.85 39.48 14.26 25.43 39.48

Table 2. Performance on the City, Residential, Campus and Road sequences from KITTI [6]. For both DispNetC and MADNet with and without online adaptations.

bly delivering better models than supervision by fewer data. These results also highlight the inherent effectiveness of the proposed MADNet. Indeed, as vouched by the rows dealing with MADNet-GT and DispNetC-GT, using for both our implementations and training them following the same standard procedure in the field, MADNet yields better accuracy than DispNetC while running about 2.5 times faster. Table 2 shows also how adapted models perform significantly worse on Campus with respect to the other sequences. We ascribe this mainly to Campus featuring much fewer frames (1149) compared the other sequences (5674, 28067, 8027), which implies a correspondingly lower number of adaptation steps being executed online. Indeed, a key trait of online adaptation is the capability to improve performance as more and more frames are sensed from the environment. This favourable behaviour, not captured by the average error metrics reported in Table 2, is highlighted in Figure 4, which plots the D1-all error rate over time for MADNet models in the four modalities. While without adaptation the error keeps being always large, models adapted online clearly improve overtime such that, after a certain delay, they become as accurate as the model that could have been obtained by offline fine-tuning had groundtruth disparities been available. In particular, full online adaptation achieves performance comparable to fine-tuning by the groundtruth after 900 frames (i.e., about 1 minute) while for MAD it takes about 1600 frames (i.e., 64 seconds) to reach an almost equivalent performance level while providing a substantially higher inference rate (∼ 25 vs ∼ 15). As Figure 4 hints at online adaptation delivering better performance when processing a higher number of frames, in Table 3 we report additional results obtained by concatenating together the four sequences without network resets to simulate a stereo camera traveling across different environments for ∼ 43000 frames. Firstly, Table 3 shows how both DispNetC and MADNet models adapted online by full back-prop yield much smaller average errors than in Table 2, as small, indeed, as to outperform the corresponding models fine-tuned offline by groundtruth labels. Hence, performing online adaptation through long enough sequences, even across different environments, can lead to more ac-

Model DispNetC DispNetC DispNetC-GT MADNet MADNet MADNet MADNet-GT

Adapt. No Full No No Full MAD No

D1-all(%) EPE 9.09 1.58 3.45 1.04 4.40 1.21 38.84 11.65 2.17 0.91 3.37 1.11 2.67 0.89

FPS 15.85 5.22 15.85 39.48 14.26 25.43 39.48

Table 3. Results on the KITTI raw dataset [6] (Campus → City → Residential → Road).

Adaptation Mode D1-all(%) EPE FPS No 38.84 11.65 39.48 Final-1 38.33 11.45 38.25 Final-7 31.89 6.55 29.82 Final-13 18.84 2.87 25.85 MAD-SEQ 3.62 1.15 25.74 MAD-RAND 3.56 (±0.05) 1.13 (±0.01) 25.77 MAD-FULL 3.37 (±0.1) 1.11 (±0.01) 25.43 Table 4. Results on the KITTI raw dataset [6] using MADNet trained on synthetic data and different fast adaptation strategies

curate models than offline fine-tuning on few samples with groundtruth, which further highlights the great potential of our proposed continuous learning formulation. Moreover, when leveraging on MAD for the sake of run-time efficiency, MADNet attains larger accuracy gains through continuous learning than before (Table 3 vs.Table 2) shrinking the performance gap between MAD and Full back-prop. We believe that this observation confirms the results plotted in Figure 4: MAD needs more frame to adapt the network to a new environment, but given sequences long enough can successfully approximate results obtainable by the costly online adaptation by full back propagation (i.e., 0.20 EPE difference and 1.2 D1-all between the two adaptation modalities on Table 3). Besides Figure 1, we report qualitative results in the supplementary material as two video sequences regarding outdoor [6] and indoor [1] environments.

NO Adaptation

90

GT Tuned

Full Adaptation

MAD

D1-all

75 60 45 30 15 0 0

100

200

300

400

500

600

700

800

900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Frame

Figure 4. MADNet: error across frames in the 2011 09 30 drive 0028 sync sequence (KITTI dataset, Residential environment).

6.4. Different online adaptation strategies To assess the effectiveness of MAD for fast online adaptation we carried out additional tests on the whole KITTI RAW dataset [6] and compare performance obtainable deploying different fast adaptation strategies for MADNet. Results are reported on Table 4 together with those concerning a network that does not perform any adaptation. First, we compare MAD keeping the weights of the initial portions of the network frozen and training only the final K layers (Final-K). Using the notations of Figure 2, Final-1 corresponds to training only the last prediction layer, Final-7 to fine-tuning the last Refinement block, Final-13 to training Refinement and D2 . Then, since MAD consists in splitting the network in independent portion and choosing which one to train using Algorithm 1, we compare our full proposal (MAD-FULL) keeping the split in independent portions and choosing which one to train at each step either randomly (MAD-RAND) or using a round-robin schedule (MAD-SEQ). Since MAD-FULL and MAD-RAND feature non-deterministic sampling steps, we report their average performance obtained across 5 independent runs on the whole dataset with the corresponding standard deviations between brackets. By comparing the Final entries with the ones featuring MAD we can clearly see how training only the final layers is not enough to successfully perform online adaptation. Even training as many as 13 last layers (at a comparable computational cost with our proposed MAD) we are at most able to halve the initial error rate, with performance still far from optimal. The three variants of MAD by training the whole 1 network can successfully reduce the D1-all to 10 of the original. Among the three options, our proposed layer selection heuristic provides the best overall performance even taking into account the slightly higher standard deviation caused by our sampling strategy. Moreover, the computational cost to pay to deploy our heuristic is negligible losing only 0.3 FPS compared to the other two options.

6.5. Deployment on embedded platforms All the tests reported so far have been executed on a desktop PC equipped with a Nvidia 1080 Ti GPU. Unfortunately for many application like robotics or autonomous vehicles, it is unrealistic to rely on such high end and powerhungry hardware. One of the key benefits of MADNet, however, is its lightweight architecture which allows an easy deployment on lower power platform. Thus, we benchmarked the same implementation of MADNet used for the tests on desktop PC, deploying on a Nvidia Jetson TX2 board and measuring execution time for producing disparity estimations on a stereo pair at the full KITTI resolution. Using this setup, we measured an average execution time for MADNet of 0.26s. For comparison, we have re-implemented, using the same framework, StereoNet [16] and measured its execution time deploying just 1 refinement module, which results in 0.76s, or using 3 of them, which result in 0.96s. Indeed, these results show that MADNet can run faster and more accurately than competing methods, making it an appealing alternative even for embedded applications.

7. Conclusion and Future Work We have shown how the proposed online unsupervised fine-tuning can successfully tackle domain adaptation for deep end-to-end disparity regression networks. We believe this to be key to the practical deployment of these potentially ground-breaking deep learning systems in many relevant scenarios. For applications in which inference time is a critical issue we have developed MADNet, a new network architecture, and MAD, a strategy to effectively adapt it online very efficiently. We have shown how MADNet together with MAD can adapt to new environments by keeping a high prediction frame rate (i.e., 25FPS) and yielding better accuracy than popular alternatives like DispNetC. As the main topic for future work, we plan to test and eventually extend MAD to any end-to-end stereo system. We would also like to investigate alternative methods to select the portion of the network to be updated online at each step.

References [1] H. Alismail, B. Browning, and M. B. Dias. Evaluating pose estimation methods for stereo visual odometry on robots. In the 11th International Conference on Intelligent Autonomous Systems (IAS-11), 2011. [2] K. Batsos, C. Cai, and P. Mordohai. Cbmv: A coalesced bidirectional matching volume for disparity estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [3] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [4] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep visual correspondence embedding model for stereo matching costs. In The IEEE International Conference on Computer Vision (ICCV), December 2015. [5] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016. [6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. [7] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012. [8] S. Gidaris and N. Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [9] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017. [10] R. Haeusler, R. Nair, and D. Kondermann. Ensemble learning for confidence measures in stereo vision. In CVPR. Proceedings, pages 305–312, 2013. 1. [11] H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 807–814. IEEE, 2005. [12] H. Hirschmller and D. Scharstein. Evaluation of stereo matching costs on images with radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:1582–1599, 08 2008. [13] X. Hu and P. Mordohai. A quantitative evaluation of confidence measures for stereo vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 2121– 2133, 2012. [14] Z. Jie, P. Wang, Y. Ling, B. Zhao, Y. Wei, J. Feng, and W. Liu. Left-right comparative recurrent model for stereo matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [15] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

of geometry and context for deep stereo regression. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In 15th European Conference on Computer Vision (ECCV 2018), 2018. Z. Liang, Y. Feng, Y. G. H. L. W. Chen, and L. Q. L. Z. J. Zhang. Learning for disparity estimation through feature constancy. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. S. Meister, J. Hur, and S. Roth. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI, New Orleans, Louisiana, Feb. 2018. M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015. J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. J. Pang, W. Sun, C. Yang, J. Ren, R. Xiao, J. Zeng, and L. Lin. Zoom and learn: Generalizing deep stereo matching to novel domains. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. M. G. Park and K. J. Yoon. Leveraging stereo matching with learning-based confidence measures. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia. Towards real-time unsupervised monocular depth estimation on cpu. In IEEE/JRS Conference on Intelligent Robots and Systems (IROS), 2018. M. Poggi and S. Mattoccia. Learning a general-purpose confidence measure based on o(1) features and a smarter aggregation strategy for semi global matching. In Proceedings of the 4th International Conference on 3D Vision, 3DV, 2016. M. Poggi and S. Mattoccia. Learning from scratch a confidence measure. In Proceedings of the 27th British Conference on Machine Vision, BMVC, 2016. M. Poggi, F. Tosi, and S. Mattoccia. Quantitative evaluation of confidence measures in a machine learning world. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. M. Poggi, F. Tosi, and S. Mattoccia. Learning monocular depth estimation with unsupervised trinocular assumptions. In 6th International Conference on 3D Vision (3DV), 2018.

[30] A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [31] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002. [32] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In British Machine Vision Conference (BMVC), 2016. [33] A. Shaked and L. Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [34] A. Spyropoulos, N. Komodakis, and P. Mordohai. Learning to detect ground control points for improving the accuracy of stereo matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1621–1628. IEEE, 2014. [35] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [36] A. Tonioni, M. Poggi, S. Mattoccia, and L. Di Stefano. Unsupervised adaptation for deep stereo. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [37] F. Tosi, M. Poggi, A. Benincasa, and S. Mattoccia. Beyond local reasoning for stereo confidence estimation with deep learning. In 15th European Conference on Computer Vision (ECCV), September 2018. [38] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017. [39] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32):2, 2016. [40] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle, V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello. Activestereonet: End-to-end self-supervised learning for active stereo systems. In 15th European Conference on Computer Vision (ECCV 2018), 2018. [41] Y. Zhong, H. Li, and Y. Dai. Open-world stereo video matching with deep rnn. In 15th European Conference on Computer Vision (ECCV 2018), 2018. [42] C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1567–1575, 2017. [43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.

.

Supplementary material for ”Real-time self-adaptive deep stereo”

1. Detailed MADNet structure We report here a detailed description of the three modules that compose MADNet. Using the notation introduced in Section 4 of the main paper, we start from Figure 1 by describing in details our pyramidal convolutional feature extractor. MADNet will deploy two of them with parameter sharing to extract features independently on the left and right frames (green and red pyramid in Figure 2a in the main paper). For each one of the 6 resolutions considered in MADNet we build one disparity estimation module as described in Figure 2 starting from the corresponding extracted features. At the lowest resolution in our network (D6 ) we ignore the warping and upsampling steps (since we do not have a D7 ) and directly create a cost volume between F6l and F6r . Finally, at the highest resolution considered by our model (D2 ) we deploy a residual refinement module, depicted in Figure 3, that use atrous convolutions and residual connections to get the final disparity estimation at the same resolution as its input. To recover full resolution, we upsample the output of the refinement module using bilinear interpolation.

Image, H x W x 3

3x3x3 Conv2D, 16, s2

𝐷𝑛−1 Upsampling

H/2 x W/2 x 16 3x3x16 Conv2D, 16, s1

𝐹1 3x3x16 Conv2D, 32, s2

෩𝑛 𝐷

3x3x32 Conv2D, 32, s1

3x3x32 Conv2D, 64, s2

3x3x128 Conv2D, 128,

H/8 x W/8 x 64 3x3x64 Conv2D, 64, s1

3x3x128 Conv2D, 96,

𝐹3 3x3x64 Conv2D, 96, s2

3x3x96 Conv2D, 64, s

H/16 x W/16 x 96 3x3x96 Conv2D, 96, s1

3x3x64 Conv2D, 32, s

𝐹4 3x3x96 Conv2D, 128, s2


H/32 x W/32 x 128 3x3x128 Conv2D, 128, s1

𝐹5 3x3x128 Conv2D, 192, s2

3x3x192 Conv2D, 192, s1

𝐹6 Output, H/64 x W/64 x 192

We resume implementation details regarding how we have initially trained MADNet on synthetic data, how we split the network into independent portions and how we compute the self-supervised loss to train them online. Regarding the initial training of MADNet, we perform 1200000 training iterations on the FlyingThings3D dataset using Adam Optimizer and a learning rate of 0.0001, halved after 400k steps and further every 200k until convergence. As loss function, we compute the sum of per-pixel absolute errors between disparity maps estimated at each resolution and downsampled groundtruth labels, summing up the contribution from each resolution according to different weights, respectively 0.005, 0.01, 0.02, 0.08, 0.32 from F1 to F6 following [5]. For the additional fine tuning on the KITTI 2012 and 2015 training sets used in section 6.2 of the submitted paper, we performed a 500K optimization steps by computing the loss only on the full-resolution disparity map and using 0.001 as weight. All the other hyperparame-

3x3xC Conv2D, 128, s

𝐹2

H/64 x W/64 x 192

2. Implementation and training details for MADNet and MAD

Concat

H/4 x W/4 x 32

Figure 1. Detailed structure of our convolutional feature extractor. For each convolutional layer, we report kernel dimensions and the stride as s followed by the stride step.

ters are kept as in the synthetic training. As reported in the submitted paper, we have grouped layers (either from the feature extractor or the disparity estimators) according to the resolution at which they operate. We made this decision because, as clearly shown in Figure 2 in the submitted paper, layers at the same resolution are directly connected through skip connections and thus may allow approximate backpropagation by flowing the gradients only through the connection without traversing the lowresolution layer in between. For example, considering the structure of the disparity estimation module depicted in Figure 2, we would backprop only through Fnl and FNr and

𝐷𝑛

Image, H x W x 3

Upsampling


6, s2

2 x W/2 x 16

6, s1

2, s2

4 x W/4 x 32

2, s1

4, s2

8 x W/8 x 64

4, s1

6, s2

16 x W/16 x 96

6, s1

28, s2

32 x W/32 x 128

28, s1

5

92, s2

64 x W/64 x 192

92, s1

6

64 x 192

H/2 x W/2 x 16

𝐹1

𝐹𝑛𝑟

𝐷𝑛+1

3x3x16 Conv2D, 16, s1 Upsampling

Warping

3x3x16 Conv2D, 32, s2

H/4 x W/4 x 32

3x3x32 Conv2D, 64, s2

Warping

𝐹𝑛𝑙

𝐹𝑛𝑙

Cost Volume

𝐷𝑛∗ 3x3xC Conv2D, 128, r1

3x3xC Conv2D, 128, s1

෩𝑛 𝐷

𝐶𝑛

Concat

𝐷𝑛∗

Concat

𝐶𝑛 𝐹𝑛𝑙

Concat

Cost Volume

3x3x32 Conv2D, 32, s1

𝐹2

෩𝑛 𝐷

𝐹𝑛𝑙

Concat

3x3x128 Conv2D, 128, r2

3x3x128 Conv2D, 128, s1

H/8 x W/8 x 64 3x3xC Conv2D, 128, s1

3x3x64 Conv2D, 64, s1

3x3xC Conv2D, 128, r1

3x3x128 Conv2D, 128, r4

3x3x128 Conv2D, 96, s1

𝐹3 3x3x64 Conv2D, 96, s2

3x3x128 Conv2D, 128, s1

3x3x128 Conv2D, 128, r2

3x3x128 Conv2D, 96, r8

3x3x96 Conv2D, 64, s1

H/16 x W/16 x 96 3x3x96 Conv2D, 96, s1

3x3x128 Conv2D, 96, s1

3x3x128 Conv2D, 128, r4

𝐹4 3x3x96 Conv2D, 128, s2

3x3x96 Conv2D, 64, s1

3x3x96 Conv2D, 64, r16

3x3x64 Conv2D, 32, s1


3x3x128 Conv2D, 96, r8

3x3x64 Conv2D, 32, r1

H/32 x W/32 x 128 3x3x128 Conv2D, 128, s1

3x3x64 Conv2D, 32, s1

𝐷𝑛

𝐹5 3x3x128 Conv2D, 192, s2

H/64 x W/64 x 192 3x3x192 Conv2D, 192, s1


𝐷𝑛

𝐹6 Figure 2. Detailed structure of our stereo estimation network at Output, H/64 x W/64 x 192 denote with Fn the convolutional features exresolution n. We tracted at resolution n, superscript l for those obtained from the left frame and r for those referring to the right one. For the upsampling block, we use standard bilinear upsampling while for the cost volume creation we use the correlation layer introduced in DispNetC [4] with max displacement 2.

not through Dn+1 . By following this strategy, we obtain 5 independently trainable portions by grouping layers that produce [Dk , Fk ] for each one of the resolutions from 6 to 3. For the higher resolution portions (1-2) we collapsed together layers working at the half and quarter resolution since our network stops the disparity estimation at D2 . Once defined the different portions of the network, we can train them independently since each one produces a disparity estimation amenable for loss computation. For the actual loss computation, we use photometric consistency between the left RGB frame of the stereo couple and the right one reprojected according to the predicted disparity. Following [2] we perform the image reprojection using a fully differential bilinear sampler, then compare the two images using a linear combination of SSIM [6] computed on 3 × 3 patches and L1 distance respectively weighted 0.85 and 0.15. We did not use the left-right consistency check proposed in [2] as it would need to elaborate each stereo pair two times, drastically reducing the frame rate of the system (even considering batched execution) without improving the performance drastically. Finally, we run some experiments adding the smoothness term proposed in [2] but we did not get a noticeable improvement in performance, so we decided to omit it to keep a simpler formulation. For all

3x3x96 Conv2D, 64, r16

3x3x32 Conv2D, 1, r1

+

3x3x64 Conv2D, 32, r1

𝐷𝑛

3x3x32 Conv2D, 1, r1

Figure 3. Detailed structure of our residual refinement network. The+network takes as input an initial disparity estimation Dn∗ and 𝐷𝑛 the corresponding left convolutional features at the same resolution Fnl , elaborates them using atrous convolutions (we report the rate used as r followed by the rate value) and finally output a residual pixel-wise correction.

Sampling Frequencies 30.00% 27.13%

25.00%

Sampling Frequencies

x3

𝐹𝑛𝑟

𝐷𝑛−1

20.92%

20.00%

18.21%

18.07%

15.67%

15.00%

10.00%

5.00%

0.00% 1/64

1/32

1/16

1/8

1/4

Resolutions

Figure 4. Sampling frequencies for the different independent portions of MADNet using MAD for fast online adaptation.

our tests, we compute every loss function at full resolution by first upsampling the small-scale predictions using bilinear sampling. All the code used for our experiments will be made available on github once completed the review process.

3. MAD sampling policy We are interested in investigating which portions of the network are trained more by MAD according to the rewardpunishment mechanism we designed. To get some insights we ran MADNet performing adaptation with MAD on the full KITTI raw dataset 5 times and kept track of the number of steps on which each portion has been sampled for training. In Figure 4 we report on the y-axis the average sampling frequencies for each portion, whereby on the xaxis we identify each of the portions by the different scale at which it operates. Surprisingly we can see how the most sampled portion (i.e., the one that according to our heuristic will grant the greater improvement once adapted) is not the last one working at 41 of the initial resolution that produces the final predictions, but the middle portion of the network 1 . As pointed out by [3], a good coarse disworking at 16 parity map can be easily upsampled and refined up to a still accurate full resolution output. This is further confirmed by our analysis, showing how MAD favors the fine-tuning of 1 lower resolution (i.e. 16 ) modules.

4. Qualitative Results As a further supplementary material, we have uploaded a video showing the effectiveness of our online adaptation formulation. For the first half of the video we have selected a video sequence from the KITTI raw dataset belonging to the Residential environment and reported the predicted disparity using MADNet in 3 different configurations. The upper right corner shows the predictions performing only inference at 40FPS, the lower left performing fast adaptation using MAD at 25 FPS and the lower right performing adaptation of the whole network using full backprop at 15 FPS. For better visualization, we have synchronized the output of the three networks, so the video is not showing real execution times. To give a quantitative measurement of the improvement, we have superimposed over each image in the top left corner the D1-ALL and EPE metrics. For the two adapting networks we also report, between brackets, the gain compared to the not adapted network. Looking at the video, we can appreciate how our proposed online adaptation of the whole network by full backprop can solve most of the biggest mistakes halving the initial error in as few as ∼ 150 frames (i.e., 10s of execution time at 15F P S). Instead, considering the fast adaptation using MAD we can see how it requires slightly more frames to achieve comparable improvements (i.e., about ∼ 400 frames or 16s at 25F P S), but after that time it will benefit of the higher frame rate. Finally, we can see how going towards the end of the sequence the gap between the adaptation of the whole network by full backprop and MAD gets smaller up to converge to comparable performance in the final ∼ 500 frames. The second half of the video concerns performance achieved in

an indoor scenario. For this qualitative evaluation we have taken a sequence from the Wean Hell dataset [1] and show (with the same layout described before) predictions yielded by MADNet trained only on synthetic data, adapted using MAD or performing full online adaptation. We can see how both adaptation strategies can drastically improve the not adapted network configuration. As in the outdoor scenario, the full adaptation requires less frame to achieve good performance while MAD needs slightly more frames. By the end of the video, both networks produce similar smooth predictions.

References [1] H. Alismail, B. Browning, and M. B. Dias. Evaluating pose estimation methods for stereo visual odometry on robots. In the 11th International Conference on Intelligent Autonomous Systems (IAS-11), 2011. [2] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017. [3] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In 15th European Conference on Computer Vision (ECCV 2018), 2018. [4] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [5] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [6] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. Trans. Img. Proc., 13(4):600–612, Apr. 2004.