Im2Depth: Scalable Exemplar Based Depth Transfer Mohammad Haris Baig+ , Vignesh Jagadeesh∗ , Robinson Piramuthu∗ , Anurag Bhardwaj∗ , Wei Di∗ , Neel Sundaresan∗ ∗ eBay Research Labs, San Jose, CA-95125, + Dartmouth College, Hanover, NH-03755
[email protected],vjagadeesh,rpiramuthu,anbhardwaj,wedi,
[email protected]
Abstract The rapid increase in number of high quality mobile cameras have opened up an array of new problems in mobile vision. Mobile cameras are predominantly monocular and are devoid of any sense of depth, making them heavily reliant on 2D image processing. Understanding 3D structure of scenes being imaged can greatly improve the performance of existing vision/graphics techniques. In this regard, recent availability of large scale RGB-D datasets beg for more effective data driven strategies to leverage the scale of data. We propose a depth recovery mechanism ”im2depth”, that is lightweight enough to run on mobile platforms, while leveraging the large scale nature of modern RGB-D datasets. Our key observation is to form a basis (dictionary) over the RGB and depth spaces, and represent depth maps by a sparse linear combination of weights over dictionary elements. Subsequently, a prediction function is estimated between weight vectors in RGB to depth space to recover depth maps from query images. A final superpixel post processor aligns depth maps with occlusion boundaries, creating physically plausible results. We conclude with thorough experimentation with four state of the art depth recovery algorithms, and observe an improvement of over 6.5 percent in shape recovery, and over 10cm reduction in average L1 error.
1. Introduction Inferring 3D structure from a single 2D image is a challenging problem receiving increased attention due to its wide applicability in augmented reality and scene understanding. Humans have the remarkable ability to infer scene structure with one eye closed. This is attributed to knowledge acquired over time on the perceptual organization of objects/scenes, as reported by extensive research in psychology [2, 11]. The capability of machines to replicate this effect would open new avenues and enhance capability of existing computer vision systems. For instance, depth estimation would enable real time measure-
Figure 1. Our Results: (Top Left) Input Image, (Top Right) Ground Truth Depth Map, (Bottom Left) Depth Map Recovered, (Bottom Right) Depth Map after plane fitting and local smoothing
ments/photogrammetry from 2D images without explicit calibration. In contrast, a carefully calibrated camera setup or acquisition of multiple image snapshots for estimating 3D structure would require users to possess a certain level of expertise. With the ever increasing use of smartphones, the capability of monocular mobile cameras to perceive depth would have significant commercial implications. In the field of object recognition and scene understanding, side information on depth has been shown to offer a significant performance and reliability boost to existing systems. A contemporary example is the reliable identification of body parts from depth images provided by the Kinect sensor [15], and recognition of gestures from RGB-D video feeds. Performance of pose estimators and gesture recognizers utilizing RGB channels is far below the performance obtained by including the depth modality. While the number of RGB-D images being acquired is constantly on the increase, acquisition of RGB images is progressing faster by several orders of magnitude. This leads to an obvious motivating question of whether it would be possible to reliably transfer depth from a large database of RGB-D images to a single query RGB image. Inspite of the promise offered by monocular depth estimation, there have been serious bot-
Manifold on RGB descriptors
T
Manifold on Depth descriptors
• Identification of reliable visual descriptors for depth transfer, and structural post processing for aligning estimated depth maps with image boundaries
2. Related Work
Figure 2. Intuition behind Method: Visualization of the 2D manifold of depth images using t-SNE [19]. Each of the smaller tiles making up the visualization are color coded with blue pixels referring to lower depths, and red pixels to higher depths. As can be observed, images with similar global depth profile are clustered together in 2D space utilizing RGB pairwise features (left) and sparse positive descriptors on depth (right) were effective in grouping images with similar depths profiles together. Observe the red and blue boxes in RGB (left) and depth (right) spaces to have similar clusters of depth profiles. The intuition of our method is to estimate a transformation T , that maps points from one space to another. This also naturally leads to the ideas of depth transfer among neighboring images in the manifold, since they share depth profiles that are similar to one another.
tlenecks in the availability of data and scalability/accuracy of these techniques which we now discuss. Issues with Depth Estimation: A challenging problem confronting the depth estimation line of research has been the acquisition of ground truth depth maps. This has been done with range sensors [14] that are fairly expensive are not deployed on a large scale. Computationally, depth estimation is a very high dimensional structured regression problem between an RGB image comprising N pixels to a depth image comprising N pixels. Solving this problem using an unconstrained functional mapping would lead to serious issues with the stability of estimated solutions. Luckily, prior information on the organization of images, such as smoothness, occlusions, and projective transforms offer meaningful constraints on the mapping. Case for Data Driven Transfer: Recent advances in the large scale deployment of depth sensing technologies have transformed the way we think of the above two problems. Large scale RGB-D datasets have not been forthcoming till recently [16, 17], facilitated by the use of structured light and time of flight depth sensors. A major advantage of these datasets is that they are ideally suited for data driven techniques, forcing a rethink of learning sophisticated parametric models. The primary contributions of our work (see Figure 1) are: • A scalable framework for depth recovery by linearly combining exemplar depth maps • Novel cross domain mapping between visual and depth spaces for estimating weights on linear combination
Understanding 3D scene structure has a rich history in computer vision. Traditional techniques have predominantly worked with multiple images to make the problem of 3D recovery well posed. These techniques capture images of the same object from multiple views, leading to Nview reconstruction systems [5, 6]. Another technique for making the problem tractable is in utilizing frames from a video sequences that capture different viewpoints of an object over time, leading to successful structure from motion and SLAM systems. Further, techniques such as shape from shading [1, 21] are attractive alternatives for reconstructing 3D structure, but are unreliable due to strong modeling assumptions, as pointed out in [14]. Specifically [1, 21] have not been shown to work on arbitrary scenes in addition to large compute time and parametric prior assumptions on natural scenes. Active research in the area of single image depth transfer is more recent, and is usually motivated by the type of scene considered (indoor/outdoor), or the type of prediction desired (absolute/relative). We now summarize the key papers in single image depth estimation. Automatic Photo Pop Up (APP): Hoiem et al. [7, 8] utilize the notion of geometric context to reason on scene layout, dividing the scene into top/left/center/right/bottom regions, thus gathering strong cues for labeling scene depth. Their primary focus was on outdoor scenes where the sky, ground, buildings are commonly recurring concepts that correspond to top/bottom/side layout labels. Make3D: Saxena et al. [14] proposed the Make3D system, which casts the problem of monocular depth estimation as a supervised learning problem. A Markov Random Field is learnt for mapping between RGB and depth space. The random field model also utilizes priors on the 3D properties of surfaces to refine solutions that may not adhere to prior knowledge on how natural scenes are organized. Non-Parametric Sampling (NPS): Karsch et al. [9] approach the problem of depth estimation using a framework that retrieves visually similar images from a training dataset and utilizes retrieved depth maps for final depth estimation. Their retrieval is performed using the GIST descriptor, followed by SIFT flow warping of retrieved neighbors. Finally, a depth optimization step comprising of spatial smoothness and global priors is utilized to recover depth from the retrieved images. BU/Google (BG): This is the most recent work [10] in data driven depth transfer. On comparison with NPS, it was shown to achieve higher performance on indoor scenes and lower performance on outdoor scenes. The crux of their method is similar to that of NPS in that they do retrieve visu-
TRAINING
Projec/on
RGB – D Images
Wr RGB Dic/onary
Clustering Clustering
Projec/on
Depth Dic/onary Wd
TESTING
Projec/on
γrk Learn T:γr δd δds
Wr RGB Dic/onary
RGB Image
γrk Transform δd = T γr
Depth Image Reconstruct
3. Proposed Formulation
Depth Dic/onary Wd
δds
Figure 3. Proposed Workflow: In the training stage (left), the RGB and depth components are separately utilized for creating RGB Dictionary Wr and depth dictionary Wd using clustering techniques. Once the dictionaries have been estimated, the RGB and depth features are respectively projected onto their dictionaries Wr and Wd to extract reconstruction co-efficients γ r and δ d . Finally, the transformation T between γ r and δ d is learnt, concluding the training phase. In the testing phase (right), an RGB image serves as query. The features extracted from the RGB image are projected onto the RGB dictionary to derive co-efficients γ r . The transformation T learnt during the training phase predicts a new set of depth reconstruction co-efficients δ d . These co-efficients when combined with the depth dictionary lead to the reconstructed depth map for the query RGB image.
ally similar images for depth transfer. However, instead of utilizing GIST they utilize HOG descriptors for performing image retrieval. Secondly, instead of performing a sophisticated depth optimization similar to NPS, a simple median operation is performed to fuse the retrieved depth maps. Finally, a cross bilateral filtering step smooths out the estimated depth map to make the depth map align with image edges for a visually pleasing result. The technique proposed in this work is most related to the non-parametric technique of BU/Google. In contrast, our technique retains the essence of a data driven nonparametric approach by compressing the dataset into a compact dictionary. In addition, we utilize a parametric transformation between the RGB and depth dictionaries. The intuition behind key components of our approach to enhance existing techniques are: • Richer Representation of the RGB and depth feature spaces in the form of compressed dictionaries, with each data point being represented in terms of dictionary elements. • Weight Estimation on data points (exemplar dictionary elements) to be fused to recover a depth map, as opposed to giving equal weights to all points in a retrieval set. A visual and intuitive explanation of the method using a low-dimensional embedding of RGB and depth feature spaces is presented in Figure 2. The structure evident in both feature spaces naturally leads the idea of richer representations using compressed dictionaries, and subsequent weight estimation on dictionary elements.
Notations and Preliminaries: Let us assume that we are given a set of L RGB images and their corresponding depth maps denoted by Itrain = {Ri ∈ {0, 1, ...255}M ×N ×3 , Di ∈ [0, 1, ...10]M ×N }L i=1 along with their respective global image descriptors {ri ∈