Multi Scale CRF Based RGB-D Image Segmentation ...

2 downloads 0 Views 1MB Size Report
Multi Scale CRF Based RGB-D Image Segmentation. Using Inter Frames Potentials. Taha Hamedani. Robot Perception Laboratory. Ferdowsi University of ...
Multi Scale CRF Based RGB-D Image Segmentation Using Inter Frames Potentials Taha Hamedani

Ahad Harati

Robot Perception Laboratory Ferdowsi University of Mashhad Mashhad, Iran

Robot Perception Laboratory Ferdowsi University of Mashhad Mashhad, Iran

[email protected]

[email protected]

1

Abstract—This paper proposed a novel multi-scale approach to solve energy minimization problem which can be used to deal with indoor scene labeling problem. The principal idea of all multi-scale algorithms is solving their finer problems to efficiently initialize coarser. We propose the use of both color and depth information which is captured by Microsoft Kinect sensor. In order to create our energy function, we use the Conditional Random Field (CRF) approach, and add our geometrical constraint to pairwise potential as regions extraction method based on both edge of RGB and Range image and definition of Cliques between two consecutive frames. We evaluate our method on challenging NYU v1 dataset and Experimental results show that our proposed method reached 2.35 for Hausdorff criterion and enhances the time of image segmentation. Keywords: Microsoft Kinect sensor; segmentation; Multi-scale CRF; normal vector.

RGB-D

image

I. INTRODUCTION Understanding the geometric structure and semantic labeling of an indoor scene is a fundamental problem in various research fields, including Robotic, machine vision, and autonomous navigation. For example, consider a robot in an indoor scene such as Fig. 1(b), and its objective is to interact with environment in order to move the visible objects and change the position of them to another desired position. If the robot wants to these affairs correctly, it should have the knowledge about accurate position of objects and itself; moreover, in order to safely navigate inside the environment, robot should percept the geometrical structure of the indoor scene, such as free spaces in the indoor structure and supporting region of visible object [1]. Hence, we should have an efficient image segmentation and scene labeling method to recognize the main structures and segments of the indoor scene like wall, desk, furniture, and ceiling. On the other hand, our method should have the capability to recognize visible small objects such as coffee mug, book, and box which is seen on the left side of Fig. 1(b). Motivated by mobile robot scenarios, we want to introduce a novel Multi-scale method to speed up image segmentation approaches and still have an acceptable performance for autonomous navigation applications. 978-1-4799-6743-8/14/$31.00 ©2014 IEEE.

Fig. 1. (a) Eddy robot with Kinect sensor. (b) Our indoor structure of robot that have small objects. The goal of robot in this structure can define as object recognition and moving the position of objects

On the other hand, semantic segmentation of 3D point cloud captured from indoor scenes has shown to be a challenging task [2]. This is caused by the large variety of object and indoor scene types, and their varying illuminations. Current methods used for 3D scene labeling segment the scene with usually more than one minute per each image [5,15]. In this paper, we want to use modified multi-scale model in order to deal with time complexity; on the other hand, we seek to improve the performance of our method by obtaining information from consecutive frames of one specific scene [12]. The rest of this paper is organized as follows: Section two is devoted to related works and reviews some important papers in the field of 3D scene segmentation and labeling. In section three, we briefly introduce CRF inference approach to assign optimal label to each pixel. We describe our multi-scale CRF model for RGBD image segmentation in section four. in section five and section six, experimental results and a conclusion are presented. New ideas for future work are mentioned in section seven. II. RELATED WORK The problem of dense semantic labeling of 3D point cloud has gained an increased interest in recent years. This can be caused by introducing the new datasets, which contain several RGB and range images besides their corresponding Ground

truth annotations. Three popular datasets in the field of indoor scene labeling are NYU Depth v1 [2], NYU v2 [3], and RGB-D Washington [4, 27] datasets. In this section we will give an overall overview of researches related to the field of indoor semantic segmentation. Firstly, we overview the recent research to segmentation by using these new datasets their methods focus more on the information from range images. Secondly, related work regarding the labeling of 3D point instead of range images. Finally, we will overview the methods used statistical information such as Markov Random Field (MRF), Conditional Random Field (CRF) and their multi-scale approach to labeling the scene. Recently, some researches focusing on the field of scene labeling using RGB-D data. For instance, Silberman in [2], used feed forward neural network as classifier with SIFT [11] feature vector which uses both depth and color in order to segment the scene, and shows that using such a depth data can significantly improve the result of segmentation. In their research, Ren et al [5], extract features for patches of scene by using kernel descriptors for RGB and depth information in scene and then by using segment tree and MRF, try to find efficient labeling. In this paper, the method used one linear SVM classifier for each layer segment tree in order to score them. These scores obtained from classifiers for unary potential and scores form pgb [6] from two adjacent patches for pairwise potential used as underlying structure for MRF model. Silberman [1] firstly determines the 3D structure of environment and distinguishes different objects in the environment, and then by using the obtained information, support relation of visible object in the scene will be inferred. Novelties of this paper are its alignment to three principal hypotheses scored by their normal vectors and four major category classifications. Furthermore, this method used Minimal Boundary Strength for merging the small region to obtain more similar region in the environment. In order to estimate label prior probabilities, distance to the floor of adjacent regions has been used in this case. The method in [8, 28] is similar to our work because it starts from super pixels obtained from SLIC [9] and then uses minimum spanning tree to create super pixel graph. Features like distance points from planes that is extracted by RANSAC [10], mean and variance of depth in super pixel, depth and height of center point of super pixel, and normal vector angel’s with horizontal plane is used to construct unary and pairwise potentials. Couprie at [11] used hierarchical convolutional network with three stages to automatically learn features from depth and color images, and the result of these three stages is concatenated to use in classifier prediction. These predictions are used for super pixel features obtained from unsupervised segmentation, and then used for coherent segments inference. For Video sequence segmentation, they enforce temporal consistency constraint to use the information of different frames. Hermans in [12] used a 2D and 3D labeling transfer method based on Bayesian updates [13] and dense pairwise 3D CRF [14]. This method uses a fast 2D semantic segmentation based on Randomized Decision Forest to reconstruct 3D scene, and fuses this segmentation into the current’s point states. Gupta. [15] incorporated both depth and appearance features for creating their super pixels, and use this features for train SVM in order to hierarchical segmentation and contour detection. Anand [16] also studies on

3D scene obtained from RGB-D data Kinect sensor for mobile robotic application which uses temporal consecutive frames to reconstruct the 3D scene. In [17], they use voxel SLAM from RGB-D data to reconstruct the scene and label the regions by Bayesian update. Lin in [18] extracts the cube box of object in the 3D scene by CPMC [19] and uses geometrical features like Area, distance to wall, volume, and angle with wall normal. There exists several research papers efforts on labeling 3D point cloud; For example, Shao et al [20] use geometrical model extracted from plane, and appearance features by SIFT to initialize regions and segment them interactively. The method in [21] uses iterative approach to gradually fuse together and classify patches extracted from 3D point cloud. Using patches to extract features of 3D data is very common in the task of semantic labeling; Similarly, Xiong et al [22] use patches from Laser scanner 3D data and context information beside generalization of neighboring system CRF graph, and finally, Hema et al [23] segment 3D data to patches according to their normal vectors and planes. Standard CRF model suffers from time complexity and several works attempt to speed up the traditional model by solving a smaller scale of the problem as initialization for finer scale. Recently, the use of hierarchical model of inference is applied to efficiently assign labels to pixels [5,11]. Kohli et al [24] is similar to our proposed method in solving the minimization of energy function in multi scale, though, they use coarse scale of image to identify the uncertain pixels passing to higher level to labeling. In another work [25] Kohli et al reconstruct and segment the 3D scene simultaneously while capturing the geometry and semantic feature of object by defining CRF model on 3D elements (voxels).

III.

CONDITIONAL R ANDOM FIELD (CRF)

Since 1980’s, probabilistic inference has been a powerful tool for image processing, and some researches in the image segmentation field utilize the framework of probabilistic inference. In particular, maximum a posteriori estimation in a Conditional random field has recently been applied to find semantic segmentation. Our goal is to improve the CRF approach by using depth data besides RGB data in order to increase accuracy of recognition region in image. The image segmentation problem of the CRF strategy for 2D image segmentation is formulated as follows. A given image can be represented as an 1*S array of feature vector, where S is the number of image pixel. The algorithm assigns a unique class label, Li, for each pixel and The labeling L= ( L1, L2,..., LS ) can obtained by maximizing the posterior probability, p ( L | f ) where f is the feature vector extracted from image pixel and L is labeling from possible label set Ω. Bayes theorem tells us that p ( L | f )  p ( f | L ) * p ( L ) / p ( f ) . We actually know that p ( f ) does not depend on possible labeling L, and we can ignore it. In 2D image segmentation problem, we have this assumption that p ( f | L)   p ( f s | Ls ) . HammersleysS

Clifford tells us that if p ( L ) wants to have markovian property, it should have Gibbs distribution function:

p(L)  

1 1 exp(U ( L ))  exp(   Vc ( L ) ) Z Z cC  Z 



 exp ( U ( L )) L







Where Z is normalization constant, U is energy function of labeling L, and V is potential function of Clique. Hence, energy function of specific labeling is obtained from sum of Clique potentials as defined in (1). On the other hand, as seen in (1) maximizing posterior probability is equivalent to minimizing energy function defined as (3): *

L  arg max p (l | f )  arg min U ( L )  arg min  Vc ( L )



cC



The potential function of Cliques of order one (Unary potential) directly reflect the probability of assigning one specific label to specific pixel of image which is used to label the image independently. We assume this probability as Gaussian, which label represent as mean and deviation as µLs and δLs respectively. So we have this energy function as sum potential function for Cliques order one and Clique order tow as (4): *

L  arg min 



 (log( 2  Ls )  sS

Fig. 2. Pyramid of fine to coarse level and isomorphism Φi in order to change Si to Bi at level i.[3]

This scheme also permits the parallelization of the relaxation algorithm on the whole pyramid. Let us define new neighboring system that allows having Cliques between two adjacent scales. As seen in Fig.3 besides singleton (C1) and doubleton (C2) Cliques in each level of pyramid, we have new doubleton Clique (C3) composed from pixel in the level i and one pixel in the coarser i-1 level and four pixels that form this pixel in the finer i+1 level.

( f s  s )2 )   V2 (c ) 2 2 Ls cC 



Where V2 (Lc) is potential function of Cliques of order two (pairwise potential) that model the relation of two adjacent labels in the image, and define by Potts model as (5): 

V 2 ( c )  V {r , s} ( L r , L s )   ( L r , L s )



 Fig. 3. New neighboring system define on pyramid and new Clique sets [3]

 3.1. Multi-scale CRF To generate a multi-scale CRF model, let us divide the input image to several low resolution images. In order to make a low resolutions of the input image, Haar wavelet transform [7] is used alternatively. Now we can define our pyramid (Fig.2) where each level i contains the coarse image (Si) of input image (image in level zero) which is an isomorphism to the scale Bi. The isomorphism approach (Φi) at level i, projects the coarse image Si to the image with the size of input image. The main benefit of this decomposition is that the potentials at coarser scales can be derived by simple computation from the potentials at the finest scale [3]. The basic idea of [3] is to find a better way of communication between the levels that the initialization used for the multi-scale model. Our approach is based on introducing new interactions between two neighbor scales in the pyramid.

Potential function of these new doubleton Cliques also define by Potts model with parameter  2 as (6), which reflect the constraint of the same labeling in two adjacent level pixels. Energy function of this new neighboring system is defined as (7). 

V3 (c )  V {r , s} ( Lr , Ls )   2 ( Lr , Ls )  U (L) 



 (log( s S

2  Ls ) 



( fs  s )2 )   V 2 (c )   V3 (c ) 2 2 Ls cC 2 cC 3 

The multi-scale algorithm essentially follows a top-down strategy ( Fig. 4). First the highest layer of the pyramid is solved (relaxation), and then the next level is initialized by the result of the previous coarse level (projection).

Fig. 4. relaxation- projection approach used in Multi-scale CRF image segmentation [3].

IV.

PROPOSED METHOD

In this section, we proposed the multi-scale CRF segmentation method by introducing our new pairwise potential function. To encode the geometry of the scene more efficiently, we use regions of scene obtained by edges of normal surfaces and Lab color space images. To perform edgebased region extraction, Simple canny edge detector for color images and average of cosine of angles between normal vectors of two neighborhood pixels for depth data are used and are normalized to have values between zero to tow. Then, simple OTSU threshold gives us the binary edge map of the scene. It is worth to mention that regions obtained from both surface normal edges and color edges, are more robust to noise and occlusion of images captured from Kinect. In Fig.5 a simple three side wall scene and its edge explained before and regions inside of the close edge are shown. This simple scene can clearly illustrate our method to obtain regions from edges extracted from both surface normal and color image. These regions are used in our new pairwise potential function in (8). The pairwise defined in (8) enforces the cliques of image in same level to have the same label of regions extracted before.

V2 ( c )  V {r , s} ( L r , L s )   ( R r , R s )( D LBP ) 



Where D LBP measures the difference between the two adjacent pixels with the same region based on LBP [26] response. In comparison with (5), if two adjacent pixels have similar region's label but do not have enough similarity in LBP measurement, their potential function decreases; whereas, in (5), their potential function just relates to their labeling. As shown in Fig.5 (top-left), two regions of the scene may have similar intensity in RGB image, but belong to different sides of the wall. In this case, direction of surface normal of each pixel can help us. As depicted in Fig.5 (bottom-left) edges extracted from direction of two neighbor's surface normals able our method to obtain the regions of the scene.

Fig. 5. (top-left) RGB image, (top-right) Range image, (bottom-left) edge from average of cosine angle between two neighbor's surface normals and canny edge detector for RGB image, (bottom-right) regions inside close edges.

We found it useful to use the segmentation of previous frame in video stream of Kinect sensor captured from indoor scene. Based on this, we try to use the previous frame labeling as new Cliques beside two previous Clique types. As Fig.6 shows, the pixel in the next frame is more probable to have same label in current frame.

Fig. 6. Doubleton Cliques between two consecutive fames

These in between frames doubleton cliques enforce the energy function to be minimized in a state that labeling of current frame is similar to labeling of previous frame. We use singleton cliques for surface normal feature like singleton cliques in RGB image that is used in traditional CRF framework. Similar to RGB singleton cliques potential, these potential also model the probability of belonging one specific pixel to each label according to the features extracted from range images like surface normal. Finally, we propose our new energy function as follow: U (L)   (log( 2  Ls ) 

( f s  s )2 2 Ls

sS

2

)   (log( 2  Ns )  sS

( N s   Ns ) 2 2 Ns 2

)

V (c)  V (c)  V (c) 2

cC 2



3

cC3

4

cC 4





Fig. 7. Segmentation result for the one image of NYU v1 depth dataset. First column from left to right shows RBD image, Range image, Ground Truth labeling, and edges extracted by canny edge detector. Second column from left to right shows Surface Normal, average cosine of angle between two adjacent normals, regions inside of close both edges including canny edges and normal edges, and final segmentation result obtained from our multi-scale CRF.



V4 ( c )  V {r , s} ( Lr , L s )   3 ( Lr , Ls )







Where Ns is the surface normal data of sth clique, and for each label we also have mean (μNS) and standard deviation (δNS). in (9), C4 is the set of doubleton cliques defined between two consecutive frames, and potential function of these cliques are defined in (10) with  3 parameter.

V. EXPERIMENTAL RESULTS We examine our method on RGB-D dataset NYU v1[2] that gives us several frames of Kinect sensor in the indoor environment. We briefly describe this dataset in the following subsection. 5.1. NYU Depth version 1. NYU Depth dataset Ver. 1 (NYU-V1)1 [2] contains 2347 images from 64 different indoor environments that pixelwise annotations are prepared in 13 classes. This dataset is the most challenging RGB-D dataset that contains different indoor environment with high details. It is worth to mention that the available NYU dataset Ground truth does not have separate label for structures like walls with different directions (Fig.7). These annotations can cause some differences in results of methods which try to encode the geometrical structure of 3D scene, and it is necessary to re label it for this specific purpose. We evaluate our segmentation algorithm on 30 scenes with three different structures from NYU v.1 by Hausdorff 1

http://cs.nyu.edu/~silberman/datasets/nyu_depth_v1.html

criterion, because, the distance is a better error indicator than simple overlap statistics. The Hausdorff Distance is a mathematical construction to measure the closeness of two sets of points that are subsets of a metric space. In Table I, we compare our proposed method with other popular methods, including standard CRF only, multi-scale CRF only [3], the method for defining pairwise potential function based on measurement of convexity and difference of surface normals [27], the method used in [28] for assigning doubleton potential function normals of plane and pixel and Gradient of intensity in RGB image, our proposed method only with consecutive frames cliques (Only-FrameCliques), and our proposed method with only doubleton cliques function defined in (8) (Only-RegionPotential).

TABLE I.

COMPARISON OF PROPOSED METHOD WITH OTHER POPULAR METHOD USING CRF FRAMEWORK BASED ON HAUSDORFF DISTANCE

Hausdorff distance

method

5.31 4.12 3.74 5.12 2.40 2.55 2.35

CRF Multi-scale CRF [3] Kevin Lai [27] Li Guan [28] Only-FrameCliques Only-RegionPotential Proposed Method

As we can see in Table I. our proposed method can reach better value for Hausdorff distance in comparison with other popular methods in this field. This achievement originating from our new cliques' definition between consecutive

frames, use of multi-scale approach, and new region extraction method based on both edges from canny detector and angles between two surface normals. We implemented this method on Matlab R2012a with 2.2 GHz CPU on WIN 8 OS. Furthermore, we achieved the best result with parameters β=0.5, β2=1,and β3=3 for energy function with five main segments. Current methods achieve good results, but they typically need more than one minute per image but our proposed method usually segment our principle structure of the scene less than a minute. VI. CONCLUSION In this paper, we enhance the quality of indoor scene segmentation by using region based pairwise potential obtained from edge of both surface normal and RGB image. We also define new cliques between labeling of two consecutive frames of sensor beside cliques between adjacent pixels in a single layer and cliques between two adjacent layers in multi-scale CRF model. Our experiments show that introducing new cliques between two consecutive frames, helps CRF to constrain to the previous labeling and attempts to modify this labeling based on current observation of Kinect sensor embedded on the mobile robot. VII. FUTURE WORK There are several ideas to extend our work to segment RGB-D images more accurately. First, in this paper, the previous frame segmentation was used as cliques beside inter-level and intra-level cliques. One improvement can be to use this information to assign the probability of each pixel belonging to each class of environment. Besides this information, surface normal of range image captured from Kinect sensor has much noise. We can update them in each iteration of our minimization method. In each iteration, we can gain a better labeling of image pixels; therefore, we are also seek to use this better labeling to modify the surface normal. Surface normal modification can change the average of angle between two surface normal, and can gradually influence in final segmentation. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus: Indoor Segmentation and Support Inference from RGBD Images. European Conference on Computer Vision (ECCV) .pages 746–760, 2012. N. Silberman and R. Fergus, “Indoor scene segmentation using a structured light sensor,” In International Conference on Computer Vision (ICCV) Workshop on 3D Representation and Recognition, pages 601–608, 2011. Zoltan Kato, Marc Berthod, Josiane Zerubia: A Hierarchical Markov Random Field Model and Multitemperature Annealing for Parallel Image Classification. CVGIP: Graphical Model and Image Processing 58(1): 18-37 (1996). Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox: A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In IEEE International Conference on Robotics and Automation (ICRA), May 2011. Xiaofeng Ren, Liefeng Bo, and Dieter Fox. RGB-(D) scene labeling: Features and algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2759–2766, 2012. P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans.PAMI, 2010.

[7] [8]

[9]

[10]

[11]

[12]

[13] [14] [15] [16]

[17] [18]

[19] [20]

[21]

[22]

[23]

[24]

[25] [26]

[27] [28]

Zbigniew R. Struzik, Arno Siebes: The Haar Wavelet Transform in the Time Series Similarity Paradigm. PKDD 1999:12-22 Cadena C., Kosecka J.: Semantic Parsing for Priming Object Detection in RGB-D Scenes. Workshop on Semantic Perception, Mapping and Exploration (in conjuction with IEEE ICRA), 2013, Karlsruhe. Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurélien Lucchi, Pascal Fua, Sabine Süsstrunk: SLIC Superpixels Compared to Stateof-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11): 2274-2282 (2012). Fischler, M. and R. Bolles: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24, 381-395 (1981). Camille Couprie, Clément Farabet, Laurent Najman, and Yann LeCun. Indoor Semantic Segmentation using depth information. CoRR, abs/1301.3572, 2013. A. Hermans, G. Floros, B. Leibe: Dense 3D Semantic Mapping of Indoor Scenes from RGB-D Images. International Conference on Robotics and Automation (ICRA), 2014. S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. P. Krähenbühl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” in NIPS, 2011. S. Gupta, P. Arbelaez, and J. Malik, “Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images,” CVPR, 2013. A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextually Guided Semantic Labeling and Search for 3D Point Clouds,” IJRR, vol. 32, no. 1, pp. 19–34, 2013. J. Stückler, N. Biresev, and S. Behnke, “Semantic mapping using object-class segmentation of RGB-D images,” in IROS, 2012. Dahua Lin; Fidler, S.; Urtasun, R., "Holistic Scene Understanding for 3D Object Detection with RGBD Cameras," Computer Vision (ICCV), 2013 IEEE International Conference on, vol., no., pp.1417,1424, 1-8 Dec. 2013. J. Carreira and C. Sminchisescu. CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts.TPAMI, 2012. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Transactions on Graphics, 31(6):136:1–136:11, 2012. Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics, 31(6):137:1–137:10, 2012. Xuehan Xiong and Daniel Huber. Using Context to Create Semantic 3D Models of Indoor Environments. In British Machine Vision Conference, 2010. Hema Swetha Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. In Neural Information Processing Systems, pages 244–252, 2011. Pushmeet Kohli, Victor S. Lempitsky, Carsten Rother: Uncertainty Driven Multi-scale Optimization. DAGM-Symposium 2010: 242251. Byung-soo Kim, Pushmeet Kohli, Silvio Savarese: 3D Scene Understanding by Voxel-CRF. ICCV 2013: 1425-1432. T. Ojala, M. PietikaÈ inen, T. MaÈenpaÈa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Machine Intelligence, vol. 24, no. 7, 2002. Kevin Lai, Liefeng Bo, Xiaofeng Ren, Dieter Fox: Detection-based object labeling in 3D scenes. ICRA 2012: 1330-1337. Li Guan; Ting Yu; Tu, P.; Ser-Nam Lim, "Simultaneous image segmentation and 3D plane fitting for RGB-D sensors — An iterative framework," Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, vol., no., pp.49,56, 16-21 June 2012.

Suggest Documents