Reinforcement Learning for Semantic Segmentation in Indoor ... - arXiv

0 downloads 0 Views 3MB Size Report
Jun 3, 2016 - Reinforcement Learning for Semantic Segmentation in Indoor Scenes. Md. Alimoor Reza. Department of Computer Science. George Mason ...
arXiv:1606.01178v1 [cs.CV] 3 Jun 2016

Reinforcement Learning for Semantic Segmentation in Indoor Scenes Md. Alimoor Reza

Jana Kosecka

Department of Computer Science George Mason University Fairfax, VA, USA Email: [email protected]

Department of Computer Science George Mason University Fairfax, VA, USA Email: [email protected]

Abstract—Future advancements in robot autonomy and sophistication of robotics tasks rest on robust, efficient, and taskdependent semantic understanding of the environment. Semantic segmentation is the problem of simultaneous segmentation and categorization of a partition of sensory data. The majority of current approaches tackle this using multi-class segmentation and labeling in a Conditional Random Field (CRF) framework [19] or by generating multiple object hypotheses and combining them sequentially [2]. In practical settings, the subset of semantic labels that are needed depend on the task and particular scene and labelling every single pixel is not always necessary. We pursue these observations in developing a more modular and flexible approach to multi-class parsing of RGBD data based on learning strategies for combining independent binary object-vsbackground segmentations in place of the usual monolithic multilabel CRF approach. Parameters for the independent binary segmentation models can be learned very efficiently, and the combination strategy—learned using reinforcement learning— can be set independently and can vary over different tasks and environments. Accuracy is comparable to state-of-art methods on a subset of the NYU-V2 dataset of indoor scenes [24] , while providing additional flexibility and modularity.

I. I NTRODUCTION The problem of semantic understanding of indoors environments is central to many service robotic applications where objects need to be sought and recognized. The two most common strategies for tackling the problem of object detection/search is to either train an object category or instance specific detector or approach the problem as one of semantic segmentation where a partition of the sensory data is labeled with the semantic categories. The later approach can be more powerful as it enables exploitation of various contextual relationships between objects and background or region categories and also between objects themselves. This problem has been studied extensively by the Computer Vision community and evaluated on a variety of benchmark datasets [6, 12, 24]. Most current semantic segmentation approaches use either a multi-label Conditional Random Field (CRF), e.g. [19], or generate multiple object hypotheses and combining them sequentially, e.g. [2, 8, 9]. Building on the expressive power of these approaches and the availability of large amounts of labeled data, there have been notable recent improvements in accuracy [9]. One appealing aspect of these approaches is that, given enough training data, a system can be trained

to perform well on a specific benchmark dataset for semantic segmentation.

Fig. 1. From left to right, a sample RGB image, corresponding ground truth image, and the semantic segmentation output from our learned policy in a bathroom scene.

The motivation of this paper is to revisit the problem setting from a methodological standpoint. To this end we observe: First, depending on the current task or scene, considering all possible labels may not be necessary, while focusing on only task-relevant labels may improve the accuracy of semantic segmentation that is relevant for the task. Second, as robotics considers wider goals including life-long learning and adaptation, it is less reasonable to assume a known fixed set of labels will be used for parsing. Instead there is a need to represent and acquiring new semantic concepts in a modular and incremental manner while linking them together with additional concepts, e.g. as represented in a large graph by [20]. Both of these observations touch on one of the core advantages and potential pain-points of approaches that use a multilabel CRF to combine models of local image content based on a variety of features with measures of consistency between nearby labels. These approaches must optimize the tradeoff between these factors with respect to all of the multiple classes. This works well when the set of labels is fixed, and the task is semantic segmentation with all those labels, but cannot be easily adapted to consider only a subset of categories relevant for a specific task or environment, or to accommodate a new category. We propose an alternative approach for semantic understanding of indoors environments, where segmentation and categorization is carried out by learning the sequence of actions (objects to be recognized and segmented). Central to this approach is that instead of training a large multi-label Conditional Random Field (CRF), we propose learning a num-

ber of independent binary object/background segmentations and learning to combine them sequentially, providing flexible modularity as well as some simplification of optimization in training and inference. We then formulate the problem of sequential combination of binary object/background segmentations as a Markov Decision Process (MDP). An additional contribution of the proposed method is the use of variegated superpixels, where large super-pixel regions are formed by robust geometric fitting of planar supporting surfaces and small compact superpixels are using appearance cues. Contributions include: •

• • •

reformulation of multi-label CRF based semantic segmentation into modular binary object/background segmentation combined using a learned policy. simplified optimization for learning and inference. variation in super-pixel formation to accommodate both large planar surfaces and smaller objects. competitive evaluation on subset of the NYUD V2 dataset. II. R ELATED W ORK

The presented work is related to different strategies for multi-class semantic segmentation, for object/background segmentation and context modeling for both RGB images and RGB-D images. Here we focus on the methodologies used and discuss the suitability of different choices for building modular adaptive life-long learning robotic perceptual systems. Baseline approaches for semantic segmentation using RGBD data were proposed in [23] where multiple alternatives were considered for the unary and pairwise terms inside a pixel-level Conditional Random Field model (CRF). In the RGB-image only settings, the most successful approaches typically combine local appearance information with a smoothness prior that favors the same label for neighbouring regions (pixels or superpixels)[7, 14]. More recently, holistic scene representations were proposed using a graphical model framework [16]. In [19, 12] the authors used superpixel hierarchies endowed by features extracted from the entire path of the segmentation tree. In later work, excellent results were achieved by considering bottom up unsupervised segmentation and boundary detection to generate and classify candidate regions [8]. More recent approaches achieved superior results, by using deep convolutional networks for feature extraction, trained jointly or independently with CRF models used to capture the context and spatial relationships between objects [14, 22]. The above mentioned approaches exhaustively consider all semantic labels and either train a multi-label Conditional Random Field (CRF) or multiple one-vs-all classifiers evaluated on the regions. In [5], the authors also bypass the complex feature computation and segmentation stage and use convolutional networks for semantic segmentation. Hypotheses for the indoor scene labelling task are then evaluated using a superpixel-based over-segmentation.

The problem of binary object/background classification has been also tackled with the commonly used sliding window processing pipelines, with window pruning strategies and features adapted to RGB-D data. The commonly used approaches include [18, 28]. The bounding boxes generated by the sliding windows approaches, while efficient, provide only poor segmentation and are often suitable for objects whose shape can be well approximated by a bounding box. Alternative strategies for object detection and semantic segmentation, proceed with mechanisms for generating object proposals from elementary segments using low-level grouping cues of color and texture. This approach has been pursued in the context of generic object detection followed by recognition in the absence of depth data by [26]. Strategies for generating and ranking multiple object proposals have been also explored by [4]. Discriminatively-trained part-based models, which incorporated some context information have been introduced recently [17]. A bodies of work explored Reinforcement Learning techniques for robotics and image understanding problems [11][13]. Karayev and colleagues [11] learn a policy that optimizes object detections in a time restricted manner. Kwok et al.[13] introduced an augmented MDP formulation for the task of picking the best sensing strategies in a robotics experiment and utilizes least square policy iteration to solve for the optimal policy.

III. P ROPOSED A PPROACH We start by learning binary object/background segmentation models for each object category. While these do use a CRF to integrate appearance and label consistency cues, these are not mulit-label CRFs and only consider a single category and the background. A the resulting object/background segmentations may overlap, they need to be combined into a single pixel-level labeling. The simplest strategy of taking the most confident prediction for the overlapping regions is not ideal as it requires calibrating the categories vs each other, something we explicitly avoid, and often prefers large objects of interest which can be detected with high confidence compared to smaller objects. Instead, we adopt a sequential strategy for combing the binary segmentations, where the strategy is found using reinforcement learning. This approach is favorable compared to a fixed ordering, which naturally gives priority to large or frequent categories due their contribution to the overall pixel level accuracy performance measure, and can dynamically adjust based on each image. In the following sections we describe the ingredients of our approach starting with learning and inference for generating binary object/background segmentation, and then the sequential combination strategy and learning the optimal sequencing of object background segmentations in reinforcement learning framework.

A. Object/background segmentation Motivated by the observation that indoor scenes contain many planar structures, we segment an image into regions of planar and non-planar surfaces. Our framework starts by identifying the dominant planar surfaces using a RANSAC based approach by sampling 3D point triplets from the depth image. The remaining regions not explained by the planar surfaces are represented by compact SLIC superpixels [1] generated from the RGB image. Figure 2 shows our superpixels, where the region boundaries are marked in red. a) Conditional Random Fields: We formulate the binary object/background segmentation in a Conditional Random Field (CRF) framework. The conditional distribution p(x|z), is a log-linear combination of feature functions or potentials as shown in 1. By maximizing the conditional likelihood, we can learn the weights wi ’s from a set of labeled training examples. Final prediction for the hidden random variables is found as the maximum a-posteriori (MAP) solution x for which the conditional probability p(x|z) is maximum. V

X 1 exp(w1 θd (xi , z) + w2 p(x|z) = Z(z) i E X (i,j)

θpc (xi , xj , z) + w3

E X

(1)

θpx (xi , xj , z))

(i,j)

The hidden random variables x={x1 , x2 , ..., x|V | } correspond to the nodes in the CRF graph, z refers to the vector of observation variables, and Z(z) is the partition function. Each hidden random variable can take one of two discrete values: xi = {object, background}, where object ∈ {bed, sofa, ... , wall} in indoor scenes experiment. Note that the object here simply refers to one of the semantic labels available in the dataset. The unary feature function associated with each node, xi , comes from the output probability p(xi |z) of a AdaBoost classifier. θd (xi , z) = −log(p(xi |z))

(2)

where the observations z are computed for each superpixel i using a subset of features described in Section III-B. The pairwise functions are computed for every edge (j, k) ∈ E in the CRF graph. We penalize the labelings of two adjacent superpixels according to their differences in one of two aspects ∈ {color, spatial}. Color terms penalized color differences, while the spatial term penalizes the differences between the mean depth associated with each superpixel as detailed in [3]. These pairwise functions ensure the smoothness in the final labeling process for superpixels. We follow similar CRF graph construction method, learning and inference technique as detailed in [3]. B. Features We use a set of rich and discriminative features to characterize the superpixels capturing both local appearance and geometry and global statistics of the scene. We exploit a subset of geometric and generic features proposed in [8], computed over superpixels. Appearance is encoded using the traditional

Fig. 2. Our superpixel image (bounded by red lines) and the depth image from NYUD V2. Notice how our superpixels capture the large planar surfaces.

representation of color and texture features introduced in [25] and [10]. The color distribution of a superpixel is represented by a fixed histogram of color in HSV color space and texture histogram is formed from a set of oriented Gaussian derivatives in a fixed set of orientations. The geometric features computed over these 3D points associated with each superpixel are described in detail in [3]. We also adopted a set of perspective features computed from the the statistics over the straight lines and their intersections and other vanishing point statistics as described in [10]. C. Unary Terms We learn the AdaBoost classifier in a one-vs-all fashion for a particular object such as bed, table, sofa and others. We utilize the logistic regression version of AdaBoost introduced in [10]. We apply sigmoid function on the predicted output of the AdaBoost, a confidence measure, to get a probabilistic output of a superpixel being a label of the object under consideration. We maintain approximately equal proportions of positive and negative instances during training to prevent over-fitting. We use a negative mining strategy that utilizes the co-occurrence statistics to select important instances from a large pool of negative instances. The important negative instances are sampled according to the distribution of object’s co-occurrence with other objects in the training images. We select negative samples from a particular object class in proportion to the object’s co-occurrence value in the distribution. For example during training for books, we will sample more examples from bookshelf class as negative instances compared to other non-frequently co-occurring objects e.g., toilet. Our trained model with this negative mining strategy shows improved performance in terms of overall accuracy compared to random selection of negative instances. D. Sequential combination of binary segmentations Traditional approaches for semantic segmentation look for all possible labels in the scene [27, 8]. They either approach the problem by training a multi-class CRF or onevs-all classifiers for all semantic labels and computing the confidences for all the semantic labels. We next describe our sequential object selection mechanism and object selection policy learning using reinforcement learning. The resulting policy dynamically selects a sequence of objects, completely based on the image content.

A C

B

Fig. 3. a) Conflict resolution of intersected region between two competing object segments discussed in III-D. b) Learned weight vector wπ corresponds to the learned policy for the living room the 3-Fold cross validation experiment. The yellowish entries denote positive values where as bluish entries denote negative values. y-axis shows the actions and each row denotes the weights for each action blocks.

We generate the binary object/background segmentations for different objects in each scene and sequentially combine their predicted segmentation masks. To resolve the conflict between an overlapping region we utilize a heuristic that considers overlap ratios of two competing segments. For example, in the sequential combination process of the bedroom scene, object bed is considered first followed by object pillow. C is the overlapped region between segments B and A, which have different labels as shown in Figure 3. The regions are broken into the following A = C + A0 and B = C + B 0 , where A0 and B 0 are the non-intersected portion of segments A and B respectively. Prior to considering pillow object in the sequential process, the overlapped part C was assigned the label Cbed along with A0 bed . The non-overlapping part of pillow has label B 0 pillow . We consider the segment-size-ratio heuristic, which prefers small-sized segments that can take place on-top-of larger segments e.g., pillow can lie on-top-of C bed. We compute the ratios C A , and B then assign C to labels of the segment that has the larger ratio value. For an image, we sequentially combine obtained binary segmentations from three background classes in the order of wall, floor, and ceiling irrespective of the scenes we consider, then follow it up with sequential selection of objects that is given by a policy we learn using reinforcement learning. 1) Markov Decision Processes: In this section section we will review the reinforcement learning framework based on least squares policy iteration introduced in [15]. The general framework of learning control policies (sequences of actions) based on experience and rewards is that on Markov Decision Processes (MDP). MDP models the agent acting in the environment as a tuple {S, A, T, R}, where S is a set of states, A finite set of actions, T (s, a, s0 ) is the transition model which describes the probability of of ending up in state s0 after executing an action a in state s and R(s, a, s0 ) is a reward obtained in state s when agent executes action a and transitions to s0 . The goal of solving MDP is to find a policy π : S → A for choosing actions in states such that cumulative future reward is maximized. If parameters T and R are known then optimal control policy can be computed efficiently using value iteration. In case the parameters of MDP are now known, the reinforcement learning can be used to

learn the optimal policy. One set of RL techniques aims at learning the model parameters T and R and then proceeds with standard value iteration, the other model-free techniques learn the policy directly. For our problem we adopt the model-free approach and apply least squared policy iteration introduced in [15] and also used in [13] and [11] in the context of robotics and image understanding problems. 2) Model: In our case the state s ∈ S consists of our current estimate of the distribution of the objects in the scene P (B) = {P (B1 ), P (B2 ), ..., P (BK )}, where P (Bi ) is the probability of an object class-i is present in the image. Additionally, we also keep track of objects that are already taken into consideration into a observation variable O. A set of actions ai ∈ A, where action ai corresponds to the selection of object/background segmentation for object i in the sequential semantic segmentation. Our reward function R is defined in terms of pixel-wise frequency weighted Jaccard Index computed over the set of actions taken at any stage of an episode. Jaccard Index JIi is the intersection-over-union ratio for the object i’s predicted segmentation and the ground truth segmentation. In the training stage we consider each image as an episode. Length of an episode is the maximum number of actions in our action set A. Lets assume that in the current episode, starting with action a1 we reached to position k of the episode by taking ak . MDP state maintains actions taken so far in observation variable O. We compute the current reward R(s, ak , s0 ) in state s at the episode as follows: R(s, ak , s0 ) = w1 ∗ JIa1 + w2 ∗ JIa2 + ... + wk ∗ JIak + rbg

rbg = wwall ∗ JIwall + wf loor ∗ JIceiling

(3)

wi are the ratio of the pixels predicted by the segmented object i to the total pixels in the image. The rbg is the pixel-wise frequency weighted Jaccard Index score of the three fixed background classes. Our dynamic policy finds the state-action value function Qπ (s, a), which estimates a real-valued score to each state-action pair as defined by equation 4. 0

0

0

Qπ (s, a) = E[R(s, a, s ) + γQπ (s , π(s ))]

(4)

The policy is defined on the Qπ (s, a) as defined by equation 5. Our state space is continuous and intractable, hence we can featurize the policy function with an approximation function with features ψ(s, a) and a set of linear weights wπT . π(s) = arg max Qπ (s, a) a

= arg max wπT ψ(s, a)

(5)

a

3) State Featurization of MDP: Our MDP is not fully observable hence we follow a pursuit of Augmented MDP[11][13] which includes uncertainty variables into the state featurization. ψ(s, a) is defined by following a blockfeaturization approach where features specific to each action are inserted into a specific block. All blocks are concatenated

one after another to give us a fixed length feature vector. Each block in ψ(s, a) has fixed order of entries. a ψa (s, a) = {P (Bprior ), P (B1 |O), ..., P (BK |O,

U (B1 |O), ..., U (BK |O)}

(6)

0

a P (Bprior )

Here is the prior probability of the object a in the image. P (Bi |O) are the object probabilities conditioned on the current set of observation. U (Bi |O) are the uncertainties conditioned on the observations, which is computed as the Shanon-Entropy measure from the P (Bi |O). Each block is of size 1+2K, where K is the size of a single block feature as mentioned above. If size of the Actions set is |A|, then our featurization function ψ(s, a) has size of |A|(1+2K). All but the selected action block is zero in this featurization approach. 4) Modeling the Featurization Components: The blockfeaturization of action a is computed as follows. The prior a term P (Bprior ) is computed from the MAP probability of the binary segmentation for object a. The observation terms updates the belief-state each time an action is performed. The selection of object changes the our belief about the presence or absence of the object in the image. In order to capture this information, we update the observation terms P (Bi |O) for ith object if action ai is in the observation vector O by the following equation: P (Bi |O) = max{P (Bi |CC 1 ), ..., P (Bi |CC M )}

(7)

Each CC j term is the connected components from the segmentation mask of object i. Each P (Bi |CC j ) is computed from the output of a classifier referred as RUSBoost [21]. We train the classifier from the ground truth segmentation mask for each object from the training images. We used the same features as described in SectionIII-B. P (Bi |O) retained the value P (Bi ) if action ai is not the observation vector O 5) Reinforcement Learning for Policy: The goal is to learn a policy, which maximizes the state-action value function, Qπ (s, a) defined in Equation 4. Qπ (s, a) is defined recursively by the expected cumulative future rewards. The discount factor defined by γ controls the amount of future rewards to be taken into consideration. We learn our policy using Least Square Policy Iteration (LSPI), a model-free Reinforcement Learning approach. Instead of directly estimating Qπ (s, a) defined in Equation 4, the LSPI learns the linear functional approximation of Qπ (s, a) as follows: Qπ (s, a) ≈ wπT ψ(s, a)

vector b ∈

Suggest Documents