JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
1
Exploiting Large Image Sets for Road Scene Parsing Jose M. Alvarez, Mathieu Salzmann, Nick Barnes
Abstract—There is increasing interest in exploiting multiple images for scene understanding, with great progress in areas such as co-segmentation and video segmentation. Jointly analyzing the images in a large set offers the opportunity to exploit a greater source of information than when considering a single image on its own. However, this also yields challenges, since, to effectively exploit all the available information, the resulting methods need to consider not just local connections, but efficiently analyze similarity between all pairs of pixels within and across all the images. In this paper, we propose to model an image set as a fully-connected pairwise Conditional Random Field (CRF) defined over the image pixels, or superpixels, with Gaussian edge potentials. We show that this lets us co-label the images of a large set efficiently, thus yielding increased accuracy at no additional computational cost compared to sequential labeling of the images. Furthermore, we extend our framework to incorporate temporal dependencies, thus effectively encompassing video segmentation as a special case of our approach, as well as to modeling label dependencies over larger image regions. Our experimental evaluation demonstrates that our framework lets us handle over ten thousand images in a matter of seconds. Index Terms—Co-segmentation, image parsing, large scale.
I. I NTRODUCTION Huge amounts of images are being acquired and made accessible daily. Such large collections of images clearly contain more information than an individual image. It therefore seems natural, and crucial, to design computer vision algorithms that exploit large image sets. In particular, for scene parsing, or semantic segmentation, appearance similarities across multiple images may provide a more reliable source of information than solely looking at local regions in an individual image. However, exploiting such rich information efficiently remains an open challenge. Video segmentation [1], [2], [3], [4], [5], [6] can be thought of as a solution to the image set labeling problem. However, it explicitly relies on the temporal nature of the data, and existing methods rarely model long-range connections between image pixels. In contrast, co-segmentation [7] was introduced as a solution to improve segmentation by relying on appearance similarity of a foreground object observed in multiple nonsequential images. Recent advances [8], [9], [10], [11], [12], [13] handle larger image sets and multiple foreground objects. However, the computational cost of even the most efficient NICTA is founded by the Australian Government through the Department of Communications and the Australian Research Council (ARC) through the ICT Center of Excellence Program. This research was supported by the ARC through its Special Research Initiative (SRI) in Bionic Vision Science and Technology grant to Bionic Vision Australia (BVA). The authors are with the Computer Vision Research Lab at NICTA, Canberra, ACT, Australia.
[email protected]
methods remains a limiting factor in exploiting the huge image repositories that are seen in, for example, continentscale mapping. In this paper, we propose to leverage large image sets for scene parsing by modeling a large collection of images as a fully-connected pairwise Conditional Random Field (CRF). This lets us account for the relationships between all the pixels in all the images of the set. Given suitable pairwise potential functions, inference can be performed efficiently using the method of [14] that employs a mean field approximation in conjunction with Gaussian filtering. As evidenced by our experiments, our framework allows us to handle 10,000 images in less than a minute, and 100,000 images in just over 10 minutes. We first focus on a general formulation that allows us to efficiently jointly label multiple independent images. We then introduce an approach to modeling temporal dependencies across images, thus specializing our approach to the problem of video segmentation. Finally, we show how label dependencies over larger image regions can be incorporated within our framework, thus in essence encoding higher-order connections while retaining a pairwise CRF. Our experimental evaluation demonstrates the benefits of our formulation in the presence of both sets of independent images and video sequences. To this end, we make use of the MSRC-21 dataset, the CamVid dataset and our own dataset of 100,000 road images. Our approach yields a significant accuracy gain compared to sequential processing of the individual images, as well as benefits over existing methods in terms of scalability and speed. This paper extends our previous work [15] in several ways. In particular, (i) we introduce the idea of region-based label consistency in our framework; (ii) we provide a more thorough experimental evaluation of our different algorithms; (iii) we conduct additional experiments to validate the method and the newly introduced label consistency extension. II. R ELATED W ORK Analyzing the appearance structure of pixels in image sets appears in several recent threads of computer vision research. In particular, our discussion addresses two approaches that have strongest connections to our work: video segmentation and cosegmentation. Video segmentation seeks to use spatio-temporal constraints to produce better segmentation. While many techniques based on interest points have been proposed (e.g., [16], [17], [18]), we focus on recent appearance-based methods, which are more closely related to this work.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
2
Fig. 1. Existing semantic segmentation algorithms focus on pixel neighborhood with in an image. In contrast, we propose to model an image set as a fully-connected pairwise Conditional Random Field (CRF) defined over the image pixels with Gaussian edge potentials. In addition, we introduce the idea of region-based label consistency in the same framework.
A major current approach to appearance-based video segmentation consists of tracking regions, for instance with a Kalman filter [19], by employing mean-shift [1], or with a circular dynamic time warping approach [2]. In [20], tracking ideas were combined with frame-by-frame graph-based segmentation. The use of graph-based techniques has often been advocated for video segmentation (e.g., [21], [22]). In particular, in [3], the algorithm of [23] for static images was extended to video data by considering temporal neighbors in a hierarchical graph of superpixels. To improve efficiency, this hierarchy was broken up, thus allowing out-of-core processing of a 40s video in 20 minutes. This processing time was sped up to around 0.25 fps in [5] via an approximation framework for streaming hierarchical video segmentation. In [4], long range motion cues from past and future frames were incorporated in the framework of [3]. While often ignored, such long range cues can be crucial to overcome occlusions and corrupted frames which can break the temporal prior. The use of object-like appearance statistics into graph-based methods was introduced in [24], thus reducing the tendency of some methods to oversegment. In contrast with most methods that only treat short sequences, processing web-scale video with a map reduce framework was proposed [25]. Although this allowed processing 20,000 videos in 30 hours, it requires access to a huge cluster, i.e., 5,000 nodes in [25]. While effective in the right context, the assumption of smooth spatio-temporal evolution of objects makes video segmentation approaches unsuitable to handle collections of non-sequential images. This limitation has been addressed by cosegmentation methods that perform segmentation on unordered sets of data. Cosegmentation was introduced in [7] to segment the foreground and background regions in an image pair using iterated graph cuts with a histogram matching cost. While many approaches have focused on the foreground-background segmentation case (e.g., [8], [9]), recent years have seen the development of several multi-class approaches. In particular, the method of [11] combines spectral and discriminative clustering terms to cosegment multiple images. In [10], some steps were taken towards achieving efficient cosegmentation of
larger image sets. The problem is addressed via iterative graphcuts and subspace estimation, which yields a processing time linear in the number of images at 25-100 seconds per image. To the best of our knowledge, the most efficient approaches to cosegmentation were introduced in [12], [13]. In [12], cosegmentation is cast as an anisotropic heat diffusion problem with a heat source for each object to be segmented. This method takes a few seconds per image, and is highly parallelizable and thus scalable to large image sets. In [13], a method designed to cosegment a subset of K foregrounds from multiple images was introduced, referred to as multiple foreground cosegmentation. Both supervised and unsupervised solutions were proposed. The resulting algorithm takes about one second per frame and was demonstrated on at most 1,000 images. In comparison, our approach takes less than 10ms per frame, and was applied to jointly segment over 100,000 images. III. F ULLY-C ONNECTED CRF S
FOR
L ARGE I MAGE S ETS
In this section, we introduce our fully-connected CRF model of an image set. We then show how this model can be used for efficient semantic co-labeling of large sets of images, as well as for video segmentation. Our goal is to account for all the information available in a large set of images. Therefore, we seek to encode the connection between any pair of pixels, or superpixels, within and across images. More specifically, let I = {I1 , ..., IF } be a set of F images, and Xi = {xi1 , ..., xiNi } be the set of random variables associated with image Ii , i.e., each xij corresponds to either a pixel, or a superpixel. Each random variable xij can be assigned a label in the set L = {l1 , ..., lL }, typically describing the object (super)pixel xij belongs to. Our fully-connected pairwise CRF models the joint distribution of the random variables {Xi } given the images. This distribution can be expressed as P(X1 , ..., XF |I) =
1 exp −E(X1 , ..., XF }|I) ; Z
(1)
where Z is the partition function and E(·) is the energy function corresponding to the problem. In particular, this
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
3
energy takes the form F
Ni
E(X1 , ..., XF |I) = ∑ ∑ ψu (xij |Ii )+ i=1 j=1
F
Ni ,Ni′
∑ ∑′ ′
′
′
ψ p (xij , xij′ |Ii , Ii ),
i,i =1 j, j =1
(2) where ψu (·) and ψ p (·, ·) denote the unary and pairwise potentials, respectively. The unary potential function encodes the cost of assigning a specific label to a variable, and the pairwise potential measures the cost of different possible class assignments for any pair of variables. Given this model, inference is achieved by trying to find a labeling that (approximately) maximizes the distribution of Eq. 1. To this end, the unary potentials can take any form. In practice, we rely on a set of supervised images to train a classifier in a learning phase, and use this classifier at test time to predict the probability of each label being assigned to each (super)pixel. The unary potentials are then computed as the negative log. of these probabilities. By contrast, to perform efficient inference, we will restrict the pairwise potential function to a mixture of Gaussians, as discussed below. A. Semantic Co-Labeling of Large Image Sets Let us for now set aside the case of superpixels and consider the scenario where the nodes in our CRF represent the pixels in the images of the set. To leverage the similarities across all pixels in all images, as mentioned above, we make use of a fully-connected pairwise CRF. Recently, efficient inference methods for fully-connected CRFs have been proposed [14], [26], [27]. Here, we rely on the approach of [14], which combines a mean field approximation with efficient Gaussian filtering. More specifically, to be able to perform efficient inference, the pairwise potential function in Eq. 2 is constrained to take the form ′
′
′
ψ p (xij , xij′ |Ii , Ii ) = µ (xij , xij′ )
M
∑ wm km (vij , vij′ ) , ′
(3)
m=1
where µ (·, ·) is a label compatibility function, which, in ′ practice, encodes a Potts model, i.e., µ (xij , xij′ ) = 1h i i′ i , with x j 6=x j′
1[·] the indicator function. The kernel km (·, ·) is a Gaussian kernel computed over the feature vector vij that describes pixel j in image i. In particular, we utilize spatial smoothness and appearance similarity kernels written as ! ′ kpij − pij′ k2 ki − i′ k2 i i′ , (4) k1 (v j , v j′ ) = exp − − σ p2 σ 2f ! i i′ 2 i i′ 2 kp − p kI − I ′k ′k ′ j j j j , (5) k2 (vij , vij′ ) = exp − − σc2 σl2 where pij and Iij encode the image location and color vector of pixel j and image i. Note that to discourage spatial smoothness across two different images, k1 depends on the image index. Setting σ f to a very low value, will make this kernel vanish for any two random variables that do not belong to the same image. In this setting, finding the maximum a posteriori (MAP) of the distribution P({X1 , ..., XF }|I) can be approximately
achieved using a mean-field method. To this end, an alternative distribution Q({X1 , ..., XF }|I) is introduced, with the assumption that Q factorizes into a product of distributions Qij over the individual random variables, i.e., Q = ∏i, j Qij (xij ). Minimizing the KL-divergence between Q and P then yields an iterative algorithm, where the factors of the distribution Q are updated as Qij (l) =
M ′ ′ 1 km (vij ,vij′ )Qij′ (l ′ ) . exp − ψu (xij ) − ∑ µ (l,l ′ ) ∑ wm ∑ Zi m=1 (i′ , j′ )6=(i, j) l ′ ∈L (6)
After convergence, or a fixed number of iterations, the assignment for each variable is taken as xij = arg maxl Qij (l). As can be seen from Eq. 6, for each variable, each iteration of the algorithm requires summing over all other random variables. A naive implementation of this approach would thus have a computational cost of O (FN)2 ML2 for each iteration, which quickly becomes intractable. However, the updates can be performed much more efficiently by observing that, with a Gaussian kernel, the expensive summation boils down to performing high-dimensional Gaussian filtering [14]. Many efficient, approximate Gaussian filtering methods have been proposed in the past [28], [27]. In particular, here, we employ the permutohedral lattice formulation of [28]. This formulation relies on three steps, which can be summarized as 1) Splatting: The d-dimensional feature for each variable is mapped to a lattice and represented in terms of barycentric coordinates with respect to the (d + 1) nearest vertices of the lattice. The value at each vertex of the lattice is computed based on the values, i.e., Qij , of the variables attached to this vertex and their corresponding barycentric coordinates. 2) Blurring: The values at the vertices of the lattice are blurred locally with a (truncated) Gaussian. 3) Slicing: The filtered values, i.e., the updated Qij s, at the variables are then obtained based on the blurred lattice values and the barycentric coordinates. The overall computation cost of this procedure can be shown to be O FNML2 . Importantly, this implies that the cost of performing inference in our CRF, which makes use of all the images in the set simultaneously, is the same as the cost of performing inference sequentially on the individual images in the set. However, as will be evidenced by our experiments, labeling the images jointly yields higher accuracy than labeling them sequentially. In other words, given a set of images, our approach yields better results than sequential processing at no additional cost. Note that, while we have described our algorithm in terms of pixel labels, it easily extends to superpixels, thanks to the full connectedness of the graph. In practice, we then use the centroid location and the mean color over the pixels as features in the Gaussian kernels. This allows our method to scale up to large image sets, i.e., tens of thousands of images. B. Semantic Video Labeling The approach described in the previous section can easily be specialized to video segmentation. Indeed, a video is in
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
essence a set of images with an additional temporal ordering. Further accounting for this ordering can be achieved by modifying the kernels in Eqs. 4 and 5. The smoothing kernel retains the same form, but we increase its variance σ 2f to consider spatial smoothness across neighboring images. To model the fact that the appearances of pixels in neighboring images are more strongly correlated, we re-write the appearance kernel as ! i − Ii′ k2 i − pi′ k2 ′ 2 kI kp ′ ′ ′ ki − i k) j j j j , − − k2′ (vij , vij′ ) = exp − σc2 σl2 σt2 where σt is typically larger than σ f to account for longer range dependencies. The rest of our formulation remains unchanged, which lets us efficiently segment videos of several thousands of frames. Our approach also naturally extends to co-labeling multiple videos by adding a video index to k1 to only allow spatial smoothing to occur between neighboring frames in the same video. C. Semantic Co-Labeling with Region-based Consistency Although fully connected, the models described in the previous sections are limited to pairwise connections between the CRF nodes. In the context of single-image semantic labeling, it has, however, been shown that accounting for higher-order dependencies can be beneficial in terms of accuracy, albeit at significantly greater computational cost [27]. Here, we introduce an approach to encoding label consistency within larger image regions, while retaining the pairwise nature of our model. When the nodes of our model represent image pixels, superpixels come as a natural choice for regions. Note, however, that this approach still applies to the scenario where the nodes of our CRF are superpixels, for example by making use of superpixel hierarchies [29]. More specifically, to favor label consistency within larger image regions, we incorporate an additional kernel k3 to the ′ ′ pairwise potential ψ p (xij , xij′ |Ii , Ii ) defined in Eq. 3. This new kernel encodes the intuition that nodes belonging to the same region should favor taking the same label, and can thus be expressed as ! ′ krij − rij′ k2 i i′ k3 (v j , v j′ ) = exp − , (7) σr2 rij
where represents the index of the region containing node j in the ith image. Setting σr to a very low value will make this kernel vanish for any two random variables that do not belong to the same region. This additional term therefore lets us encourages (super)pixels belonging to the same region to share the same label, while retaining the same pairwise CRF framework as before. This formulation bears similarity with the common postprocessing step of forcing the pixels belonging to the same superpixel to take the same label via a majority voting scheme [30]. However, our formulation treats region consistency as a soft constraint, thus allowing us to take into account the mistakes of the superpixelization techniques. Furthermore, this term comes at virtually no cost in our efficient inference
4
procedure, rather than involving an additional post-processing step. Importantly, as evidenced by our experiments, our approach yields better results than this common post-processing strategy. IV. E XPERIMENTS We evaluate our approach on two standard benchmark datasets for multi-class segmentation: the MSRC-21 dataset [31] and the Cambridge-driving Labeled Video dataset (CamVid) [32]. Quantitative evaluation is performed using pixelwise comparisons of the obtained segmentations with ground-truth. We report global and per-class average accuracies [31]. The former represents the ratio of correctly classified pixels to the total number of pixels in the test set. The latter is computed as the average over all classes of the ratio of correctly classified pixels in a class to the total number of pixels in that class. All the experiments were conducted using R CoreTM i7-2600 single threaded code on a standard Intel CPU 3.40GHz desktop. For each dataset, we employed the training set to learn pixel-level classifiers from the same image features as in [37] using the publicly available DARWIN framework of [38]. These classifiers were used to generate unary potentials on the remaining test (and validation for MSRC-21) images. For MSRC-21, we also conducted an additional experiment using the more accurate TextonBoost unary potentials of [14]. The parameters of our algorithm were set by cross-validation on the MSRC-21 validation set and kept fixed for all our experiments, including those on the CamVid dataset. When working at superpixel level, we used SLIC [39] to compute an oversegmentation of the images, which takes approximately 55 milliseconds to obtain 3000 superpixels in one image. These superpixels were used as regions for the label consistency term in our pixel-level CRF. In our experiments, we compare the accuracy of our semantic co-labeling approach with the results obtained from the unary potentials only, as well as by sequentially performing inference in a fully-connected CRF over each individual image [14]. For this baseline, the parameters of the CRF were fixed to the values reported in [14]. A. MSRC-21 The MSRC-21 dataset consists of 591 color images of size 320 × 213 with corresponding ground truth labelings of 21 object classes. In particular, we make use of the accurate ground-truth of [14]. Due to its relatively small size, our experiments on this dataset were performed at pixel level. To jointly segment images that contain similar classes, we consider the groups of images defined by their main category as image sets. Note, however, that these images contain up to 9 different classes and that not all classes are present in all the images. Furthermore, since our unary potentials were obtained from classifiers trained on all 21 classes, it may give a high probability to a class that never even appears in the image set. In MSRC-21, there are 20 such groups with 13 images per group on average.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
5
TABLE I MSRC- 21: C OMPARISON OF OUR RESULTS USING DARWIN AND T EXTON B OOST [14] UNARIES COMPUTED OVER GROUPS OF IMAGES OF THE SAME MAIN CATEGORY WITH BASELINES EXPLOITING INDIVIDUAL IMAGES . T HE FIRST COLUMN INDICATES THE NUMBER OF CLASSES PRESENT IN EACH GROUP, ALTHOUGH NOT IN ALL IMAGES . N OTE THAT OUR APPROACH OUTPERFORMS THE BASELINES ON MOST GROUPS . B OLD VALUES INDICATE HIGHEST ACCURACY.
Building Grass Tree Cow Sheep Airplane Water Face Car Bicycle Flower Sign Bird Book Chair Road Cat Dog Body Boat Overall
num. classes 7 5 5 3 3 8 7 8 6 5 1 4 5 1 4 9 2 9 9 6 21
Unary Avg. Global 73.6 76.1 85.6 93.1 81.9 90.2 86.5 84.2 78.5 88.9 61.4 86.6 53.9 66.1 49.6 74.4 63.0 68.6 65.4 77.8 71.5 71.5 58.0 48.2 66.4 70.5 76.9 76.9 64.0 61.6 51.4 83.6 73.5 75.0 45.6 59.4 60.1 72.8 43.5 78.9 65.3 76.4
DARWIN Unaries Pairwise Avg. Global 88.0 95.5 91.3 95.5 83.4 92.2 90.8 89.2 85.0 92.7 62.7 88.6 54.0 69.1 50.2 76.3 65.3 73.6 68.5 82.5 79.7 79.7 52.3 68.3 72.5 77.0 81.6 81.6 66.2 69.7 53.0 87.6 79.2 80.7 51.1 65.9 62.1 76.3 43.9 80.4 69.2 80.6
Ours Avg. Global 72.1 89.4 92.2 96.2 91.8 93.2 93.8 93.1 85.4 94.1 61.1 88.7 50.8 71.3 47.5 75.6 64.4 75.4 68.9 84.9 86.3 86.3 50.9 70.9 74.6 78.9 96.9 96.9 71.6 68.3 51.4 87.6 81.0 82.9 51.3 70.5 60.3 75.3 45.2 82.4 71.2 82.7
TextonBoost Unaries [14] Pairwise Ours Avg. Global Avg. Global 90.1 89.5 86.7 90.2 91.2 92.2 87.9 94.7 79.1 94.7 77.9 94.7 93.1 94.0 91.3 93.7 71.3 91.9 68.6 91.7 84.5 92.8 80.6 92.3 58.0 60.9 59.8 64.0 71.1 80.2 63.7 78.3 60.4 87.9 60.1 87.8 91.8 89.6 91.4 89.8 96.7 96.7 97.8 97.8 86.4 79.5 88.5 83.7 82.3 79.3 83.5 80.7 98.6 98.6 99.4 99.4 84.8 71.1 88.1 82.3 92.7 93.5 84.6 92.2 87.1 86.3 89.2 88.5 63.5 66.8 65.6 73.1 71.6 82.6 75.7 85.0 56.6 70.8 55.8 74.3 78.2 85.5 80.9 86.8
Unary Avg. Global 86.7 86.8 90.3 95.2 80.6 94.7 91.8 92.5 73.7 91.4 85.3 91.4 57.9 58.2 68.7 78.5 63.1 84.5 90.7 87.5 92.6 92.6 82.5 74.7 79.3 77.6 97.0 97.0 81.2 68.3 89.8 91.8 82.9 82.3 66.1 66.0 73.8 80.7 56.7 73.4 76.2 83.6
Ours-R Avg. Global 85.5 90.4 85.4 94.4 78.3 95.1 91.8 94.1 69.2 92.0 81.6 93.0 59.8 64.9 63.2 78.4 60.2 88.0 91.1 89.9 97.6 97.6 88.2 84.4 83.5 81.1 99.3 99.3 88.1 83.3 84.5 92.5 89.6 89.0 62.8 74.0 75.2 84.8 55.7 74.7 81.6 87.4
Grass
Tree
Cow
Sheep
Sky
Aeropl.
Water
Face
Car
Bicycle
Flower
Sign
Bird
Book
Chair
Road
Cat
Dog
Body
Boat
Avg.
Global
Shotton et al. [33] Jiang and Tu [34] HCRF+Coocc. [35] Yao et al. [36] Dense CRF [14] Ours Ours-R
Build.
TABLE II MSRC-21: C OMPARISON OF OUR APPROACH WITH THE STATE - OF - THE - ART.
49 53 74 71 75 75 76
88 97 98 98 99 99 99
79 83 90 90 91 90 89
97 70 75 79 84 85 85
97 71 86 86 82 82 81
78 98 99 93 95 95 95
82 75 81 88 82 83 87
54 64 84 86 71 71 72
87 74 90 90 89 90 90
74 64 83 84 90 90 88
72 88 91 94 94 94 93
74 67 98 98 95 95 96
36 46 75 76 77 80 83
24 32 49 53 48 48 54
93 92 95 97 96 96 96
51 61 63 71 61 68 78
78 89 91 89 90 91 90
75 59 71 83 78 89 83
35 66 49 55 48 53 63
66 64 72 68 80 81 87
18 13 18 17 22 23 23
67 68 77.8 79.3 78.3 79.3 81.4
72 78 86.5 86.2 86.0 86.4 87.2
Grass
Tree
Cow
Sheep
Sky
Airplane
Water
Face
Car
Bicycle
Flower
Sign
Bird
Book
Chair
Road
Cat
Dog
Body
Boat
Avg.
Global
Unary DenseCRF Ours Ours-PP Ours-R
Building
TABLE III MSRC-21: Analysis of the effect of larger regions. N OTE THAT, WHILE USING LARGER REGIONS IN A POST- PROCESSING STEP DEGRADES THE PERFORMANCE OF OUR APPROACH , DIRECTLY INCORPORATING THEM AS A NEW KERNEL IN OUR PAIRWISE TERM YIELDS BETTER ACCURACIES .
71.9 74.5 75.3 74.8 75.3
98.2 98.8 98.8 98.6 98.8
89.7 91.4 90.5 89.6 90.5
84.3 84.7 84.8 81.8 84.5
80.6 81.6 81.8 79.6 81.3
93.3 95.0 95.3 94.8 94.9
82.4 81.9 82.9 80.5 87.2
67.5 70.5 70.9 71.0 72.4
88.2 89.6 89.8 88.9 90.2
84.2 89.9 89.9 89.2 88.1
91.1 93.2 93.8 92.9 92.9
90.7 94.7 95.2 95.3 95.6
70.0 76.5 79.5 79.5 83.1
47.6 47.1 47.7 46.4 53.7
94.1 95.7 96.1 95.2 96.0
59.3 62.7 67.6 66.1 78.2
88.8 90.2 90.5 90.3 90.1
75.7 78.0 78.9 78.3 82.6
46.0 47.3 53.1 52.6 63.4
79.9 81.2 80.9 79.5 86.7
25.2 23.5 22.6 22.8 22.7
76.6 78.5 79.3 78.5 81.4
84.0 85.9 86.4 85.8 87.2
In Table I, we show semantic co-labeling results over the 20 image groups, as well as our overall accuracies. Ours denote our basic model defined in Section III-A, and Ours-R stands for the model incorporating larger regions of Section III-C. Note that our approach, both with and without larger regions, outperforms the baselines in most categories. Furthermore, when computing the accuracy over all the images, our approach outperforms the fully-connected CRF on individual images by about 2%. While this may not seem a large improvement, recall that this comes at no additional computational cost. Note also that the same pattern of improvement can be observed when using the unary
potentials from DARWIN (Table I left side) or TextonBoost (Table I right side). Importantly, as shown in Table II, our results obtained using the TextonBoost unaries provide the best global and average accuracies compared to the current state-of-the-art. Sample results on the MSRC-21 dataset are shown in Fig. 2. We now analyze more closely the effect of adding the region-based label consistency term in our model. As mentioned in Section III-C, region-based consistency can be achieved in two different manners. The most common strategy involves a post-processing step that forces the pixels belonging to the same regions to take the same label based
Ours-PP
Ours
DenseCRF
Unary
Ground Truth
Input
6
Ours-R
Ours-R
Ours-PP
Ours
DenseCRF
Unary
Ground Truth
Input
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
Fig. 2. Sample results from the MRSC dataset: Note the improvement of our co-labeling approach over the results obtained by performing inference sequentially on individual images.
on a majority vote. We denote by Ours-PP the method that applies such a post-processing step after our co-segmentation algorithm. Alternatively, and as proposed in Section III-C, label consistency can be encoded as soft constraints, which, as mentioned earlier, we denote by Ours-R. In Table III, we comapre the corresponding per-class and overall accuracies of these techniques. Note that accounting for regions in a postprocessing step degrades the performance of our co-labeling approach. By contrast, the addition of our region-based label consistency kernel improves the performance of our basic approach. Interestingly, the results in Table III indicate that the effect of our region-based label consistency kernel is more pronounced in relatively small classes, such as chair, sign, bird, cat and dog. No such trend can be observed for the postprocessing approach, which slightly degrades the performance of our basic approach for almost all classes. Furthermore, in
contrast to the post-processing step, our new kernel entails virtually no additional computation cost. Finally, we analyze the sensitivity of the algorithm to the weight of the region-based label consistency kernel and to the number of regions in the image. Different number of regions were obtained by varying the parameters of the SLIC algorithm, which gives some control over the number of resulting superpixels. More precisely, we set four different values for the initial region size of the algorithm (i.e., 5, 8, 15, and 25). The regularizer to trade-off appearance for spatial regularity when clustering is fixed to 0.01 leading to approximately 2700, 1100, 330 and 110 regions per image. Fig. 3 summarizes the results of these two experiments. The error bars represent the standard deviation with respect to the SLIC parameters. Note that Ours-PP consistently yields a lower accuracy than Ours-R for any value of w3 , including
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
0.875
7
Ours−R Ours−PP
Input
Input
Unary
0.86
Unary
global accuracy
0.87
0.855
0.85 0
5
10
w
15
20
25
20
25
3
Pairwise
0.815
Pairwise
(a) Ours−R Ours−PP
0.8 0.795
Ours
Ours
average accuracy
0.81 0.805
0.79
5
10
w
3
15
(b) Fig. 3. Sensitivity of our model with region-based label consistency to the kernel weight w3 on the MRSC-21 dataset. We plot the global (a) and average per-class (b) accuracies as a function of w3 . The error bars represent the standard deviation with respect to variations in the number of regions (superpixels). As shown, our approach outperforms the baseline (regions as post-processing) and, more importantly, our approach is more stable in front of variations in the number and size of the regions.
w3 = 0. The influence of our region-based kernel increases with w3 until it essentially stabilizes for a large enough value (i.e., around w3 = 13). Furthermore, we can observe that the influence of our region-based kernel is much more stable with respect to the number of superpixels in the images than the post-processing approach. This can be attributed to our use of soft constraints, which will be less prone to oversmoothing than hard decisions made by a post-processing step. These results let us conclude that incorporating higher-order knowledge within our model is typically beneficial for semantic segmentation. We may therefore expect further improvements by making use of more complex potentials, such as robust statistics or class co-occurence [27], which we leave as a topic for future research. B. CamVid CamVid consists of four image sequences with ground truth labels at 1fps that associate each pixel with one of 32 semantic classes. Following the experimental setup of [40], [41], the images are down-scaled to 320 × 240 and the semantic classes are grouped into 11 categories. Furthermore, the dataset is divided into two subsets: Day and Dusk, with non-overlapping training and test sets. The details of this dataset are summarized in Table IV. For our first experiment with this dataset, we trained two separate classifiers from the Day and Dusk training sets, and perform semantic co-labeling of their respective test sets at pixel level. Note that, in general, each of the 171 Day and 62 Dusk test images only contain a subset of the 11 classes, and that some of these classes (e.g., column-pole, fence or
Ours-R
0.78 0
Ours-R
0.785
Fig. 4.
Sample results from the CamVid-dusk dataset. TABLE IV Summary of the CamVid dataset [32].
Type #imgs #labs
Seq 1 Day Test 6580 171
Seq 2 Day Train – 204
Seq 3 Day Train – 101
Seq 4 Dusk Test Dusk Train 3750 62 62
sign-symbol) only appear in a few images (see [32]). Table V shows the accuracy of our approach for the Dusk sequence and the baselines and Table VI the global and average accuracies over the entire dataset of our approach, the baselines and the most relevant state-of-the-art methods. In global accuracy, our approach outperforms our pairwise baseline and the baselines that rely on similar information as ours, i.e., the appearance model of [40], the pairwise model of [41]. Note that, as opposed to [41], we did not learn our model parameters for CamVid, but re-used the MSRC ones. As a result, our approach provides a lower per-class average accuracy. This reflects the statistics of the dataset and the fact that the simple pairwise edge connections that we use tend to over-smooth the classes that rarely appear in the images. Note, however, that the effect of over-smoothing is reduced with our co-labeling approach compared to considering the images individually. This suggests that our approach finds connections between the small classes that appear in distant images and benefits from them (see, e.g., pedestrian and bicyclist). The other baselines rely either on additional cues (e.g., [42] and [40]), or on higher-order potentials (e.g., [41], [35] and [43]). Note that our model achieves a global accuracy that is close to these more complex models. Our approach could be extended to higher-order terms following [27], which we leave for future work. Sample results from the Dusk test squence are shown in Fig. 4. As a second experiment, we make use of CamVid to recover the geometric layout of road scenes [44]. To this end, we grouped the 11 semantic labels into three classes: sky, horizon-
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
8
Tree
Sky
Car
Sign-Sym.
Road
Pedest.
Fence
Col.-Pole
Sidewalk
Bicyclist
Avg.
Global
Unary Pairwise Ours Ours-R
Build.
TABLE V CamVid: C OMPARISON OF OUR APPROACH WITH THE BASELINES METHODS ON THE 11 CLASSES OF THE C AM V ID - DUSK DATASET. N OTE THAT, DUE TO THE PRESENCE OF SMALL CLASSES THAT ONLY APPEAR IN FEW IMAGES , OUR PER - CLASS AVERAGE ACCURACY IS LOWER THAN THE ONE OBTAINED WITH UNARY POTENTIALS ONLY. N OTE , HOWEVER , THAT IT IS HIGHER THAN THE ACCURACY OF THE CORRESPONDING PAIRWISE MODELS .
72.3 76.0 75.3 76.1
81.8 83.5 82.5 84.3
95.1 97.1 97.4 97.2
59.5 60.4 57.8 57.8
32.5 17.8 15.8 11.0
95.3 97.3 97.6 98.0
37.0 28.9 26.2 19.6
1.3 0.3 0.3 0.1
14.9 6.1 6.5 3.3
75.6 69.7 68.2 67.9
51.1 50.9 47.1 45.9
56.0 53.5 52.2 51.0
80.5 81.5 81.1 81.6
Input Unary Pairwise
Higher order [41] Using detectors [35] Top down [43] Depth cues [42] SfM [40]
Global 80.5 81.1 81.4 80.1 76.4 66.5 79.8 83.8 83.8 83.2 82.1 61.8
Input
Unary/learned pairwise [41]
Avg. 53.3 47.9 46.7 44.8 59.8 52.3 59.9 59.2 62.5 59.6 55.4 43.6
Pairwise
Method Unary Pairwise Ours Ours-R Unary [41] Appearance [40]
Unary
TABLE VI CamVid: C OMPARISON OF OUR APPROACH WITH CURRENT STATE - OF - THE - ART METHODS ON THE 11 CLASSES OF THE C AM V ID DATASET.
89.1 67.1 55.4 70.7 75.6
93.8 89.6 13.6 65.3 75.0
89.4 71.7 55.6 72.3 77.3
Ours (Spix) 89.5 71.5 55.4 72.5 77.1
Ours (Video) 94.6 79.8 53.5 76.3 82.1
tal surfaces (i.e., roads and sidewalks) and vertical surfaces. In this scenario, almost all the images present the same number of classes. To simulate the use of noisy pixel classifiers learned from training data dissimilar to the test examples, we generate the unary potentials of the Day images with the classifiers trained on the Dusk data, and vice versa. We then evaluate our approach by simultaneously segmenting the Day test images and the Dusk images, which yields 295 images. Note that we can use the Dusk training images as test images, since their unary potentials were generated with the Day classifiers. Inference for these 295 images is performed at both pixel and superpixel levels. Additionally, we evaluate our video semantic labeling approach at superpixel level, concatenating the full Day test and Dusk sequences, to yield a set of 10330 images. Our results in this scenario are summarized in Table VII. Note that, here, the use of a pairwise term on individual images decreases the accuracy of the unary potentials on the sky class. This can be explained by the fact that the unary potentials obtained on the Dusk images with the Day classifier perform poorly on sky regions. When used in a single image,
Ours
Ours
Ours video
Pairwise
Ours
Horizontal Vertical Sky Average Global
Unary
Ours video
TABLE VII CamVid Scene Layout: E VALUATION OF OUR SEMANTIC CO - LABELING AND VIDEO LABELING APPROACHES ON THE 3 CLASS SCENARIO .
Fig. 5. Sample results on the CamVid scene layout estimation problem: Our approach allows us to nicely recover the scene layout from noisy unary potentials. The improvement is even larger when segmenting a full video.
the pairwise term therefore tends to make entire sky regions disappear. As illustrated in Fig. 5, this issue is overcome by our co-labeling approach, which can make use of the presence of sky regions in other images. As a consequence, our semantic co-labeling approach has higher accuracy than the baselines. Accuracy increases further when making use of video data. Note that using superpixels instead of pixels yields virtually no loss in accuracy. We now conduct a third experiment to test the benefits of adding the region-based label consistency term in our colabeling approach. To this end, we select a subset of 270 consecutive images (the limit memory-wise for our pixel-level approach) from the Dusk CamVid sequence and compare the results of three instances of our video-based approach: colabeling multiple images at pixel level (Ours-Vid), co-labeling multiple images at pixel level using superpixels for region
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
9
TABLE VIII CamVid Scene Layout: E VALUATION OF SEMANTIC CO - LABELING USING REGION - BASED LABEL CONSISTENCY ON THE 3 CLASS SCENARIO FOR THE FIRST 210 IMAGES IN THE D USK SEQUENCE .
Horizontal Vertical Sky Average Global
Unary 87.4 54.9 98.2 80.2 75.3
Pairwise 90.5 58.1 97.7 82.1 77.9
Ours-Vid 92.0 58.8 97.6 82.8 78.8
Ours-Vid-PP 92.1 58.6 97.5 82.8 78.8
Ours-Vid-R 94.3 59.3 95.6 83.0 79.7
TABLE IX Runtimes for our CamVid scene layout experiments.
Nb. imgs Runtime
pixel level 171 45.4 s
superpixel level 171 10K 100K 1.6 ms 51.5 s 13.3 min
post-processing (Ours-Vid-PP), co-labeling multiple images at pixel level with the additional region-based label consistency term (Ours-Vid-R). The accuracies of these different approaches are shown in Table VIII and compared against the usual baselines. As for MSRC-21, while using larger regions in a post-processing step does not improve accuracy, our additional region-based kernels yields better performance. Finally, to demonstrate the scalability of our approach, we performed semantic labeling of a video of over 100, 000 frames obtained by concatenating three sequences of road images of 22K, 48K and 42K images, respectively. Working at superpixel level, this resulted in a total number of 255.1 Million nodes in our CRF (i.e., approximately 2278 superpixels per image). The sequences were acquired in three different regions of Australia using a camera mounted on the windshield of a vehicle. The original color images of size 1156 × 1512 were down-sampled to 289 × 378 pixels and unaries were extracted using a multiscale algorithm based on the pre-trained convolutional neural network of [45]. The outline of the algorithm used to compute the unaries is shown in Fig. 6. There are two main differences with respect to the original approach: First, we applied a preprocesing step aiming at modifying the test image to become more similar to the images in the training set using the color transfer technique of [46]. Second, we improve the robustness of the algorithm by applying the same network to multiple scales of the input image. In Fig. 7, we demonstrate the benefits of our approach in such a very large scale scenario by showing some randomly selected images from the sequences1. The runtimes of our scene layout experiments are given in Table IX. Note that, on 100, 000 images, inference in our model only takes 13 minutes.
Fig. 6. Algorithm used to compute unaries over different Australian road sequences. The algorithm performs at multiple scales and is based on the pre-trained convolutional neural network presented in [45].
(a)
(b)
(c)
(d)
Fig. 7. Sample results from our 100,000 images dataset: a) Input Image; b) Unary; c) Pairwise; and d) Ours Results. Note the improvement obtained by co-labeling the full video simultaneously.
image and video segmentation using pixels and superpixels, and showed its scalability to over 100,000 frames, taking only a few minutes on a single thread on a standard desktop. Such efficient computation opens the way for new approaches that explore appearance structure over very large datasets. In the future, we intend to study the generalization of our framework to incorporating higher order potentials, as well as to working in a semi-supervised setting.
V. C ONCLUSION
R EFERENCES
In this paper, we have developed an efficient inference technique for fully-connected CRFs over the pixels of image sets. Our approach enables analysis of structure over huge numbers of pixels across large image sets. We have demonstrated the benefits of our semantic co-labeling approach on
[1] S. Paris, “Edge-Preserving Smoothing and Mean-Shift Segmentation of Video Streams,” in ECCV, 2008. [2] W. Brendel and S. Todorovic, “Video object segmentation by tracking regions,” in ICCV, 2009. [3] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient Hierarchical Graph-Based Video Segmentation,” in CVPR, 2010. [4] J. Lezama, K. Alahari, J. Sivic, and I. Laptev, “Track to the Future: Spatio-temporal Video Segmentation with Long-range Motion Cues,” in CVPR, 2011.
1 Additional
results in available at http://www.rsu.forge.nicta.com.au.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
[5] C. Xu, X. Xiong, and J. Corso, “Streaming hierarchical video segmentation,” in ECCV, 2012. [6] J. M. Alvarez, T. Gevers, and A. M. Lopez, “Learning photometric invariance for object detection,” IJCV, pp. 45 – 61, 2010. [7] C. Rother, V. Kolmogorov, T. Minka, and A. Blake, “Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs,” in CVPR, 2006. [8] D. Batra, A. Kowdle, D. Parikh, J. Luo, and C. T, “iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance,” in CVPR, 2010. [9] S. Vicente, C. Rother, and V. Kolmogorov, “Object Cosegmentation,” in CVPR, 2011. [10] L. Mukherjee, V. Singh, J. Xu, and M. D. Collins, “Analyzing the Subspace Structure of Related Images: Concurrent Segmentation of Image Sets,” in ECCV, 2012. [11] A. Joulin, F. Bach, and P. J, “Multi-Class Cosegmentation,” in CVPR, 2012. [12] G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade, “Distributed Cosegmentation via Submodular Optimization on Anisotropic Diffusion,” in ICCV, 2011. [13] G. Kim and E. P. Xing, “On Multiple Foreground Cosegmentation,” in CVPR, 2012. [14] P. Kraehenbuehl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potential,” in NIPS, 2011. [15] J. M. Alvarez, M. Salzmann, and N. Barnes, “Large-Scale Semantic Co-Labeling of Image Sets,” in WACV, 2014. [16] M. Brand and V. Kettnaker, “Discovery and segmentation of activities in video,” PAMI, vol. 22, no. 8, pp. 844–851, 2000. [17] E. Elhamifar and R. Vidal, “Sparse Subspace Clustering,” in CVPR, 2009. [18] P. Ochs and T. Brox, “Object Segmentation in Video: A Hierarchical Variational Approach for Turning Point Trajectories into Dense Regions,” in ICCV, 2011. [19] J. Kim and J. W. Woods, “Spatiotemporal adaptive 3-d kalman filter for video,” PAMI, vol. 6, no. 3, pp. 414–424, 1997. [20] X. Ren and J. Malik, “Tracking as repeated figure/ground segmentation,” in CVPR, 2007. [21] Y. Huang, Q. Liu, and D. Metaxas, “Video Object Segmentation by Hypergraph Cut,” in CVPR, 2009. [22] A. Vazquez-Reina, S. Avidan, H.-P. Pfister, and E. Miller, “Multiple hypothesis video segmentation from superpixel flows,” in ECCV, 2010. [23] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based image segmentation,” IJCV, vol. 59, no. 2, pp. 167–181, 2004. [24] Y. J. Lee, J. Kim, and K. Grauman, “Key-Segments for Video Object Segmentation,” in ICCV, 2011. [25] G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. Essa, J. Rehg, and R. Sukthankar, “Weakly Supervised Learning of Object Segmentations from Web-Scale Video,” in ECCV, 2012. [26] Y. Zhang and T. Chen, “Efficient Inference for Fully-Connected CRFs with Stationarity,” in CVPR, 2012. [27] V. Vineet, J. Warrell, and P. H. S. Torr, “Filter-based Mean-Field Inference for Random Fields with Higher-Order Terms and Product Label-Spaces,” in ECCV, 2012. [28] A. Adams, J. Baek, and A. Davis, “Fast High-Dimensional Filtering Using the Permutohedral Lattice,” in Eurographics, 2010. [29] J. Reynolds and K. Murphy, “Figure-ground segmentation using a hierarchical conditional random field,” in Proceedings of the Fourth Canadian Conference on Computer and Robot Vision, ser. CRV, 2007, pp. 175–182. [30] A. Blake, P. Kohli, and C. Rother, Markov Random Fields for Vision and Image Processing. The MIT Press, 2011. [31] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” IJCV, vol. 81, no. 1, pp. 2–23, 2009. [32] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” PRL, 2008. [33] M. J. J. Shotton and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in CVPR, 2008. [34] J. Jiang and Z. Tu, “Efficient scale space auto-context for image segmentation and labeling,” in CVPR, 2009. [35] L. Ladick´y, P. Sturgess, K. Alahari, C. Russell, and P. Torr, “What, where and how many? combining object detectors and crfs,” in ECCV, 2010. [36] J. Yao, S. Fidler, and R. Urtasun, “Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation,” in CVPR, 2012.
10
[37] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in ICCV, 2009. [38] S. Gould, “DARWIN: A framework for machine learning and computer vision research and development,” JMLR, vol. 13, pp. 3533–3537, Dec 2012. [39] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Suesstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” PAMI, vol. 34, no. 11, pp. 2274–2282, 2012. [40] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV, 2008. [41] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr, “Combining appearance and structure from motion features for road scene understanding,” in BMVC, 2009. [42] C. Zhang, L. Wang, and R. Yang, “Semantic segmentation of urban scenes using dense depth maps,” in ECCV, 2010. [43] G. Floros, K. Rematas, and B. Leibe, “Multi-class image labeling with top-down segmentation and generalized robust pn potentials,” in BMVC, 2011. [44] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from an image,” IJCV, vol. 75, no. 1, pp. 151–172, 2007. [45] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene segmentation from a single image,” in ECCV’12, vol. 7578, 2012, pp. 376–389. [46] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE CGA special issue on Applied Perception, vol. 21, no. 5, pp. 34–41, 2001.
´ Jose M. Alvarez is currently a researcher at NICTA and a researh fellow at the Australian National University. Previously, he was a postdoctoral researcher at the Computational and Biological Learning Group at New York University. During his Ph.D. he was a visiting researcher at the University of Amsterdam and VolksWagen research. His main research interests include road detection, color, photometric invariance, machine learning, and fusion of classifiers. He is a member of the IEEE.
Mathieu Salzmann is a Research Associate at EPFL. Previously, he was a Senior Researcher and Research Leader at NICTA and an Adjunct Research Fellow at ANU. Prior to this, in 2010-2012, he was a Research Assistant Professor at TTI-Chicago, and, in 2009-2010, a postdoctoral fellow at ICSI and EECS, UC Berkeley. He obtained his PhD from EPFL in 2009. His main research interests lie at the interface of geometry and machine learning for computer vision. He is an IEEE member.
Nick Barnes received the B.Sc. degree with honors and the Ph.D. degree in computer vision for robot guidance from the University of Melbourne, Melbourne, Australia, in 1992 and 1999, respectively. He worked for an IT consulting firm from 1992 to 1994. In 1999, he was a Visiting Research Fellow with the LIRA Laboratory, University of Genova, Genova, Italy, supported by an Achiever Award from the Queensˆa Trust for Young Australians. From 2000 to 2003, he was a Lecturer with the Department of Computer Science and Software Engineering, University of Melbourne. Since 2003, he has been with National ICT Australia’s Canberra Research Laboratory, Canberra, Australia, where he is currently a Principal Researcher and Research Group Manager in computer vision. His current research interests include visual dynamic scene analysis, computational models of biological vision, feature detection, vision for vehicle guidance, and medical image analysis.