Real-time Stereo and Flow-based Video Segmentation with Superpixels

0 downloads 0 Views 2MB Size Report
Real-time Stereo and Flow-based Video Segmentation with Superpixels. Michael ... Object classes have such a variety of appearance, that image-based ..... corporating depth in order to correctly segment static fore- .... Learning affinity func-.
Real-time Stereo and Flow-based Video Segmentation with Superpixels Michael Van den Bergh1 1 ET HZurich Zurich, Switzerland

Luc Van Gool1,2 2 KU Leuven Leuven, Belgium

[email protected]

[email protected]

Abstract The use of depth is becoming increasingly popular in real-time computer vision applications. However, when using real-time stereo for depth, the quality of the disparity image is usually insufficient for reliable segmentation. The aim of this paper is to obtain a more accurate and at the same time faster segmentation by incorporating color, depth and optical flow. A novel real-time superpixel segmentation algorithm is presented which uses real-time stereo and realtime optical flow. The presented system provides superpixels which represent suggested object boundaries based on color, depth and motion. Each outputted superpixel has a 3D location and a motion vector, and thus allows for straightforward segmentation of objects by 3D position and by motion direction. In particular, it enables reliable segmentation of persons, and of moving hands or arms. We show that our method is competitive with the state of the art while approaching real-time performance.

Figure 1. The goal in this paper: efficiently segmenting foreground objects/persons and moving parts.

Figure 2. Bad examples of color-based superpixel segmentation.

door use. The presence of sunlight saturates the IR sensor, resulting in an over-exposed IR image. Stereo camera setups have a wider range of application, as they can work in sunlight, and the working volume can be chosen by use of different cameras, different lenses, or a different baseline. Furthermore, stereo camera pairs are easier to integrate into mobile devices like phones and laptops. On most robotics platforms, stereo and optical flow are available. The main contribution of this paper is the combination of color (RGB), depth, and motion information (optical flow) for superpixel segmentation. We introduce a color-spatiotemporal distance measure that incorporates color, spatial and temporal information. Furthermore we introduce temporal simple linear iterative clustering (temporal SLIC), which exploits the temporal aspect of video in order to minimize the number of iterations per frame and achieve realtime clustering. We present a real-time stereo system that provides object boundaries based on color, depth and motion, and thus makes it possible to detect foreground objects and moving objects at the same time.

1. Introduction Segmentation is usually slow and unreliable. Object classes have such a variety of appearance, that image-based segmentation requires complicated and slow features, yet struggles to deliver consistent results. Especially in humanmachine interaction there is a need for a fast and reliable segmentation of persons and body parts for interaction. This goal is illustrated in Figure 1, where the person and the moving hand are segmented. Figure 2 shows examples of standard color-based superpixel segmentation failing. Shadow parts get separated from highlighted parts, and similar colors get confused with the background. One solution is to use infrared (IR)-based depth sensors such as Time-of-Flight (ToF) cameras, or structured light devices such as the Primesense sensor. These sensors provide reliable point clouds which can be used to segment objects in 3D space. However, there are several reasons not to rely on IR. IR-based systems are limited to a strict working volume defined by the sensor manufacturer, and to in89

2. Background

rection or magnitude, they cannot distinguish between different moving objects, and they cannot deal with camera movement or moving elements in the background. Therefore it is interesting to look into the computationally more expensive optical flow. Brox and Malik [4] show that very high accuracy can be obtained by segmenting objects based on optical flow. However, the largest bottleneck is the computational cost of optical flow. Werlbergeret al. [15, 14] present a GPU-based dense real-time optical flow algorithm, which lends itself perfectly to our application. The remainder of this paper is structured as follows: in Section 3 we show an overview of the system, and describe the real-time stereo and real-time optical flow algorithms used in this paper. The superpixel clustering is explained in Section 4, where we introduce the color-spatio-temporal distance measure and the temporal simple linear iterative clustering (temporal SLIC). Then, in Section 5, we evaluate the proposed system. First we evaluate it against the Berkeley motion segmentation dataset. Then we evaluate the benefit of using depth and of using optical flow, and compare the performance between using an IR-based sensor and using real-time stereo. Finally, we show an experiment where the system is used outdoors on a mobile autonomous urban robot.

In the field of object segmentation, many approaches make use of superpixel-based low-level partitioning of the image [6, 8, 5, 7, 12, 3]. This allows for groups of pixels (superpixels) to be classified and thus to achieve an accurate segmentation of the whole image. This partitioning is usually based on clustering pixels based on colors or local texture features. Achanta et al. [1] present a fast approach for color-based clustering of the image into superpixels, based on simple linear iterative clustering (SLIC). This approach achieves state of the art superpixel clustering in O(N ) complexity, where N is the number of pixels in the image. Previous approaches only achieved N 2 or N logN complexity. The SLIC approach paves the way for real-time superpixel segmentation. However, the approach is iterative and the clustering must be run several times on each input frame. Real-time depth information can be acquired using IRbased sensors such as Time-of-Flight cameras (e.g. SwissRanger) or structured light cameras (e.g. PrimeSense). As these sensors provide relatively accurate depth information, we will show that the use of such sensors can result in dramatic improvements in segmentation accuracy. However, we are more interested in the performance based on realtime stereo. Stereo has a broader application range, but the noisy depth from stereo is also an ideal candidate to be improved by incorporating additional cues (color, optical flow) for the segmentation. Bleyer et al. [2] present a real-time stereo algorithm based on GPU-trees. This approach is fast, but the output is not ideal for segmentation as lots of oscillations appear in the disparity image. For segmentation it is important that each object surface is represented as smooth as possible. Geiger et al [9] present a CPU-based real-time stereo algorithm that provides disparity images that are more useful to our application. The approach first looks for a selection of good feature points in both the left and right camera images, and creates a triangular mesh based on those feature points. The other pixels are expected to lie close to this triangular mesh, requiring only to look for a match in a local region, reducing the computation time. This also results in smooth surfaces which are beneficial for our object segmentation. Besides depth, motion is also an important cue for object detection in videos. Typical objects of interest are moving cars or persons, and for interaction we are interested in moving arms or hands. There are several ways to approach the detection of motion. Basic approaches include foregroundbackground segmentation [13] and pixel-level differences between frames. These approaches can be improved with smoothing techniques such as conditional random fields (CRF) [10], median filtering and HMMs [11]. Even though these approaches are low on computational cost, they have a number of significant disadvantages: they lack a motion di-

3. Depth and Motion Calculation 3.1. System Overview An overview of the system presented in this paper is shown in Figure 3. A stereo camera set-up is used. A real-time stereo component is implemented based on LibElas, and a real-time optical flow component is implemented based on LibFlow. The stereo algorithm runs on the CPU, while the optical flow algorithm runs on the GPU. From the results a 6-channel image is produced: 3 color channels, a depth channel, and a horizontal and vertical flow channel. The superpixel segmentation is then run on this 6D image. It uses a color-spatio-temporal distance measure, and temporal SLIC in order to produce meaningful object boundaries. left right

Stereo

Superpixel Segmentation

superpixels

Flow

Figure 3. System overview.

The resulting superpixels aim to fit object boundaries, especially towards objects or persons in the foreground of the 90

scene, and towards moving objects, persons or body parts. The system produces a labeled image, and for each superpixel an average depth and an average motion vector. To give an example, in combination with a face detector, this labeled image can be used to segment a person in the foreground, and moving body parts such as a waving hand or a gesturing arm. In such a setting the system lends itself well to real-time hand gesture interaction.

For more details about this optical flow algorithm we refer to [15, 14]. We scale the input image (640 × 480 pixels) down by a factor of 3 in order to speed up the optical flow computation. The algorithm produces a (x, ˙ y) ˙ flow vector for each pixel.

4. Superpixel Clustering Superpixels are a useful primitive for image segmentation. Given good superpixels, the object segmentation is limited to connecting the superpixels which belong together based on simple criteria or features. The superpixel clustering algorithm presented in this paper is an extension of the work presented by Achanta et al. [1]. Our method differs from this method by introducing a new distance measure and by the temporal iteration method. In this section we will first introduce a new distance measure which incorporates the color, the position, the depth and the motion of each pixel. Furthermore, a temporal iterative clustering approach is introduced, which allows for a significant speed-bump in the clustering. The superpixel algorithm takes as input the desired number of superpixels K. At the beginning of the algorithm K cluster centers are chosen at regular grid intervals S (step size). In [1] the pixels are clustered in a 5D space: L, a, b, x and y, where (L, a, b) is the color in the CIELAB color space, and (x, y) is the 2D position of the pixel. Rather than using Euclidian distance, a distance measure is presented which weighs the color distance against the spatial distance of each pixel. This approach is good for small distances (small superpixels), however, for larger distances the color similarity within one object is not guaranteed. Therefore we introduce a new distance measure based on color, 3D position, and motion.

3.2. Real-time Stereo Depth is incorporated into the system in order to distinguish foreground objects (or persons) from objects behind it or from the background. IR-based depth sensors such as the PrimeSense sensor and the Swiss Ranger Time-of-Flight camera provide reliable depth data in real-time. However, in order to deal with outdoor scenarios we make use of a stereo set-up and of the real-time stereo algorithm presented by Geiger et. al [9]. The benefit of using stereo is that the system can work in sunlight, and any combination of lenses and baselines can be chosen in order to accommodate different working volumes. However, the resulting disparity images are inaccurate and noisy, as shown in Figure 4, and too noisy to be used as a single input as is the case with for example the PrimeSense sensor.

(a) PrimeSense

(b) Real-time stereo

4.1. Color-Spatio-Temporal Distance Measure

Figure 4. The difference in quality between the disparity images from an IR-based sensor and the output from a real-time stereo algorithm.

We introduce a distance measure Ds defined as follows: Ds = dlab + m · dxyz + n · df low

Even though the depth image is noisy, it is a good cue for the segmentation, and we will show that combined with the color and motion information, a cleaner and more accurate depth image can be produced based on the resulting superpixels.

(1)

where m is a parameter that controls the spatial compactness of the superpixels, n is a parameter that controls the motion compactness of the superpixels, and where

3.3. Real-time Optical flow The motivation for using optical flow is that we want to detect and distinguish moving objects, persons or body parts. Especially in traffic scenes and in gesture interaction, we are interested in the moving objects or body parts. In this paper we use the optical flow algorithm presented by Werlbergeret al. [15, 14]. This approach deals well with poorly textured regions and small scale image structures, and a GPU-accelerated version of the algorithm is available.

dlab

=

dxyz

=

df low

=

1p (ak − ai )2 + (bk − bi )2 + wl (lk − li )2 C 1p (xk − xi )2 + (yk − yi )2 + wz (zk − zi )2 S 1 p (x˙ k − x˙ i )2 + (y˙ k − y˙ i )2 TS

where C is the depth of each color channel (C = 256 in our case), S is the step size between superpixels, T is the time delta between the previous and the current frame, wl is a weight for the intensity, wz is a weight for the depth, and 91

Algorithm 1 Temporal SLIC 1: Initialize cluster centers at grid interval S. 2: repeat 3: Read next video frame. 4: for each cluster center do 5: Assign best matching pixels from 2S × 2S neighborhood. 6: Compute new cluster centers. 7: end for 8: until end of video

(x, ˙ y) ˙ is the optical flow vector. The color distance and the spatial distance are normalized with C and S respectively, while the motion distance is normalized with the step size S and the time T between the two frames used for optical flow. We introduce a weight for the color intensity component wl in order to lower the influence of shadows and highlights within one object; an object should not be cut in half because of shadow, for example. We also introduce a weight for the depth component wz . In our experiment we have empirically chosen wl = 0.5 and wz = 10. This means we lower the influence of direct light vs. shadows, while we amplify the influence from the depth registration. m and n control spatial and motion compactness. We have empirically found these values to work well when m = 1 and n = 10. This corresponds to a neutral spatial compactness and an amplified motion compactness, as we are very intersted in detecting moving objects.

(a) 10 iterations

4.2. Temporal SLIC

(b) temporal (10th frame)

Figure 5. Iterative vs. temporal clustering: (a) shows the result on a still image after 10 iterations; (b) shows the result after 10 video frames with one iteration per frame.

The superpixel centers are initialized at regular grid steps S. Then, according to the distance measure Ds , the best matching pixels from a 2S × 2S neighborhood around the superpixel center are assigned to the superpixel. This is repeated for each superpixel. The new cluster centers are computed, and the process is iterated until convergence or until a fixed number of iterations have passed. This process is based on the SLIC algorithm described in [1]. However, instead of iterating the linear clustering N times on each frame, the superpixel positions from the previous frame are taken, and then the clustering only iterates once on each new frame. This temporal approach yields similar superpixels as the iterative approach, as illustrated in Figure 5. The temporal clustering makes the assumption that during each video sequence we observe a fixed set of objects, for example a user interacting. If the set of objects changes, a reinitialization of the superpixels would be required, or a more intelligent handling of the superpixels (for example inserting new superpixels or removing superpixels online). However, this is beyond the scope of this paper and we can assume a reinitialization when the scene changes significantly.

component Flow Stereo Superpixels System

resolution 213 × 160 640 × 480 640 × 480

GPU CPU CPU

computation time 300 ms 200 ms 270 ms 570 ms

Table 1. Computation times.

5. Evaluation First, we evaluate the general performance of the presented system by running it on the Berkeley motion segmentation dataset and comparing the results with a state of the art optical-flow based segmentation approach. Then, we illustrate the usefulness of incorporating depth and motion into the segmentation with some examples. Subsequently, we compare the performance of the system by using ‘perfect’ depth information (from a PrimeSense sensor) to the performance using real-time stereo depth information. We show examples to illustrate that the stereo-based approach performs competitively despite the noisy depth data. Finally, we show some examples of the system running on an outdoor robot in traffic and interaction scenarios, illustrating the benefit of using a stereo setup (IR-based sensors do not work outside in sunlight).

4.3. Computation time We have measured the computation time of the presented system on a Core i7 system with a GeForce GTX 260 GPU. The running times of the different components (per frame) are shown in table 1. Keep in mind that the optical flow and stereo algorithms run in parallel, one on the GPU and one on the CPU. The resulting system is able to run at approximately 2 frames per second on our modest test system.

5.1. Evaluation on the Berkeley Motion Segmentation Dataset We provide an evaluation of our method by running it on the Berkeley motion segmentation dataset provided by Brox and Malik [4]. In this dataset no stereo or depth infor92

5.2. Benefit of using Depth and Motion

mation is available, so we evaluate only based on RGB and motion data. The dataset provides 26 annotated sequences. We compare our method based on the overall clustering error: the number of bad labels over the total number of labels on a per-pixel basis. As in [4], multiple clusters (superpixels) can be assigned to the same region to avoid high penalties for over-segmentation that actually makes sense. For instance, the arms of a person move independently even though only the whole person is annotated in the dataset. For the segmentation of a frame, we only use the current and the previous frame. We do not use the 10 previous frames. The superpixel segmentation is based on the RGB image itself and the output from the optical flow algorithm. The results in Table 2 show that the presented method slightly outperforms the segmentation in [4]. This could be explained by the fact that the superpixel method also takes the RGB input into account, and not just the optical flow. The outlier for the marple9 sequence is due to the bodies of the subjects not moving during the entire sequence. However, as shown in the 5th row of Table 2, the method does segment the heads correctly, which do move in the sequence. The marple10 sequence also performs less than average, because in the ground truth besides a person, one of the walls is also considered an object, which is hard to detect as a separate object because the other walls move in a similar motion. sequence cars1 cars2 cars3 cars4 cars5 cars6 cars7 cars8 cars9 cars10 marple1 marple2 marple3 marple4 marple5 marple6 marple7 marple8 marple4 marple9 marple10 marple11 marple12 marple13 people1 people2 tennis average

overall error 4.01% 4.79% 3.93% 0.42% 1.22% 0.47% 0.62% 2.42% 1.77% 6.84% 3.36% 2.18% 2.96% 2.65% 2.05% 7.63% 5.12% 3.47% 2.65% 60.16% 16.59% 1.99% 4.46% 2.88% 0.99% 1.06% 3.40% 5.67%

input

ground truth

We compare the superpixel segmentation for four cases: (1) based on just color; (2) based on color and depth; (3) based on color and motion; and (4) based on color, depth and motion. The examples in Figure 6b show that the segmentation is improved by adding depth: color is often not enough to distinguish between different objects. Figure 6c shows that by taking motion into account, we can segment moving objects or object parts in separate superpixels. By using both depth and motion, as shown in Figure 6d, superpixels are obtained that segment the person and the moving parts correctly. This experiment shows the usefulness of incorporating depth in order to correctly segment static foreground objects, and of incorporating optical flow in order to segment moving objects.

5.3. PrimeSense vs. Stereo

segmentation

We also compare the performance based on an IR-based sensor compared to real-time stereo. As shown in Figure 4, the amount discontinuities in the disparity image from IRbased sensors is sufficiently low that one could just segment based on a threshold. However, in the case of real-time stereo, the disparity image is very noisy. Some example results from a PrimeSense sensor are shown in Figure 7, and from a real-time stereo setup are shown in Figure 8. For this experiment the PrimeSense sensor and the stereo camera pair were mounted on the same tripod, and the same sequences were recorded simultaneously (note that the PrimeSense sensor has a wider angle of view). The results show that for the case of real-time stereo, despite the noisy disparity image, by incorporating color and motion a good segmentation can still be obtained.

5.4. Outdoors One of the main reasons for choosing real-time stereo over IR-based sensors is outdoors functionality. Therefore we show some examples recorded outdoors on an autonomous urban robot. These results are shown in Figure 9, where examples are shown of some traffic and some interaction scenarios. Notice that the monotonous appearance of the street makes the stereo disparity image even more noisy. Nevertheless, the system is able to sergment moving cars on crossings, persons in front of the robot and moving (waving) arms.

6. Conclusion In this paper, a novel superpixel segmentation technique was presented that takes noisy depth and motion data, and produces a useful object segmentation.A color-spatiotemporal distance measure was introduced that incorporates color (RGB), spatial (xyz) and temporal information (opti-

Table 2. Left: evaluation results. Right: segmentation examples from the Berkeley motion segmentation dataset.

93

cal flow). Furthermore we introduced temporal simple linear iterative clustering (temporal SLIC), which exploits the temporal aspect of video in order to minimize the number of iterations per frame required. We present a realtime stereo system that provides object boundaries based on color, depth and motion, and thus makes it possible to detect foreground objects and moving objects at the same time. The superpixel segmentation is not the end of the pipeline. The segmentation can be improved by next steps that process the superpixels. This can be based on simple features such as color, depth and motion, but also on additional more complicated features to further identify the content of the superpixels. However, this post-processing is outside the scope of this paper. The presented system will be applied towards detecting waving and pointing hand gestures on a mobile robot platform, which will be used outdoors. The method allows for detection of persons in the foreground, moving body parts and moving objects in the background. In these circumstances we cannot fall back on IR-based sensors, and the segmentation that combines depth and motion is useful. It can be noted that the quality of the stereo/depth input is not always the same, and the real-time stereo can sometimes fail miserably. This influences the superpixel segmentation, as we use a fixed weight parameter (wz ) to influence the importance of the depth input. This could be resolved by using a dynamic weight, which changes depending on the confidence of the depth values on that frame or pixel. There is also the option to extend the optical flow to 3D, based on the depth data it should be possible to calculate a 3D flow vector. One could also look into incorporating additional input values, especially in the outdoor case, where one could incorporate the input from laser scans or other sensors. Furthermore, for the future it will be good to look into tighter integration of the stereo, optical flow and superpixel clustering. These processes are tightly related and a smarter integration should be possible in order to further reduce processing time while improving the quality of the segmentation.

[2] M. Bleyer and M. Gelautz. Simple but effective tree structures for dynamic programming-based stereo matching. In International Conference on Computer Vision Theory and Applications (VISAPP), pages 415–422. [3] X. Boix, J. M. Gonfaus, J. V. de Weijer, A. D. Bagdanov, J. Serrat, and J. Gonzalez. Harmony potentials: Fusing global and local scale for semantic image segmentation. [4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In Proceedings of European Conference of Computer Vision, pages 282–295, Greece, 2010. [5] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(24):603–619, 2002. [6] P. F. Felzenszwalb and D. P. D.P. Huttenlocher. Effcient graph-based image segmentation. International Journal of Computer Vision, 2(59):167–181, 2004. [7] C. Fowlkes, D. Martin, and J. Malik. Learning affinity functions for image segmentation: Combining patch-based and gradient-based approaches. 2003. [8] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. 2009. [9] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching. In Proceedings of Asian Conference on Computer Vision, Queenstown, New Zealand, November 2010. [10] A. Griesser, S. D. Roeck, A. Neubeck, and L. V. Gool. Gpu-based foreground-background segmentation using an extended colinearity criterion. Proc. of VMV, pages 319– 326, November 2005. [11] B. Jedynak, H. Zheng, and M. Daoudi. Skin detection using pairwise models. Image and Vision Computing, September 2005. [12] D. Martina, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003. [13] R. Mester, T. Aach, and L. Dumbgen. Illumination-invariant change detection using a statistical colinearity criterion. In Pattern Recognition: Proceedings 23rd DAGM, pages 170– 177, 2001. [14] M. Werlberger, T. Pock, and H. Bischof. Motion estimation with non-local total variation regularization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, June 2010. [15] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof. Anisotropic Huber-L1 optical flow. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, September 2009.

Acknowledgments. This work is carried out in the context of the Seventh Framework Programme of the European Commission, EU Project FP7 ICT 248314 Interactive Urban Robot (IURO), and the SNF project Vision-supported Speech-based Human Machine Interaction (200021-130224).

References [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels. June 2006. EPFL Technical Report 149300.

94

(a) Superpixels based on RGB values

(b) Superpixels based on RGB and depth values

(c) Superpixels based on RGB and optical flow values

(d) Superpixels based on RGB, depth and optical flow values

Figure 6. Superpixels depending on different input types (the image in the background is faded to make the superpixels more visible). By adding depth, the persons are segmented more cleanly. By adding motion, the moving arms are segmented as a separate object. By using both, we get cleanly segmented persons and moving body parts.

Figure 7. Experiments with RGB and depth data taken from a PrimeSense sensor. The clean and pixel-accurate depth registration allows for the system to produce a clean segmentation.

95

Figure 8. Experiments with RGB images taken from two Point Grey Grasshopper cameras and depth provided by real-time stereo.

Figure 9. The superpixel segmentation on some outdoor scenes recorded from a robot. From left to right: RGB input image, stereo disparity, optical flow, resulting superpixels, resulting segmentation.

96

Suggest Documents