Background Subtraction with KINECT data: An ...

25 downloads 0 Views 1MB Size Report
Abstract. This paper describes an efficient combination KINECT data. (RGB and Depth data) for background subtraction. To obtain this goal, we simply utilize a ...
Background Subtraction with KINECT data: An Efficient Combination RGB and Depth Van-Toi Nguyen1,2 , Hai Vu1 , and Thanh-Hai Tran1 1

International Research Institute MICA, HUST - CNRS/UMI-2954 - Grenoble INP and Hanoi University of Science & Technology, Vietnam 2 University of Information and Communication Technology, Thai Nguyen University

Abstract. This paper describes an efficient combination KINECT data (RGB and Depth data) for background subtraction. To obtain this goal, we simply utilize a statistic model of background pixels like Gaussian Mixture Model for color and depth features. However, beyond results of the segmentation from separated data, our combination strategy takes into account spatial pixels whose depth (or color feature) is more valuable than the other one. Our strategy is that in valid range of the depth measurement, results of foreground segmentation from depth features are biased, whereas the pixels where out of the range, results of foreground segmentation using color feature are utilized. Following such combination scheme, depth pixels are filtered through a proposed noise model of depth as well as is validated in a range of the depth measurements. The proposed method is evaluated using a public dataset which is suffered from common problems of the background subtraction such as shadows, reflections and camouflage. The experiments show segmentation results which are comparable with recent reports. Furthermore, the proposed method is successful with a challenging task such as extracting human fall-down in a RGB-D image sequence. The foreground segmentation results is feasibility for recognition task. Keywords: Microsoft KINECT, Background Subtractions, Color Segmentation, Depth in use, RBG-D Combinations

1

Introduction

Background/foreground segmentation (or Background Subtraction - we shortly named BGS) aims to separate moving/dynamic object from a static or background scene. This is critical and is a preprocessing task in many applications such as object detection, tracking, and recognition. One of the most common approaches to BGS is utilizing Gaussian Mixture Model (GMM) to model statistics of background pixels [6, 7]. We can find some good reviews about background/foreground subtraction in [11, 9]. Using only color image, typical problems with BGS techniques include foreground objects with some of the same colors as the background (camouflage problem), and shadows or other variable lighting conditions (cause inclusion of background elements in the computed

foreground). Recently, depth/range data provided by Time-of-flight camera, particularly a consumer device Microsoft KINECT [1], becomes very attractive for background/foreground segmentation in indoor environments. A major advantage of depth data is that it does not suffer issues of color based algorithms (camouflage, shadow, illumination changes). However, the solid using depth data is still present problems [4]. For example, depth data noise at object boundaries; depth measurements are not always available for all the image’s pixels; measurements noise depends on the real measured distance. These aspects has to be included in the combination strategy in order to reduce the detection error due to the problems of color as well as noise issues of depth features. Some methods in the past [3] has simple combination between depth and color or only use depth channel so they does not exploit effectively advantages of both depth and RGB information. Some another work [10, 4] has more effective combination allow to solve color segmentation issues such as shadows, reflections and camouflage. However, they are too complex or still do not really exploit full advantages of both depth and color information. To tackle these issues, we propose an efficient combination depth and color’s segmentation results. Our combinations scheme is mainly based on observations for valid range of depth measurement. The observation is that wherever depth measurement indicates that a pixel is in the valid range of the depth measurement, color matching is unimportant because the depth information alone is sufficient for correct segmentation. In contrast, if a pixel is out of measurement of depth sensor, foreground segmentation from color features is utilized. In our implementations, we firstly utilize a model statistics of background pixels like Gaussian Mixture Model for color and depth features. Valid depth pixels then are filtered using a proposed noise model of depth. Noise model is learnt based on statistical analysis depth data for each pixel in a duration time. This noise model is attractive because it supports us filtering out noise pixels in depth image. A disjunction of segmentation results from depth and color feature produces final results. The disjunction is implemented according to above combination conditions. In this paper, Section 2 shows the our depth-noise model and the proposed method to filter noise of the depth as well as identifying depth in use. Section 3 shows our combination scheme using color and depth features. Section 4 gives the experimental results that prove the effectiveness of the proposed method and Section 5 summarizes the paper.

2 2.1

Removing noises in the depth data Build a model of noise of depth data

To build noise model of depth data, we taken into account the depth of static scene (background scene) in a duration T . Assume that depth map of the background scene is S = [M × N × T ] with < M, N > is width and height of the depth image (usually, images size 640 × 480 pixels). A noise model of depth data aims to find positions of the noises from the observed signal S and statistical parameters to filter noises from the depth image captured.

(a)

(b)

(c)

Fig. 1. An example of noise data captured by depth sensor KINECT. (a). RGB image for reference. (b). The corresponding depth image. (c) Zoom-in a region around chest board. A noise data is especially high in object boundaries.

Observing a depth signal s at pixel < i, j > in the duration T allows evaluating stability of depth data. Intuitively, a noise pixel usually make signal s(i, j) becoming unstable. To measure the stability of each background pixel (i, j), we evaluate standard derivation (std) of the s(i, j). A pixel at location (i, j) will be defined as noise as following: { 1 N oise(i, j) = 0

if std(s(i, j)) ≥ T hreshold; if std(s(i, j)) < T hreshold;

(1)

The T hreshold is predetermined by heuristical selection. However, the emprical study shows that it is not strickly selected T hreshold value. A stable s(i, j) always assiociate with a low value of std. Figure 3 shows the noise pixel detected in a background scene observed in Fig 1 above. The noise signal s along time T of a pixel at cooridnate (251,182), As shown in Fig. 3(a), is extracted. Original depth data of s(251, 182) is plotted in red line in Fig.3(b). It is a noise pixels according to (1). A image of noise pixels is shown in Fig. 3(c). As expected, the noise pixels appear high density around chessboard.

2.2

Noise reduction using the proposed noise model

The noise model supports us an effective algorithms for filtering noise pixels in the depth image. As shown in Fig. 3(a), identifying a pixel that is noise or not in the depth images is ensured. For such pixels, we generated new value of depth based on observation on low band data of the result of a K-mean (K = 2) clustering. A random value is generated to fill-in the depth pixels. Fig. 3(b) presented results of noise depth frame after applying the filtering procedure. Some pixels is still in noise is available to remove using simple median filter on current frame. Fig. 3(c) shows results after a median filtering with kernel size of 3 × 3 pixels.

Fig. 2. Results of the learning noise model in a duration T = 5sec.. (a). The signal s at pixel at position (251, 182) is examined. (b). The corresponding signal s along T is plotted in red; the filtered signal sf is plotted in blue. (c) Noises pixels detected in all images. As expected, high density of noise appeared in regions of chessboard and boundary of the book cases.

(a)

(b)

(c)

Fig. 3. Results of filtered noise on background scene. (a). An original depth frame. (b). The filtered noise depth frame. (c) Average of the filtered noise in duration T .

3 3.1

Background/Foreground segmentation The prototype of background subtraction

We define the background as the stationary portion of a scene. If pure background frames are available, pixel-wise statistics in color and depth can be computed directly. The more difficult case is computing the background model in sequences which always contain foreground elements. We model each pixel as an independent statistical process. Gaussian Mixture Model is observed for each pixel over a sequence of frames. For ease of computation, we assume a covariance matrix of three color channels [RGB] is equal. At each pixel a mixture of three Gaussian is estimated according to procedure proposed in [6] Once we have an estimate of the background in terms of color and range, we can use this model to segment foreground from background in a subsequent image of the same scene. Ideally a pixel belongs to the foreground, F , when its current value is far from the mode of the background model relative to the standard deviation. F ≡ |Pi − Pm | > kσ

(2)

where Pi is the pixel value at frame i (in color and range space), Pm is the mode of the background model at the same pixel, σ is the variance of the model at that pixel, and k is threshold parameter. In our implementation, this prototype for background subtraction is implemented for both depth and color features. Foreground segmentation from depth named Fd , whereas foreground segmentation from color named Fc . Background model of depth and color named Pmd and Pmc , respectively. We build separated model for each channel [R, G, B, andD]. An example of the R channel for F all sequence (see details in Section 4) is shown in Fig.4

(a)

(b)

(c)

Fig. 4. GMM of R channel for F all sequence. The mean data of the first, second, and third Gaussian is visualized at (a),(b),(c), respectively.

3.2

Background subtraction using depth feature

Given a depth image as shown in Fig. 5(b) (see Fig. 5(a) for reference). Using background model of depth as shown in Fig. 5(c), we obtain different from given frame and background model. According to (2), a predetermined threshold is selected to obtain binary images including foreground regions. Futherprocessing obtains a fine results of foreground regions (Fig. 5(f)) 3.3

Background subtraction using color feature

Similar to BGS using depth feature, our segmentation prototype is applied to color feature. Original color frame is shown in Fig. 6(a). For a background model given in Fig. 4, different from given frame and background model is shown in Fig. 6(b). Using a predetermined threshold in ((2), we obtain foreground regions, as shown in Fig. 6(c). However, selecting a suitable threshold for BGS using color feature is more difficult than that using using depth feature. 3.4

Combination of Depth and Color

Our combination takes a disjunction of the foregrounds detected by depth and color features. The final segmentation result therefore is defined by: ∪ F ≡ Fd Fc (3)

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. BGS using Depth feature. (a)-(b) Color and depth original images, respectively. (c) Background model of depth. (d) Different from given frame and background model. (e). Fd segmentation. (f). Results after Removing small blobs

A strategy for the combination is that where depth indicates that a pixel is in the valid range of the depth measurement, color matching is unimportant since the depth information alone is sufficient for correct segmentation. Therefore, a valid depth is proposed to obtain foreground from depth: V alid(Fd ) ≡ Depth ¿ M axV al where M axV al is depth value which is out of range of depth measurement. Given a depth image F , which is filtered noises using the proposed noise model in Sec.2, foreground regions Fd is able to estimated by (2). However, the presence of low confidence depth values, which we have been referring to as

(a)

(b)

(c)

Fig. 6. (a) Original color image. (b) Different from background model. (e). Fc segmentation results

(a)

(b)

(c)

(d)

(e)

(f)

(h)

(i)

(g)

(k)

Fig. 7. First row: (a). Original depth image; (b). Valid depth measurement from background. (c). Valid depth with foreground image. (d). Original color image; (e). Different of color from background model without depth association (c). (f). Different of color form BG model with depth association (c). (h)-(i) are foreground segmentation from (e)-(f), respectively. (k) Final result: is disjunction of (g) and (i)

invalid. The procedure to eliminate invalid is: B = V alid(Pmd ) ∩ (1 − Fd )

(4)

V alid(Fd ) = V alid(F ) − B

Effective of the combination scheme is shown in Fig. 7. As the proposed scheme, a valid depth in use is identified first. Fig. 7(b) shown valided depth pixels from background models, that presents pixels in range of measuments from depth sensor. The valid depth is reduced with foreground images in Fig. 7(c). This subfigure presents pixels where depth is biased than color features. Without using depth features, results of foreground segmentation is including many shadows around box, as shown in Fig. 7(e). Using depth information, many shadow have been removed in Fig. 7(f). Final results is disjunction of depth and color in Fig.7(k). On the other hand, this example also present effective of color features. For pixels in out of range of depth (or invalid depth pixels), as border regions of images, foreground segmentations from color feature is utilized. Therefore, in the final results, hand of the person, who keeps the box, is included.

4

Experiments

The aim of our experiments is to demonstrate the performance of our proposed method. We evaluate the proposed method in two aspects: (1). Comparing the proposed method with a state-of-the-art combination method using a public dataset provided in [4]. (2). Confirming that the proposed method is successful for segmenting an image sequence with human fall (fall-like) actions (MICAFALL sequences). 4.1

Segmentation results using a public dataset in [4]

This dataset provides in [4] including several indoor sequences. These sequences are acquired by Microsoft Kinect [1]. Each sequence contains a challenge of BGS such as shadow, camouflage, so on. These dataset is to test the performance of the algorithms. The description details of this data set can be found in [4].

Fig. 8. Frame 1014 of the GenSeq sequence: depth data (A), FG detected based-on depth (B), color data (C), FG detected based-on color (D), FG detected based-on combination of color and depth (E), result of [4] (F) with some false detection in front of the feet of the human

Figure 8 shows the segmentation result on frame #1014 of GenSeq sequence. The challenge of the GenSeq sequence is that illumination change during the man moving into room. Our segmentation result is shown in Figure 8-E. As shown, FG detected using the proposed method is better than that using only depth data (Figure 8-B) or only using color (Figure 8-D). The combination of depth and color allows to remove almost false detection of two methods. Intutively, this result is better than result’s in [4] (as shown in Figure 8-F). The result of [4] includes some false detections. The major reasons are that the shadow of the human on the floor affects negatively to results in segmentations of [4].

Fig. 9. The result on an example sub-sequence of MICA-FALL dataset.

4.2

The segmentation results with MICA-FALL sequences

MICA-FALL sequences are to detect human fall and fall-like actions. The main purpose is to automatically detect abnormal activities of human (patient in hospital, elderly) in order to alarm to assistance in hospital as soon as possible. These sequences are captured by a KINECT device in an indoor environment. This dataset is more challenges for segmenting foreground. There are big shadows on the wall when a patient moves in the room; inflection on the floor of bodypatient when he falls down. The field of view in the treatment-room is quite large and patient always goes out of range of depth sensors. Figure 9 shows the result on an image sequence of the fall-down action. Obviously, these segmentation results are feasible to implement recognizing works. 4.3

Quantitative evaluation

We use a measure based-on the binary Jaccard index that is described in [12]. The measure is defined as: F G ∩ GT JI = (5) F G ∪ GT Where F G is foreground detection result, GT is ground truth. Table 1 shows the results on six sequences in our experiment. The results indicate that our method works effectively in most cases. The result on sequence GenSeq shows the disadvantage of proposed method. The proposed method makes some errors when the object is closed to the background where in valid range of depth measurement. In this case, the depth measurement of foreground and background are nearly equal. Table 1. The evaluation using the measure based-on the Jaccard index Sequence ColCam GenSeq DcamSeq ShSeq stereoSeq FallSeq average JI 94.77% 86.87% 60.71% 90.76% 79.86% 77.20% 81%

5

Conclusions

This paper proposed an efficient method for background subtraction with KINECT data. We taken into account noise of depth features. This model presented effective to eliminate depth of noise, that is attactive for identifying the valid depth pixels. Our combination scheme was based on advantages of valid depth and full view of colors, these features were complementary to obtain fine segmentation results. The proposed method was suffered from evaluations with several image sequences. It was also successful for segmenting human activities in our challenging task. That is to recognizing a human fall-down actions. The proposed method has some limitations where shadow of subject moving is out of range depth measurement. These limitations suggests direction to further researches.

References 1. Microsoft Kinect, http://www.xbox.com/en-US/xbox360/accessories/kinect, 2013 2. Baf,F., Bouwmans, T., Vachon, B.: Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling, Advances in Visual Computing, pp. 772781, (2008) 3. G. Gordon, T. Darrell, M. Harville, J. Woodfill, ”Background estimation and removal based on range and color” , In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition,pp. 1-6 ,Jun. 1999 4. M. Camplani, L. Salgado, Background Foreground segmentation with RGB-D Kinect data: an efficient combination of classifiers, Journal of Visual Communication and Image Representation, Elsevier, In Press. 5. M. Camplani, L. Salgado, Efficient spatio-temporal hole filling strategy for Kinect depth maps, in: Three-Dimensional Image Processing (3DIP) and Applications II, volume 8290, SPIE, 2012, p. 82900E. 6. W.E.L. Grimson, Chris Stauffer, Raquel Romano, and Lily Lee, Using adaptive tracking to classify and monitor activities in a site, In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Santa Barbara,pp. ,Jun. 1998 7. C. Stauffer, W. E. L. Grimson, ”Adaptive background mixture models for realtime tracking”. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. pp. 246252, Aug. 1999 8. Sigari, M.H., Mozayani, N., Pourreza, H.R.: Fuzzy Running Average and Fuzzy Background Subtraction : Concepts and Application. International Journal of Computer Science and Network Security. 8, 138143 (2008). 9. Brutzer, S., Hoferlin, B., Heidemann, G.: Evaluation of background subtraction techniques for video surveillance. Cvpr 2011. 19371944 (2011). 10. Schiller, I., Koch, R.: Improved Video Segmentation by Adaptive Combination of Depth Keying and. Lecture Notes in Computer Science (Image Analysis). 6688, 5968 (2011). 11. Bouwmans, T.: Recent Advanced Statistical Background Modeling for Foreground Detection - A Systematic Survey. RPCS. 147176 (2011). 12. McGuinness, K., O?Connor, N.E.: A comparative evaluation of interactive segmentation algorithms. Pattern Recognition. 43, 434?444 (2010).

Suggest Documents