Improved video segmentation with color and depth using a stereo camera Simona Ottonelli, Paolo Spagnolo, Pier Luigi Mazzeo
Marco Leo
National Council Research of Italy Institute of Intelligent Systems for Automation Via Amendola 122 DO 70126 - Bari, Italy Email:
[email protected]
National Council Research of Italy Institute of Optics (CNR-INO) Via della Libert`a, 3 73010 Arnesano (Lecce)
Abstract—The detection of moving objects is a crucial step in many application contexts such as people detection, action recognition, and visual surveillance for safety and security. The recent advance in depth camera technology has suggested the possibility to exploit a multi-sensor information (color and depth) in order to achieve better results in video segmentation. In this paper, we present a technique that combines depth and color image information and demonstrate its effectiveness through experiments performed on real image sequences recorded by means of a stereo camera.
I. I NTRODUCTION The accurate detection and tracking of moving objects in video sequences represents a relevant problem in computer vision and image processing. Possible applications of this subject are manifolds, including surveillance, human motion capture, traffic monitoring, 3D immersive videoconferencing, immersive games, human-computer interaction and assisted rehabilitation. The most critical step in video processing is represented by the background subtraction, which allows segmenting the scene in static (background) and moving (foreground) regions. More in detail, background subtraction basically consists of comparing each new frame with a suitable model of the scene background: significant differences in two consecutive frames usually correspond to foreground objects. Traditionally, foreground/background detection has been approached by exploiting only color information through a large variety of mathematical algorithms [1] based on graph cut [2], mean-shift clustering [3], Gaussian mixture models [4], or non-parametric kernel density estimation [5], among the others. Each of these methods successfully works in controlled laboratory setting, but often fails in not controlled real-world conditions due to dynamic changes. Some of the well-known causes of failure are global and local light changes, moving background, camouflage and bootstrapping [6]. Recently, a new promising segmentation algorithm has been proposed by Barnich et al. [7], [8], [9]. This method, named ViBe, provides a sample-based estimation of the background via a random selection policy of the background values. This approach has demonstrated to be effective in real-world experiments showing robustness toward noisy acquisitions,
fast processing speed, versatility towards several experimental conditions, and great accuracy on the contours of silhouette. In spite of these relevant benefits, ViBe algorithm still suffers of the limitations described hereinafter. In spite of the great interest of both academic and industrial world toward this topic, as demonstrated by the huge amount of research activity reported in literature, the definition of a real-time background/foreground model showing robustness, reliability and responsiveness at the same time still remains an open and complex problem. In order to overcame these limitations, several authors have recently shown even more interest toward the use of depth information together with the color content of a video scene [10], [11], [12], [13], [14], [15], [16], [17], [18]. This interest has been also solicited by the recent rapid advance in the development of hardware and software for real-time computation of depth maps, i.e. images in which each pixel contains a measurement of the distance of the object visible in that pixel from the camera. Depth data are intrinsically able to compensate for noise and drawbacks in color data: since the shapes of scene objects are not affected by illumination, shadows, and inter-reflections, depth information provides much robustness to such phenomena. On the other hand, depth data are usually noisy and unreliable, especially in scene location with little visual texture or that contains foreground objects in close proximity to the background. Hence, background subtraction methods based on depth alone often produce invalid results [19], [20]. On the base of the above considerations, the optimal solutions consists in the fusion between color and depth segmentation so that the intrinsic limitations of each method can be counterbalanced and much improved segmentation results can be achieved. In this paper, we describe an innovative algorithm for foreground objects segmentation based on both color and depth information. Starting from the well-established ViBE algorithm for background subtraction, we demonstrate how depth information can helpfully contribute to enhance the standard foreground extraction based on color images. The use of ViBE on both color and depth images is not new since in [10], [11] it has been already applied for segmenting moving objects (further details will be introduced in Section II-B). However, in those papers the vision system consists in a 3D-
ToF (Time of flight) camera with an adding RGB camera, thus producing an experimental arrangement that is expensive, bulky, and extremely difficult to align and calibrate. Instead in this paper we focus our attention onto a simpler setup consisting only of a standard stereo camera, which typically exhibits a better resolution and a lower cost with respect to ToF cameras, and is able to produce intrinsically calibrated RGB and depth frames. This paper is organized as follows. A brief review of related work is presented in Section II. Section III presents a detailed description of the proposed method for foreground segmentation. Section IV describes basic principles, benefits and drawbacks of the depth camera technology, then presents a description and discussion of the experimental results. Finally, Section V concludes the paper. II. R ELATED W ORK A. Previous work While the literature about foreground extraction based only on color information is extremely wide, the number of papers focused onto scene segmentation by means of color and depth information is still rather limited. Basically, there are two main approaches: the first one performs two independent segmentations, one on the color image and one on the depth data, and then join the two results; the second one fuses the RGB/depth data before performing a joined segmentation. One of the earlier work in this field is that of Gordon et al. [12], where depth data is inferred by a stereo camera and the background estimation is based on a multidimensional (range and color) clustering at each image pixel in the mixture of Gaussians approximation. Segmentation of the foreground in a given frame is then performed via comparison with background statistics in both range and normalized color space, using the depth information to make dynamically adaptive the color segmentation criteria. However, the proposed method lacks time adaptivity because it uses a relatively simple statistical model of the background. Improvements are described in [6] through a modulation of the background model learning rate based on scene activity, still maintaining the linkage between the depth and color information (4-vector YUVD) by means of a threshold that reciprocally depends on color and depth images. Bleiweiss and Werman [13] also present a joint depth and color segmentation criteria which exploits a 6-dimensions vector XYL*U*V*D (XY 2D lattice coordinates, L*U*V* color components converted to L*U*V* color space, and D depth coordinate), extracted by a 3D camera based on timeof-flight. After 6D vectors are computed, they are processed iteratively through a mean shift clustering until the desired convergence level is reached. Similar approach is followed in [14], except for the conversion of the RGB image in the CIElab color space, the use of k-means clustering method and the addition of a final refinement step for noise removal. A different approach is described in [15], which exploits three kinds of information: motion, color and depth. Motion is used in the beginning in order to get a prior about the
position of the “Region Of Interest” within the scene; foreground segmentation area is then obtained by the application of a region growing algorithm to depth map, whereas color information is only used to provide a feedback information about segmenting confidence thus producing a better refined output result. Similarly, method presented in [16] extracts foreground objects in real time by means of a user-defined threshold depth, while the color image is used only to support a small fraction (12 %) of the pixels which are not solved by the depth threshold, named uncertain pixels. In this last case two likelihood functions, one built on the basis of joint depth and color information and the other one on the basis of only color data, are evaluated in order to assign uncertain samples to the background or to the foreground. Another different solution is that based on Graph-Cut, where both data and smoothness terms are computed by a fusion of depth and color information [17], [18]: in [17] data term merges the color and depth channels by using the maximum distance between experimental pixel values and the corresponding Gaussian Mixture Models (GMM), whereas in [18] data term is obtained by using the weighted sum of two likelihoods, one for color frames and the other for depth frames, both modelled with GMMs and learned using an Expectation Maximization algorithm (EM). Focusing the attention onto people detection and tracking, video segmentation in both indoor closed environments or streets are reported in [21], [22]. In [21] authors find tracklets (chronological sequences of observations that are likely to correspond to the same person), apply an Histograms-ofOriented Gradients (HOG) classifier to each color frames, and finally demonstrate that tracklets with scores exceeding a predefined threshold correspond to people. Instead in [22] the foreground is extracted via Mixture of Gaussian model, then pedestrians at each foreground region are separated through a threshold selected by the histogram of depth data. B. ViBE and depth maps Interestingly, the sample-based ViBE algorithm described in Section I has also been applied to color, depth and motion for achieving better results in video segmentation, specifically for people detection applications. Main results are shown in [10], where the combined segmentation process proposed by the authors basically consists in the individual application of ViBE to color and depth video frames, in the combination of extracted foregrounds by means of a logical or, and in a final refinement step for noise removal by morphological filtering. The feasibility of the proposed algorithm has been demonstrated by means of an experimental setup made of a RGBcamera and a 3D - Time of Flight camera (an indoor PMD Photonic Mixer Device camera), settled one on top of the other and both equipped with similar objectives. Results show that color and depth segmentation are perfectly able to reciprocally counter for their intrinsic limitations: color segmentation is helpful when the user is close to the background and the depth map is too noisy to produce a valid foreground mask; on the other side, depth contribution is relevant in those situations
(a)
(b)
(c) Fig. 1. Foreground extracted by RGB (b) and DM (c) images by means of ViBE algorithm, compared with the true silhouette retrieved through manual segmentation (a).
where color segmentation usually fails, e.g. in presence of illumination changes or when the colors of the user are similar to those of the backgrounds. However, fusion between color and depth segmentation using PMD technology and a RGB camera brings not only advantages (pixels correctly recognized as foreground) but also drawbacks (pixels erroneously recognized as foreground), represented firstly by the persistence of motion when there is a fast movement, and secondly by the appearance of infrared shadows induced by the specific geometry of the PMD camera. The resolution of these drawbacks is the aim of paper [11], that exploits the same identical setup of [10] in order to successfully improve the application of ViBE algorithm to color and depth videos and offer better performances in unconstrained conditions. In spite of the good achieved results, authors complain in both papers about the difficult alignment procedure between the PMD and RGB cameras, step made even made even more complex by the low resolution of PMD camera with respect to the RGB one. Besides, PMD cameras exhibits wiggling, i.e. a static error consisting in a periodic noise affecting the measure of depth.
holes inside the silhouette because of camouflage and light changes, whereas Fig. 1(c) shows extremely low resolution and fails in recognizing hands and legs. On the base of the above considerations, the method we propose is based on two kinds of input images: an RGB color frame and the corresponding DM (depth map) frame. The method, whose main steps are sketched in Fig. 2, starts considering the foreground mask RGBV iBE given its great reliability in contours definition, and then adds a compensation factor (CF ) obtained by both color and depth information. The schematic description of the compensation factor algorithm is shown in Fig. 3. Once CF has been evaluated, the final foreground mask (F ) is obtained after a logical “or” and after a further step for noise removal (N R), which could be eventually ignored if the result is already clear before applying this step. Formally, if we consider a generic frame i (i = 1 . . . n with n total number of frames), main steps of the proposed method can be described as follows: • color and depth segmented frames are preliminary converted in black and white (with pixels having only 0 / 1 values) and then they are subtracted, producing the subtraction mask S i : S i = DMVi iBE − RGBVi iBE .
We only consider positive pixel values in S , since our goal is to add something to the original mask RGBVi iBE , without deleting any of its portions: { S i if S i (j, k) > 0 i′ S = ∀(j, k) (2) 0 if S i (j, k) < 0
•
where (j, k) are the coordinates of a generic pixel, being k = 1 . . . K (K is the total number of columns) and j = 1 . . . J (J is the total number of rows). The result is a logical mask that contains only those regions of DMVi iBE which are not included in RGBVi iBE , with no consideration about the goodness degree of such regions. An averaging filter (AF ) is applied to the original color frame RGB i after its conversion from RGB to grayscale (producing the image GRAY i ), so that the gray level of each pixel is replaced by the average of the gray levels in a neighborhood of that pixel of surrounding pixels in a square window. For a generic pixel having coordinates (j,k), the gray level is given by: T i (j, k) =
III. S EGMENTATION BASED ON COLOR AND DEPTH :
M ∑ 1 4M 2
M ∑
GRAY i (s, t)
(3)
s=−M t=−M
PROPOSED METHOD
A preliminary segmentation by ViBE algorithm produces the video sequences DMV iBE and RGBV iBE . However, individual segmentation on color and depth images produces unsatisfying results, as already underlined in [10], [11] and confirmed by the test frame shown in Fig. 1. In fact, a comparison between the resulting foreground in Fig. 1(b - c) and the ground truth obtained by manual delineation of segments in Fig. 1(a) shows that both RGBV iBE and DMV iBE fail: Fig. 1(b) is highly accurate in the contours but presents multiple
(1) i
•
where T i is the filtered image and M = 30 is the half-side of the squared kernel used for the averaging. After this step is accomplished, the intensity value T i (j, k) is compared with a gray level threshold T H defined by the user, so that a logical value can be assigned to that pixel according to the following criteria: { 1 for |GRAY i (j, k) − T i (j, k)| < T H Di (j, k) = 0 otherwise (4)
Fig. 2. Schematic of foreground extraction algorithm. CF stands for “Depthbased Compensation Factor”, N R stands for “Noise-removal”, and F stands for “Final foreground mask”.
Fig. 3. Schematic of depth-based compensation factor CF algorithm: S represents the “Subtraction mask”, AF stands for “Averaging filter”, D stands for “Depth enhanced mask” and T H is the decision threshold in pixel unit.
Experiments have shown that a unique threshold TH of 20 is appropriate for the tested image sequence, hence there is no need to adapt this value during the video processing. Repeating iteratively this procedure for each pixel, a logical depth-enhanced mask Di is built. The role of this mask is to provide a selection guideline about which pixels in the subtraction mask S i should be accepted and which should be rejected to obtain the compensation factor. The basic idea is to consider a pixel valid only if it comes from a uniform color region, so that its level of intensity is quite similar to the medium level of intensity computed in the surrounding averaging window. To this aim, mask Di is applied to the subtraction mask S i by means of a logical “and”, thus computing the compensation factor CF as: CF i = Di ∧ S i •
(5)
Finally, the final foreground F is obtained through the application of an “or” operator between the color segmented foreground RGBV iBE and the compensation factor CF: F i = RGBVi iBE ∨ CF i .
(6)
If necessary, the final foreground can be filtered to remove noise using only connected components over a certain minimum area. Filling holes in the connected components is also possible using a morphological closing with a small structuring element. IV. E XPERIMENTAL RESULTS A. Comparison between depth cameras There are several kinds of depth camera that can be exploited to infer depth-information from a video sequence, the most important of which are stereo cameras and time of
flight (ToF) cameras. Stereo vision systems exploit a passive method, since they use the information coming from two identical color cameras to localize the position of each pixel in both color images by the application of simple geometrical considerations. At the contrary, a ToF camera is based on an active method, since it computes directly the depth information by measuring the time-of-flight of a light signal between the camera and the subject for each point of the image. A comparison between the performances of both systems shows that stereo cameras have high resolution and long operative range; moreover the two color cameras are precalibrated and pre-registered, thus allowing for a simultaneous registration of aligned RGB and depth frames. However, a stereo camera requires parameter tuning for each new environment and often fails in retrieving depth information in nontextured regions. On the other side, a ToF camera is more compact (there is no need of any base line), it acquires depth information directly on the camera hardware, and it shows faster processing time making it ideally suited to be used in real-time applications. However, ToF cameras have several limitations: besides being rather expensive, depth measurements can be very noisy, with low-resolution and affected by the ambient light, limiting its use to indoor scenarios with controlled lighting. Lastly, since a ToF camera typically provides only depth images, the user is forced to add and align an additional color camera by its own in all those applications requiring both color and depth information. In this case, the use of standard stereo camera still represents the most suitable and reliable solution. B. Experimental setup The experiments have been performed on three real indoor image sequences acquired with a moving stereo camera Point Gray mod. Bumblebee2 - 08S2C, having a maximum frame rate of 20 Hz and a selected resolution of 480 × 640 pixels. The processing has been implemented off-line with a Pentium IV computer, having 3.2 GHz and 4 Gb of RAM. The three sequences consist in 155, 106 and 140 frames respectively, and show a person moving at a distance approximately equal to 2.5 m from the camera. The first sequence has been acquired in controlled lighting condition, i.e. in a hallway with only artificial lights. At the contrary, the second and third sequences have been recorded near the door and the window of an office, thus producing much more variable illumination condition. The preliminary calibration of the stereo camera has guaranteed the perfect geometrical alignment between the produced disparity map and one of the two color images (the left one in our case). As a consequence, no further alignment operations have been required. Before the application of the segmenting algorithm, color and depth video sequences have been recorded on the PC and they have been separately processed with ViBE algorithm executable file “vibe-rgb.exe” [23], choosing the following ViBE parameters as suggested in [8]: •
number of samples per pixel model N = 20;
• • •
cardinality, i.e. number of close samples needed to classify a current pixel value as background s = 2; subsampling factor, that is the amount of random temporal subsampling rate f = 16; matching threshold used to compare pixel samples to current pixel values t = 20.
R(%) FAR(%) PCC(%)
C. Results and discussion Main results are reported in Fig. 4 that shows three sample frames, one for each tested sequence. The original color and depth images are illustrated in rows (a) and (b). The result of the standard segmentation approach obtained by the application of ViBE algorithm to color images is reported in row (c). Finally row (d) shows the result of the our new segmentation approach. The qualitative comparison between the last two columns demonstrates that our method succeeds in providing a more complete figure segmentation since it is able to fill several empty areas within the silhouette. In order to have a quantitative estimation of the error, we have considered some common metrics usually used to assess the output of a foreground detection algorithm given a series of ground truth segmentation maps. In detail, the attention has been focused onto the Recall, the False Alarm Rate (F AR), and the Percentage of Correct Classifications (P CC), as proposed in [8], [24]: Recall =
TP FN + TP
(7)
F AR =
FP FP + TP
(8)
TP + TN TP + FP + TN + FN
(9)
P CC =
TABLE I E XPERIMENTAL RESULTS .
where TP (true positive) are the detected regions that correspond to moving objects; FP (false positive) are the detected regions that are erroneously classified as moving object; TN (true negative) are the regions that correspond to background, and FN (false negative) are moving objects that are not detected. In order to minimize errors, an image should exhibit PCC and Recall parameters as high as possible and FAR the lowest as possible. Results for the three frames shown in Fig. 4 are listed in Table I and are compared with the results achievable by means of the standard segmentation approach based only on color images. The validity of the proposed method is confirmed by the high value of PCC, that is always greater than PCC based only on color images. At the same time, we can note a remarkable improvement of the Recall value, that is greater than 10 - 15 - 27 % for the three tested sequence, respectively. Accordingly, the FAR parameter also decreases of 1 - 4 % except that in the first sequence; anyway, this is not a limitation since the slight worsening of the FAR parameter is counterbalanced by the Recall so that the PCC value is still better than the precision based only on color images. Hence, it can be concluded that the proposed algorithm produces an overall improvement in the foreground segmentation
color color+depth color color+depth color color+depth
1 80.89 91.12 1.23 3.05 96.82 98.14
Acquisition 2 64.55 80.05 10.83 10.04 93.78 95.87
3 67.38 94.34 40.69 36.42 93.08 94.64
with respect to the standard color-based approach. Different numerical results in the three tested sequence can be ascribed to differences in the experimental condition: Sequence 1 (recorded in presence of controlled illumination condition) exhibits the highest absolute value of Recall but a reduced relative increase with respect to the color segmentation; Sequence 2 and 3 (acquired in a strongly variable illuminated environment) exhibit a smaller absolute value of Recall but a greater relative increase with respect to the standard color segmentation results. V. C ONCLUSION This paper presents a method that improves the foreground segmentation of moving people by means of the fusion between two image modalities, color and depth. Starting from the foreground extracted by using color images only, the proposed algorithm exploits the depth information to enhance the result and achieve a more complete human silhouette detection. The method has been validated onto three video sequences recorded by means of a standard stereo camera. Experimental results show a quantitative and qualitative improvement in the segmented foreground. The quantitative improvement is confirmed by an increase of the detected recall and precision. The qualitative improvement is demonstrated by the visual inspection of foreground mask in Fig. 4, showing that the proposed method is able to detect the overall human silhouette, including limbs or parts of the chest which are not detected by the standard approach. This make this method ideally suitable for all those applications which are based onto action recognition and/or body skeletonization, e.g. video-surveillance for security or safety. As a future work, we are testing our approach on other challenging indoor sequences, in presence of more than one person, and we are also evaluating the possibility of using this approach to both handle occlusions and perform tracking in crowded scenes. R EFERENCES [1] M. Piccardi, “Background subtraction techniques: a review,” in Systems, Man and Cybernetics (SMC), vol. 4, oct. 2004, pp. 3099 – 3104. [2] Y. Boykov and M. Jolly, “Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images,” in Computer Vision (ICCV), vol. 1, 2001. [3] D. Comaniciu and P. Meer, “Robust analysis of feature spaces: color image segmentation,” in Computer Vision and Pattern Recognition (CVPR), jun 1997, pp. 750 –755. [4] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” in Computer Vision and Pattern Recognition (CVPR), vol. 2, 1999.
Sequence 1
Sequence 2 Sequence 3
Fig. 4. Three sample frames (one for each sequence) used to test the proposed model: the four rows show color image (RGB), normalized depth image (DM ), segmented color image (RGBV iBE ), and segmented image using our new approach (F ), respectively.
[5] A. M. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for background subtraction,” in European Conference on Computer Vision (ECCV). London, UK: Springer-Verlag, 2000, pp. 751–767. [6] M. Harville, G. Gordon, and J. Woodfill, “Foreground segmentation using adaptive mixture models in color and depth,” in IEEE Workshop on Detection and Recognition of Events in Video (EVENT’01), 2001, pp. 3–11. [7] O. Barnich and M. V. Droogenbroeck, “Vibe: a powerful random technique to estimate the background in video sequences,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2009, pp. 945–948. [8] O. Barnich and M. Van Droogenbroeck, “Vibe: A universal background subtraction algorithm for video sequences,” Image Processing, IEEE Transactions on, vol. 20, no. 6, pp. 1709 –1724, june 2011. [9] M. Van Droogenbroeck and O. Paquot, “Background subtraction: Experiments and improvements for vibe,” in Computer Vision and Pattern Recognition Workshops (CVPRW), june 2012, pp. 32 –37. [10] J. Leens, S. Pi´erard, O. Barnich, M. V. Droogenbroeck, and J.-M. Wagner, “Combining color, depth, and motion for video segmentation,” in International Conference on Computer Vision Systems (ICVS), ser. Lecture Notes in Computer Science, vol. 5815. Springer, 2009, pp. 104–113. [11] S. Pi´erard and M. V. Droogenbroeck, “Techniques to improve the foreground segmentation with a 3D camera and a color camera,” in Workshop on Circuits, Systems and Signal Processing (ProRISC), Veldhoven, The Netherlands, November 2009, pp. 247–250. [12] G. Gordon, T. Darrell, M. Harville, and J. Woodfill, “Background estimation and removal based on range and color,” in Computer Vision and Pattern Recognition (CVPR), vol. 2, 1999. [13] A. Bleiweiss and M. Werman, “Fusing time-of-flight depth and color for real-time segmentation and tracking,” in Dynamic 3D Imaging, ser. Lecture Notes in Computer Science, A. Kolb and R. Koch, Eds. Springer Berlin / Heidelberg, 2009, vol. 5742, pp. 58–69. [14] C. D. Mutto, P. Zanuttigh, and G. M. Cortelazzo, “Scene segmentation by color and depth information and its application,” in Streaming Day, Udine, Italy, September 2010. [15] E. Mirante, M. Georgiev, and A. Gotchev, “A fast image segmentation algorithm using color and depth map,” in The True Vision - Capture, Transmission and Display of 3D Video (3DTV), may 2011, pp. 1–4. [16] R. Crabb, C. Tracey, A. Puranik, and J. Davis, “Real-time foreground segmentation via range and color imaging,” in Computer Vision and Pattern Recognition Workshops (CVPRW), june 2008, pp. 1 –5. [17] M. Dahan, N. Chen, A. Shamir, and D. Cohen-Or, “Combining color and depth for enhanced image segmentation and retargeting,” The Visual Computer, pp. 1–13, 2011. [18] L. Wang, C. Zhang, R. Yang, and C. Zhang, “Tofcut: Towards robust real-time foreground extraction using a time-of-flight camera,” in 3D Data Processing, Visualization and Transmission (3DPVT), 2010. [19] E. Francois and B. Chupeau, “Depth-based segmentation,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 7, no. 1, pp. 237 –240, feb 1997. [20] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis, and S. Kollias, “Efficient summarization of stereoscopic video sequences,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 4, pp. 501 –517, jun 2000. [21] J. Salas and C. Tomasi, “People detection using color and depth images,” in Pattern Recognition, ser. Lecture Notes in Computer Science, J. Martnez-Trinidad, J. Carrasco-Ochoa, C. Ben-Youssef Brants, and E. Hancock, Eds. Springer Berlin / Heidelberg, 2011, vol. 6718, pp. 127–135. [22] M. Kawabe, J. K. Tan, H. Kim, S. Ishikawa, and T. Morie, “Extraction of individual pedestrians employing stereo camera images,” in Control, Automation and Systems (ICCAS), oct. 2011, pp. 1744 –1747. [23] University of Liege, “Vibe - a powerful technique for background detection and subtraction in video sequences,” http://www2.ulg.ac.be/ telecom/research/vibe, 2011. [24] P. Spagnolo, M. Leo, T. D′ Orazio, and A. Distante, “Robust moving objects segmentation by background subtraction,” in Image Analysis for Multimedia Interactive Services (WIAMIS), Lisboa, Portugal, april 2004, pp. 1744 –1747.