VISUAL TRACKING AND SEGMENTATION USING TIME-OF-FLIGHT SENSOR Omar Arif, Wayne Daley, Patricio Vela, Jochen Teizer and John Stewart Georgia Institute of Technology, Atlanta, GA 30332
[email protected],
[email protected] ABSTRACT Time-of-Flight (TOF) sensors provide range information at each pixel in addition to intensity information. They are becoming more widely available and more affordable. This paper examines the utility of dense TOF range data for image segmentation and tracking. Energy based formulations for image segmentation are used, which consist of a data term and a smoothness term. The paper proposes novel methods to incorporate range information, obtained from the TOF sensor, into the data and the smoothness term of the energy. Graph cut is used to minimize the energy.
(a) Segmentation without distance penalty.
(b) Segmentation with distance penalty obtained from range sensor.
(c) Segmentation with distance penalty obtained using image plane [10]. Distance penalty weight, β = .05
(d) Segmentation with distance penalty obtained using image plane [10]. Distance penalty weight, β = .2
Index Terms— Segmentation, Time-of-flight. 1. INTRODUCTION The problem of visual tracking and segmentation can be expressed in terms of energy minimization [3, 9]. The energy consists of the data term (external energy) and the smoothness term (internal energy). The data term measures how consistent the proposed model is with the sensed data and the smoothness term measures how piecewise smooth the solution is. The energy minimization task is carried out in the image domain where the neighborhood system, N , of the pixels is defined based on a 2-dimensional regular grid. Due to the projection equations associated to images, pixels that are in spatially proximity in the image domain can represent objects far from each other in the true scene. TOF sensors are relatively new imaging devices capable of measuring in real time dense depth information along with intensity information. A TOF-based camera modulates its active illumination source and measures the phase of the returned modulated signal at each pixel. The distance at each pixel is taken as a fraction of the wave length of the modulated signal. The intensity image corresponds to the amplitude of the reflected signal. Since, the depth information and the intensity information is obtained using the same sensor, the depth map is registered with the intensity image and no additional processing is required for correspondence matching. This paper examines the applicability of TOF sensor data to image segmentation and tracking. Specifically, we describe novel methods to incorporate the range information
Fig. 1. Segmentation of the box marked with + sign. into the energy based formulation. The range information will be added either to the data term or to the smoothness term. Graph cuts [8] will be used to minimize the energy. In Figure 1, multiple objects are placed on a conveyor belt, occluding each other. The intensity values of the box marked with + sign are used to segment the box. Using the energy formulation of [2], all objects of similar intensity to the target object are also segmented, as shown in Figure 1(a). In Figure 1(b), a distance penalty obtained using the TOF sensor is added to the energy, as explained in Section 4.1, which results in accurate segmentation of the box. A distance penalty, based on distances in the image domain [10], can not perform accurate segmentation as shown in Figures 1(c) and 1(d), where different weighting is used for distance penalty. 2. TOF SENSORS In TOF systems, the time taken for light to travel from an active illumination source to the objects in the field of view and back to the sensor is measured. Given the speed of light c,
the distance can be determined directly from this round trip time. The TOF sensor modulates its illumination LEDS with a modulation frequency of f . A CMOS/CCD (complementary metal oxide semiconductor/charge-coupled device) sensor chip with the associated electronics is positioned to receive and measure the phase of the returned modulated signal at each pixel, resulting in 176×144 pixel depth map. The distance at each pixel is the fraction of the wavelength λmod of the modulated signal, which limits the non-ambiguous range measurements by c λmod = . D= 2 2f
(1)
The reflected signal is sampled four times at 1/4 period phase shifts, i.e. 900 phase angle. The four samples are taken as separate exposures. The intensity/brightness information is the average of all four amplitude samples. 3. RELATED WORK TOF sensors have found application in numerous computer vision tasks. For example, they have been employed for multiple people tracking [1], where tracking is performed using a top-down view of the scene. The height from the ground as obtained using the TOF camera is used an indicator of the presence of a person. Since using a top-view of the ground floor limits applicability as only a smaller portion of the scene is visible, [5] propose a method for tracking people moving on a planar surface. The background model is build by fusing the intensity and the depth information. Tracking is performed by projecting the points on the ground plane and using expectation maximization algorithm to track moving clusters of pixels significantly different from the background model. Gonsalves and Teizer [4] use depth information to track construction workers using a particle filter. The worker is modeled using a star skeleton structure and motion analysis is performed to determine the variation in angles between the various segments of the model. This is then used for safety and health monitoring purposes. The problem of human pose estimation from depth image is also addressed in [12]. Holte et al [6] use range and the intensity information for view invariant gesture recognition. The intensity image is used to select the region of interest for range data. The range data of the region of interest is represented using shape context, which is based on spherical histograms. The range image has also been used in conjunction with intensity image within a graph-cut based segmentation framework to segment planar surfaces [7]. 4. ENERGY FORMULATION Let P be the set of all pixels in the image and let L = {0, 1}, be label assignments for P. The label 1 means the pixel belongs to the target, while the label 0 means it belongs to the
background. The segmentation problem is cast as that of finding a labeling w : P → L, minimizing an energy E(w), modeled by: E(w) = Ed (I, w) + Es (w). (2) The data term Ed measures how well the labeled pixels fit the image model given the sensed data. Here, the data term follows [9] X Ed (w) = Fp (wp ), (3) p∈P
where Fp measures the how well the label wp fits pixel p given by the observed data. Details are given in Section 4.1. The smoothness term, Es , imposes regularity of the solution. Let N be a neighborhood system on P, then Es is given by X Es = V (wp , wq ). (4) {p,q}∈N
The truncated L2 distance V (wp , wq ) = min(γ, ||wp − wq ||) or the Potts interaction penalty V = δ(wp 6= wq ) are candidate smoothness terms [2]. Graph Cut based Energy Minimization: Graph cut [8] will be used to minimize the energy in Equation (2). The energy is realized on the graph by considering each pixel to be a node, with two additional nodes, representing the target and background nodes. The data term is realized on the graph by connecting each pixel to target and background node, with edge weights derived from Equation (3). The smoothness term Es is realized by connecting each pair of pixels {p, q} ∈ N with a non-negative edge weight derived from Equation (4). The min cut of the graph represents the segmentation that best separates the target from its background and minimizes the energy E(w). In the next two sections we describe the proposed methods for adding range information to the data term and the smoothness term. 4.1. Distance Penalty in the Data Term A typical intensity based data energy [3, 11], utilizes ( (Ip − µt )2 , if wp = 1 Fp (wp ) = (Ip − µb )2 , if wp = 0,
(5)
where µt and µb are the mean target and background intensities. In addition to the energy due to intensity, we add distance penalty also. Each pixel is penalized based on the distance from the expected position of the object. The distance calculation is done using the range information obtained from the camera. So, the cost for each pixel is ( (Ip − µt )2 + β(Dp − µd )2 , if wp = 1 Fp (wp ) = (6) (Ip − µb )2 , if wp = 0, where Dp is the 3D coordinates of the pixel p and µd is expected position of the target. The data term is realized in the
graph by connecting each pixel to the target and background terminal nodes with edge weights Fp (1) and Fp (0). For the smoothness term we use the Potts interaction model [2]. Malcolm [10] uses the distance penalty based on the distance in the image plane. As mentioned before, It can happen that two objects are far from each other but they appear close in the image plane. As seen in Figures 1(c) and 1(d) the distance penalty based on the image plane can not accurately segment the box.
(a) Frame 1.
(b) Frame 200.
(c) Segmented point cloud.
(d) Segmented point cloud.
4.2. Distance penalty in the Smoothness term. This section proposes an image segmentation scheme by incorporating the range information in the smoothness term. Figures 4(a) and 4(b) show the intensity and range information as obtained from the TOF sensor. The intensity image alone can not easily distinguish the robot. By using the range information in the smoothness term, which measures the interaction between neighboring pixels, an energy is proposed that segments contiguous regions in the image. In terms of graph cut, the energy is realized on the graph by connecting all pixels to the background terminal node with edge weights 1/n, where n is the total number of pixels in the image. One of the pixels that belongs to the target is manually selected and connected to the target node with edge weight of 1. Each pixel is connected to 8 neighbors. The regular 2-dimensional image grid is used to select 8 neighbors, however, the edge weight between two neighboring pixels {p, q} ∈ N is given by −D(p, q) , (7) Vp,q = exp 2σs2
Fig. 2. Tracking objects of similar intensity using range information.
(a) Frame 100.
(b) Frame 340.
(c) Frame 530.
Fig. 3. Tracking a box occluded by an object of similar intensity.
where D(p, q) is the distance between pixel p and q as obtained from the range data. To incorporate the intensity information also, the weights can be modified as −(Ip − Iq )2 −D(p, q) exp Vp,q = exp . (8) 2σs2 2σi2 5. EXPERIMENTS This section shows tracking and segmentation results are shown by incorporating the range information as described in Sections 4.1 and 4.2. Experiment 1: The first experiment deals with tracking objects in a scenario similar to that shown in Figure 2, where target objects have similar intensity. Also, the background has similar intensity as the target objects. The location of the targets µd , in the first frame are known beforehand. In subsequent frames, the location estimated in the previous frame is used. The graph cut framework of Section 4.1 is used to perform tracking, where the range information is incorporated in the data term. Similarly in Figure 3, the box being tracked is occluded by another object of similar intensity, but the proposed method can accurately track the occluded box.
(a) Intensity information
(b) Range information
(c) Segmentation of the object from background
(d) Segmented Object
Fig. 4. Point cloud segmentation of the robot.
6. CONCLUSION
(a) Segmented hammer
(b) Segmented tripod
(c) Segmented wrench
Fig. 5. Sample segmented objects
This paper proposed two methods to incorporated the range information, obtained from a TOF imaging sensor, into energy-based segmentation and visual tracking algorithms. Range information was incorporated into the data and the smoothness term of the energy, with energy minimization carried out using graph cut. The methods proposed in this paper enable segmentation and tracking when intensity-based information provides insufficient discrimination capabilities. 7. REFERENCES [1] A. Bevilacqua, L. D. Stefano, and P. Azzari. People tracking using a time-of-flight depth sensor. In IEEE International Conference on Video and Signal Based Surveillance, page 89, 2006.
(a)
(b)
(c)
(d)
[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE TPAMI, pages 1222–1239, 2001. [3] T. Chan and L. Vese. Active contours without edges. IEEE Trans. Image Proc., 10(2):266–277, 2001. [4] R. Gonsalves and J. Teizer. Human motion analysis using 3D range imaging technology. In Int. Symp. on Automation and Robotics in Construction, 2009.
(e)
(f)
(g)
(h)
Fig. 6. Tracking of the person.
Experiment 2. Next, we test the segmentation results by incorporating the range information into the smoothness term. The graph structure is obtained by following the method proposed in Section 4.2. One pixel of the target object is manually selected and connected to the target terminal node. The edge weights are given by Equation (7). Figure 4(a) shows the intensity image of a robot. The intensity values of the robot are quite similar to the background, which makes the segmentation based on intensity information alone, difficult. Also, the framework of Section 4.1 can not be used here, because the robot has a non-convex shape. However, following Section 4.2 the robot is accurately segmented from the background (Figure 4(d)). Similarly, Figure 5 show point clouds of some sample segmented objects, which show that objects connected by thin regions can be accurately segmented. Experiment 3. The final experiment consists of a person tracking scenario. Figure 6 shows the tracking results where the person’s intensity is similar to the background.
[5] D. W. Hansen, M. S. Hansen, M. Kirschmeyer, and R. Larsen. Cluster tracking with time-of-flight cameras. In IEEE CVPR Workshop, pages 1–6, 2008. [6] M. Holte, T. Moeslund, and P. Fihl. Fusion of range and intensity information for view invariant gesture recognition. In IEEE CVPR Workshop, pages 1–7, 2008. [7] O. Kahler, E. Rodner, and J. Denzler. On fusion of range and intensity information using graph-cut for planar patch segmentation. Int. J. of Intelligent Systems Technologies and Applications, 5:365–373, 2008. [8] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE TPAMI, 26(2):147–159, 2004. [9] S. Li. Markov random field models in computer vision. Lecture Notes in Computer Science, 801:361–370, 1994. [10] J. Malcolm, Y. Rathi, and A. Tannenbaum. Tracking through clutter using graph cuts. In BMVC, 2007. [11] X. Zeng, W. Chen, and Q. Peng. Efficiently solving the piecewise constant Mumford-Shah model using graph cuts. Technical report, Technical report, Dept. of Computer Science, Zhejiang University, PR China, 2006. [12] Y. Zhu, B. Dariush, and K. Fujimura. Controlled human pose estimation from depth image streams. In CVPR, pages 1–8, 2008.