Range imaging technology: new developments and applications for people identification and tracking Timo Kahlmann*, Fabio Remondino**, Sébastien Guillaume Institute of Geodesy and Photogrammetry, ETH Zurich, 8093 Zurich, SWITZERLAND *Chair of Geodetic Metrology and Engineering Geodesy -
[email protected] http://www.geometh.ethz.ch **Chair of Photogrammetry and Remote Sensing -
[email protected] http://www.photogrammetry.ethz.ch ABSTRACT Range Imaging (RIM) is a new suitable choice for measurement and modeling in many different applications. RIM is a fusion of two different technologies. According to the terminology, it integrates distance measurement as well as imaging aspects. The distance measurement principle is dominated by the time-of-flight principle while the imaging array (e.g. CMOS sensor) enables each pixel to store also the distance towards the corresponding object point. Due to the technology’s relatively new appearance on the market, with a few different realizations, the knowledge of its capabilities is very low. In this paper we present our investigations on the range imaging camera SwissRangerTM (realized by the Swiss Center for Electronics and Microtechnology - CSEM). Different calibration procedures are performed, including a photogrammetric camera calibration and a distance system calibration with respect to the reflectivity and the distance itself. Furthermore we report about measurement applications in the field of surveillance and biometrics. In particular, range imaging data of moving people are analyzed, to identify humans, detect their movements and recover 3D trajectories.
Keywords: range imaging, time-of-flight, CMOS, CCD, calibration, tracking
1. INTRODUCTION Range Imaging (RIM) is a fusion of two different technologies. According to the terminology, it integrates distance measurement as well as imaging aspects. In recent years range imaging (RIM) has undergone a tremendous growth even if there are still some limitations that actual sensor’s dimensions are limited to few thousand pixels. Providing for image and range data at the same time with a typical frame rate of up to 55 Hz, they are used in different fields but their main employment is in the surveillance domain, in particular to detect movements. The tracking of objects using video sequences is still a great topic of investigation. It is generally performed with monocular [1] or multi-camera videos [2, 3] by means of probabilistic [4] or deterministic [5] approaches. Markerless approaches are still not reliable and precise compared to techniques based on markers attached on the tracked object. The tracking of moving objects using RIM sensors is a good and challenging alternative and has many advantages on the already developed approaches. Range information with related intensity values are available in real-time. At the moment, the raw data of RIM cameras do not reach levels of accuracy generally required in robotics or industrial applications. Therefore, a calibration of the sensor is strongly required to recover more accurate metric information. In this contribution, after a review of the range imaging technology and the available products on the market (Section 2), the performed sensor calibrations are reported (Section 3) while the realized prototype to identify moving objects, track them and derive trajectory and 3D movements is presented in Section 4.
2. RANGE IMAGING This part will introduce the basics of RIM technology. As the terminology indicates, a distance measurement is performed together with an image acquisition. Various electro optical methods are known to measure a distance: interferometry, triangulation and Time-of-Flight (ToF) are the basic principles. In case of range imaging cameras, two basic variations of the ToF-principle are mostly implemented. The EPF Lausanne has introduced SPAD-Arrays which meas-
ure the distance by means of the direct measurement of the runtime of a traveled light pulse [6]. On the other hand, the investigated RIM cameras at the Institute of Geodesy and Photogrammetry at ETH Zurich are based on the indirect ToF-principle. Amplitude modulated light is emitted by means of a number of LEDs, travels to the object, is reflected, and finally demodulated by means of a specialized CMOS/CCD pixel. Demodulation is known as the reconstruction of the received signal. In case of a sinusoidal signal, three parameters have to be calculated: the intensity B, the amplitude A, and the phase φ. Four sampling points τ0, τ1, τ2 and τ4 (intensity measurements, each shifted of 90°) are triggered to the emitted wave. Thus the phase shift φ can be calculated as: ⎛ c (τ 3 ) − c (τ 1 ) ⎞ ⎟ ⎜ c (τ 0 ) − c (τ 2 ) ⎟ ⎝ ⎠
(1)
ϕ = arctan ⎜
The phase shift φ is directly proportional to the distance D the light has traveled: D=
λmod 2
⋅
ϕ 2π
with
D≤
λmod
(2)
2
where λmod is the modulation wave length which is at about 15 m (default) in most RIM cameras. This gives a nonambiguous distance of 7.5 m. Some of the newer cameras allow also for longer wavelengths. But this also decreases the accuracy of the measured distances. The distance measurement principles shown above are very well known, so far. Even much more sophisticated and advanced versions can be outlined (e. g. [7] and [8]).
(a)
(b)
(c)
(d)
(e)
Figure 1: RIM cameras SR-2 (a) and SR-3000 (b) produced by CSEM. A typical 2D range (c) and intensity (d) image and the related 3D data (e).
In order to gain range imaging sensors, one of the main challenges is to merge the distance measurements principles shown above with imaging technology. RIM cameras integrate a CMOS sensor, whose pixels are capable of demodulation. Thus every pixel is a distance measurement system on its own. One emitting system, which consists of several LEDs illuminates the scene within the field of view with amplitude modulated NIR light uniformly. The light is reflected by the scene and mapped by means of optics on the demodulation pixel array. Therefore every single pixel measures the distance towards the point in the field of view it is related to. Out of these distances and simple geometrical dependencies 3D coordinates can be calculated. The Swiss Center for Electronics and Microtechnology (CSEM) in Zurich (Switzerland) has manufactured two high resolution RIM cameras: the SR-2 and the SR-3000 (see Fig. 1). The cameras have a resolution of about 20’000 pixel (SR-2) and 25’000 pixel (SR-3000) with a distance precision of about 5 to 10 mm for center near pixels. The follow up version SR-3000 is even able to suppress background radiation to avoid saturation effects. This enables it to be used in outdoor environments. The second main advantage is the intelligent cooling which reduces temperature effects. The cameras are described in closer detail in [9] and [10], respectively. Other realized RIM cameras are presented at [11, 12, 13, 14], for example.
3. CALIBRATION In the laboratories of the Institute of Geodesy and Photogrammetry (IGP), ETH Zurich, different properties of the SwissRangerTM sensor were investigated (e. g. [15] and [16]): the influence of the internal and external temperature on the distance measurements, some systematic effects in the recovered 3D point clouds, the distance accuracy, the photogrammetric sensor parameters. In the next sections, the more salient calibration results are reported. If not indicated otherwise, the (internal, decimal) integration time was 100. 3.1 Distance Measurement The basic element of RIM cameras is their distance measurement ability. Therefore a closer look at the precisions that can be expected from those sensors is needed. Figure 2 outlines the distance measurement precision (standard deviation for a single distance measurement) for the whole sensor array of the RIM cameras SR-2 (left) and the SR-3000 (right).
Figure 2: Precision measurements for the SR-2 (left) and the SR-3000 (right). The standard deviation for a single distance measurement with respect to the used sensor setup is shown.
For the measurements the cameras were set up in a perpendicular manner towards a plain white wall. The perpendicular distance was 2.453 m (SR-2) and 2.463 m (SR-3000). Due to the fact that lower optical energies (e. g. LED radiation specifications) cause lower precisions the result could be foreseen. In the center of the sensor the theoretical limit is nearly reached (compare [9]). As Figure 3 indicates, a warm up sequence has to be regarded. This is due to internal heating effects and the sensitivity of semiconductor / LEDs to temperature. Figure 3 shows the distance measurement variations of the SR-2 (left) and the SR-3000 (right) in the first minutes of acquisition. The intelligent cooling of the SR-3000 reduces the effect and thus the warm up sequence can be reduced from about 12 to 20 minutes with the SR-2 down to about 6 minutes with the SR-3000. 0.268
1.88
1.87
0.266
1.86
0.262 1.84 0.26 1.83
intensity [16bit=1.0]
measured distance [m]
0.264 1.85
0.258 1.82 distance distance (filtered) intensity intensity (filtered)
1.81
0.256
1.8
0.254 0
2
4
6
8
10
12
time [min]
Figure 3: Warm up sequence of the SR-2 distance offset (left) in comparison with the further developed SR-3000 raw and filtered distance measurements (right). Whereas the SR-2 needs about 20 minutes to reach temperature stability the SR-3000 just needs about 6 minutes. The range in which the distance varies is reduced from about 10 to 5 cm.
The next aspect that has been investigated is the distance measurement with respect to the integration time and the distance itself. In [17] is already reported that uneven harmonics within the modulation of the emitted light can cause significant cyclic deviations from the optimal linearity of the distance measurement. At the IGP a 50 m automated interferometric calibration track line with a precision of several microns for the distance measurement is available. The accuracy is about 1 mm. The results of the calibration can be seen in Figure 4. Additive look-up-table (LUT) values are shown for both sensors (SR-2 on the left, SR-3000 in the middle). The already predicted cyclic deviations can clearly be outlined. With an amplitude of about 6 cm they definitely exceed the sensor’s precision. Therefore, if higher accuracies are needed, these deviations can simply be calibrated by means of a LUT.
Figure 4: Additive look-up-table for the calibration of the distance measurements of the SR-2 (left) and the SR-3000 (middle). Fixed pattern noise (FPN) for the SR-2 at a perpendicular distance towards a wall at 2.453 m (right).
Due to the fact that the distance calibration procedure shown above is very time consuming a separate calibration of each pixel cannot be realized. Thus only the offset of every pixel is regarded. Figure 4 (right) shows this offset (median filtered) which is also known as fixed-pattern-noise (FPN) for the SR-2. The variation over the whole sensor array reaches about 30 cm at the edges of the field of view. 3.2 Photogrammetric camera calibration An accurate camera calibration is a necessary prerequisite for the extraction of precise and reliable 3D metric information from image data. A camera is considered as calibrated if the focal length, principal point location and lens distortion parameters are known. In many applications, especially in computer vision, only the focal length is recovered while for precise photogrammetric measurements all the calibration parameters are generally employed. The choice of the camera model is often related to the final application and required accuracy. Projective or perspective camera models can be used but generally a precise sensor orientation is performed by means of a least squares bundle adjustment with additional parameters.
Figure 5: The final planar object with 25 infrared LED used to calibrate the RIM camera (left). The sensor connected via USB to a laptop is also visible in the foreground. The recovered camera poses during the calibration procedure (middle and right).
For the geometric calibration of the RIM camera SR-2, we performed many tests using different testfields available in our laboratories [16]. At the end, the calibration was performed by means of a planar testfield whose targets are represented by NIR LEDs. These active targets can be automatically extracted from the intensity image provided by the camera due to their high intensities in the image. They are automatically extracted by means of a threshold and a centroid operator and used as input measures for the calibration procedure. The calibration network included 35 stations, well distributed, in particularly in depth direction as we used a planar testfield and with some rotated images. After the selfcalibrating bundle adjustment, we achieved a theoretical precision for the computed object points of 1.4 mm, 1.3 mm and 1.8 mm respectively in x, y and depth directions. This represents a relative accuracy of approximately 1: 270 in the
testfield plane and 1:120 in the depth direction. Between the additional parameters, only the first two terms of the radial distortions turned out to be statistically significant. The calibration network is shown in Figure 5. 3.3 Considerations All effects and properties shown above are highly systematically and more or less smooth. Therefore a calibration can be performed. This will result in more precise and accurate measured data. Better data also makes the RIM sensors more suitable for many applications that need precise results. Especially the area of laser scanning can be pointed out here. Normally the sub centimeter level for derived 3D coordinates is aimed. Thus the potential for RIM cameras to replace common close range laser scanners is feasible, in particular due to the real time RIM process. Their enormous speed of acquisition as well as their potentially low costs are good criteria for a significant presence in future 3D measurement systems.
4. DETECTION AND TRACKING OF MOVING OBJECTS IN RIM DATA 4.1 Introduction The increasing use and availability of video streams have increased the research community’s attention towards processing of these data for scene reconstruction or surveillance applications. In particular the automatic surveillance of traffic areas, the navigation of robots, the surveillance of people, motion capture for movies or video games are applications where detection and tracking procedures are nowadays reliable and well established. But the arrival of the new imaging sensors such like RIM cameras, opens new possibilities of object's tracking. Although the resolution of this kind of sensors is still very far from current normal digital video cameras, it is very practical to detect and follow moving objects in these range images and determine object's characteristics in the metrical 3D space. In [18] tracking is defined as the estimation of the state of a moving object based on remote measurements. From this definition, we have three keywords that we have to consider in a given application: • State: it is the finally goal of the tracking process which describes the state of the object in mathematical language. For example, if we want to track a tennis ball, the state can be a point with the velocity vector. Or if we want to track the contour of a mouth, the state can be the parameters of a B-Spline. • Moving object: it is the object that we want to track which moves in the space with certain properties. For example, if we want to track a tennis ball during its movement, we know that the ball is approximately only accelerated by the gravitational field and force that comes from the air resistance. In other words if we know its position at the time t, we can predict them at the time t + 1. To increase the robustness and the efficiency of the tracking process we use a system model that describes how the state is changing during time. • Measurements: if we want to track an object, we should recognize it in the images to improve the predicted state. The keyword for objects recognition in the images is segmentation. There are many algorithms to capture the motion of objects, but there is no universal tracking procedure for all applications and all kinds of objects. It is possible to categorize some well known methods as: • Direct Matching: the tracking in an image sequence is performed with a segmentation of the images to measure directly the object's state. This approach does not take into account a system model and is not successful when the object is occluded. • Recursive Bayesian Filter: it is a probabilistic approach that uses the Bayes’ rule to estimate the a posteriori probability distribution of the state with the a priori "predicted" distribution and the measurements. The approach cannot be solved easily for all kinds of probability distributions. Nevertheless, there are solutions under some assumptions: - If the system model and the observation equations are linear and all error sources are additive Gaussian noises with known variance-covariance matrix, the well known analytical solution is the Kalman Filter. - For a more general case, where there are no restrictions on the system model, the observation equations and the error sources distributions, a sampled-based version of the Recursive Bayesian Filter is called Condensation algorithm. In our work we used a Recursive Bayesian Filter approach to track moving people in the RIM data and in particular the Condensation algorithm [19] as it is a more general solution for a tracking process, it is flexible, relatively simple and with many promising examples presented in the literature [21].
4.2 Threshold filter One of the simplest ways to perform object tracking is to subtract the static background. Figure 6 depicts a tracking by means of background subtraction and thresholding. The remaining “moving” pixels can be identified by means of simple clustering and the clusters even be separated from each other. For the further processing, in order to avoid true 3D algorithms and their higher complexity, 2.5D range images were used.
Figure 6: Tracking by means of background subtraction and thresholding. The 3D shape and trajectory of the person can be easily outlined within the background.
4.3 The Condensation algorithm 4.3.1 Recursive Bayesian Filter When we consider a statistical probabilistic tracking approach, an object is no longer represented by a single object state but by its probability distribution. With a Recursive Bayesian Filter, the goal is to estimate the a posteriori probability density with the Bayes's theorem: p ( xt | Z t ) =
p( xt | Z t -1 ) p( z t | xt ) p( z t | Z t -1 )
= k p( xt | Z t -1 ) p( z t | xt ) (with independent measurements)
(3)
So, from a known density p( xt −1 | Z t -1 ) we compute the propagated density p( xt | Z t -1 ) of the state x with a dynamic model and finally we compute the updated density p( xt | Z t ) with the observations Z t . propagate update p( xt −1 | Z t -1 ) ⎯⎯ ⎯⎯→ p ( xt | Z t -1 ) ⎯⎯ ⎯→ p( xt | Z t )
4.3.2 Condensation algorithm To solve the Recursive Bayesian Filter for a general approach, it is necessary to discretize the probability distributions and to solve (3) with numerical methods. It would be possible to approximate the densities at regular intervals and to compute (3) with numerical integration methods and simple multiplications. However, in practical problems we work
with large multi-dimensional state x t , therefore this method would be very time consuming. For this reason the the factored sampling technique is used to discretize and multiply the probability densities. The probability densities are represented by a weighted sample "particle" set S={(sj, πj)| j = 1...n}. Each sample consists of an element s which represents the state vector and a corresponding weighting factor π. The size of the sample set corresponds to n. The first step is to propagate stochastically the initial particle set with the system model. The propagation is computed individually for all particles (Fig 7,a), having f t −1 ( xt −1 , wt −1 ) as the system model and wt −1 as the vector of random variables with a known probability distribution that allows the model uncertainties (i.e. process noise). Afterwards, the second step is the updating (Fig 7,b). To compute the a posteriori density p( xt | Z t ) , the Bayes’ theorem is used. Analytically this requires
the multiplication of two probability densities: p( x t | Z t -1 ) and p( z t | xt ) . We assume that one of these densities is given as a set of samples while the other can be evaluated directly with its analytic function (Gaussian distribution for example). We know that our propagated particle set is formed randomly with the probability p( x t | Z t -1 ) . If we choose a sample with the probability π ( j ) = p( z t | xt ) = p( z t | st( j ) ) from the sample set st , we obtain a sample set that have a distribution which tends to p( x t | Z t ) if n → ∞ . The probability density p( z t | st ) can be calculated with the observation model z t = ht ( xt , vt ). vt is a vector of random variables that describes the observation noise. The last step is to n
compute the estimated state of the particle set. E [S ] = ∑ s (j) π (j) . j =1
(a)
(b) Figure 7: Individual propagation of particles (a) and updating step using the probability density of the observations (b).
4.3.3 The implemented tracking process The developed tracker [20] is composed of: • Initialization phase • The state vector • The transition function • The measurement vector • The observation model The choice of these elements is very important and critical for a good implementation of the tracker. It has many possibilities and degrees of freedom, but a general rule is to take the simplest model as possible for the wanted result. It is not necessary to model the evolution of a skeleton model if the goal is to detect and to follow the whole person, and on the contrary it is impossible to track the movement of all the members only with a simple box model. 4.3.3.1 The initialization procedure In this initial phase we have to define (1) the static background, (2) the target reduced range histogram and (3) the initial particle set that defines our moving object. This three tasks can be done manually or (semi) automatically. A fully automatic initialization process is very difficult to be developed for all the possible cases. Each object has different properties in its dynamics as well as in its measurement model. The background computation (and then subtraction) is performed computing the mean range and the variance for each pixel of the available image sequence. Afterwards it is necessary to compute the reduced range histogram of the object that we want to track. To compute it, we have to take an image that contains the wanted object, then for a given box and the histogram properties (number of bins, min and max range – see section 4.3.3.4) it is possible to calculate the reduced range histogram. Finally the initial particle set is generated randomly (uniform) within a rectangle that surrounds the non background pixels (with a certain tolerance). The velocity and the acceleration are set to 0 and the box size must be given before. After that, the Bhattacharyya coefficient (see section 4.3.3.4.1) is calculated for each initial particle, and if more of 20 % of the particles have a coefficient > 0.8, the tracking can begin [21].
Figure 8: Automatic computation of the initial particle set (1028 particles). At the beginning, the particles are distributed between the two persons. Afterward, the most part of the particles are concentrated on the left person because the person on the right has a box in his hands and that generate particles with smaller Bhattacharyya coefficients (see section 4.3.3.4.1). As we can see, it should be possible to track more than one object with one tracker; the problem is to differentiate the different objects in the particle set. It can be done using a clustering algorithm.
4.3.3.2 The state vector As we want to identify a 2D bounding box around a moving object throughout the frames, the state of an object in a range image can be described with (1) a state in the 3D Cartesian space or (2) a state in the 2D image space. From the object point of view and considering its dynamics, the appropriate approach would be the 3D state. Without going into details, this solution was tested and has demonstrated that the particle number must be very high to obtain a satisfactory result. Therefore we defined the state vector in the 2D image space (RIM). This solution makes the measurement model and the update stage easier but takes the dynamic model away from the reality and it can have an incidence on the reliability of the final results. Formally the state of a person at the time t is described with: ⎛ xt ⎞ x t = the center of the box in e x [mm]. ⎜ ⎟ ⎜ yt ⎟ y t = the center of the box in e y [mm]. ⎜v ⎟ v xt = the velocity of the box in e x [mm/s]. ⎜ xt ⎟ ⎜ v yt ⎟ where: xt = ⎜ ⎟ v yt = the velocity of the box in e y [mm/s]. ⎜ a xt ⎟ a xt = the acceleration of the box in e x mm/s 2 . ⎜ay ⎟ t ⎜ ⎟ a yt = the acceleration of the box in e y mm/s 2 . ⎜ lxt ⎟ ⎜ ly ⎟ lx t = The half - box dimension in the reality in the projected direction e x [m ]. ⎝ t⎠
[ [
] ]
ly t = The half - box dimension in the reality in the projected direction e y [m].
Figure 9: State vector represented on a range image (left) and computed in the 3D Cartesian space (right).
4.3.3.3 The transition function To predict the state vector in the next frame, it is necessary to define a transition function. This function should be able to compute as good as possible the state until the next update - or the next image. With a high frequency of acquisition, the transition function can be simpler compared to low frequency acquisition. In our work, the image acquisition frequency of the given SwissRanger data was 5 Hz. The transition function considers that the acceleration between two images remains constant, yielding: ⎛ xt ⎜ ⎜ yt ⎜v ⎜ xt ⎜v x t = f t −1 ( x t −1 , wt −1 ) = ⎜ yt ⎜ a xt ⎜ay ⎜ t ⎜ lx t ⎜ ly ⎝ t
⎛ ⎞ ⎜1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0 ⎟ ⎜0 ⎟ ⎜ ⎟ = ⎜0 ⎟ ⎜0 ⎟ ⎜ ⎟ ⎜0 ⎟ ⎜ ⎟ ⎜0 ⎠ ⎜ ⎝0
0 Δt
0
1 2 Δt 2
1
0
Δt
0
0 0 0 0 0
1 0 0 0 0
0 1 0 0 0
Δt 0 1 0 0
1 2 Δt 2 0 Δt 0 1 0
0
0
0
0
0
0
⎞ 0 0 ⎟ ⎛ x t −1 ⎞ ⎛ w x ,t −1 ⎞ ⎟ ⎟ ⎜ ⎟⎜ ⎜ y t −1 ⎟ ⎜ w y ,t −1 ⎟ ⎟ 0 0 ⎜ ⎟ v ⎟ ⎜ wv ,t −1 ⎟ x x ⎟ 0 0 ⎟ ⎜ t −1 ⎟ ⎜ ⎟ ⎜ v yt −1 ⎟ ⎜ wv y ,t −1 ⎟ 0 0 ⎟⎜ ⎟ ⎟+⎜ a xt −1 ⎟ ⎜ wa x ,t −1 ⎟ ⎜ ⎟ 0 0 ⎜ ⎟ a y ⎟ ⎜ wa y ,t −1 ⎟ ⎟ 0 0 ⎟ ⎜ t −1 ⎟ ⎜ ⎜ lx t −1 ⎟ ⎜ wlx ,t −1 ⎟ ⎟ 1 0 ⎟⎜ ⎟ ⎟ ⎜ ⎝ ly t −1 ⎠ ⎝ wly ,t −1 ⎠ 0 1 ⎟⎠
with: w.,t −1 ≅ N (0, σ 2 f .,t −1 ) and we assume that all w.,t −1 are independent. 4.3.3.4 The measurement vector and the observation model The updating step of our propagated particle set can only be done with measurements at each time-step. In this problem, we don't have a device that directly gives us observations of the tracked object. The "only" thing that we have is range and intensity images. To derive information from it, it is necessary to create an appropriate segmentation algorithm to identify and to localize the desired object. There are many methods to perform a segmentation. Most of them are developed for normal colour images. Nevertheless, some ground principles can be used for range images. In our case, only the range information is used to do the segmentation as the intensity images doesn’t have enough invariant characteristics to be useful. The quasi-invariant characteristic feature used in this application for updating the particle set is the reduced range histogram. After a non-background segmentation computed with the difference between the range image and a mean background, the image is reduced with the range of the central pixel of the box. Afterward, a range histogram is computed with m bins between a minimum and a maximum range value. All the values that are not in this interval are considered in the minimum respectively maximum bin. Furthermore, to increase the reliability of the range distribution, smaller weights are assigned to the pixels that are further away from the box centre by employing a weighting function: 1− r 2 : r < 1 k (r ) = 0 : otherwise where r is the distance from the region center normalized with the half diagonal of the concerned box. The bins p xu for
a box centered at location x are calculated as: p xu =
where:
1
J
∑ k (r j ) I ∑i =1 k (ri ) j =1
if the pixel j belongs to the bin u .
p u = the value of the bin u for a box centered at x. x
k (ri ), k (r j ) = the weight of the pixel i or j. ri , r j = the normalized distance from the center to the pixel i or j. i = index of all pixels in the box. j = index of all pixels in the box that belong to the bin u. I = number of pixels in the box. J = number of pixels in the box that belong to the bin u.
Figure 10: Range image of a person (left) with its reduced range histogram (right). The histogram has 50 bins, the min. range = -1 m and the max. range = +1 m. The first bin belongs to the background.
4.3.3.4.1 The similarity measurement Theoretically, for a given object, the reduced range histogram is invariant during the time if a box with the appropriate dimension is at the correct position. Naturally, that is never the case and we are obliged to use a method for measure the resemblance of two histograms. We use the Bhattacharyya distance – or coefficient (not the same). If p(u ) and q(u ) are two histograms, the Bhattacharyya coefficient is defined as: m
ρ [ p, q ] = ∑ p(u)q(u) u =1
and the distance as:
d [ p, q ] = 1 - ρ [ p, q ]
Figure 11: Range image of a person (left in the x, y image space) with his Bhattacharyya coefficient field (right in line, column space) for a constant box dimension (in the object space). The histograms have 50 bins, the min. range = -1 m and the max. range = +1 m.
4.3.3.4.2 The updating procedure After the propagation step, we update the particle set s with incoming measurements. It is realized by affecting a ( j) weight π for each sample that is proportional to p( z t | st ) [21]. To do that in our application, the Bhattacharyya coefficient has to be computed between the target histogram and the histogram of the hypotheses. Each hypothetical region
is specified by its state-vector s ( j ) . As we want to favor particles whose reduced range histogram are similar to the target model, small Bhattacharyya distances correspond to large weights. We use a Gaussian function with the variance σ 2 :
π
( j)
=
1 2π σ
e
−
[
d p
]
,q 2
s( j ) 2
2σ
where: p s ( j ) = the reduced range histogram of the sample s ( j ) . q = the reduced range histogram of the target.
σ = the variance is determinated empirically. 4.3.4 Tracking example A typical example is reported below. The background is subtracted and then the tracking of the moving person starts. In Figure 12 the 3D trajectory is also reported, together with the bounding box at some instant.
Figure 12: Tracking of a person who moves and sits. The estimated trajectory is in white.
5. CONCLUSION In this contribution we have reported about the RIM technology, its problematic, the developed calibration procedures and our first results of tracking moving objects from RIM data. With the high-resolution range imaging camera SwissRangerTM, a powerful new tool for many applications with a need for metrology is available. Many influencing parameters have an impact on the accuracy of the system. But due to their systematic appearance, they can be identified and removed within a calibration procedure. The achievable accuracy with respect to the calibrations depicted is around 1 cm. The recovered relative accuracy shows the potentiality of the TOF sensor for middle-accuracy applications, like
autonomous navigation or real-time modeling and mapping. As far as the tracking process concerns, the results are promising, even if more tests should be performed. The recovered bounding boxes and 3D trajectories are useful for movement analysis, also for psychological investigations where the movement analysis helps the psychologist in the examination the specific diseases of people with psychological problems which sometimes manifest in movement anomalies.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
C. Sminchisescu, “Three dimensional human modeling and motion reconstruction in monocular video sequences”, Ph.D. Thesis, INRIA Grenoble, 2002 D.M. Gavrila and L. Davis, “3D model-based tracking of humans in action: a multi-view approach”, IEEE Proc. of CVPR, pp. 73-80, 1996 D'Apuzzo, N., “Surface measurement and tracking of human body parts from multi station video sequences”, Ph.D. Thesis, Nr. 15271, Institue of Geodesy and Photogrammetry, ETH Zurich, Switzerland, 2003 Sidenbladh, H., M. Black, and D. Fleet, “Stochastic tracking of 3D human figures using 2D image motion”, ECCV02, Springer Verlag, LNCS 1843, pp. 702-718, 2000 Remondino, F. and Roditakis, A., “Human figures reconstruction and modeling from single images or monocular video sequences”, IEEE International 3DIM Conference, pp. 116-123, Ottawa, Canada, 2003 Niclass, C., Besse, P.-A., Charbon, E., 2005. Arrays of Single Photon Avalanche Diodes in CMOS Technology: Picosecond Timing Resolution for Range Imaging. In: Proceedings of the 1st Range Imaging Research Day at ETH Zurich, Zurich, Switzerland, pp. “Supplement to the Proceedings”. Hinderling, J., Distanzmesser, Funktionsprinzipien und Demonstration von EDM, unpublished scriptum „Geodatische Sensorik“; Institute of Geodesy and Photogrammetry, ETH Zurich, 2004 Rüeger, J. M., “Electronic Distance Measurement”, ISBN 3-540-51523-2 Springer-Verlag, 1990 Oggier, T., M. Lehmann, R. Kaufmann, M. Schweizer, M. Richter, P. Metzler, G. Lang, F. Lustenberger and N. Blanc, An all-solid-state optical range camera for 3D real-time imaging with sub-centimeter depth resolution (SwissRangerTM), Proc. of SPIE Vol. 5249, pp. 534-545, Optical Design and Engineering, 2004 Oggier, T., B. Büttgen, F. Lustenberger, G. Becker, B. Rüegg, and A. Hodac. “SwissRanger SR3000 and First Experiences based on Miniaturized 3D-TOF Cameras”, in: Proceedings of the First Range Imaging Research Day at ETH Zurich, ISBN 3-906467-57-0, 2005 Möller, T., H. Kraft, J. Frey, M. Albrecht, and R. Lange, “Robust 3D Measurement with PMD Sensors”, in: Proceedings of the First Range Imaging Research Day at ETH Zurich, ISBN 3-906467-57-0, 2005 Gokturk, S.B. H. Yalcin and C. Bamji, “A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions”, IEEE Computer Vision and Pattern Recognition Workshop, 2004 Ushinaga, T., I. A. Halin, T. Sawada, S. Kawahito, M. Homma, and Y. Maeda, “A QVGA-size CMOS timeof-flight range image sensor with background light charge draining structure”, SPIE 6056, 2006 Stoppa, D., Gonzo, L., Simoni, A., “Scannerless 3D imaging sensors”, IEEE International Workshop on Imaging Systems and Techniques, 2005 Kahlmann, T., Ingensand, H., Calibration and improvements of the high-resolution range-imaging camera SwissRangerTM, Proc. SPIE Vol. 5665, pp. 144-155, 2005 Kahlmann, T., Remondino, F., Ingensand, H., “Calibration for increased accuracy of the range imaging camera SwissRanger”. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXVI, part 5, pp. 136-141, Dresden, Germany, 2006 Lange, R., “3D Time-of-Flight Distance Measurement with Custom Solid-State Image Sensors in CMOS/CCD-Technology”, Dissertation, University of Siegen, 2000 Bar-Shalom, Y. and X.-R. Li, Estimation and Tracking: Principles, Techniques, and Software, Boston MA, Artech House, 1993 Isard, M. and A. Blake, “CONDENSATION -- conditional density propagation for visual tracking”, International Journal of Computer Vision, 29(1), pp. 5-28, 1998 Guillaume, S., Identification and Tracking of moving People in Range-Imaging Data. Semester work, Institute of Geodesy and Photogrammetry, ETH Zurich, Switzerland, 2006 Koller-Meier, E., Extending the Condensation Algorithm for Tracking Multiple Objects in Range Image Sequences, Diss. ETH No. 13548. Zürich 2000.