2012 Eighth International Conference on Signal Image Technology and Internet Based Systems
Temporal denoising of Kinect depth data Kyis Essmaeel1,2,3, Luigi Gallo1, Ernesto Damiani2, Giuseppe De Pietro1, Albert Dipandà3 1
Institute of High Performance Computing and Networking, National Research Council of Italy Via Pietro Castellino 111, Naples, Italy {kyis.essmaeel,luigi.gallo,giuseppe.depietro}@na.icar.cnr.it 2
3
Department of Computer Technology, University of Milan Via Comelico 39-41, Milan, Italy
[email protected]
Laboratoire LE2I (CNRS-UMR 5158), Aile des Sciences de l’Ingénieur, Université de Bourgogne 9 Avenue Alain Savary, BP 47870-21078 Dijon Cedex, France
[email protected] The release of the Microsoft KinectTM was probably one of the major leaps forward in terms of hardware in this research field. In fact, the Kinect represents a good replacement for stereo-cameras and even for multicamera systems that deliver a depth map of the scene. Furthermore, unlike laser scanners and ToF (Time of Flight) cameras, the Kinect is an economically convenient device, easy to set up and handle. This is the main reason why, in recent years, many systems using this device have been presented in different application domains like robotics [7], human motion analysis [8] and human-computer interaction [9]. Among the research domains where the Kinect can be profitably used, human motion analysis [13] seems to be one of the most promising [14]. Here, researchers struggle to build appropriate human body models with conventional devices. In fact, modelling the human body is an important step in order to achieve accurate motion analysis, particularly in those applications where accurate details must be extracted from the human motion. Given all the previously mentioned benefits of the Kinect device, it still has its own disadvantages and limitations. Whereas the Kinect has proved to be suitable for use in gaming applications, its performance in applications that require data with a high degree of accuracy has not been sufficiently analysed. While most of the work around the Kinect has been on how to exploit it capabilities in different kinds of applications, there have been few studies on its limitations such as in the precision of depth measurements. Moreover, in applications where multiple Kinects need to be combined together, the calibration of multiple Kinects, with the associated interference effect, represents a challenging task that is still far from being solved. Among the few studies focusing on such limitations, in [12] a comparison of Kinect-based systems with marker-based multiple-camera systems on postural control assessment is carried out. The reported results show that the Kinect is promising, but still not as good as the other systems. Particularly, depth instability, which is due to the nature of structured light-based sensors, prevents the use of Kinect depth data in applications that require a high level of accuracy.
Abstract—The release of the Microsoft Kinect has attracted the attention of researchers in a variety of computer science domains. Even though this device is still relatively new, its recent applications have shown some promising results in terms of replacing current conventional methods like the stereo-camera for robotics navigation, multi-camera system for motion detection and laser scanner for 3D reconstruction. While most work around the Kinect is on how to take full advantage of its capabilities, so far only a few studies have been carried out on the limitations of this device and fewer that provide solutions to enhance the precision of its measurements. In this paper, we review and analyse current work in this area, and present and evaluate a temporal denoising algorithm to reduce the instability of the depth measurements provided by the Kinect over different distances. Keywords-temporal denoising; Kinect; depth instability; smoothing depth data; adaptive gain
I. INTRODUCTION For many years, research fields like motion capture, 3D reconstruction and robotics have used well-assessed technologies like video cameras and laser scanners. Specifically, in the domain of motion capture and 3D reconstruction, we can identify two main types of devices: marker-based and marker-less. In marker-based devices, sensors are attached to a subject and then the signals coming from these sensors are processed [2]. Regardless of their capability to provide accurate measurements [3], their usage is limited to certain applications since they are inconvenient for daily use and quite complex to set up. In contrast, marker-less devices are unencumbering and much more affordable, since webcams, DSLRs and video recorders can be customized and combined together to fit almost any kind of application. However, using marker-less devices usually results in software with a higher computational cost. Moreover, several problems arise from using and combining such devices. In stereo-camera systems, for example, achieving a good calibration is a critical step to obtain a reliable depth map.
978-0-7695-4911-8/12 $26.00 © 2012 IEEE DOI 10.1109/SITIS.2012.18
47
The way the Kinect functions creates some problems. The emitted IR pattern is highly affected by lighting conditions, which cause inaccurate measurements. Even in proper lighting conditions, the measurements are unstable because of the interferences of the structured light and the objects in the scene. Moreover, the Kinect has a limited coverage area of around 3.5 meters. To overcome this limitation, multiple Kinects can be combined, but, in this case, the calibration and interference between the projected patterns are issues that have to be faced. Most of the techniques used to calibrate multi-Kinect systems come from the conventional methods used to calibrate multi-camera systems [6], taking advantage of the fact that the Kinect registers the depth image to the RGB image. Therefore, it is only necessary to calibrate RGB cameras to have the depth maps also calibrated. However such an approach does not take into account some of the peculiar characteristics of the Kinect, like the shift between the infra-red and depth image [10], and the geometrical model [11]. Interference is a serious challenge in multi-Kinect set-up. The intersection of the IR projector fields of two or more Kinects leads to missing data since the device is unable to recognize correctly the projected IR pattern. Figure 2 shows an example of the effects of interference between two Kinects. The first depth image is taken from one of the two Kinects when the other Kinect is turned off. In the second depth image, where both sensors are turned on, we can clearly notice the black holes, which are the noise in the depth image. While this may not be a concern for applications that simply require a coarse skeletal tracking [22], it is a primary concern when it comes to critical applications, particularly in the health/medical field, like posture analysis. Differently from the interference, a relevant problem that even a single Kinect suffers from is the instability of depth measurements. We can notice a sort of vibration in the acquired depth values even when the object is stood still. Figures 3 and 4 show the measured values from one pixel over a period of time in a motionless and a moving situation respectively, and the magnitude of this vibration increases together with the distance from the sensor. The peculiar properties of the Kinect-based depth map generation make traditional image denoising algorithms slightly effective [16], [17].
Figure 1. Standard deviation of depth values related to distance
In this paper, we present an algorithm to cope with the instability of depth measurements provided by the Kinect. The goal is to enhance the stability and reliability of the Kinect depth map without requiring a high computational cost. The rest of the paper is organized as follows. In Section II, we describe the previously proposed methods for Kinect noise reduction, and discuss their limitations. In Section III, we describe the proposed temporal noise reduction algorithm. In Section IV, we report the experimental results we have obtained by applying the filter on depth measurements of both steady and moving objects at varying distances. Finally, in Section VI, we present our conclusions. II. MOTIVATIONS AND RELATED WORK The Kinect consists of three main parts: an IR projector, an IR camera and an RGB camera. The IR Projector emits an IR pattern onto the scene and the IR camera captures this pattern. Then, by triangulation, the Kinect computes a depth map of the scene. The images coming from the RGB camera are then matched against the depth images. A statistical analysis of the sensor precision dependence on distance is given in [4]. In this work, the authors show that the relation between the distance of the target from the depth camera and the range/standard deviation of the measured depth values fits to a quadratic function (see figure 1). Since each depth image has a fixed resolution, the depth point density is inversely proportional to the square distance from the sensor, along the perpendicular camera axis [15].
Figure 2. Noise in Kinect depth data Figure 3. Depth variation with time of a steady object placed at 2 meters from the sensor
48
Figure 4. Depth variation with time of an object moving at 2 meters from the sensor Figure 5. Noise reduction using the adaptive threshold method
Some approaches proposed in literature make use of features preserving denoising algorithms [18], [19] or the bilateral filtering algorithm [20] to process and denoise the depth data. Recently, a denoising approach aimed at getting a reliable and noise free depth map has been proposed in [21]. In this work, a block-based hole filling scheme is employed to predict the invalid depth values and then a bilateral filter is used in the spatial and temporal domains. In [24], the authors describe a temporal filtering method for depth maps which also uses the RGB image. Such a method can fill the holes in the depth image by using a temporal smoothing, but no details are provided on how the algorithm performs in reducing vibrations of depth data. However, the most used approach in literature to deal with depth instability is temporal noise suppression, which consists in using an adaptive threshold method, in which the threshold changes together with the distance from the sensor [4]. A drawback of this method is the discontinuity in the measured values, which can be particularly appreciated when objects move slowly at a distance from the sensor (see figure 5). To cope with the instability of depth measurements, without filtering out significant depth data, in this paper we propose a denoising algorithm that works on the temporal domain on a per-pixel level. The goal is to produce a stable and reliable depth image with an extremely low computational cost for those applications that require a higher level of accuracy but real-time performance. III.
Figure 3 shows the measured depth from this pixel with the object being kept steady at 2 meters from the sensor, whereas figure 4 shows the depth measurements from the pixel the object being slowly moved back and forth at a 2 meters distance from the sensor. In both cases, the effect of the vibration is clearly noticeable. To smooth the Kinect depth measurements while avoiding filtering out significant depth data as in the adaptive threshold approach, we have modified the Smoothed Pointing algorithm [5], a velocity-based precision enhancing technique for distal pointing. The Smoothed Pointing technique works by dynamically adjusting the control-to-display (C-D) gain [12] according to the average velocity of the user movements. The rationale of the algorithm is to filter out hand tremors by using a variable C-D gain. Changing the gain according to the average velocity, in fact, filters out hand tremors, since they result in a null velocity. This peculiar feature makes the method suitable for use to filter out Kinect depth vibrations. However, the filter also has to preserve the user’s mental model of absolute-device operation. To achieve this aim, the offset between the measured position and the displayed position, which results from using a variable gain, has to be kept confined. In Smoothed Pointing, the maximum offset value is chosen according to the distance between the user and the display so that, according to the visual acuity theory, the user does not notice the difference between the measured and displayed positions while the movement is smoothed. When the offset exceeds this threshold, namely Dmax, a recovery process moves the pointer so as to draw up the displayed position of the cursor to the measured one. In order to tailor a velocity-based technique for filtering Kinect depth data, some differences with distal pointing have to be taken into account. Firstly, Smoothed Pointing filters only the cursor position, i.e., only one point, while in our case we need to filter every value of the depth map. Secondly, in Smoothed Pointing the Dmax value is set so that the user does not perceive the difference between the measured and the displayed positions of the cursor. When it comes to the Kinect depth data, the Dmax value has to be set so as to discriminate between vibrations due to insufficient measurements and slow movements. In our filter, the Dmax value corresponds to the level of noise/vibration we expect at a specific distance from the Kinect.
THE TEMPORAL DENOISING ALGORITHM
As introduced in the previous section, depth measurements provided by the Kinect suffer from instability due to the nature of the structured light-based depth sensors. The effect of such a problem can be noticed as a flickering and vibration of the depth values, and becomes more noticeable on the contours of the objects present in the scene and on reflective surfaces. Moreover, this vibration increases with the distance from the Kinect. Figures 3 and 4 depict the depth instability problem, by recording the measured depth values of an arbitrary chosen pixel of an object in the scene.
49
Figure 6. Adaptive threshold filtering vs. temporal denoising filtering
Therefore, we set the Dmax value to be equivalent to the standard deviations of depth values, which, as shown in figure 1, are proportional to the square of the distances from the sensor. Therefore, in our implementation the threshold changes together with the distance from the sensor of each measured value. Moreover, the offset recovery procedure completes in one frame. The algorithm starts by computing the following variables: Δz(t), which is the difference between the current measured value and the value measured T ms before; the offset dz(t), which is the difference between the current measured value at time t and the filtered value at time t-1, where t-1 is the time the previous value was measured; sz(t), which is the difference between the measured value at time t and at time t-1:
Δ z (t ) = z measured (t ) − z measured (t − T ) d z (t ) = z measured (t ) − z filtered (t − 1)
s z (t ) = z measured (t ) − z measured (t − 1)
Therefore, if sz is larger than Dmax, there will be no filtering; otherwise the measured value will be filtered using the gain: ° z if |sz ( t )| > Dmax z filtered (t) = ® measured (7) °¯ z filtered (t -1) + g z (t)⋅s z (t) otherwise
The gain value is calculated using two velocity thresholds and . The function to compute the gain consists of three parts:
g
z
t
z
(t ) =
if
mˆ z ( t ) ≤ 1
otherwise
if
sz (t )≠ 0
(8)
otherwise
(1) the first part is based on a sinusoid function and is in charge of reducing the vibration and smoothing the slow movements as long as the average velocity is smaller than vmax and the offset is still lower than the standard deviation at that distance. Figure 6 illustrates how the filter works compared with the adaptive threshold method. Whereas the adaptive threshold cuts all the depth data that differ from the previous value lower than the standard deviation, so filtering out slow movements of objects in the scene, the proposed temporal denoising algorithm smooths slow movements, filtering only the vibrations, but recovers the offset in one frame when the new measured value differs from the previous one by more than the standard deviation at that distance from the sensor. It is worth noting that the proposed denoising method works on the pixel level. It is based on the changes in the depth value of each single pixel, i.e., no information from the neighbouring pixels is needed, In fact, the vibration at one pixel is not correlated with its neighbours, unlike in the RGB image where the neighbouring pixels suffer from the same kind of perturbations [23].
(2) (3)
As previously stated, Dmax is the standard deviation of the depth value, whereas vt is the average velocity over a period of time T, vˆt is the normalized velocity, dˆz (t ) is the normalized offset and mˆ z (t ) is a hybrid parameter set as the maximum between the normalized velocity and the normalized offsetǤ
| d (t ) | Dmax = SD( z measured (t )) , dˆ z (t ) = z Dmax v − vmin | Δ (t ) | , vˆt = t vt = z vmax − vmin T mˆ (t ) = max(vˆ , dˆ (t ))
z
g min + 1 (1 − g min )(sin §¨ mˆ z ( t ) π − π ·¸ + 1 ) 2 2 ¹ © °° d z ( t ) ® sz (t ) °0 °¯
(4) (5)
(6)
The aim is to smooth the depth vibrations, which are present when the changes in the measured value in a single frame are lower than the standard deviations at the measured depth, namely Dmax.
50
Figure 7. Applied filter on a pixel of a steady object
Figure 9. Measured and filtered values from a fast moving object
Figure 8. Applied filter on a pixel of a moving object
Figure 10. Comparison of the temporal denoising and the adaptive threshold algorithms
IV.
This results in discontinuities in the depth measurements. In contrast, in our approach the vibrations are reduced and the slow movements are not filtered out. The algorithm has been implemented in the C++ programming language, as an extension of the RGBDemo software [6]. On a commodity PC equipped with a 3.40Hz processor and 8 GB of RAM, Microsoft Windows 7 OS, the average time required to filter the whole Kinect depth map is 35ms. Therefore, in its current implementation, the filtering process causes a slight decrease of the visualization frame rate, which is 30 Hz. However, since the algorithm is entirely dataparallel, it can be implemented to run on GPUs so as to speed up the computations. Furthermore, it is worth noting that current implementations of the most used neighbouring filters result in a higher execution time on commodity hardware. For example, on the commodity PC we used in the preliminary experimental campaign, the Bilateral filter with a 5x5 window size runs in 120ms.
PRELIMINARY EXPERIMENTAL RESULTS
In the following section we show the results of the denoising process on a depth measure of an object in both steady and movement situations. Concerning the thresholds used in the algorithm, gmin, vmin and vmax, these were determined by tuning the algorithm on the recorded measurement of an object from different positions with different velocities, while the variable T in which the velocity is averaged was set to 1 second. In particular, the T value has been chosen by observation. In fact, we observed that a lower value results in insufficient vibration reduction, whereas a higher value causes a noticeable lag between measured and filtered values. Figure 7 shows the result of the application of the filter on an object placed at 3 meters from the Kinect, at the limit of the coverage area of the Kinect. The measured depth values are in blue, whereas the green line shows the filtered value. Figure 8 shows the same situation with the object being moved further from the Kinect slowly. Also in this case, the depth vibration is reduced but the object movement is not filtered out. The vibration problem is slightly perceptible in fast movements. However, also in this case the filtering process does not result in a significant lag, as shown in figure 9. Finally, figure 10 shows a comparison with the adaptive threshold suppression algorithm. As previously reported, this algorithm works by verifying if the difference between the previously recorded value and the current measured value is greater than a depthadaptive threshold; if this is the case, it changes the depth value to the current measured value; otherwise it keeps the previously recoded value. As shown in figure 10, the problem with this algorithm is that slow movements that actually happen are completely discarded since they are considered to be noise.
V. CONCLUSIONS In this paper, we have presented a temporal denoising algorithm, aimed at reducing the instability of Kinect depth measurements, which is based on an adaptive gain approach. The paper also reviews existing approaches to cope with interference and depth variability problems, outlining their advantages and limitations when used in critical applications, which require a high level of accuracy and real-time applicability. Our preliminary experimental results show that the filter reduces the depth instability but without filtering slow movements of objects present in the scene, so making the algorithm suitable for use in applications that need to capture small details in movements, like in the field of human motion analysis. The results also demonstrate the real-time applicability of the whole filtering process. 51
[13] Aggarwal, J. K., and Cai, Q., “Human motion analysis: a review”, Computer Vision and Image Understanding, vol. 73, no. 3, 428–440 (1999). [14] Essmaeel, K., Gallo, L., Damiani, E., De Pietro, G., and Dipandà, A. Multiple structured light-based depth sensors for human motion analysis: a review”, IWAAL, LNCS, Springer Berlin / Heidelberg (2012). In press. [15] Khoshelham, K., and Elberink, S.O., “Accuracy and resolution of kinect depth data for indoor mapping applications,” Sensors, vol. 12, no. 2, pp. 1437–1454, 2012. [Online]. Available: http://www.mdpi.com/1424-8220/12/2/1437 [16] Rudin, L.I., Osher, S., and Fatemi, E., “Nonlinear total variation based noise removal algorithms”, Physica D., vol. 600, no.1-4, pp. 259-268 (1992). [17] Buades, A., Coll, B., and Morel, J.M., “A non-local algorithm for image denoising”, CVPR, pp. 600-65 (2005). [18] Schall, O., Belyaev, A., and Seiddel, H.P., “Feature-preserving non-local denoising of static and time-variant range data”, ACM SPM, pp. 217-222, ACM (2007). [19] Chan, D., “Noise vs Feature: Probabilistic Denoising of Timeof-Flight Range Data”, [Online]. Available: http://cs229.stanford.edu/proj2008/ChanDProbabilisticDenoisingOfRangeData.pdf [20] Fleishman, S., “Bilateral Mesh Denoising”, ACM SIGGRAPH 2003, Volume 22 Issue 3, July (2003). [21] Fu, J., Wang, S., Lu, Y., Li, S., Zeng, W., “Kinect-like depth denoising”, IEEE ISCAS, pp.512-515, IEEE, May (2012). [22] Satyavolu, S.; Bruder, G.; Willemsen, P.; Steinicke, F.; , "Analysis of IR-based virtual reality tracking using multiple Kinects," Virtual Reality Workshops (VR), 2012 IEEE , pp.149150, March (2012). [23] A. Buades, B. Coll, and J.-M. Morel, "Image and movie denoising by nonlocal means," Tech. Rep. 25, CMLA (2006). [24] Matyunin, S.; Vatolin, D.; Berdnikov, Y.; Smirnov, M.; , "Temporal filtering for depth maps generated by Kinect depth camera," 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), pp.1-4, May (2011).
REFERENCES [1]
Microsoft. Kinect for X-BOX 360. http://www.xbox.com/enUS/Kinect (2010). [2] Herda, L., and Urtasun, R., “Hierarchical implicit surface joint limits to constrain video-based motion capture”, ECCV 2004, LNCS 3022, pp. 405–418, Springer-Verlag (2004). [3] Cerveri, P., and Pedotti, A., “Robust recovery of human motion from video using Kalman filters and virtual humans”, Human movement science, 22, 3, 377–404 (2003). [4] Maimone, A., and Fuchs, H., “Enhanced personal autostereoscopic telepresence system using commodity depth camera”, Computers & Graphics, vol. 36, no. 7, pp. 791–807, (2012). [5] Gallo, L., and Minutolo, A., “Design and comparative evaluation of Smoothed Pointing: a velocity-oriented remote pointing enhancement technique”, International Journal of Human-Computer Studies, 70, 4, 287-300 (2012). [6] Burrus, N., “RGB Demo project”, [Online]. Available: http://labs.manctl.com/rgbdemo (2011). [7] Hartmann, J., Forouher, D., Litza, M., Klüssendorff, J.H., and Maehle, E., “Real-time visual SLAM using FastSLAM and the Microsoft Kinect Camera”, ROBOTIK, (2012). [8] Clark, R. A., and Bryant, A. L., “Validity of the Microsoft Kinect for assessment of postural control”, Gait & Posture (2012). [9] Gallo, L., Placitelli, A. P., and Ciampi, M., “Controller-free exploration of medical image data: experiencing the Kinect”, IEEE CBMS, 1–6, IEEE, Los Alamitos, CA, USA, (2011). [10] Konolige, K., and Mihelich, P., “Technical description of Kinect calibration”, Tech. Rep., Willow Garage. [Online]. Available: http://www.ros.org/wiki/Kinect_calibration/technical (2011). [11] Smisek, J., and Jancosek, M., “3D with Kinect”, IEEE International Conference on Computer Vision Workshops, pp. 1154-1160, IEEE, New York (2011). [12] Frees, S., Kessler, G.D., and Kay, E., “PRISM interaction for enhancing control in immersive virtual environments”, ACM Transactions on Computer-Human Interaction, vol. 14, no. 1 (2007).
52