BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LX (LXIV), Fasc. 3-4, 2014 SecŃia AUTOMATICĂ şi CALCULATOARE
REGION DETECTION AND DEPTH LABELING ON KINECT STREAMS BY
ALEXANDRU BUTEAN* and OANA BĂLAN University POLITEHNICA of Bucharest Faculty of Automatic Control and Computer Science
Received: December 2, 2014 Accepted for publication: December 20, 2014
Abstract. Although Kinect was designed as a gaming tool, in the last few years studies have shown that this sensor can be used for real-time environmental scanning, segmentation, classifications and scene understanding. Our approach, based on using Kinect or any other similar device, gathers depth and RGB data from the sensors and processes the information in near real time. The purpose is to divide the data into distinct regions based on depth and colour and then calculate the distance for each detected area (depth labelling). To achieve performance in many real situations involving humans, compared to other existing segmentation or depth calculation solutions, right from the beginning, we considered the fact that humans are different than objects. Most of the objects are static and thus, they are less likely to change their dimensions and localization into every frame. We propose a method where regions are detected by merging 2 different segmentation methods: human detection using skeletal tracking and RANSAC algorithm as a method for object detection. Our experimental results are showing that the solution running on a mobile device (notebook) works with a humble improvement of maximum 7% compared to the RANSAC object detection method. Key words: depth labelling; 3D segmentation; human detection; Microsoft Kinect; scene understanding. 2010 Mathematics Subject Classification: 00A06. *
Corresponding author; e-mail:
[email protected]
98
Alexandru Butean and Oana Bălan
1. Introduction When Kinect was first introduced on the market, in 2010, it was a technological wonder and his main purpose was to serve as a complementary device connected to an XBOX 360 console to provide a new kind of gaming experience and unique interaction capabilities. After 2 years, the SDK was released and this was a crucial moment that triggered an important wave into many computer science areas like image processing (Yang et al., 2012), video flows (Abramov et al., 2013), 3D reconstruction (Ren et al., 2013; Chen et al., 2012; Yang et al., 2012), depth calculus (Andersen et al., 2012; Wang & Jia, 2012).
Fig. 1 − Kinect main parts.
Fig. 1 shows the main parts: a depth camera, a simple RGB colour camera, an infrared sensor and a sensitive pair of microphones. The stream contains real time data at a frame rate up to 30fps at a resolution of maximum 640x480.
Fig. 2 − Kinect sample output.
In Fig. 2 there are a few samples showing how Kinect output looks like. From left to the right: RGB camera, depth camera and infrared sensors. Our points of interests for this paper and for future research in this area are real-time segmentation (Yang et al., 2012), object detection (Li et al., 2011), 3D reconstruction (Izadi et al., 2011), indoor modelling (Shao et al., 2012) and important results on innovative depth calculating techniques (Newcombe et al., 2011; Chen et al., 2012; Kourosh & Sander, 2012). Existing methods in this area are proven to be very effective for specific purposes. Since we would like to develop a general purpose method, our approach is to merge two of the existing solutions in order to get better general results. This research area is still young there are lots of results to harvest for.
Bul. Inst. Polit. Iaşi, t. LX (LXIV), f. 3-4, 2014
99
We are proposing a depth labelling system that calculates the distances from the Kinect sensor to the detected humans and objects from the viewport. This is achieved by merging human detection methods with object detection algorithms. Every human and object is treated as a region but what makes our approach unique is the concept that relies on the fact that in a series of consecutive frames, humans’ positions are changing very fast and they should be treated different than static segmentation objects, thus they should get more processing power and attention. The system will be aware of the existence of humans in the scene, allowing different perspectives for scene understanding. If possible, we also seek to reduce the processing power needed for real time segmentation in order to be able to use the solution for future work in an integrated mobile assistive device. 2. Related Work The depth data stream from a Kinect camera offers the possibility to create 3D reconstruction of indoor scenes as well as using these results in application such as CAD or gamming (Izadi et al., 2011). Kinect Fusion system has many other usages: it can be seen as a low cost handled scanning - allows users to capture an object from different viewpoints and receives feedback immediately on the screen or reverse; offers the possibility to segment the desired object from the scene through direct interaction; supports geometry aware augmented reality, a 3D virtual word is overlaid and interacts with the real world representation. Using this approach, the aspects of the real-worlds physics can be simulated in application areas like gaming and robotics. A very interesting application is to provide input for real time object localization with embedded audio description (Gomez et al., 2011). An efficient assistance method that can evolve into the near future into an electronic travel aid for the visually-impaired. Many types of depth sensors such as stereo-based RGBD images, Timeof-Flight and Kinect, have the ability of capturing grayscale/colour images and corresponding per-pixel depth values of dynamic scenes simultaneously up to 30fps. Since the colour images used for stereo vision are of higher resolution and better texture than the depth sensors, it is reasonable to fuse the depth data from the depth sensors with colour cameras to produce corresponding highresolution depth maps of images from colour cameras (Wang & Jia, 2012). Another proposed method to improve the depth map given by the Kinect camera implies filling holes, refining edges and reducing noise. The method first detects the pixels with the wrong depth assigned, usually the ones near the object boundaries, then fills the holes combining the region growing method and bilateral filter (Chen et al., 2012).
100
Alexandru Butean and Oana Bălan
2. Method Description 2.1. Human Detection
After studying existing similar solutions we realized that all of them are applying effective algorithms, but for the stages of segmentation and classification all the objects are treated with the same priority. Opposing to this, we consider using Kinect’s feature to detect and perform skeletal tracking of players in games (Fig. 3).
Fig. 3 − Kinect Skeleton Tracking points.
Fig. 4 − Kinect Skeleton Tracking lines.
Default skeletal tracking (Kar et al., 2010), illustrated in Fig. 4 would output centre axis lines for each human part, but when region detection is needed, central axis are not enough to establish the lines for the human body. The solution for turning stick skeleton tracking into human detection tracking involves help from the depth camera to get similar values around a specific pixel. Using this idea, we can confidently assume that a human was detected correctly and we can track him during his entire presence on the stream (MSDN, 2014). The described method works great for the first 6 humans that are inside a frame. Using only one Kinect device, hardware limitations does not offer support for using this kind of method for more. For now we are considering that on most practical cases, on a 3 meter radius is unlikely to find more humans, if there are more, they will still be detected by the object detection methods with a minor performance drawback and without being able to detect them as humans. Of course, for a more precise solution regarding the number of humans from a scene, many devices can run in parallel, each one dealing with maxim 6 humans. 2.2. Camera Calibration
In parallel with human detection we evaluated several methods for object segmentation that apply for all other objects from the stream. Before applying the segmentation process, as can be seen in Fig. 5, it has to be considered that the RGB camera has an important offset compared to the depth camera (Abramov et al., 2013).
Bul. Inst. Polit. Iaşi, t. LX (LXIV), f. 3-4, 2014
101
Fig. 5 − Color and depth offset.
In order to align color and depth data coming from 2 different cameras, a calibration step was applied using an already implemented method from OpenNI toolbox with the output shown in Fig. 6 (Villaroman et al., 2011).
Fig. 6 − Output results after color-depth calibration with OpenNI.
Due to the depth measuring principle, the depth image contains optical noise, unmatched edges and invalid pixels that sometimes lead to holes. This could affect our segmentation process and lead to a high detection error. A smoothing step is needed in order to avoid such results (Chen et al., 2012). Usually the wrong pixels are located between the edges of the depth map and the corresponding edges from the color image. This particular problem can be solved by using OpenNI toolbox region growing. It has to be applied to the depth image edge until it reaches the color image. The exact same process is necessary from the color image to the depth image. The final mask is obtained using an AND operator applied on those two results. Once the invalid pixels are detected using the mask, the next step is to fill the holes with estimated pixel values according to the valid pixels from the neighborhood. To polish the results, a bilateral filter was applied. As shown in Fig. 7, the difference is remarkable and the process reveals sharp details without noise.
Fig. 7 − Smoothing depth data using OpenNI toolbox.
102
Alexandru Butean and Oana Bălan
2.3. Object Detection
Object detection method consists in segmentation of the point cloud elements acquired using an adjacency matrix. The matrix was built based on the given data, but taking into consideration the distance metric imposed by us (how close the points must be in order to compare them). The adjacency matrix allows in this way an efficient lookup of points based on their locality. For each cell on the adjacency matrix we compute an average normal given by the RANSAC algorithm (Li & Putchakayala, 2012). The RANdom SAmple Consensus (RANSAC) (Derpanis, 2010) algorithm is a general parameter estimation approach designed of outliers in the input data. This resampling technique takes the minimum data points required to estimate the underlying model parameters. The algorithm selects randomly the minimum number of points required for the solution, solves for the parameters of the model then checks if the number of data points from the set fit, of course that a predefined tolerance is needed. If the fraction of the number of inliers over the total number of data points exceeds a predefined threshold, re-estimate the model parameters the model parameters using all identified inliers and stop, otherwise repeat the previous steps for maximum N times. N is chosen high enough to ensure that the probability that at least one of the sets of random samples does not include an outlier. Considering u representing the any selected data point is an inlier, v = 1- u the probability of observing an outlier and m the minimum number of points denoted, the N iterations value is calculated as it follows: (1) To compute the average normal (plane’s normal), we simply, give the adjacency cell into the RANSAC. We compare each point to each point from the neighboring adjacency cells and check if the distance between the points compared is smaller than a predefine threshold distance and if the angle formed by them is smaller than a predefined threshold angle. If this heuristic is true for the given pair of point, then we consider that those points are in the same segment. After this all the points considered to be in the same segment are merged labeled with a random color. 2.4. Merging Methods
Object detection was the first implemented method and has a more general purpose, which allows us to detect any kind of object. The problem with this one is that when the scene gets crowded the frame rate goes down badly. After adding the human detection method the system logic changed. This one is
Bul. Inst. Polit. Iaşi, t. LX (LXIV), f. 3-4, 2014
103
a lot faster because it uses native Kinect capabilities. Unfortunately, Kinect has native functions only for humans. Our approach proposes a mix between those methods, a mix guided by these steps: • duplicate the matrices for both RGB and depth streams; • apply the human detection methods; • remove humans pixels and depth data from the matrix that goes as an input for object detection; • process the matrices again with object detection methods; • merge the outputs and establish regions. 2.5. Depth Labelling
Our conceptual idea is to merge 2 methods, but in the end the results from both human and object detection are considered as regions. For every calculated region, the systems gets the pixel with the closest depth and positions the calculated value (converted to meters) directly in a copy of the color matrix which is overlaid above the current color matrix. An overlay was necessary, otherwise editing the color data flow by altering original pixels will surely affect the detection process. 3. Results We implemented our proposed solution using Kinect SDK 1.8 and the development was made in Microsoft Visual Studio 2013 with WPF and C#. The experimental results shown in this section where obtained using a Kinect device connected to a notebook with an Intel i5 1.8 GHz processor, 4Gb of RAM, 256 Gb SSD hard drive and Intel HD graphics 4000 integrated GPU. Kinect stream has a resolution of 640x480 with a starting frame rate of 30fps.
Fig. 8 − Object detection results.
Fig. 9 − Human detection results.
Fig. 10 − Human detection + Object detection results.
As we can see in Fig. 8 object detection methods are working smoothly on their own. From our measurements, the number of frames per second (FPS) varies between 10 and 2 depending on the number of objects in the scene. As a
104
Alexandru Butean and Oana Bălan
future work we will try to test using a complete white room by adding objects one by one in order to establish exactly when the FPS starts to go down drastically. In the middle, Fig. 9 shows human detection results, working at 30 FPS benefiting from the extended skeleton detection. Here we can clearly see the offset between RGB image and the overlay depth. Intentionally we have not applied calibration in order to keep the native functions that allows fast and precise detection. Fig. 10 shows the final results with both methods activated. The detection was precise but the measured FPS was between 6 and 12. With an improvement of only 7% comparing to the object detection method, the results are showing that this approach is still far from real time (30 FPS). 4. Conclusions In this paper we present a mixed method for region detection and depth labelling using Microsoft Kinect sensors. Our unique approach brings into your attention an uncommon classification that considers that humans are different than normal objects, therefore we try to use different methods for segmentation. For the object segmentation we are using region growing smoothing and RANSAC algorithm. For detecting humans we are using an extended version of the basic skeleton tracking. Output data from both methods are merged into regions and labelled with distance values. It is well known that RGB-D 3D object segmentation needs a high volume processing environment. By mixing an existing segmentation method with a native detection method we wanted to be able to run the solution on a mobile device (notebook) that does not have extreme processing resources. Since the progress in this area was humble, in order to achieve better results in the future we would like to try parallel GPU processing (Asavei et al, 2010). As an important result, human detection works together with object detection. This solution allows an autonomous system to be aware of the existence of humans in the scene, information that can be an influential improvement for scene understanding. Xbox One was launched in the beginning of 2014. An interesting idea would be to test this solution with the new Kinect One that comes with FullHD cameras and highly improved SDK. Acknowledgments. The work has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398 and POSDRU/159/1.5/S/132395.
Bul. Inst. Polit. Iaşi, t. LX (LXIV), f. 3-4, 2014
105
REFERENCES * * *
Microsoft Developer Network- http://msdn.microsoft.com/en-us/library/jj131025.aspx - Method for Using Skeleton Traking. available 2014. Abramov A., Pauwels K., Papon J, Worgotter F, Dellen B, Depth-Supported Real-Time Video Segmentation with the Kinect. IEEE Workshop on Applications of Computer Vision (WACV), 457−464, 2013. Andersen M.R, Jensen T., Lisouski P., Mortensen A.K., Hansen M.K., Gregersen T., Ahrendt P., Kinect Depth Sensor Evaluation for Computer Vision Applications. Electrical and Computer Engineering Technical Report ECE-TR-6, 2012. Asavei V., Moldoveanu A., Moldoveanu F., Morar A., Egner A., GPGPU for Cheaper 3D MMO Servers. 9th WSEAS International Conference on Telecommunications and Informatics, Session Information Science and Applications, 238−243, 2010. Chen Li, Lin Hui, Shutao Li, Depth Image Enhancement for Kinect Using Region Growing and Bilateral Filter. 21st International Conference on Pattern Recognition (ICPR), 3070–3073, 2012. Derpanis K.G., Overview of the RANSAC Algorithm. Image Rochester NY, 4, 2-3, 2010. Gomez J.D., Mohammed S., Bologna G., Pun T., Toward 3D Scene Understanding via Audio-description Kinect-iPad Fusion for the Visually Impaired. ASSETS '11 the Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility, 293−294, 2011. Izadi S., Kim D., Hilliges O., Molyneaux D., Newcombe R., Kohli P., Shotton J., Freeman D., Davison A., Fitzgibbon A., KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera. Microsoft Research Center, 2011. Kar A., Mukerjee A., Guha P., Skeletal Tracking Using Microsoft Kinect. Methodology 1, 2010. Kourosh K., Sander O., Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications. Sensors 2012, 12, 1437–1454, 2012. Li T., Putchakayala P., Wilson M., 3D Object Detection with Kinect. 2012. Newcombe R.A., Izadi S., Hilliges O., Molyneaux D., Kim D., Davison A.J., Kohli P., Shotton J., Hodges S., Fitzgibbon A., KinectFusion: Real-Time Dense Surface Mapping and Tracking. ISMAR '11 Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127−136, 2011. Ren C.Y., Prisacariu V., Murray D., Reid I., STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data. Proc. Int. Conf. on Computer Vision, Sydney, Australia 2013. Shao T., Xu W., Zhou K., Wang J., Li D., Guo B., An Interactive Approach to Semantic Modeling of Indoor Scenes with an RGBD Camera. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 2012, 31, 6, Article No. 136, 2012. Villaroman N., Rowe D., Swan B., Teaching Natural User Interaction Using OpenNI and the Microsoft Kinect Sensor. SIGITE '11 Proceedings of the Conference on Information Technology Education, 227−232, 2011. Wang Y., Jia Y., A Fusion Framework of Stereo Vision and Kinect for High-Quality Dense Depth Maps. ACCV'12 Proceedings of the 11th International Conference on Computer Vision, 2, 109−120, 2012.
106
Alexandru Butean and Oana Bălan
Yang Z., Jin L., Tao D., Kinect Image Classification Using LLC, ICIMCS '12 Proceedings of the 4th International Conference on Internet Multimedia Computing and Service, 50−54, 2012.
DETECłIA DE ZONE DE INTERES ŞI ETICHETAREA DISTANłELOR FOLOSIND FLUXURI DE DATE DE LA KINECT (Rezumat) Deşi Kinect a fost conceput ca un instrument pentru industria jocurilor pe consolă, în ultimii ani, studiile au arătat că acest senzor poate fi folosit pentru scanarea şi înŃelegerea în timp real a mediului înconjurător, segmentarea de obiecte şi clasificări. Abordarea noastră, foloseşte datele de la camerele de adâncime şi RGB şi procesează informaŃiile aproape în timp real. Scopul este de a împărŃi datele în regiuni distincte, bazate pe adâncime şi culoare, ca apoi calcularea distanŃei să se realizeze pentru fiecare zonă detectată (etichetare cu adâncime). Ceea ce face această abordare unică este faptul că am considerat că într-o scenă oamenii sunt diferiŃi de obiecte. Cele mai multe dintre obiecte sunt statice, astfel e mai puŃin probabil să îşi schimbe dimensiunile şi localizarea în fiecare cadru. Propunem o metodă în care regiunile de interes sunt detectate prin fuzionarea a 2 modalităŃi de segmentare diferite: detectarea oamenilor folosind detecŃia scheletului şi algoritmul RANSAC ca metodă de detecŃie a obiectelor. Rezultatele experimentale de până acum arată ca soluŃia rulează pe un dispozitiv mobil (notebook), funcŃionând cu o îmbunătăŃire foarte modestă de maximum 7% comparativ cu metoda RANSAC de detectare obiectelor. Metodele de detecŃie şi segmentare a obiectelor 3D folosind camera de adâncime rulează de obicei pe sisteme care oferă putere mare de procesare. Cercetarea noastră avea ca scop rularea acestor metode pe dispozitive mobile de tip laptop cu consum redus de energie care nu beneficiază de putere mare de procesare. Sistemele de detecŃie 3D a obiectelor bazate pe algoritmul RANSAC rulează pe configuraŃia noastră mobilă la valori maxime de 10 FPS, de aceea, o îmbunătăŃire de doar 7% folosind metoda mixtă nu rezolvă pe deplin problema. Aceste rezultate ne încurajează însă şi ne conduc spre noi încercări de optimizare folosind metode de procesare paralelă folosind procesoarele de pe placa grafică.