Fast and Robust Detection and Tracking of Multiple ...

Fast and Robust Detection and Tracking of Multiple Persons on RGB-D Data fusing Spatio-Temporal Information Sebastian Starke, Norman Hendrich, Hannes Bistry, Jianwei Zhang Dept. of Informatics, University of Hamburg, Germany {9starke,hendrich,bistry,zhang}@informatik.uni-hamburg.de

Abstract— In this paper, we present an efficient and adaptive method for detecting and tracking multiple persons while providing real-time capability and high robustness to outlier noise. Given an RGB-D image data sequence, our algorithm combines two independent approaches for person detection. First, a cluster-based segmentation and classification on RGBD point clouds and second a face detection on RGB images, where each method itself is post-processed by spatio-temporal filtering for tracking and sensitivity purposes. Our analysis and experimental results prove that the combined approach performs significantly better than the individual solutions and greatly reduces the number of false positives in situations where one detector fails.

The ensuing sections are structured as follows. Section II summarizes the state of the art and describes recent approaches. Section III then introduces our algorithmic approach, describing the extended methods for cluster analysis and face filtering as well as the information fusion method. The corresponding experimental results and improvements by our approach are depicted in section IV. Section V concludes with a summary accompanied by section VI with an outlook to future work.

Keywords: Person Detection, Information Fusion, Face Detection, Tracking, Mobile Robotics, RGB-D Data, Computer Vision, Pattern Recognition

Person detection in general is a classical problem and has been studied for decades. Accordingly, there are numerous approaches and probably thousands of publications addressing this subject. In this section we will highlight especially that work which is most relevant for our solution. Most classical algorithms for person detection are based on 2D image processing, either due to computational efficiency or sensor availability. A robust method for real-time object detection was introduced by Viola & Jones [3] which has become a landmark in this area of research. It basically performs a cascade classification on Haar-like features and later on was also specialized for the task of face detection [4]. A corresponding implementation and documentation is provided by [5]. Overall, the algorithm obtains good results in detecting most faces in the scene, but the detection rate is not truly persistent over consecutive frames. Also, a correlated main issue are frequent false positives within single frames. However, [6] recently showed that from a computational point of view this solution is highly efficient and hence very useful for many classification tasks in computer vision. [7] presented a method for object detection and recognition by computing specific HoG (Histogram of Oriented Gradients) features in order to describe representative shapes. Furthermore, the availability of low-cost depth sensors (e.g. Kinect) has enabled novel approaches based on depth images and colored point clouds. Based on this, [8], [9] proposed an efficient method on RGB-D point clouds to detect and track standing or walking people on a planar ground plane with real-time capability under standard CPU computation. Objects are clustered in a way that persons standing close and upright are reliably seperated from each other and lower body occlusion is corrected by extending clusters at a certain height to the ground. A linear soft-margin SVM (Support Vector Machine) classification on a pre-trained data set is

I. I NTRODUCTION The problem of detecting, tracking and recognizing objects within a noisy real-world environment is surprisingly easy for humans. In contrast to this, understanding the underlying processes and finding patterns to create efficient and robust algorithms has so far proven difficult. This is especially the case when trying to detect transformable objects such as persons that can dynamically change their pose and even differ in size, shape and other features. General problems are caused by unusual poses as well as objects in the environment that are very similar to humans by their shape. Another problem is typically also given by changing lighting conditions. Additionally, often it is not sufficient only to detect but also to keep track of multiple persons at the same time. Large crowds can make this a very complex task since it becomes highly difficult to extract separate individually segmented objects. However, there is a great variety of modern applications – like mobile robotics, security systems or video games – that do increasingly rely on solving exactly this kind of problem. Those applications very often also require real-time capability and a high demand on reliability. This paper adresses both the computer vision and robotics community, hence we will also make links to the ROS framework [1], [2] and keep focus on the need for person detection and tracking as a common task in modern mobile robotic applications from which this paper has originally evolved. This work was partially supported by the European Commission in project Robot-Era FP7-288899, www.robot-era.eu/

II. R ELATED W ORK

then applied to each cluster in order to detect persons based on their HoG features. [10] provides a documented implementation. The overall classification accuracy behaves better than the one in [5], but our tests still returned frequent false positives within single frames and the detection rate for persons was highly dependent on their individual poses. Moreover, the ROS package openni tracker is available and was evaluated. It is based on the body and arm motion tracking originally developed for the Microsoft Kinect sensor and its games [11]. The software detects up to six standing and sitting persons simultaneously and can track moving people across occlusions. However, the tracker sometimes started to permanently keep track of objects which were merged from being touched by tracked persons. In addition, the development of the open-source library has been discontinued and the latest NiTE2 library cannot be used together with the proceeding ROS development anymore. Thus, it cannot be integrated into our project with mobile service robots which is directly based on ROS. While this paper primarily focusses on a reliable detection of persons, recognition and user identification are also very related and essential tasks where first landmarks in this area of research have been set by [12]. Recently, [13], [14] have designed an efficient and robust real-time framework which was implemented on the mobile service robot Care-O-Bot. Their solution serves all tasks from detection, tracking, recognition to interaction with persons. Detection and tracking was done by fusing temporal information of a person’s head and face and was then followed by approaching an identified person. A ROS implementation with the cob people detection node is readily available at [15]. Other popular person detection and recognition methods besides pure computer vision solutions using SVM, HoG & Haar-like features etc. are settled in biologically-inspired neural processing. The most promising deep learning architectures for visual processing are convolutional neural networks (CNN) reaching current state of the art results [16], [17] with accuracy rates of >90%. Unfortunately, the learning typically is very intransparent and the performance from a computational point of view may not be suitable for many tasks with real-time requirements. For a broader overview concerning current research in person detection and its evolution, see [18], [19].

cupboards, pillars or even tall chairs. Hence, the exclusive classification by shape can yield a huge number of false positives. Considering our mobile service robot scenario, interaction between humans and robots typically involves a person directly looking at the robot. Therefore the presence of a face is clearly a striking and reliable representation for a person. This is particularly the case when also detecting certain facial components like eyes, ears, nose or mouth. However, faces might also appear on plain photography and – above all – can only be detected from certain viewing angles which also indirectly depends on a person’s pose. At the same time, a detected face on an object cluster whose shape is also representative for a person gives rise to a high confidence in the classification result. Following this idea and generalizing the concept that persons within an immediate environment cannot simply disappear, our method is based on the algorithms presented in [4], [9] where both are individually and independently extended and improved by spatio-temporal filtering and lastly combined. Accordingly, our information fusion approach explicitely integrates the color and depth information of consecutive frames which are obtained from the videostream of the robot’s camera system. Each filtering results in a more robust person tracking and the suppression of noise from static objects within the environment where the fusion contributes to a higher detection rate for persons. Figure 1 illustrates the complete information fusion architecture for our person detection and tracking approach. Both the cluster-based as well as the face detection method can be run as separate independent threads. This is greatly advantageous since both new images as well as point clouds are usually available at different timestamps. Hence, larger timescale variations within the separate processing chains do not delay each other. Accordingly, high-confidence states

III. A LGORITHMIC S OLUTION Our goal was to come up with a method that can simultaneously detect and track multiple persons without strictly depending on their pose and the camera’s viewing angle while obtaining robustness to outlier noise and illumination disturbances. This automatically gives rise to the question what the most representative features are in order to detect a person, either by pure classification or higher-level reasoning. Obviously, the shape is one of the most prominent features to enable a binary distinction between a person and other objects, meaning it is possible to exclude many objects before further processing. However, there are several objects which might be similar to humans by their shape such as

Figure 1: Information fusion processing scheme for the person detection and tracking approach.

for detection and tracking can be achieved in less time than running strictly parallel within each frame. In order to make our algorithms easier to comprehend, we will use the following formal notations: • ”∼”: An object which has been directly obtained from one of the used basic algorithms and is not yet considered to be a true positive. (candidate) • ”+”: An object which has been tracked using spatiotemporal filtering. (potential) • ”∗”: A tracked object that has reached a high-confidence state and is considered to be a true positive. (true) A. Cluster Filtering In our tests, the method presented in [8]–[10] obtained good results and was able to roughly keep track of moving persons performing frame-by-frame classification. However, false positives belonging to chairs or walls were frequently detected within single frames. A person’s track was lost as soon as they adopted unusual poses or under changing illumination conditions. More generally, it was not possible to detect sitting, kneeling, bending or crouching people. In order to obtain a higher reliability in classification and a continuous person tracking, our extension by spatio-temporal filtering applies the following heuristic: It calculates and analyzes the area around potential clusters by an adaptive distance threshold and releases true person clusters only after reaching a certain minimum sum of counts which behaves as a secondary confidence information. Algorithm III.1 describes the filtering approach to determine the true person clusters P C ∗ among all stored and tracked potential person clusters P C + . C ∼ are the frameselective candidate clusters obtained by [9]. Parameter c denotes the confidence level computed by the SVM that is required to keep observing any Ci∼∼ where this initial threshold can be relaxed to c0 if there is already any known P Ci∗∗ within a given Euclidean distance threshold that is weighted by v 0 . Further, v weights the maximum distance that each closest cluster pair (P Ci++ , P Ci∼∼ ) is allowed to have – or in other words, how far any tracked cluster P Ci++ is allowed to have moved when searching for corresponding clusters. Note that both v and v 0 with v 0 < v are interpretable as velocity weights multiplied with the delta time ∆t between two consecutive frames. Lastly, kmin defines the sum of counts that is required to mark any P Ci++ as P Ci∗∗ by an upper bound of kmax . It should be noted that even if the confidence ci+ of any P Ci++ drops below c, it is still possible to reliably classify whether it represents a P Ci∗∗ reasoned by its heuristical sum of counts. This is enabled by trading a smaller confidence c0 against the presence of a true person P Ci∗∗ within an immediate distance which is again controlled by a constrained velocity weight v 0 instead of v. Also, if any P Ci∗∗ temporarily leaves the field of view and returns within a short time, it is possible to immediately reallocate it instead of adding a new cluster to P C + and slowly increasing its belief state. Even further, a robust multimodal tracking is achieved where clusters in P C + are strictly only allowed to be merged with

Algorithm III.1: F ILTER C LUSTERS(C ∼ ) P arameters : c, c0 , v, v 0 , kmin , kmax for ∀i+ ∈ P C + do SetU ntracked(i+ ) ∼ for ∀i ∈ C∼ if ¬{{ci∼ ≥ c} or do {ci∼ ≥ c0 and ∃i∗ ∈ P C ∗ : ki∗ , i∼ k ≤ v 0 · ∆t }}  then CS∼ ← C ∼ \ i∼

+ ∼ for ∀(i , i ) ∈ (P C + , C ∼ ) : argmin i+ , i∼ ≤ v · ∆t i+ ,i∼ + ∼ i ← i  do SetT racked(i+ )  ∼ C ← C ∼ \ i∼ + for ∀i ∈ P C+ + if )  IsT racked(i    + if k < kmax  i   then then k i+ ← k i+ + 1  do + + k ← k  i  i −1    + if k < kmin else  i    then P C + ← P C + \ i+ ∼ for ∀i ∈ C∼ S P C + ← (P C + , i+ ← i∼ ) do ki+ ← 1 S return (P C ∗ ← P C + : ki+ ≥ kmin ) i+

Alg. III.1: Cluster-based person filtering: The algorithm evaluates all potential person clusters P C + , adjusts confidence levels depending on nearby clusters and returns the list of high-confidence true persons P C ∗ .

those in P C ∼ . Note that this is also true for P C ∗ since the relation P C ∗ ⊆ P C + always holds. B. Face Filtering The method presented in [4] was originally developed for a frame-individual face detection on single static RGB images and thus without consideration of temporal correlation. Therefore, similar classification results with [5] occured as for [10] where false positives predominantly appeared within single frames caused by illumination and sensor noise. Also, many instances of incorrectly classified faces and their visualized bounding boxes appeared to have a very different size from the actual faces in the scene. Since our scenario in person detection on mobile robots receives a stream of RGB image sequences, we again decided to use the extension of spatio-temporal filtering in order to increase the classification sensitivity and to track the potential faces F + among all frame-selective candidate faces F ∼ obtained by [4]. See algorithm III.2 for further details. Parameter kmin similarly describes the required minimum number of consecutively detected faces corresponding to each other and is limited by the upper bound of kmax . This repeatedly represents the confidence information deciding whether to mark any Fi++ as a true face Fi∗∗ . The main distinction from the filtering approach in section III-A comes from the computation of spatial measures in rectangular-shaped

pixel lattices and thus from the decision on whether two detected bounding boxes (2i+ , 2i∼ ) of faces F + and F ∼ actually belong to each other. To solve this problem, we introduced two parameters to restrict unrealistically large changes in pixel area and distance. The first denotes the required minimum ratio of intersected area øA within a Fi++ by an overlapping Fi∼∼ and the second the maximum allowed deviation σA in relative size. This forces the algorithm only to merge faces which lie within same depth layers and are of similar size but also to allow partial overlapping. Since the face detection in [4] has originally evolved from [3] – which is a universal algorithm for object detection –, it can also be used for detecting certain facial components like eyes, nose or mouth. Hence, our solution optionally supplies an additional classification step where the confidence for each Fi++ can be adjusted by a weighted sum of detected components. This typically obtains hitting kmin in fewer frames but possibly enforces the trade-off in requiring higher computational cost what might result in a lower frame rate. This extension allows us to reliably filter out almost all false positives detected by the classical Viola & Jones face detection algorithm [4] in advance of further processing. C. Matching Since the cluster-based approach with additional spatiotemporal filtering already obtained quite reliable detection

Algorithm III.2: F ILTER FACES(F ∼ ) P arameters : øA , σA , kmin , kmax for ∀i+ ∈ F + do SetU ntracked(i+ ) T S + ∼ (2i+ , 2i∼ ) + ∼ ≥ øA for ∀(i , i ) ∈ (F , F ) : argmax Ai+ + ∼ i ,i A+ where1 − σA ≤ Aii∼ ≤ 1 + σA + ∼ i ← i do SetT racked(i+ )  ∼ F ← F ∼ \ i∼ + for ∀i ∈ F + Optionally :     Detect f acial components using [3]   do   and adjust ki+ by a weighted sum    +  ) if IsT racked(i if ki+ < kmax do then      then ki+ ← ki+ + 1    ki+ ← ki+ − 1    if ki+ < kmin else     then F + ← F + \ i+ ∼ ∼ for ∀i ∈ F S F + ← (F + , i+ ← i∼ ) do ki+ ← 1S return (F ∗ ← F + : ki+ ≥ kmin ) i+

Alg. III.2: Image-based face filtering: The algorithm evaluates all potential faces F + , adjusts confidence levels depending on partially overlapping and equally sized faces and returns the list of high-confidence true faces F ∗ .

and surprisingly good tracking results, we decided to treat the extended face detection as a supplementary verification of raw unclassified candidate clusters. Also, the detection of faces is highly dependent on the camera’s viewing angle which the cluster-based person detection using HoG features is less sensitive to. Moreover, the presence of a face may not necessarily give rise to a corresponding cluster – i.e. due to photography or resembling contrast patterns on flat surfaces – but more likely vice versa. Hence, this preliminarily offers to adjust a candidate cluster’s confidence before further exploring whether any corresponding face can be detected. Punishing a cluster candidate for which no matching face was found would vice versa result in lower detection rates for persons that are not oriented towards the camera or are in unusual poses. This has to be taken into account in order not to decrease each algorithm’s individual classification accuracy by fusing it. Algorithm III.3 describes our information fusion approach using the high-confidence faces obtained from the preprocessed RGB image with the raw cluster candidates within the RGB-D point cloud. With F ∗ , we are given all corresponding 2D bounding boxes of faces which have reached a high-confidence state and therefore are considered to be true positives. The only required parameter is κ which denotes the constant confidence gain for any candidate cluster Ci∼∼ if its 2D-projected cluster centroid Mi∼ is contained within at least one bounding box of F ∗ . Note that this confidence gain is directly added to the candidate cluster’s confidence initially computed by the SVM. The true person clusters P C ∗ can then be determined using the output C ∼ with adjusted confidences ci∼ and applying the cluster filtering algorithm III.1. The beneficial outcome of this matching is to significantly increase the detection rate in cases where any untracked cluster’s HoG is not representative enough. Vice versa, potential false positives are not very likely to be amplified since a detected face without a corresponding cluster is strictly ignored. Algorithm III.3: M ATCHING(C ∼ , F ∗ ) P arameters : κ ∼ for ∀i ∈ C∼ if ∃i∗ ∈ F ∗ : Mi∼ ⊆ 2i∗ do then ci∼ = ci∼ + κ; return (C ∼ ) Alg. III.3: Matching: Each candidate cluster C ∼ is adjusted by a constant confidence gain κ if there is at least one corresponding true face F ∗ found.

IV. E XPERIMENTS AND R ESULTS We implemented and tested our approach using ROS and a Microsoft Kinect camera sensor. The system runs on an Intel QuadCore i5 (3,4 GHz) processor with 8 GB memory. We achieve an average frame rate of ∼15 Hz running at ∼250% CPU usage while tracking up to four persons simultaneously. Our combined method using the filtering algorithms III.1-2 and the matching algorithm III.3 only causes a minor loss

of ∼3% (0.5) FPS compared to the cluster-based person detection implementation available at [10]. See table 1 for further details.

•

•

Computational Performance Algorithm

Frame Rate (Hz)

CPU (%)

Cluster-based Person Detection (basic) [10]

∼15.4

∼174

Viola & Jones Face Detection (basic) [5]

∼7.6

∼182

Cluster-based Person Detection (filtered, tracking up to 4 persons)

∼15.1

∼176

Viola & Jones Face Detection (filtered, tracking up to 3 faces)

∼7.5

∼183

Combined Person Detection (tracking up to 4 persons)

∼14.9

∼248

• •

Table 1: Comparison of the computational performance of each used basic and extended algorithm versus the combined.

Our person detection node is subscribed to the registered point cloud and color image data which is published by the OpenNI stack, i.e. /camera/depth registered/points and /camera/rgb/image color. For the basic algorithms [4] and [9], parts of the implementations available at [5] and [10] are used. For [5], we used the included classifier data haarcascade frontalface default.xml and for [10] the accompanied .yaml-file containing the pretrained SVM parameters of HoG features. In order to compute the ground plane coefficients which are required in [10], we used a static transform of three known ground points to the camera’s optical frame, i.e. /camera rgb optical f rame. Detecting on the mobile service robot PR2 [20], [21], the tf system of ROS was used which gives the camera’s offset and rotation above ground and allows to directly estimate the environment’s ground plane coefficients in relation to the camera’s optical frame. Backtransformation of tracked person positions to and from the robot’s fixed base or world frame further enabled to integrate relative motion in cases when the robot is moving. Considering our parameter selection, the parameters for the cluster filtering extension were set to kmin = 5, kmax = 10, v = 2.5, v 0 = 1.5, c = −1.5 and c0 = −3.0. Further, note that the basic cluster-based detection additionally requires two parameters defining the minimum and maximum height that extracted clusters lie within. Those were set to 0.5 and 2.5. For the face filtering module, we have chosen øA = 0.5, σA = 0.2, kmin = 3 and lastly kmax = 6. For matching candidate clusters with true faces, we set κ = 0.6. Figures 2-4 illustrate the detection rate of our combined method compared to both basic algorithms in different scenarios. For reasons of clarity and comprehensibility, the meanings of the visualized objects are as follows: • White Sphere: A candidate cluster which was classified as a person by the basic person detection but is not yet considered to be a true but a potential person by the combined person detection.

Green Sphere: A tracked cluster which has reached a high-confidence state and therefore is considered to be a true person by the combined person detection. White Square: A detected face by the Viola & Jones face detection algorithm which is not yet considered to be a true face by the filtered face detection. Green Square: A tracked face which has reached a highconfidence state and is considered to be a true face. Blue Square: A 2D-projected candidate cluster which was matched with a tracked true face.

Note that green and white spheres are directly overlapping in the visualization. Hence, whenever there is no white sphere visible within the green, the basic person detection was not able to detect the person anymore. First of all, figure 2 shows the detection accuracy within a typical standard scenario. In the upper image (a), all algorithms succeded where all persons together with their faces could be detected. However, in the lower image (b), the person in the foreground was lost by the basic person detection but was successfully tracked by the combined method. Additionally, the same person’s face was detected by the filtered face detection, but was correctly rejected by the combined person detection due to information fusion.

Figure 2: All algorithms succeeded in the upper image (a) while in the lower image (b) the combined person detection successfully tracked the person in the foreground and rejected the printed face in the middle.

In addition, the basic person detection was mostly not able to detect persons which were not in an upright pose. With figure 3 we particularly show that the combined method using spatio-temporal filtering is also able to track bending (a) as well as sitting people even when not turned towards to the camera (b). This is clearly a striking result in terms of reliability. However, it is still necessary to initially detect a person either by the basic person detection or by the filtered face detection in order to start tracking. Then again, once a person has been detected the combined method avoids losing track persistently.

Figure 3: The combined person detection succeeds in tracking bending (a) and also sitting persons which are turned away from camera (b).

Figure 4 lastly illustrates the detection robustness and accuracy which is achieved by matching the tracked highconfidence faces with candidate clusters. The filtered face detection in upper image pair (a) has successfully detected the faces in the scene. However, since the SVM cluster confidence for the printed faces is not high enough or no cluster was matched, the combined method does not consider them to actually represent persons. In contrast to this, the person on the right containing a higher SVM confidence is correctly detected and tracked. The lower image pair (b) also illustrates how the filtered face detection algorithm successfully only merges faces at same depth layers.

Figure 4: Information fusion of tracked high-confidence faces with candidate clusters provides higher robustness and accuracy in person detection.

V. C ONCLUSIONS This paper has adressed the topic of person detection and tracking and suggested individual extensions to both of the presented methods in [8], [9] and [3], [4]. For each approach, we designed independent spatio-temporal filtering algorithms which were ultimately fused by matching candidate person clusters on RGB-D point clouds with corresponding faces in RGB images. With each filtering extension itself, it was already able to dramatically reduce the number of false positives given by its basic algorithm and to enable a smooth and robust tracking. Furthermore, once a person has reached a high-confidence state and hence was considered to be a true positive, it was even possible to keep track of it when both basic approaches individually would have failed. Matching candidate clusters with detected faces lastly enabled to significantly increase the detection rate to an extent that also allowed to reliably track bending, sitting, crouching or lying persons. These improvements were possible without requiring notably more computational power where our implementation runs at stable ∼15 Hz while tracking. Also note that the extended spatio-temporal filtering approaches are quite modularly reasoned by their required input which only depends on the results from single frame-by-framebased detection. Accordingly, they can be applied to any similar basic algorithm in order to achieve better detection and tracking rates in terms of reliability and robustness. VI. O UTLOOK In future, we plan to improve our solution by a statistical evaluation and optimization of the chosen parameters. The combined method will be tested on several data-sets. Unfortunately, there are only very few available since our approach explicitely requires video streams containing color and depth information instead of single static and temporally uncorrelated RGB images. To date, the adaptive potential area thresholding with respect to the elapsed time between two consecutive frames is only applied to the cluster filtering algorithm which brings on another planned improvement of the face filtering approach. This also comes along with an adaptive adjustment of the minimum pixel size which is analyzed by the Viola & Jones face detection algorithm. This can be done by integrating the approximate distance of a candidate cluster in order to increase the FPS rate in detecting faces. Ultimately, we aim to extend our approach by recognition of known as well as adaptive learning of previously unknown persons in order to serve the growing demand in autonomy for mobile robots.

R EFERENCES [1] ROS framework. [Online]. Available: http://www.ros.org [2] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, and J. Leibs, “ROS: An Open-Source Robot Operating System,” Open-Source Software workshop of IEEE International Conference on Robotics and Automation, 2009. [3] P. Viola and M. Jones, “Robust Real-Time Object Detection,” International Journal of Computer Vision, 2001. [4] P. Viola and M. Jones, “Robust Real-Time Face Detection,” International Journal of Computer Vision, 2004. [5] OpenCV Cascade Classifier. [Online]. Available: http://docs.opencv. org/doc/tutorials/objdetect/cascade classifier/cascade classifier.html [6] S. Zhang, C. Zhu, J. K. O. Sin, and P. K. T. Mok, “An Analysis of the Viola-Jones Face Detection Algorithm,” Image Processing On Line, pp. 128–148, 2014. [7] N. Dalal and B. Triggs, “Histogram of Oriented Gradients for Human Detection,” Computer Vision and Pattern Recognition, vol. 1, pp. 886– 893, 2005. [8] M. Munaro, F. Basso, and E. Menegatti, “Tracking people within groups with RGB-D data,” Proceedings of the International Conference on Intelligent Robots and Systems, pp. 2101 – 2107, 2012. [9] M. Munaro and E. Menegatti, “Fast RGB-D people tracking for service robots,” Autonomous Robots, vol. 37, pp. 227 – 242, 2014. [10] PCL Ground Based RGB-D People Detection. [Online]. Available: http://pointclouds.org/documentation/tutorials/ ground based rgbd people detection.php [11] OpenNI Programmer’s Guide. [Online]. Available: http://structure.io/ openni [12] M. Turk and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71 – 86, 1991. [13] N. Hu, R. Bormann, T. Zwölfer, and B. Kröse, “Multi-user identication and efcient user approaching by fusing robot and ambient sensors,” IEEE International Conference on Robotics and Automation, pp. 5299 – 5306, 2014. [14] R. Bormann, T. Zwölfer, J. Fischer, J. Hampp, and M. Hägele, “Person recognition for service robotics applications,” IEEE-RAS International Conference on Humanoid Robots, pp. 260 – 267, 2013. [15] ROS Node cob people detection. [Online]. Available: http://wiki.ros. org/cob people detection [16] S. S. Farfade, M. Saberian, and L. Li, “Multi-view Face Detection Using Deep Convolutional Neural Networks,” International Conference on Multimedia Retrieval, 2015. [17] J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human Tracking Using Convolutional Neural Networks,” Transactions on Neural Networks, vol. 21, no. 10, pp. 1610 – 1623, 2010. [18] P. Dollr, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,” IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 34, pp. 743 – 761, 2012. [19] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten Years of Pedestrian Detection, What Have We Learned?” in Computer Vision for Road Scene Understanding and Autonomous Driving (CVRSUAD, ECCV workshop), September 2014. [20] PR2 Mobile Service Robot. [Online]. Available: https://www. willowgarage.com/pages/pr2/overview [21] S. Cousins, “ROS: An Open-Source Robot Operating System,” IEEE Robotics and Automation Magazine, vol. 17, pp. 23–25, 2010.

Fast and Robust Detection and Tracking of Multiple ...

Fast and Robust Detection and Tracking of Multiple ...

Suggest Documents

Fast and Robust Multiple ColorChecker Detection using Deep

A Fast and Robust Point Tracking Algorithm

A Fast and Robust Point Tracking Algorithm

JOINT DETECTION AND TRACKING OF MULTIPLE MANEUVERING ...

Robust facial landmark detection and tracking

Robust obstacles detection and tracking using

Robust Multiple Object Tracking by Detection with Interacting Markov ...

Fast and robust tracking of multiple moving objects with a ... - Robotics

SIMULTANEOUS AND FAST 3D TRACKING OF MULTIPLE FACES IN

Robust Detection and Tracking of Human Faces ... - Dorin Comaniciu

Robust detection and tracking of annotations for outdoor ... - CiteSeerX

Fast and Robust Face Tracking for CNN chips ... - KU Leuven

Fast Trajectory Planning and Robust Trajectory Tracking ... - IEEE Xplore

Fast and Robust Appearance-based Tracking - iBUG - Imperial ...

Fast and Robust Face Tracking for CNN chips - KU Leuven

Fast and Robust Moving Objects Detection ... - Journal of Software

Robust Multiple Target Tracking Under Occlusion

Robust Wide Area Tracking in Single and Multiple Views

Robust Face Tracking using Multiple Appearance Models and Graph ...

Real-time detection and tracking of multiple objects with partial

Laser-based detection and tracking of multiple people in ... - CiteSeerX

Robust Stereo-Based Person Detection and Tracking for a ... - CiteSeerX

on-board robust vehicle detection and tracking using adaptive ... - Core

Robust Stereo-Based Person Detection and Tracking for a Person ...